Category Archives: RDF

Readable URIs

Over the years we’ve been engaged in a number of discussions in which the ‘readability’ of URIs was raised, either as an issue with non-readable URIs or as a requirement in new URI schemes.

At the Registry, we understand and are sensitive to the desire for human readability in URIs. However embedding a language-specific label in the URI identifying concepts in multilingual vocabularies has the side effect of locking the concept into the language of the creator. It also unnecessarily formalizes the particular spelling-variant of the language of the creator, ‘colour’ vs. ‘color’ for instance.

When creating the URIs for the RDA vocabularies we acceded to requests to make the URIs ‘readable’ specifically to make it easier for programmers to create software that could guess the URI from the prefLabel We have come to regret that decision as the vocabularies gained prefLabels in multiple languages. And it creates issues for people extending the vocabulary and adding concepts that have no prefLabel in the chosen language of the vocabulary creator.

That said, the case is much less clear for URIs identifying ‘things’, such as Classes and Properties, in RDFS and OWL, since these are less likely to have a need to be semantically ‘understood’ independent of their label and are less likely to be labeled and defined in multiple languages. In that case the semantics of the Class or Property is often best communicated by a language-specific, readable URI.

In the end I personally lean heavily toward non-readable identifiers because of the flexibility in altering the label in the future, especially in the fairly common case of someone wishing to change the label even though the semantics have not changed. This becomes much more problematic when the label applied to the thing at a particular point in time has been locked into the URI.

I’m not trying to start a non-readable URIs campaign, just pointing out that the Registry, in particular, is designed to support vocabulary development by groups of people, whose collective agreement on labeling things may change over the course of the development cycle, who are creating and maintaining multilingual vocabularies. Our non-literal-label URI default is designed to support the understanding we’ve developed of that environment over time.

LCSH, SKOS and subfields

This week, Karen Coyle wrote a post about LCSH as linked data: beyond “dash-dash” which provoked a discussion on the id.loc.gov discussion list.

It seems to me that there are several memes at play in this conversation:

LCSH and SKOS

As Karen points out, LCSH is more than just a simple thesaurus. It’s also a set of instructions for building structured strings in a way that’s highly meaningful for ordering physical cards in a physical catalog. In addition, each string component has specific semantics related to its position in the string, so it’s possible, if everyone knows and agrees on the rules, to parse the string and derive the semantics of each individual component. The result is a pre-coordinated index string.

These stand-alone pre-coordinated strings are perhaps much less meaningful in the context of LOD, but this certainly doesn’t apply to the components. I think what Karen is pointing out is that, while it’s wonderful to have a subset of all of the components that can be used to construct LC Subject Headings published as LOD, there’s enough missing information to reduce the overall value. As I read it, she’s wishing for the missing semantics to be published as part of the LCSH linked data, and hoping that LC doesn’t rest on its well-earned laurels and call it a day.

Structured Strings

Dublin Core calls the rules that define a structured string a "Syntax Encoding Scheme" (SES) and basically, that’s what the rules defining the construction of LC Subject Headings seem to be. It’s structurally no different than saying that the string "05/10/09", if interpreted as a date using an encoding scheme/mask of "mm/dd/yy", ‘means’ day 10 in the month May in the year 2009 using the Gregorian calendar. Fascinatingly, that same ‘date’ can be expressed as a Julian date of "2454962", but I digress.

As far as I can tell, no one has figured out a universally accepted (or any) way to define the semantic structure of a SES in a way that can be used by common semantic inference engines, and I don’t think that anyone in this discussion is asking for that. What’s needed is a way to say "Here’s a pre-coordinated string expressed as a skos:prefLabel, it has an identity, and here are it’s semantic components."

Additional data

So…

"Italy--History--1492-1559--Fiction"

…is expressed in http://id.loc.gov/authorities/sh2008115565#concept as…

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .  
@prefix terms: <http://purl.org/dc/terms/> .  
@prefix owl: <http://www.w3.org/2002/07/owl#> .

<http://id.loc.gov/authorities/sh2008115565#concept>
    skos:prefLabel "Italy--History--1492-1559--Fiction"@en ; 
    rdf:type ns0:Concept ;    
    terms:modified "2008-03-15T08:10:27-04:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> ; 
    terms:created "2008-03-14T00:00:00-04:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> ; 
    owl:sameAs <info:lc/authorities/sh2008115565> ; 
    skos:inScheme
        <http://id.loc.gov/authorities#geographicNames> , 
        <http://id.loc.gov/authorities#conceptScheme> ; 
    terms:source "Work cat.: The family, 2001"@en . 

…and has a 151 field expressed in the authority file as…

151 __* |a *Italy* |x *History* |y *1492-1559* |v *Fiction

…which has the additional minimal semantics of…

<http://id.loc.gov/authorities/sh2008115565#concept>
    loc_id:type "Geographic Name" ; #note that this is also expressed as a skos:inScheme property
    loc_id:topicalDivision "History" ;
    loc_id:chronologicalSubdivision "1492-1559" ;
    loc_id:formSubdivision "Fiction" ;
    loc_id:geographicName "Italy" .

…and this might also be expressed as…

<http://id.loc.gov/authorities/sh2008115565#concept>
   loc_id:type http://id.loc.gov/authorities/sh2002011429 ;
   loc_id:topicalDivision http://id.loc.gov/authorities/sh85061212 ;
   loc_id:formSubdivision http://id.loc.gov/authorities/sh85048050 ;
   loc_id:geographicName http://id.loc.gov/authorities/n79021783 ;
   dc:temporal "1492-1559" ;
   dc:spatial http://sws.geonames.org/3175395/ ;
   dc:spatial http://id.loc.gov/authorities/n79021783 .

Making sure that those strings in the first example are expressed as resource identifiers is also something that I think Karen is asking for. (BTW, The ability to lookup a label by URL at id.loc.gov is really useful)

I should point out that Ed, Antoine, Clay, and Dan’s DC2008 paper detailing the conversion of LCSH to SKOS goes into some detail (see section 2.7) about the LCSH to SKOS mapping, but doesn’t directly address the issue that Karen is raising about mapping the explicit semantics of the subfields.

DSPs, DCAPs, and WIKIs, oh my!

The Dublin Core Metadata Initiative is looking for someone to build them a wiki. At least I think that’s what they want.

From the Call for Tender page:
“DCMI Call for Tender 2007-03: Wiki format for application profiles convertible into XML”
From the DCMI home page:
“Call for tender for a machine-processable application profile format”

I don’t think that these two descriptions are describing the same thing at all. Of course, that just reflects my sense that “machine-processable application profile” doesn’t mean “application profile that can be scraped from a wiki page and expressed as XML”.

I’m more inclined to think that a “machine-processable application profile” means a DCAP that can be directly used to validate data that has been created with the intention of conforming to a specified DCAP (or is it DSP? — I wish that they wouldn’t suddenly change the terminology just to fit the model).

Increasingly, I’m viewing “machine-processable application profile” as meaning machine-processable-DCAP-derived data-entry forms (XFORMS) used to generate DCAP-conformant XML data that can be validated using a machine-processable-DCAP-derived RELAX NG schema, W3C XML Schema, or Schematron ruleset. RDF triples would then have to be derived from the validated XML.

The intermediate XML validation is necessary because a sensibly efficient way to validate RDF against a DCAP currently doesn’t exist. Although Alistair‘s notion of rules-based RDF validation based on SPARQL query assertions looks like it might work in a Schematron-like way. This would then imply the ability to derive SPARQL queries from a machine-processable DCAP.

While the idea of a wiki-based DCAP editor is conceptually interesting, it would seem to me that a tender to produce exemplars of the above based on the current DCAP XML expression would be far more useful in actually providing useful test cases for determining the validity and utility of that expression.