SKOS updated for Vocabularies

Just a quick note that today we updated the version of SKOS that we provide for describing value vocabularies. This deprecates the properties that were removed from the final SKOS release and adds the many new ones. We’ve also restricted the non-mapping relation properties (skos:broader, skos:narrower, skos:related) to the ‘containing’ scheme while providing cross-scheme mapping for the mapping relations.

We don’t yet provide a useful interface for building collections, but that’s coming real soon now.

Oh, and we added a SPARQL endpoint.

The German National Library: translating and registering RDA elements and vocabularies

A prerequisite for the registering of our terms in the NSDL Registry and one of the greatest challenges for the German National Library at the moment is the translation of the RDA elements and vocabularies.  Since bibliographic description is executed with a highly specialised vocabulary, we are finding that the process of pinpointing the appropriate terms is interesting but also very involved. Although the existing German rules for bibliographic description (RAK) and the authority files for subject headings (Schlagwortnormdatei, or SWD) have plenty of vocabulary to offer as equivalents to Anglo-American cataloguing terminology, RDA does include concepts relatively new to bibliographic description.

Before resorting to “inventing” words, always a last resort, we launch comprehensive vocabulary mining efforts, in the process of which, beyond checking already existing translations (FRBR, MARC 21), we consult the expertise such institutions as art libraries and film institutes to get the most up-to-date descriptive terms available in the German language. If we deem a word previously used in a translation suboptimal, we may deviate from its use and in particular cases forgo the advantages of standardisation in the interest of our primary criteria: consistency, currency, usability, and precision. A quick and general Google search can also be helpful to learn how terms are being (in)formally circulated. In the case that we should find it necessary to create a new term in German, as we are experiencing with such an example as the type unmediated, we have to weigh up what sort of etymological root we would like to lean towards, Latin or Germanic.  If we translate it with unmediatisiert, it can ease communication around cataloguing between nations because of its morphological similarity to many European languages.  However, leaning on Germanic roots may sometimes be necessary in the interest of standardisation and aligning with existing descriptive language or with the strengths and realities of the German language. In that case, we may be better off choosing nicht mediatisiert or ohne Hilfsmittel zu benutzende Medien, which seems awkward but conforms to types of uses already in existence in the subject headings. The option of the “new-proposed” status in the Registry for the concepts therefore suits our needs perfectly, since for the reasons just mentioned and outlined in Diane’s blog entry about multiple languages and RDA, none of the translations we have entered are as of yet official.

Once our small team of librarians from the Office for Library Standards has followed these processes and developed a pool of equivalent German terms which we deem worthy of proposing initially for the Registry and subsequently for our official translation of RDA, we make them available to groups of colleagues specialised in bibliographic description or subject headings at the German National Library for comment in a Wiki and working meetings. Our experience with translation has shown us that the translations of descriptive bibliographic elements and vocabulary into German must be handled by librarians (professional translators can potentially pick up from there) and peer-reviewed through the above-mentioned process to ensure accuracy and acceptance in the library community.

Beyond motivating us to begin our RDA translations early, our participation in the Registry really has also given us an opportunity to dabble in the semantic web through the process of assigning URIs to our German translations of RDA element and value vocabulary.  As a test run, it therefore allows us to toy with the idea of linked data by setting descriptive bibliographic vocabulary up with its prerequisite domain. The lessons learned and questions raised through this experience put us in a better position for strategic planning regarding the nature of the presentation and sharing of bibliographic data in the future.

What has particularly attracted us about the Registry and its connection with the RDA tool is that, provided that we do decide to provide linked bibliographic data in the future as an institution, the Registry makes it possible to do so in our national language. This is a condition for its wide-spread usability and acceptance in the German-speaking library and internet community and therefore of primary importance to us, provided of course that the Committee for Library Standards takes the decision to introduce RDA as the official rules for description and access in Germany and Austria.

LCSH, SKOS and subfields

This week, Karen Coyle wrote a post about LCSH as linked data: beyond “dash-dash” which provoked a discussion on the id.loc.gov discussion list.

It seems to me that there are several memes at play in this conversation:

LCSH and SKOS

As Karen points out, LCSH is more than just a simple thesaurus. It’s also a set of instructions for building structured strings in a way that’s highly meaningful for ordering physical cards in a physical catalog. In addition, each string component has specific semantics related to its position in the string, so it’s possible, if everyone knows and agrees on the rules, to parse the string and derive the semantics of each individual component. The result is a pre-coordinated index string.

These stand-alone pre-coordinated strings are perhaps much less meaningful in the context of LOD, but this certainly doesn’t apply to the components. I think what Karen is pointing out is that, while it’s wonderful to have a subset of all of the components that can be used to construct LC Subject Headings published as LOD, there’s enough missing information to reduce the overall value. As I read it, she’s wishing for the missing semantics to be published as part of the LCSH linked data, and hoping that LC doesn’t rest on its well-earned laurels and call it a day.

Structured Strings

Dublin Core calls the rules that define a structured string a "Syntax Encoding Scheme" (SES) and basically, that’s what the rules defining the construction of LC Subject Headings seem to be. It’s structurally no different than saying that the string "05/10/09", if interpreted as a date using an encoding scheme/mask of "mm/dd/yy", ‘means’ day 10 in the month May in the year 2009 using the Gregorian calendar. Fascinatingly, that same ‘date’ can be expressed as a Julian date of "2454962", but I digress.

As far as I can tell, no one has figured out a universally accepted (or any) way to define the semantic structure of a SES in a way that can be used by common semantic inference engines, and I don’t think that anyone in this discussion is asking for that. What’s needed is a way to say "Here’s a pre-coordinated string expressed as a skos:prefLabel, it has an identity, and here are it’s semantic components."

Additional data

So…

"Italy--History--1492-1559--Fiction"

…is expressed in http://id.loc.gov/authorities/sh2008115565#concept as…

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .  
@prefix terms: <http://purl.org/dc/terms/> .  
@prefix owl: <http://www.w3.org/2002/07/owl#> .

<http://id.loc.gov/authorities/sh2008115565#concept>
    skos:prefLabel "Italy--History--1492-1559--Fiction"@en ; 
    rdf:type ns0:Concept ;    
    terms:modified "2008-03-15T08:10:27-04:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> ; 
    terms:created "2008-03-14T00:00:00-04:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> ; 
    owl:sameAs <info:lc/authorities/sh2008115565> ; 
    skos:inScheme
        <http://id.loc.gov/authorities#geographicNames> , 
        <http://id.loc.gov/authorities#conceptScheme> ; 
    terms:source "Work cat.: The family, 2001"@en . 

…and has a 151 field expressed in the authority file as…

151 __* |a *Italy* |x *History* |y *1492-1559* |v *Fiction

…which has the additional minimal semantics of…

<http://id.loc.gov/authorities/sh2008115565#concept>
    loc_id:type "Geographic Name" ; #note that this is also expressed as a skos:inScheme property
    loc_id:topicalDivision "History" ;
    loc_id:chronologicalSubdivision "1492-1559" ;
    loc_id:formSubdivision "Fiction" ;
    loc_id:geographicName "Italy" .

…and this might also be expressed as…

<http://id.loc.gov/authorities/sh2008115565#concept>
   loc_id:type http://id.loc.gov/authorities/sh2002011429 ;
   loc_id:topicalDivision http://id.loc.gov/authorities/sh85061212 ;
   loc_id:formSubdivision http://id.loc.gov/authorities/sh85048050 ;
   loc_id:geographicName http://id.loc.gov/authorities/n79021783 ;
   dc:temporal "1492-1559" ;
   dc:spatial http://sws.geonames.org/3175395/ ;
   dc:spatial http://id.loc.gov/authorities/n79021783 .

Making sure that those strings in the first example are expressed as resource identifiers is also something that I think Karen is asking for. (BTW, The ability to lookup a label by URL at id.loc.gov is really useful)

I should point out that Ed, Antoine, Clay, and Dan’s DC2008 paper detailing the conversion of LCSH to SKOS goes into some detail (see section 2.7) about the LCSH to SKOS mapping, but doesn’t directly address the issue that Karen is raising about mapping the explicit semantics of the subfields.

Multiple languages and RDA

We’ve been thinking for some time about how to implement multi-lingual (and multi-script) vocabularies in the Registry. Some Registry users have been experimenting with language and script capability for some time (see Daniel Lovins’ Sandbox Hebrew GMD’s). But it was really when we started working with the RDA vocabularies that we got serious about multi-linguality.

At DC-2008 in Berlin, we started talking to the librarians at the Deutsche Nationalbibliothek about adding German language versions of RDA vocabularies into the Registry. I knew how eager the German libraries were to participate more actively in the RDA development, and had been talking to German librarians for some time about their frustrations with the notion that they had to wait until “later” to become involved. Christine Frodl and Veronika Leibrecht have been our primary contacts at the Deutsche Nationalbibliothek on this work, and they’ve been a real pleasure to work with.

We decided collectively to start with some of the value vocabularies, in particular Content Type, Media Type and Carrier Type. We enabled Veronika to become a maintainer on those vocabularies, and she worked within her library and associated German-speaking libraries to translate and develop labels and definitions in German for the existing terms. As she describes the challenge:

“Because RDA was not developed simultaneously in various languages (that would be an even more daunting task!), we are looking for ways to adapt German to English language/cataloguing concepts and must get agreement on the terms in our community. The search for terminology to translate RDA will therefore be an ongoing process in the short term for us. … Now I am looking forward to seeing French and Spanish come along 😉 and would be happy to share a few resources I found which could help people in their search for terminology.”

Those of you who know German (or have an interest in multilingual vocabularies in general, might want to take a look at some of the work done already:

Content Type Vocabulary (you can see that for now, all concepts display in English)

Detail for concept of “computer program”: http://metadataregistry.org/concept/show/id/517.html (the German translation for the label appears in the list of properties of the concept)

Veronika points out that the process behind this effort is a complex one, but solidly based on existing relationships in the German-speaking world:

“[B]ecause of the federal system in Germany, the DNB works very closely with all library consortia in the country and Austria and decisions about cataloguing rules and data formats are reached through consensus with them. The reason for this it that the consortia include and represent libraries which existed long before the German state as such (or the DNB, for that matter) and therefore have traditionally and independently held the written cultural heritage of their individual counties, duchies, kingdoms etc.”

We have had some additional interest by other language communities in this effort, and Jon has added some detail on our wiki to describe how we plan to improve the software to make both building and maintenance of other language versions simpler, and easier to configure at the output end. Do note that this isn’t implemented yet, but is instead a blueprint for moving ahead in this critical area.

Updated Step-by-Step Instructions

Those of you who have actually discovered the Registry and tried to add stuff to it have (I hope) already realized that we had Step-by-step Instructions for doing so. They were old, and we’d added new things (mostly Jon added new things—I just rant, nag and test), so I finally re-did the instructions. They can be found here: http://wiki.metadataregistry.org/Step-By-Step_Instruction.
Looking at the old instructions was, for me at least, a reminder that we have made progress, much as it sometimes seems like we’re moving at a glacial pace. The interface has changed, we’ve added versioning and history, as well as schema registration (read Jon’s posts for more details). There’s still lots more to come, and believe me we have seemingly endless list of what’s still missing. But writing documentation, even basic stuff like these instructions, is a humbling experience. Trying to do things more linearly than I usually do reminds me yet again where the gaps are.

One of the issues, which I’m not sure I’ve papered over very well in the instructions, is something I call the “eating our own dog food” problem. Those of you who know me personally have heard me use that phrase before—it’s a favorite. It basically means that, if you’re just preaching about how to do something, and not doing it, you’re not eating your own dog food. Not a good thing, and likely as not it will affect your credibility in ways that aren’t very comfortable, because SOMEBODY will call you on it.

Where we managed to step in it (the natural product created from said dog food, that is), was when we extended the registry from value vocabularies only to value vocabularies and schemas. Then, our model of concepts and properties of concepts started getting a little funky. When you’re registering schemas, you’ve got an aggregation of schema properties, and then, um, properties of properties? Uh oh. You can see the problem, I think—it’s about identifying and defining terms (among other things), and isn’t that what we’re supposed to be doing?

So, for the moment, until we’ve figured out how to hold our noses and eat that unappetizing dog food, we’re making a distinction in the schema instructions between “schema properties” and “specific properties.” Not elegant, but until inspiration strikes, somewhat helpful, I hope.

If any of you have occasion to use the instructions or stumble upon them and want to provide some helpful (or not) comments, just send them along to me: metadata.maven@gmail.com.

Heck of a job, Phippsy

It’s been a busy summer, but not on the Registry front.

We’re currently working on integrating the ARC library so we can handle RDF a bit more intelligently. This will give us import capability, a SPARQL endpoint, and the ability to express vocabularies in more RDF serializations. We’ve also made some improvements to our URI-building feature, adding support for ‘hash’ namespaces and tokenized identifiers (rather than simply numeric). This means that a URI like http://www.w3.org/2008/05/skos#Concept will be built for you properly instead of having to edit the current default http://www.w3.org/2008/05/skos/12345 to get what you want. None of this even on the beta site, primarily because we haven’t had time to test it at all, and there are some things we know are still broken.

There’s also now a fairly simple PHP script that accesses the new Registry API to retrieve data remotely. You can see this in action at http://rdvocab.info/roles.rdf — there’s no data actually maintained on rdvocab.info, the data is retrieved from the Registry. We’re not publishing the script yet or documenting the API because, like so many things, they’re not quite finished — the script needs to be even simpler, tested with PHP4, and less dependent on .htaccess. The API needs a few more methods and also needs to require a key for some operations.

Expect to see some of this stuff appear in early September.

The grant to work on the Registry runs out in September, but I’ll keep working on it and hope to have some collaborators. I’ve been pretty poor at creating a welcoming collaborative environment, networking, and promotion so that may be a vain hope.

There’s a fairly long list of things yet to do and some of them are major. Application profile management is the biggest, but there are also things like the ability to follow, twitter-like, activity on a vocabulary, and more extensive control over notifications, and integrated discussions are needed to help support the vocabulary development features. The ability to import, export, edit, re-import, and have changes tracked throughout the process is also pretty critical. We want very much to integrate the sandbox into the main Registry, at least integrating user registration and making it possible to easily move a vocabulary from the sandbox to the registry. And there needs to be much more extensive help, better explanations of what’s going on, a place to report bugs and make suggestions that integrates with trac.

I’m off messing about in Canada on holiday for the next 2 weeks, so some of the things that I finished up this week will have to wait until I get back before they’re integrated into the site — I hate to potentially break things and then disappear.

What is a Taxonomy

Bob DuCharme is taking a course and is pleased to find a standard that defines taxonomy, quoting from the ANSI/NISO Z39.19 standard, Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies, and discussing the various classes of controlled vocabulary.

Well worth reading…

What is a taxonomy? – bobdc.blog

I’ve described ontologies to people as being like taxonomies, except that you (or more likely, people in your field) get to make up new, specialized relationships beyond those standardized for thesauri. For example, in legal publishing, a higher court ruling could have the relationship property “cite” to a lower court ruling, with potential values such as “overturns” or “affirms”.

Registry Installation Instructions

Jeepers, no posts for 3+ months and then two in one day! The truth is that I hadn’t realized the last post was still sitting in my drafts folder more than a month after I wrote it.

Moving on…

A number of folks have been interested in installing the Registry, especially since we’ve talked before about ‘easy installation’ being one of our design goals.

We’re pleased to announce that we have finally tweaked things to make a reasonably simple install from our subversion repository possible and provided some hopefully simple instructions detailing how to get the Registry up and running. We don’t provide enough tweaking or instructions (yet) to fully customize the interface, so once it’s installed it’ll still look exactly like the Registry, just running on your server instead of ours.

Whenever we update the production server, we’ll tag that code in subversion and update the link in the instructions (tying a string around my finger to help me remember as we speak), but there won’t be any other ‘release’ announcement unless we do something major.

Whenever we modify the database structure, we’ll provide a sql script to alter the database with each release. These scripts will always modify the database as it was after the previous release, so if you skip releases you’ll need to run the scripts sequentially. But this will all be on the instructions page.

We expect to update the production code quite often over the next few months.

Metadata Schema

If you’ve been watching the Registry closely (and we know you have), you’ll have noticed that a few weeks ago we started supporting the registration of metadata schemas. It’s not finished and far from perfect, but the perfect can often be the enemy of the good and at the moment it’s, well, good enough for now.

What makes it tough to get schema registration right is that our approach to what we’re calling registration attempts to be cross-cultural — trying to create a bridge from the technologies supporting the Semantic Web to the somewhat more ‘traditional’ data transfer technologies like XML.

We’re also trying to ‘eat our own dog food’ and are using an internally registered Application Profile to define the properties we’re using to describe metadata schemas and ultimately Application Profiles. This AP helps drive the schema registration user interface and we hope at some point we’ll be able to use a registered AP to generate many different interfaces, both human and application. It’s arguably too ambitious, but baby steps…

Vocabulary Management
The Registry is really more Vocabulary Management Application than Registry at this point, since we’ve layered so many management services on top of the basic registry functions. It manages two types of vocabularies:

  • Value vocabularies — unordered lists of values (terms) that we express as skos concept schemes in RDF and a simple enumeration in XML Schema
  • Class/Property vocabularies — lists of classes, properties (or attributes depending on your mental model) that we currently express as rdf:properties and rdfs:classes

Much of our terminology (value vocabularies, metadata schema, application profile) stems from our work with the Dublin Core Community more than the Semantic Web Community and maybe we’ll refactor some of those names as we move forward. But we hope the semweb folks can translate and we hope that the DC folks won’t hold our ultimate departure from some of their terms against us.

In the meantime, feel free to play in the sandbox.

Makes my head hurt

I was talking with Diane this morning about building the schema portion of the Registry and I feel the need to write down some of what we discussed.

For purposes of discussion, we have a draft schema property interface that defines some basic metadata schema property properties. We started the conversation because I was trying to get away from the “property property” nomenclature and because I couldn’t quite figure out the best way to extend the too-simple model to incorporate repeatable, typed notes/annotations.

Over the course of the discussion we came to a few conclusions:

  • What we’re really discussing is an Application Profile in the old DC sense of that term (it has since been changed to “Description Set Profile” to reflect the more DCAM-centric viewpoint of the current DC Community) in which we’re defining schema property restrictions, namespaces, and usage requirements: There can be only one token, definition, label, type and they’re required; ‘Type’ utilizes a controlled vocabulary containing the concepts ‘property’ and ‘subproperty’; etc.
  • We have a Schema Properties Vocabulary registered that identifies these schema property description ‘terms’ as ‘concepts’, but this isn’t really correct because they’re actually properties of a metadata schema ‘property’ (and so we’re back to property properties &amp;lt;sigh&amp;gt;) and as such they should be registered as an Application Profile rather than a Vocabulary.
  • The properties of each schema we register should be based on its own Application Profile, since there will be many different requirements and we’d like to provide some flexibility. For instance the RDA schema may need to have an additional property property that declares a relationship between the property and a FRBR entity.
  • We can’t register a Metadata Schema Properties Application Profile until we can register a Schema
  • In order to register a metadata schema we need a generic Metadata Schema Properties Application Profile
  • We’re stuck with “property properties”
  • This stuff makes my head hurt

In the interest of moving forward, stopping the spinning, and headache relief we’re going to pretend that a generic Metadata Schema Property Application Profile (MSPAP — pronounced ‘ems-pap’) exists and slap something together and make the interface fairly inflexibly tied to it. At some point in the future we’ll (hopefully) make it flexible enough to be based on any registered MSPAP.