All posts by Jon

Activity feeds fixed

Part of our server move was prompted by the ever-increasing indexing of our many, many activity feeds.

In addition to the overall history feed on the front page, each vocabulary has a history feed:
Metadata Registry Change History for Element Set: ‘MARC 21 Elements 00X’

Each element or concept also has its own history feed:
Metadata Registry Change History for Element: ‘Biography of Books’ in Element Set: ‘MARC 21 Elements 00X’

Each individual statement has a history feed (yes we are indeed as crazy as you may suspect we are at this point):
Metadata Registry Change History for Property: ‘label’ of Element: ‘Biography of Books’ in Element Set: ‘MARC 21 Elements 00X’

We believe that it’s important for vocabulary management systems to track changes (what, who, when) right down to the statement level and while we have no idea if there’s any utility to that fine-grained a history feed, we do track it so it seemed reasonable to offer a feed, even at that level, in RSS1 (rdf), RSS2, and atom no less.

This means we have thousands of feeds and these are being parsed heavily by various search engines, much more so than the resource changes they represent, and this generates a lot of bandwidth and processing overhead. We’re working on a number of different solutions, but the first one we tried — implementing app-level managed caching — didn’t do the trick. and we spent quite a bit of time fiddling with different configuration options. And in tinkering with that we broke the feeds in a way that was very hard to track down, since they passed our integration tests, worked on our staging server, and then broke on the production server… sometimes.

So they’re fixed now, and we’re exploring other caching options.

Password reset fixed

One of the side effects of our server move was the need to reconfigure our locally-hosted transactional email services, which were always a little flaky — if you have tried to reset your password lately you will no doubt have noticed that the promised email never arrived.

That’s fixed now. We  switched from self-hosted to using Postmark, which works very well and should be far more stable.

Server move in progress

…from Rackspace to Digital Ocean and from one large-ish server to several smaller ones, from a relatively prehistoric version of RedHat to the latest Ubuntu 14LTS, from Zend Server+Apache to nginx (openresty actually), and from PHP5.2 to PHP5.5. It really wasn’t as painful as it sounds like it should have been. As part of the move, we also made some improvements in error and uptime reporting, integrated into a Slack channel for near real-time monitoring.

Actually the move is pretty much complete, although we’re still tinkering a bit with the caching and load balancing setup — we apologize for any brief outages you may experience while we do that — since we couldn’t afford to completely replicate the entire server stack in a staging environment.

Getting browser/proxy/server caching right is actually taking more time than we anticipated. We’ll let you know here as soon as we turn our attention to other things. In the meantime, if you see something, please say something — we’ve got issues!

Server issues

Normally the Registry just hums along quietly and doesn’t demand too much attention. But the last system update we performed seems to have altered our memory usage pretty dramatically and we’re quite suddenly having out-of-memory issues and some heavy swapping. We’ve expanded the server capacity twice already as a stopgap while we investigate, but before we move to an even larger server we’re testing some alternative configurations.

In the meantime there may continue to be periodic slowdowns, but you should see some improvement in a few days.

The last thing we want is for you to think we’re not seeing, or ignoring, the problem.

Whoops

Until lately we’ve been pretty happy with our ISP, Dreamhost. But a few months ago, after several during-the-presentation meltdowns of the Registry we determined that we needed to move to a higher-performing, more reliable server. We could have done the easy thing and moved to a Virtual Private Server at Dreamhost. Instead, we setup an entirely fresh server in the Rackspace Cloud and very carefully, with much testing created a fresh instance of the Registry with greatly expanded data capacity, some updated code, and considerably more speed. So far, so good.

We had a self-imposed deadline of several weeks before the DCMI 2011 Conference in The Hague and completely missed it. This left us with the choice of waiting until after the conference to redirect our domain to the new server or taking the risky step of switching domains just a few days before the conference. Of course, we didn’t wait. At which point we discovered that we couldn’t simply redirect our main domain to the new server but needed to redirect the subdomains as well, breaking our wiki and blog. Which we had a great deal of difficulty restoring while on the road to DCMI.

But everything’s back to normal now, and even updated. We now resume our regular programming.

SPARQL queries

We’re using Benjamin Nowack‘s excellent ARC libraries for a tiny bit of our RDF management. It may surprise you to know that we don’t use a triple store as our primary data store, but we do too many things with the data that we think are cumbersome at best when managed exclusively in a triple store (a subject for another post someday). Still, last year we started nightly importing of the full Registry into the ARC RDF store and enabled a SPARQL endpoint, thinking that it might be useful.

Lately, we’ve heard a few folks wishing for better searching in the Registry and since we’re actively building an entirely new version of the Registry in Drupal (a subject for another post someday) we’re loathe to spend time doing any serious upgrading of the current Registry. But we have SPARQL!

Yesterday, as part of another conversation, a colleague helped me figure out what I think is a fairly useful search query. If you follow the link, you’ll be taken to the Registry’s SPARQL endpoint which will display inline a list of all of the skos:Concepts in the RDA vocabularies which have no definitions. Well, 250 of them anyway since that’s the arbitrary limit we’ve set on the endpoint. They’re not hyperlinked (which would be really useful) but it’s still good info.

The SPARQL query used to create the list:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 
SELECT DISTINCT ?s
WHERE { GRAPH ?g { ?s ?p ?o . }
OPTIONAL { ?s skos:definition ?Thing . }
FILTER (!bound(?Thing))
FILTER regex(str(?s), "^http://RDVocab.info/termList")}

…can be used to find any missing property (see ‘optional’) and the regex used in the filter can be modified to limit the search to any vocabulary, group of vocabularies, or a domain. I’m not enough of a SPARQL expert (meaning I’m completely clueless) to know how to filter by attribute, but it should be possible, if not easy, to find skos:Concepts that have an English definition, but no German definition (I look forward to your comments).

LCSH, SKOS and subfields

This week, Karen Coyle wrote a post about LCSH as linked data: beyond “dash-dash” which provoked a discussion on the id.loc.gov discussion list.

It seems to me that there are several memes at play in this conversation:

LCSH and SKOS

As Karen points out, LCSH is more than just a simple thesaurus. It’s also a set of instructions for building structured strings in a way that’s highly meaningful for ordering physical cards in a physical catalog. In addition, each string component has specific semantics related to its position in the string, so it’s possible, if everyone knows and agrees on the rules, to parse the string and derive the semantics of each individual component. The result is a pre-coordinated index string.

These stand-alone pre-coordinated strings are perhaps much less meaningful in the context of LOD, but this certainly doesn’t apply to the components. I think what Karen is pointing out is that, while it’s wonderful to have a subset of all of the components that can be used to construct LC Subject Headings published as LOD, there’s enough missing information to reduce the overall value. As I read it, she’s wishing for the missing semantics to be published as part of the LCSH linked data, and hoping that LC doesn’t rest on its well-earned laurels and call it a day.

Structured Strings

Dublin Core calls the rules that define a structured string a "Syntax Encoding Scheme" (SES) and basically, that’s what the rules defining the construction of LC Subject Headings seem to be. It’s structurally no different than saying that the string "05/10/09", if interpreted as a date using an encoding scheme/mask of "mm/dd/yy", ‘means’ day 10 in the month May in the year 2009 using the Gregorian calendar. Fascinatingly, that same ‘date’ can be expressed as a Julian date of "2454962", but I digress.

As far as I can tell, no one has figured out a universally accepted (or any) way to define the semantic structure of a SES in a way that can be used by common semantic inference engines, and I don’t think that anyone in this discussion is asking for that. What’s needed is a way to say "Here’s a pre-coordinated string expressed as a skos:prefLabel, it has an identity, and here are it’s semantic components."

Additional data

So…

"Italy--History--1492-1559--Fiction"

…is expressed in http://id.loc.gov/authorities/sh2008115565#concept as…

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .  
@prefix terms: <http://purl.org/dc/terms/> .  
@prefix owl: <http://www.w3.org/2002/07/owl#> .

<http://id.loc.gov/authorities/sh2008115565#concept>
    skos:prefLabel "Italy--History--1492-1559--Fiction"@en ; 
    rdf:type ns0:Concept ;    
    terms:modified "2008-03-15T08:10:27-04:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> ; 
    terms:created "2008-03-14T00:00:00-04:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> ; 
    owl:sameAs <info:lc/authorities/sh2008115565> ; 
    skos:inScheme
        <http://id.loc.gov/authorities#geographicNames> , 
        <http://id.loc.gov/authorities#conceptScheme> ; 
    terms:source "Work cat.: The family, 2001"@en . 

…and has a 151 field expressed in the authority file as…

151 __* |a *Italy* |x *History* |y *1492-1559* |v *Fiction

…which has the additional minimal semantics of…

<http://id.loc.gov/authorities/sh2008115565#concept>
    loc_id:type "Geographic Name" ; #note that this is also expressed as a skos:inScheme property
    loc_id:topicalDivision "History" ;
    loc_id:chronologicalSubdivision "1492-1559" ;
    loc_id:formSubdivision "Fiction" ;
    loc_id:geographicName "Italy" .

…and this might also be expressed as…

<http://id.loc.gov/authorities/sh2008115565#concept>
   loc_id:type http://id.loc.gov/authorities/sh2002011429 ;
   loc_id:topicalDivision http://id.loc.gov/authorities/sh85061212 ;
   loc_id:formSubdivision http://id.loc.gov/authorities/sh85048050 ;
   loc_id:geographicName http://id.loc.gov/authorities/n79021783 ;
   dc:temporal "1492-1559" ;
   dc:spatial http://sws.geonames.org/3175395/ ;
   dc:spatial http://id.loc.gov/authorities/n79021783 .

Making sure that those strings in the first example are expressed as resource identifiers is also something that I think Karen is asking for. (BTW, The ability to lookup a label by URL at id.loc.gov is really useful)

I should point out that Ed, Antoine, Clay, and Dan’s DC2008 paper detailing the conversion of LCSH to SKOS goes into some detail (see section 2.7) about the LCSH to SKOS mapping, but doesn’t directly address the issue that Karen is raising about mapping the explicit semantics of the subfields.

Heck of a job, Phippsy

It’s been a busy summer, but not on the Registry front.

We’re currently working on integrating the ARC library so we can handle RDF a bit more intelligently. This will give us import capability, a SPARQL endpoint, and the ability to express vocabularies in more RDF serializations. We’ve also made some improvements to our URI-building feature, adding support for ‘hash’ namespaces and tokenized identifiers (rather than simply numeric). This means that a URI like http://www.w3.org/2008/05/skos#Concept will be built for you properly instead of having to edit the current default http://www.w3.org/2008/05/skos/12345 to get what you want. None of this even on the beta site, primarily because we haven’t had time to test it at all, and there are some things we know are still broken.

There’s also now a fairly simple PHP script that accesses the new Registry API to retrieve data remotely. You can see this in action at http://rdvocab.info/roles.rdf — there’s no data actually maintained on rdvocab.info, the data is retrieved from the Registry. We’re not publishing the script yet or documenting the API because, like so many things, they’re not quite finished — the script needs to be even simpler, tested with PHP4, and less dependent on .htaccess. The API needs a few more methods and also needs to require a key for some operations.

Expect to see some of this stuff appear in early September.

The grant to work on the Registry runs out in September, but I’ll keep working on it and hope to have some collaborators. I’ve been pretty poor at creating a welcoming collaborative environment, networking, and promotion so that may be a vain hope.

There’s a fairly long list of things yet to do and some of them are major. Application profile management is the biggest, but there are also things like the ability to follow, twitter-like, activity on a vocabulary, and more extensive control over notifications, and integrated discussions are needed to help support the vocabulary development features. The ability to import, export, edit, re-import, and have changes tracked throughout the process is also pretty critical. We want very much to integrate the sandbox into the main Registry, at least integrating user registration and making it possible to easily move a vocabulary from the sandbox to the registry. And there needs to be much more extensive help, better explanations of what’s going on, a place to report bugs and make suggestions that integrates with trac.

I’m off messing about in Canada on holiday for the next 2 weeks, so some of the things that I finished up this week will have to wait until I get back before they’re integrated into the site — I hate to potentially break things and then disappear.

What is a Taxonomy

Bob DuCharme is taking a course and is pleased to find a standard that defines taxonomy, quoting from the ANSI/NISO Z39.19 standard, Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies, and discussing the various classes of controlled vocabulary.

Well worth reading…

What is a taxonomy? – bobdc.blog

I’ve described ontologies to people as being like taxonomies, except that you (or more likely, people in your field) get to make up new, specialized relationships beyond those standardized for thesauri. For example, in legal publishing, a higher court ruling could have the relationship property “cite” to a lower court ruling, with potential values such as “overturns” or “affirms”.

Registry Installation Instructions

Jeepers, no posts for 3+ months and then two in one day! The truth is that I hadn’t realized the last post was still sitting in my drafts folder more than a month after I wrote it.

Moving on…

A number of folks have been interested in installing the Registry, especially since we’ve talked before about ‘easy installation’ being one of our design goals.

We’re pleased to announce that we have finally tweaked things to make a reasonably simple install from our subversion repository possible and provided some hopefully simple instructions detailing how to get the Registry up and running. We don’t provide enough tweaking or instructions (yet) to fully customize the interface, so once it’s installed it’ll still look exactly like the Registry, just running on your server instead of ours.

Whenever we update the production server, we’ll tag that code in subversion and update the link in the instructions (tying a string around my finger to help me remember as we speak), but there won’t be any other ‘release’ announcement unless we do something major.

Whenever we modify the database structure, we’ll provide a sql script to alter the database with each release. These scripts will always modify the database as it was after the previous release, so if you skip releases you’ll need to run the scripts sequentially. But this will all be on the instructions page.

We expect to update the production code quite often over the next few months.