For an interested few associated with libraries and data, like myself, Linked Data has been a topic of interest and evangelism for several years. For instance, I gave this presentation at IFLA 2010.
However, Linked Data and Linked Open Data have now arrived on the library agenda. Last summer, it was great to play a small part in the release of the British National Bibliography as Linked Data by the British Library – openly available via Talis and their Kasabi Platform. Late last year the Library of Congress announced that Linked Data and RDF was on their roadmap, soon followed by the report and plan from Stanford University with Linked Data at it’s core. More recently still, Europeana have opened up access to a large amount of cultural heritage, including library, data.
Even more recently I note that OCLC, at their EMEA Regional Council Meeting in Birmingham this week, see Linked Data as an important topic on the library agenda.
The consequence of this rise interest in library Linked Data is that the community is now exploring and debating how to migrate library records from formats such as Marc into this new RDF. In my opinion there is a great danger here of getting bogged down in the detail of how to represent every scintilla of information from a library record in every linked data view that might represent the thing that record describes. This is hardly unsurprising as most engaged in the debate come from an experience where if something was not preserved on a physical or virtual record card, it would be lost forever. By concentrating on record/format transformation I believe that they are using a Linked Data telescope to view their problem, but are not necessarily looking through the correct end of that telescope.
Let me explain what I mean by this. There is a massive duplication of information in library catalogues. For example, every library record describing a copy of a book about a certain boy wizard will contain one or more variations of the string of characters “Rowling, K. J”. To us humans it is fairly easy for us to infer that all of them represent the same person, as described by each cataloguer. For a computer, they are just strings of characters.
OCLC host the Virtual International Authority File (VIAF) project which draws together these strings of characters and produces a global identifier for each author. Associated with that author they collect the local language representations of their name.
One simple step down the Linked Data road would be to replace those strings of characters in those records with the relevant VIAF permalink, or URI – http://viaf.org/viaf/116796842/. One result of this would be that your system could follow that link and return an authoritative naming of that person, with the added benefit of it being available in several languages. A secondary, and more powerful, result is that any process scanning such records can identify exactly which [VIAF identified] person is the creator, regardless of the local language or formatting practices.
Why stop at the point of only identifying creators with globally unique identifiers. Why not use an identifier to represent the combined concept of a text, authored by a person, published by an organisation, in the form of a book – each of those elements having their own unique identifiers. If you enabled such a system on the Linked Data web, what would a local library catalogue need to contain? – Probably only a local identifier of some sort with links to local information such as supplier, price, date of purchase, license conditions, physical location, etc. plus a link to the global description provided by a respected source such as Open Library, Library of Congress, British Library, OCLC etc. A very different view of what might constitute a record in a local library.
So far I have looked at this from the library point of view. What about the view from the rest of the world?
I contend that most wishing to reference books, journal articles, curated and provided by libraries would happiest if they could refer to a global identifier that represents the concept of a particular work. Such consumers would only need a small sub-set of the data assembled by a library for basic display and indexing purposes – title, author. The next question may be, where is there a locally available copy of this book or article that I can access. In the model I describe, where these global identifiers are linked to local information such as loan status, the lookup would be a simple process compared with a current contrived search against inferred strings of characters.
Currently Google and other search engines have great difficulty in managing the massive amount of library catalogue pages that will mach a search for a book title. As referred to previously, Google are assembling a graph of related things. In this context the thing is the concept of the book or article, not the thousands of library catalogue pages describing the same thing.
Pulling these thoughts together, and looking down the Linked Data telescope from the non-library end, I envisage a layered approach to accessing library data.
- A simple global identifier, or interlinked identifiers from several respected sources, that represents the concept of a particular thing (book, article, etc.)
- A simple set of high-level description information for each thing – links to author, title, etc., associated with the identifier. This level of information would be sufficient for many uses on the web and could contain only publicly available information.
- For those wishing more in depth bibliographic information, those unique identifiers, either directly or via SameAs links, could link you to more of the rich resources catalogued by libraries around the world, which may or may not be behind slightly less open licensing or commercial constraints.
- Finally library holding/access information would be available, separate from the constraints of the bibliographic information, but indexed by those global identifiers.
To get us to such a state will require a couple of changes in the way libraries do things.
Firstly the rich data collated in current library records should be used to populate a Lined Data data model of the things those records describe – not just reproducing the records we have in another format. An approach I expanded upon in a previous post Create Data Not Records.
Secondly, as such a change would be a massive undertaking, libraries will need to work together to do this. The centralised library data holders have a great opportunity to drive this forward. A few years ago, the distributed hosted-on-site landscape of library management systems would have prevented such a change happening. However with library system software-as-a-service becoming an increasingly viable option for many, it is not the libraries that would have to change, just the suppliers of the systems the use.
Monkey picture from fPat on Flickr