When moving to a new place we bring loads of baggage and stuff from our old house that we feel will be necessary in our new abode. One poke of the head in to your attic will clearly demonstrate how much of the necessary stuff was not so necessary after all. Yet you will also find things lurking around the house that may have survived several moves. The moral here is that, although some things are core to what we are and do, it is difficult to predict what we will need in a new situation.
This is also how it is when we move to doing things in a different way – describing our assets in RDF for instance.
I have watched many flounder when they first try to get their head around describing the things they already know in this new Linked Data format, RDF. Just like moving house, we initially grasp for the familiar and that might not always be helpful.
So. You understand your data, you are comfortable with XML, you recognise some familiar vocabularies that someone has published in RDF-XML, this recreation in RDF sounds simple-ish, you take a look at some other data in RDF-XML to see what it should look like, and…… Your brain freezes as you try to work out how you nest one vocabulary within another and how your schema should look.
This is is where stepping back from the XML is a good idea. XML is only one encoding/transmission format for RDF, it is a way of encoding RDF so that it can be transmitted from one machine process to another. XML is ugly. XML is hierarchical and therefore introduces lots of compromises when imparting the graph nature of an RDF model. XML brings with it a [hierarchical] way of thinking about data that constrains your decisions when creating a mode for your resources.
I suggest you not only step back from XML, you initially step back from the computer as well.
Get out your white/blackboard or flip-chat & pen and start drawing some ellipses, rectangles, and arrows. You know your domain, go ahead draw a picture of it. Draw an ellipse for each significant thing in your domain – each type of object, concept, or event, etc. Draw a rectangle for each literal (string, date, number). Beware of strings that are really ‘things’ – things that have other attributes including the string as a name attribute. Draw arrows to show the relationships between your things and things and their attributes. Label the arrows to define those relationships – don’t initially worry about vocabularies yet, use simple labels such as name, manufacturer, creator, publication event, discount, owner, etc. Create identifiers for your things in the form of URIs, so that they will be unique and accessible when you eventually publish them. What you end up with should look a bit like this slide from my recent presentation at the Semantic Tech and Business Conference in Berlin – the full deck is on SlideShare. Note how how I have also represented resources the model in a machine readable form of RDF – yes you can now turn to the computer again.
Once back at the computer you can now start referring to generally used vocabularies such as foaf, dcterms, etc. to identify generally recognised labels for relationships – an obvious candidate being foaf:name in this example. Move on to more domain specific vocabularies as you go. When you find a relationship not catered for elsewhere you may have to consider publishing your own to supplement the rest of the world.
Once you have your model, it is often a simple bit of scripting to take data from your current form (CSV, database, XML record) and produce some simple files of triples, in n-triples format. Then use a useful tool like Raptor to transform it in to good old ugly XML for transfer. Better still, take your n-triples files and load them in to a storage/publishing platform like Kasabi. This was the route the British Library took when the published the British National Bibliography as RDF.
Moving Brothers picture from Lawrence OP on Flickr