Separate Clouds

A blog by Ewan Klein

Vocabulary Hacking with SPARQL and UMBEL

| 0 comments

In my blog post on finding commodity terms for Trading Consequences, I described how one of the project’s tasks was to carry out Named Entity Recognition of commodities in digitised historical texts. In this post, I want to describe in a bit more technical detail some of the steps we have been taking to carry out this task. Warning: this is not a distilled set of take-away lessons, but more of in-progress log.

Requirements

We decided initially that we should construct a thesaurus which would use relevant terms from an existing vocabulary and supplement it with more obscure or archaic terms that were discovered in the course of reviewing relevant historical documents. An important design consideration was to include a limited amount of hierarchical structure in order to support querying, both in the database interface and also in the visualisation process. For example, it ought be possible to summarise the export of limes, apples and oranges under the label Fruit. We also wanted to be able to add other properties to terms, such as noting that both nuts and whales are a SourceOf Oil. Finally, it was important to be able to list multiple alternative forms for the same commodity; for example, rubber might be referred to in several ways, including not just “rubber” but also “India rubber”, “caoutchouc” and “caouchouc”. These factors made SKOS (Simple Knowledge Organization System) an obvious choice of framework for organising the thesaurus.

The following diagram (from SKOS Core Guide 2005) illustrates preferred and alternative lexical labels attached to a concept:

SKOS Preferred and Alternative Lexical Labels

SKOS Preferred and Alternative Lexical Labels

Hierarchical relations between concepts in SKOS are expressed in terms of the skos:broader property (or its inverse, skos:narrower) as seen here (also from SKOS Core Guide 2005):

skos:broader

The skos:broader property


We can read the bottom part of this diagram as saying that the concept Mammals ‘has a broader concept’ Animals.

Although skos:broader is not transitive, SKOS also contains the property skos:broaderTransitive, which is, and this is what we will be using. A concept in SKOS is explicitly intended to be a fuzzier notion than a class of things. Nevertheless, we’ll be using concepts as though they were classes, and consequently, ex:mammals skos:broaderTransitive ex:animals is effectively equivalent to saying that Mammals is a subclass of Animals.

The Base Vocabulary

I looked at existing vocabularies already represented in SKOS which could be taken as a starting point. I decided, without a huge amount of investigation, to take the UMBEL upper ontology as the starting point for our base vocabulary. It seems big enough to provide a good basis, and is small enough (just under 120Mb) to download from github.

After poking around looking at the file structure, it seemed that everything we wanted was contained in umbel_reference_concepts.n3. Note that this an RDF file, serialised in Notation 3 format (best thought of as Turtle, which is a subset of Notation 3 that has become the de facto alternative to RDF/XML). The data model for RDF is a graph in which ‘subjects’ are related by properties to ‘objects’. While UMBEL uses SKOS concepts and properties, it supplements these with selected properties from the RDF Schema language (RDFS; see http://www.w3.org/TR/rdf-schema/).

Trying to get an overview of what was in UMBEL was an initial challenge. Although I spent a little time looking at SKOS editing tools, nothing stood out as an obvious contender in terms of ease-of-installation, broad adoption and relevant functionality. However the Free Edition of TopBraid Composer worked well for browsing umbel_reference_concepts.n3. The following screenshot illustrates the interface.

Screenshot of TopBraid Composer

Pruning UMBEL

In order to construct the base vocabulary, I wanted to extract a subset of the SKOS structure that only dealt with concepts that were relevant to the commodity domain; these seemed to be all subclasses of the concepts Animals, Plants and Natural Substances. This can be carried out using the SPARQL query language. As well as supporting SQL-like queries which return tuples of values, SPARQL has a CONSTRUCT operator which returns an RDF graph. The query shown below will return all subject-predicate-object triples where the subject (shown as the variable ?s) is a subclass of Plant, Animal or NaturalSubstance:

Of course, to actually execute the query, we need to do some more work. While TopBraid Composer allows you to run SPARQL queries in a GUI pane, the results are saved using
the SPARQL Query Results XML Format, which was less convenient for me that CSV. In addition, I wanted to be able to run the queries programmatically rather than via a GUI, and to be able to save them in a version control system. In the past, I’ve had good results using the Jena ARQ library, so was disposed to try this again. However, in the last couple of years, Jena has migrated from Sourceforge to being first an Apache Incubator project and, since April 2012, a top-level Apache project. In the spirit of adventure, rather than just using the ARQ query engine again, I decided to have a crack at running the Fuseki SPARQL server. This turned out be very simple to install, and to query over HTTP. Assuming that Fuseki is running on the default port 3030 and that the SPARQL query is contained in the file subgraph.rq, this command will execute the query and save the results in subgraph.ttl:

Finding Lexical Labels in UMBEL

Now that I had extracted the relevant subgraph from UMBEL, I needed another SPARQL query to identify lexical labels — these could then be converted into a gazetteer and incorporated into the text processing pipeline. The following query extracts tuples consisting of the SKOS concept, the preferred and alternative lexical labels, and any broader concepts:

The best way to understand this query is to look at the output, the first 10 lines of which are as follows (slightly simplified to use prefixed names in place of full URIs):

There is a lot of redundancy in this format, since each separate item of information requires a separate row in the results. For example, rows 5–7 contain the information that “herbaceous plant” has three alternative labels, namely “herb”, “herbs” and “herbaceous plants”. How many distinct lexical labels are there altogether? We can use the following Unix command line to extract and count all the unique items that occur in the second field position (preferred labels) of a row (and similarly for the other three fields by changing the -f argument of cut:

The counts of the different items are as follows:

classes 3,445
preferred labels 3,414
alternative labels 5,904
superclasses 898

Next Steps

As I mentioned at the start of this post, we are supplementing the vocabulary derived from UMBEL with terms derived from nineteenth century documentary sources. Jim Clifford has been focussing on capturing and transcribing data from annual Customs reports on the quantity and value of goods arriving in Britain each year. Here’s part of a page showing imports of “Ammunition: Shot, Large and Small”:

Fragment of 1898 Customs record

In a follow-up post, I will describe how we are converting this additional set of terms into a SKOS-compatible form, and how we are integrating it with the base vocabulary from UMBEL.

Leave a Reply

Required fields are marked *.

* Copy This Password *

* Type Or Paste Password Here *