to dada, or not to dada

A Primer on the Semantic Web and Linked Data

The Semantic Web is a vision to make data on the Web interoperable and understandable to both humans and machines.

Unfortunately, the very idea of the Semantic Web is not easily understandable. RDF, XML, Linked Data, OWL, SPARQL, FOAF and a whole bunch of three-letter abbreviations form a cloud of terms and concepts that make it hard to get to the core ideas of the Semantic Web.

This is a short introduction to the Semantic Web with the hope of convincing you that its core ideas are surprisingly simple.

We approach the topic from the perspective of a data modeler who has been given the task of creating an information system for touristic data.

The initial motivational section requires basic knowledge of relational databases. If you want to skip ahead, you can start reading at the second section.

Touristic Data

A touristic information system should be able to handle data such as:

  • Hotels, hostels, guesthouses
  • Restaurants
  • Cultural events
  • Points of interest
  • Outdoor activities
  • Real time data such as weather or road status

A well-proven way of handling such data is with a relational database. We can define tables for each entity type (e.g. Restaurant, Hotel, Hiking Trail).

We realize that there are links between the entities in the data that we care about. For example:

  • A restaurant may be the venue for a cultural event
  • A point of interest may lie on a hiking trail
  • Accessibility to a hotel may depend on the status of a road (e.g. closed because of snow)

These links can be created in a relational database by explicitly defining relationships with foreign keys and possibly intermediate entities such as venues.

The data model quickly becomes very complicated. Furthermore, if in the future we need to add another data type (e.g. sport events), we would have to introduce a new table and a lot of new foreign keys all over our data model to be able to represent new relationships we might care about.

Is there another way of representing and processing such data?

Let's try using a graph.

A simpler data model: A graph

Let us try and model the data as a graph. We use nodes to describe concrete entities and abstract concepts (such as restaurant) and connect them with labeled directed edges.

We can say that "Pizzeria Dada" is a restaurant as follows:

Pizzeria Dada

If the Pizzeria Dada lies on a hiking trail we can extend the graph:

Pizzeria Dada lies on a hiking trail

And if Johnny Cash happens to drop by in town for a concert, we can extend the graph further:

Johnny Cash drops in for a concert

Expressing the data as a graph allows us to make interesting queries. For example, we could be interested in finding "all the concerts that are held along the hiking trail to Piz Blups". The answer to this query can be computed by traversing the graph and detecting certain shapes.

Triples

The basic unit of this data representation are two nodes that are connected with a labeled edge. This basic unit is called a triple. The node from where the edge goes out is called the subject, the edge is labeled with a predicate and the node receiving the edge is the object.

A triple

Our data model is a set of triples.

Note that subjects can have multiple outgoing edges:

A triple

and subjects can be the object of another subject:

A triple

Naming

Consider the situation where a node "Johnny Cash Live" was added to our graph:

Johnny Cash Live

Unfortunately, both "Johnny Cash Live" and "Johnny Cash Concert" refer to the same concert. The naming of the nodes is ambiguous. We need a naming convention.

For instance we could use Uniform Resource Identifiers (URIs) to name things:

URIs

Think of URIs as an address for a resource, as an Uniform Resource Locator (URL). URIs are a generalization of URLs, but we won't care about that distinction now.

By naming the node with an URI we've lost the human readable name. Let's add it with another triple:

Add name literal

We also use URIs to name abstract concepts and properties:

URI everything

The node "Johnny Cash Live" remains as it is. It does not refer to anything, but is a value that describes something. We call such a node a literal.

We used names starting with http://schema.org and http://www.w3.org/1999/02/22-rdf-syntax-ns. These are names that have been previously agreed upon by a community. As we will see later, there are advantages in using commonly used names.

Linked Data

The basic unit of our data model are triples consisting of a subject, predicate and object. We name things and properties with URIs.

Data model

Recall that URIs are addresses for resources. What if at the address we used as a name for our nodes there is another graph that extends our graph?

For example we can have a graph that has a node http://somewhere-else.org/something:

Remote content

If at the address http://somewhere-else.org/something we can retrieve data that is also in the graph fom that we have seen, for example:

Remote content

Then we could combine the graphs:

Remote content

Therefore, by using addresses (URIs) as names for nodes we are able to extend our graph with remote data. We can link our data with other data. This is called Linked Data.

Resource Description Framework

Let us recap our data model:

  • We model our data as graphs consisting of triples (subject, predicate, object)
  • We use URIs to name things
  • We can extend our graph by fetching more data from the URIs of things

The idea behind Linked Data is that we can link our data to other remote data sources and use the same tools and processes on the combined data.

The model that we introduced is called Resource Description Framework (RDF). RDF specifies the exact model and form of triples. It is a concrete specification of Linked Data.

RDF does not specify any particular way of serialization (a machine-readable format for writing down the graph). There is an XML serialization for RDF, a more human-readable format called Turtle and a JSON serialization (called JSON-LD). Unfortunately, these serializations are very complicated and hard to implement (especially the XML and JSON ones), which has caused a lot of confusion about RDF in general.

Let us not worry about serialization for the moment and admire RDF as a simple and powerful data model.

Semantic Web

Using RDF allows us to link remote data sources together and use them in a unified way. This enables interoperability.

But what makes data understandable?

Unambiguous identifiers for things and predicates makes the data understandable.

In RDF we need to use URIs for naming things or naming predicates that describe things. By using the same names to talk about the same thing in different contexts we make things understandable across contexts.

This seems straightforward when naming things. For example, we could use the URI http://example.org/Alice to uniquely describe the person Alice, and Alice would be described in two different data sets with the same URI. In this way, we understand that both data sets describe Alice and the combination is a more complete description of Alice.

The same also holds for names of predicates. For example, the predicate http://schema.org/address can be used to describe the physical address of an item. If we want to send a letter to some subject we would need to check if the subject is in a triple with the predicate http://schema.org/address and send the letter to the object of that triple. The subject could be a person, an institution, or anything that could have a physical address. We don't need to care. We understand that if the subject has the address predicate, we probably can send a letter to it.

When combining different data sets it is necessary that the same names are used to describe the same things. If one data set uses the predicate http://schema.org/address is used and another uses http://noschema.com/postalAddress, then we won't be able to understand that they are both talking about the same thing.

It is very important to agree upon a common vocabulary for naming things. There are several projects that aim to provide vocabularies to describe many things (see links at the bottom of this page).

Let us recap:

  • The Semantic Web is a vision to make data on the Web interoperable and understandable.
  • Linked Data is an idea on how multiple remote data sources can be made interoperable.
  • RDF is a concrete implementation of Linked Data.
  • Meaning and understanding comes from using same names for same things.

That sounds great! Does RDF/Linked Data solve all my problems?

Probably not. Even if the data model is simple and powerful there are some drawbacks:

  • The tools and technology used to handle RDF/Linked Data are not as widely used - and maybe not as usable - as other stacks (e.g. relational database).
  • The data model is conceptually different from the more prevalent relational model and requires some mental effort to get used to.
  • Widely used serialization formats of RDF are extremely complicated and difficult to implement (RDF/XML, JSON-LD).
  • Agreeing upon a vocabulary to describe data is hard and requires a lot of social collaboration.

Nevertheless, Linked Data is extremely well suited for distributed systems where data interoperability is a high priority.

In a future post titled The Semantic Social Network I hope to illustrate how the Semantic Web and Linked Data can be used to create structured content in federated and decentralized networks in a crowd-sourced manner.

Further reading

pukkamustard, Tue 19 November 2019