GeoPub: Datalog, RDF and IndexedDB

Thu 14 April 2022

As part of the openEngiadina project we are researching a semantic social network - a system for creating and sharing complex data. Our approach includes:

RDF to support the various types of data.
Content-addressing to make content robustly available.
The ActivityStreams vocabulary to describe social interactions on data.
XMPP as a federated transport protocol.

This approach is being tested in a proof-of-concept Web application: GeoPub. I am very happy to announce a new release of GeoPub (v0.5.0) that uses the query language Datalog and the Web-native database IndexedDB. This is a significant step towards decentralized and local-first applications for local knowledge.

A demo installation of GeoPub is available here (Please note that initial loading of GeoPub can take some time as sample data and vocabularies are loaded. Subsequent reloads should be much faster.)

In this post I'd like to give a brief overview of Datalog, how it is used in GeoPub and with IndexedDB.

Datalog

Datalog is database query language based on logic programming. In a nutshell it allows you to describe entailment rules which allow complex facts to be derived from more simple facts. Such rules form a Datalog program.

For example a program that computes the existence of paths from simple edges:

path(a,b) :- edge(a,b).
path(a,b) :- edge(a,c), path(c,b).

The program consists of two rules (or clauses) that define when path holds. Here path(a,b) holds if there is a direct edge between a and b or if there is an intermediary node c that connects a and b (the second rule). The terms a, b and c are variables that can be bound to any matching values. Given some database that stores the simple edge facts, this Datalog program can compute the transitive closure - i.e. all paths appearing in the graph of nodes connected with edges.

People familiar with Prolog will recognize the syntax of the program above, it is very similar to Prolog. However the semantics - the meaning - of a Datalog program is defined differently than for a Prolog program. Datalog turns out to be much better suited for querying databases. For example queries are guaranteed to terminate. On the other hand Datalog programs are strictly less expressive than Prolog programs. Datalog is not Turing-complete.

For a more in-depth introduction to Datalog I can highly recommend the paper What You Always Wanted to Know About Datalog (And Never Dared to Ask) (1989) or chapters 12 and 13 of Foundations of Databases (the Alice book). I have also given a talk on the topic: Logic Programming and Databases (2021).

Datalog and RDF

Given a database that stores RDF triples as triple(subject, predicate, object) we can query the triples using Datalog.

For example the Datalog predicate note defined by

note(s) :- triple(s, <rdf:type>, <as:Note>).

can be used to query for all objects that have type <as:Note> (placeholder for https://www.w3.org/ns/activitystreams#Note).

We may define other predicates:

content(c) :- note(s), triple(s, <as:content>, c).
withGeo(s) :- triple(s, <geo:lat>, lat), triple(s, <geo:long>, long).

The predicate content returns the content of all notes and withGeo holds if something has a latitude and longitude attached.

The usual way to query RDF is with SPARQL. There are many reasons why I think that Datalog is a much better query language:

Datalog is a much simpler language. If you have been able to follow the examples given above, then you already understand the entire syntax of Datalog.
Datalog allows recursive queries. SPARQL allows a form of recursion via property paths, however this is very limited (see Recursion in SPARQL).
Datalog can be extended with built-in predicates that perform arbitrary computation. This allows unpacking complex literal values for use in specialized indices (e.g. geospatial coordinates or full-text search) or computation of cryptographic functions during query evaluation (e.g. DMC). There are numerous SPARQL extensions that add similar functionalities, but they all require additional syntax and custom implementations.
There is a rich literature on how to optimize Datalog queries and many existing and historic applications on massive amounts of data.
Datalog programs compose naturally. Did you notice that content is defined by using the previously defined note predicate? This almost uncanny composability is a intrinsic feature of logic programming. See for example William Byrd's talk at FOSSDEM 2021 to see how nice this is.

Finally, Datalog can be implemented with reasonable effort. See the stand-alone OCaml implementation that we developed for openEngiadina: ocaml-datalogl.

Type Inference

Another huge advantage of Datalog over SPARQL is the ability to express complex inference rules.

For example RDF Schema defines a vocabulary for RDF classes and properties. This is used in vocabularies such as ActivityStreams that define classes and properties as such:

@prefix as: <https://www.w3.org/ns/activitystreams#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

as:Create a owl:Class ;
  rdfs:label "Create"@en ;
  rdfs:subClassOf as:Activity ;
  rdfs:comment "To Create Something"@en .

as:Like a owl:Class ;
  rdfs:label "Like"@en ;
  rdfs:subClassOf as:Activity ;
  rdfs:comment "To Like Something"@en .

The as:Create and as:Like classes are defined as sub-classes of as:Activity (using the rdfs:subClassOf property).

Real data is usually annotated with the most concrete and specific type:

@prefix ex: <https://example.com/> .

ex:activity1 a as:Create ;
  as:actor ex:actor .

ex:activity2 a as:Like;
  as:actor ex:actor2 .

Here ex:activity1 and ex:activity2 are specified to have class as:Create and as:Like respectively. This is how one usually encounters content on ActivityPub.

Sometimes we might only be interested in finding all activities (class as:Activity) regardless of the specific sub-class. Can we formulate a query that does this? With Datalog, yes:

rhodf(x, <rdf:type>, b) :- triple(a, <rdfs:subClassOf>, b), triple(x, <rdf:type>, a).

The predicate rhodf(x, <rdf:type>, b) (x has type b) holds if a is a sub-class of b and x has type a.

A query rhodf(x, <rdf:type>, <as:Activity>) will return all x that have the type of any sub-class of as:Activity. Exactly what we want.

Such inference rules are an important tool in being able to deal with a wide (and extensible) variety of data. We have implemented inference rules as described in Simple and Efficient Minimal RDFS (2009) in GeoPub.

IndexedDB

IndexedDB is a Web API that allows large amounts of content to be stored at client-side (i.e. in the browser). GeoPub now uses IndexedDB to store content. Datalog evaluation is performed directly on IndexedDB without having to first load the entire database to memory.

Content can be viewed and inspected locally (without any XMPP connectivity). It's still a long way to make GeoPub usable without any internet connectivity, but by using IndexedDB we have made a significant step towards being local-first.

I invite you to inspect the GeoPub database via the development tools in your browser (Firefox: "Storage" > "IndexedDB", Chromium: "Application" > "Storage" > IndexedDB"). By maintaining appropriate indices we are able to perform arbitrary queries efficiently.

Next Steps

This milestone is an important technical proof-of-concept. Much work is required in terms of usability. This will be the focus of the next GeoPub milestone where we hopefully will also be able to showcase some interesting use-cases based on the Valueflows vocabulary.

There is much room for improving the performance of IndexedDB and the Datalog implementation.

The openEngiadina project is slowly winding down and finishing up last milestones as part of the NLnet/NGI0 funding. Feel free to join our chat to stay up to date (via IRC or Matrix).

In the meantime we have secured funding for continuing the development of ERIS - a scheme for robust content-addressing and a spin-off from the openEngiadina project. Stay tuned for updates and feel free to join our mailing list.

Fin

It saddens me deeply to have to say that my fellow openEngiadina developer rustra is still under arrest in Belarus. Resist political repression and support the victims! Please consider donating to the Anarchist Black Cross Belarus or showing your support in any other way.

I look forward to questions and comments via mail (pukkamustard [at] posteo [dot] net). Thank you for your interest and keep safe!

The openEngiadina projects is supported by the NLnet Foundation trough the NGI0 Discovery Fund.

pukkamustard