GeoPub: A Multi-Model Database

Fri 01 July 2022

As part of the openEngiadina project we are researching technology for open and decentralized local knowledge.

Local knowledge comes in a variety of forms and shapes. In order to support this variety we use the RDF data model along with Datalog as a declarative and expressive query laguage (see this previous post).

In this post I'd like to describe the indexing strategies we use that allow efficient querying of local knowledge. In particular, I'd like to highlight how a database can support efficient graph, geo-spatial and full-text search queries at the same time - a multi-modal database. This is an important step towards search and discovery of local knowledge.

Everything described here is implemented in the GeoPub Web application. A demo installation of GeoPub is available here (note that initial loading takes very long).

Without further ado, let's dive into how GeoPub's multi-model database works.

Ordered Key-Value Stores

A very simple form of a database is a key-value store. A key-value store allows values to be persistently stored and accessed with some associated key. Given a key the value can usually be retrieved very efficiently, i.e. in $O (log (n))$ where $n$ is the number of values stored in the database.

Unfortunately, accessing individual values by their keys is not enough to allow interesting database queries.

Assume we have a database with $n$ key-value pairs $(k_{i}, v_{i})$ and some function $f : V \to N$ that maps values to natural numbers. Now we would like to find all the key-value pairs $(k_{i}, v_{i})$ such that $f (v_{i}) = 42$ .

In a simple key-value store there is no way of efficiently finding the subset of key-value pairs that fulfills such a property. We must iterate over all $n$ key-value pairs in the store and evaluate the function on every value individually.

The number of key-value pairs $c$ that we are interested in is usually much smaller than the number of stored key-value pairs $n$ . For example we might be looking for a property that is held by ten entries in a store of millions of entries. In order to be able to query our key-value store efficiently, we really need a way of finding these entries without having to iterate over all entries in the store. Luckily, we don't need to add much to the key-value store. We just need some order.

Order, order!

An ordered key-value store (OKVS) is a persistent storage that allows values to be stored with some associated key. The keys are sorted according to some order and given a range of keys the associated key-value pairs in the range can be accessed efficiently. If there are $c$ key-value pairs in a given range then the complexity of accessing the range is usually $O (log (n) + c)$ . We need to find the first entry in the range ( $O (log (n))$ ) and then iterate over all entries in the range ( $O (c)$ ). This is much better than a simple key-value store.

Note that such a range query can not be implemented efficiently in a simple key-value store. Checking if a key is within a range is a property that must be evaluated on every single stored key-value pair (like our function $f$ ).

But how does this help us query more efficiently?

Pre-Computing Properties

Instead of stroring values associated with the original key, we can instead store them associated with $f (v_{i})$ . The store is now populated with key-values of the form $(f (v_{i}), v_{i})$ .

We can efficiently query for all the values for which $f (v_{i}) = 42$ by querying for the range of keys with $42 \leq k \leq 42$ . In fact we can also query for more interesting ranges: For example all values $v_{i}$ with $13 \leq f (v_{i}) \leq 21$ .

This may seem like a cheap trick, but it is a very powerful way of making values efficiently queryable along some pre-defined properties (in our case the value of the function $f$ ). It is also a very widespread trick. This is what is called an index in database terminology.

The index is a projection of some properties we are interested in to a dimension that can be efficiently queried. In our case the dimension that can be efficiently queried is the intrinsic order of the key-value store. The function $f$ is a projection to this dimension. We will see some more concrete examples that will illustrate this idea much better. Don't worry too much about understanding this in the abstract now.

Of course our toy example is overly simplified and has some draw-backs. For one, we have lost the original keys (the $k_{i}$ ). We also might want to query along some other property (e.g. a function $g$ ). We will be able to solve both with some tricks.

Tuples as Keys

Most existing ordered key-value stores require keys (and values) to be byte sequences and the intrinsic order in which keys are sorted is the lexicographic byte order. That is b00 < b01 and b0 < b00.

We can use other types such as bools, integers, strings or compound types for keys as long as we can encode them to bytes. The only thing we must ensure is that the encoding preserves the order between keys.

We will mostly be using tuples as keys. For example a key could be $(1, 2, 3)$ or $(Lorem Ipsum, true, 32)$ . We use lexicographic order on tuples, e.g. $(1) < (1, 2) < (1, 5) < (42) < (strings are ordered higher than ints, 52)$ .

The fact that an order preserving encoding to byte sequences exists for tuples (and the types we use in tuples) is left as an exercise for the reader. Hint: For tuples try a concatenation of type-length-value encoded elements (e.g. CBOR).

Another basic type that will be useful is symbols. We will use symbols such as rdf or spo in key tuples, e.g.: $(rdf, spo, 1, 42, 98)$ .

Using symbols allows us to create named sub-spaces that can be queried efficiently. For example we can efficiently query for all keys in the $(rdf, spo)$ space that start with the values 1 to 12 by querying for the range of keys with $(rdf, spo, 1) \leq k \leq (rdf, spo, 12)$ . This would return keys such as $(rdf, spo, 1, 42, 98)$ , $(rdf, spo, 1, 59, 2)$ or $(rdf, spo, 7, 8, 5)$ (if present in the store).

Indexing

We are now ready to improve on our initial solutions for efficiently querying values applied to the function $f$ .

To recap: we have a bunch of key-value pairs $(k_{i}, v_{i})$ , a function $f : V \to N$ that maps values to natural numbers and we would like to find key-value pairs for which the applied function takes on some values.

For every key-value pair $(k_{i}, v_{i})$ that we store, we also store an additional entry with key $(f, f (v_{i}))$ and value $k_{i}$ .

In order to efficiently query along the property defined by the function $f$ we query for all keys in the range $(f, lower-bound) \leq k \leq (f, upper-bound)$ . The values returned by this query are the original keys $k_{i}$ . We can use them to retrieve the associated values $v_{i}$ .

The additional entries $(f, f (v_{i}))$ are index entries that point to the original key (also called the primary key). Unlike our initial solution we have not lost any information and can easily add additional indexes for other properties.

This is the basic idea towards building your own custom indices and multi-model databases. We will see examples and variations of this idea when looking at concrete examples below.

Implementation Notes

Ordered-key value stores such as LMDB or LevelDB expose an API as described above and the ideas illustrated above can be implemented pretty directly. In GeoPub we use the web-native database IndexedDB, which provides a slightly different and more high-level API. Nevertheless, the OKVS ideas can be applied and we will give some hints on how while discussing the examples. For a deep-dive check out the GeoPub source-code.

Indexing Strategies

Enough of the abstract. Let's look at some concrete examples of properties and dimensions on which we would like to be able to efficiently query.

RDF Triples

For the openEngiadina project we use the RDF data model to support the various types of data (see this previous post for an introduction to RDF).

In a nutshell: In RDF data is represented as a directed and labeled graph. Nodes are either IRIs (internationalized URIs) or literal values and directed edges are labeled with an URI. The atomic elements in RDF are triples: two nodes and the directed edge between them. Or in RDF parlance: A subject node that is connected with a directed edge, that is labeled with a predicate, to an object node.

We will use the notation $(s, p, o)$ for triples, where $s$ is the subject, $p$ the predicate and $o$ the object. A value that may appear in the subject, predicate or object position of a triple is called a term.

We note that in RDF things and concepts ideally have unique identifiers. Even if they appear in completely different contexts. In other word, the same term can appear in many triples.

Dictionary

Reflecting on the fact that terms are heavily re-used in RDF, we make them primary entities in our database.

For every term $t$ appearing in some RDF graph, we create an unique identifier $id$ and add an entry with key $(dictionary, id)$ and value $t$ .

Triple Indices

We now have stored all the terms that appear in our RDF data. We are still missing the triples. How do we represent triples in our ordered key-value store?

Given some term $s$ we might be interested in all the triples where $s$ appears as subject. To answer this query we could add entries with key $s$ and value $[(p_{1}, o_{1}), (p_{2}, o_{2}), ...]$ such that $(s, p_{i}, o_{i})$ are triples.

This is a complete representation of all triples, i.e. we have stored the RDF graph. However, this does not seem like a very good representation. For every new triple we add to our database we need to modify some stored values, which is usually more expensive than adding new smaller entries. Furthermore, we are not making good use of our ordered key-value store. To query for all objects that appear in triples with some fixed subject and predicate we need to find the entry, unpack the value and filter the stored list. We can do better.

Instead let's add entries with keys $(s, p_{i})$ and values $[o_{1}, o_{2}, ...]$ such that $(s, p_{i}, o_{i})$ are triples. Already better! We can now do a range query to find all triples for a given subject $s$ , i.e. by querying for all keys with $(s) \leq k \leq (s)$ we will get all matching keys and can unpack the objects of the triple.

We can apply this trick once more and for every triple add an entry with key $(s, p, o)$ and an empty value. Now querying for all triples given a fixed subject and predicate is simply a range query. Adding a new triple is also just creating a new entry.

But what if we want to query for all triples with a fixed object? Our $spo$ index is of no use here. We can not do a range query that binds the object as the lexicographical order of the index is wrong. Luckily, we can easily add another index to be able to efficiently perform such queries, e.g. an $osp$ index.

It turns out that to efficiently query RDF triples in any direction we only need three indices: $spo$ , $pos$ and $osp$ . This is a very nice argument in favor of RDF. The simple structure of the data model allows efficient indexing and querying while still being very expressive. The same argument has been made previously by Weiss et. al. (see Hexastore: Sextuple Indexing for Semantic Web Data Management), where they show that six indices are enough to query RDF efficiently. By using ordered key-value stores we can improve on this and only need three indices.

The Index is the Massage

You might have noticed that the dictionary is not really necessary. We could populate our indices with the RDF terms directly. Using a dictionary is a considerable performance improvement and allows us to follow the framework of indices being projections of some values.

The $spo$ , $pos$ and $osp$ indices may be seen as projections of the RDF terms. The projection is just dependent on the RDF graph. The projection, or the index, is really the content we are interested in.

Alternatively, you may see the RDF graph as a set of triples and our indices as a method for querying set membership efficiently. This is probably a more direct way of reasoning, but would not allow us to make cool puns.

GeoPub Implementation Notes

IndexedDB does not allow the manual creation of indices. We need to use a little trick.

We create an ObjectStore triples, where for every triple we add a JavaScript object with three fields corresponding to the indices:

{"spo":[6,8,9],"pos":[8,9,6],"osp":[9,6,8]}

We create IndexedDB indices that index the respective fields and can now query triples efficiently.

Note that the integers are the identifiers for the terms as stored in the dictionary and that IndexedDB orders arrays exactly how we would like tuples to be ordered (see the relevant section of the IndexedDB specification). This allows us to use range queries to find matching triples, as described above.

Geo-Spatial

There are some limits to what we can query for with the RDF indices presented previously. For example some RDF data might have geo-spatial information attached:

<https://example.com/a-geo-note>
  a as:Note ;
  geo:lat "46.7970040956";
  geo:long "10.2982868244";
  as:content "An ActivityStreams note with a geo position." .

We would like to be able to find such notes efficiently by querying for things in the vicinity of a given point.

Note that we can not do this with the RDF indices as described above. The RDF indices only allow us to search for triples that exactly match some pattern. Finding things in the vicinity can not be formulated as a triple pattern that should be matched. We need a specialized index.

Geo-spatial data is two dimensional and we need to be able to query both dimensions (longitude and latitude) at the same time. There exist specialized data-structures for querying such multi-dimensional data. Examples include Quadtrees, R-Trees or k-d trees. Our ordered key-value store, on the other hand, is queryable only in a single dimension. Can we still make use of our okvs for geo-spatial queries?

Z-Order Curve and Geohash

The trick is to project the multi-dimensional data to a single dimension while preserving locality of points. If two points are close together in the higher dimension their projection to the single dimension should also be close together (and vice-versa). We will use a Z-order curve that fulfills these properties (almost).

A Z-order curve partitions a two dimensional space by filling the space with "Z"s. At every iteration the curve is repeated (like a fractal) to more finely fill the space and allow higher resolution. A popular implementation of a Z-order curve is Geohash. We will use Geohash for our geo-spatial index.

Geohash maps an area bounded by latitude and longitude to a sequence of characters. The more characters used the more precise the resolution - the smaller the area. With enough precision the areas become good approximations for points.

For example the position of the ActivityStreams note can be described with the Geohash u0r64r4ur3z4. Whereas the Geohash u0r6 roughly describes the area of the lower Engadin valley (about 20 sqkm). Note that the larger area is a prefix of the more precise Geohash. This is something we can efficiently query for with our ordered-key value store.

Geohash index

For every RDF subject with geo-location we add an entry with key $(geo, Geohash (lat, long))$ and value the identifier of the RDF subject.

To query for subjects in a certain area we compute the Geohash of the area and then query for all entries in the geo index that start with the Geohash of the area. This can be formulated as a range query on an ordered key-value store.

Using Geohash has some limitations: It does not in general hold that two points that are geo-graphically close are close together in the Geohash projection (the opposite does hold). In particular this happens close to borders where Geohash splits the space. Mitigations are known and can be implemented (see the notes in the Wikipedia).

Full-Text Search

Another dimension we would like to be able to query is full-text, i.e. given some natural language term we would like to find all entries that contain the term. This is another example of where exact RDF triple pattern matching does not work.

Full-text search is a fuzzy thing. It would be nice if a query for the term "connected" also returns entries that contain the words "connections", "connected" or "connective". This can be done by performing a linguistic normalization called stemming (see the Xapian documentation for a good introduction to the topic).

In order to perform full-text searches we create an index for all RDF literals that have type string (and langString) that maps the linguistic normalized stems appearing in the literal to the literal identifier (as stored in the dictionary).

When given a query we again perform the linguistic normalization on the query to first get a list of query stems. We then use the index to find all literals that contain the stems appearing in the query.

When querying we want to match literals that contain all the query stems (i.e. query result should be a logical and, not a logical or). This is a join over the literal ids. A neat trick (that we stole from this little gist) is to do this join as a merge join. This is an efficient join algorithm that uses the fact that entries are soreted in the okvs.

More Indices

We have looked at three different indexing strategies (RDF, geo-spatial and full-text search) that can be implemented on ordered key-value stores. There are more indices that can be implemented on ordered key-value stores. Examples include temporal indices, pre-computed cryptographic signature verification (see RDF Signify and DMC), revisions (i.e. versioning) and many more!

Conclusion and Outlook

I hope to have illustrated the ideas behind how we have implemented a multi-modal database for openEngiadina that is capable of handling the variety of data that is local knowledge.

The main ingredient to all this has been ordered key-value stores. I find this a very elegant and powerful abstraction. The ideas are practically accessible as robust and well-tested ordered key-value stores exist (e.g. LMDB, LevelDB or SQLite LSM) that can be used to experiment with the ideas. The hard parts of databases (persistence and transactions) are already solved and we can experiment with indices and how to make concrete queries efficient in relative safety.

Ordered key-value stores are a relatively niche subject. Previous work include SRFI-167 and SRFI-168 by Amirouche Boubekki as well as FoundationDB.

This post is a feeble attempt to provide a framework for thinking about ordered key-value stores. I hope to have sparked your interest and invite you to experiment and formulate your thoughts on the topic. Much more thought and work is needed to polish these ideas.

SPARQL and GeoSPARQL

SPARQL is a query language for RDF databases. The ideas described in this post can be used to implement a SPARQL query engine - they describe lower-level indexing strategies.

I do hope to have motivated the case for application specific indices. Some applications may require geo-spatial indexing, others may requires high-performance indexing of revisions or pre-computed cryptographic functions.There are extensions to SPARQL that allow geo-spatial indexing (GeoSPARQL) and also vendor-specific extensions that allow full-text search. However, I believe that SPARQL is in general not flexible enough to allow such extensions and application specific optimizations.

Datalog

As mentioned in a previous post we use Datalog as query language. Datalog naturally allows custom indexing strategies and the formulation of queries that make use of them. I hope to motivate these ideas in a future post.

Fin

My fellow openEngiadina developer rustra has been sentenced to 5 years in prison as a victim of political repression in Belarus. Read his last words in court and an interview. Please consider donating to the Anarchist Black Cross Belarus.

Many thanks to the countless people who have helped form these ideas and have provided valuable feedback and ideas. A special thanks goes to Amirouche Boubekki who gave me the courage to pursue these ideas with his work on SRFI-167, SRFI-168 and Babelia.

Thank you, dear reader, for your interest and time. If you have thoughts, comments or questions, please feel free to get in contact! We have an IRC room on Libera (#openEngiadina) or you can contact me directly by mail (pukkamustard [at] posteo [dot] net).

The openEngiadina project is supported by the NLnet Foundation trough the NGI0 Discovery Fund.

pukkamustard