Making Sense Out of Datomic, The Revolutionary Non-NoSQL Database

June 16, 2013

I have finally managed to understand one of the most unusual databases of today, Datomic, and would like to share it with you. Thanks to Stuart Halloway and his workshop!

Why? Why?!?

As we shall see shortly, Datomic is very different from the traditional RDBMS databases as well as the various NoSQL databases. It even isn't a database - it is a database on top of a database. I couldn't wrap my head around that until now. The key to the understanding of Datomic and its unique design and advantages is actually simple.

The mainstream databases (and languages) have been designed around the following constraints of 1970s:

memory is expensive
storage is expensive
it is necessary to use dedicated, expensive machines

Datomic is essentially an exploration of what database we would have designed if we hadn't these constraints. What design would we choose having gigabytes of RAM, networks with bandwidth and speed matching and exceeding harddisk access, the ability to spin and kill servers at a whim.

But Datomic isn't an academical project. It is pragmatic, it wants to fit into our existing environments and make it easy for us to start using its futuristic capabilities now. And it is not as fresh and green as it might seem. Rich Hickey, the master mind behind Clojure and Datomic, has reportedly thought about both these projects for years and the designs have been really well thought through.

The Weird Architecture of Datomic

Datomic is a database on top of another database (or rather storage) - in-memory, a file system, a traditional RDBMS, Amazon Dynamo.
You do not send your query to the server and get back the result. Instead, you get back all the data you need to execute the query and run the query - and all subsequent queries - locally. Thus, "joins" are pretty cheap and you can do plenty of otherwise impossible things (combine data from multiple databases and local data structures, run any code on them, ...). Each application using Datomic - a "peer" - will have the data it needs, based on its unique needs and usage patterns, close to itself.
All writes go through one component, called Transactor, which essentially serializes the writes, thus ensuring ACID. It might sound as a bottleneck but it isn't for most practical purposes^[1] given the design and typical application needs. (Reportedly, Datomic could handle all transactions for all credit cards in the world. Listen to the experiences of Room Key with their rather write-heavy load in the Relevance Podcast with Kurt Zimmer (Podcast Episode 033).)
Datomic works quite similarly to a version control system such as Git. It never overwrites data, there are no updates. You only mark the data as not valid anymore and add new data, which produces a new version of the database (think of git hash / svn revision number). You can then query the latest state of the database or the state as of a particular version. (Of course the whole database isn't copied whenever you add a fact to it. Datomic is smart and efficient.)
It is not a single, monolithic server, the storage, transactor, and peers are physically separate pieces.

What has made this possible?

Network access as fast as or faster then disk access => can fetch all the data over the network
Plenty of memory => can store a substantial subset of it on each peer according to its actual needs
Storage is huge and cheap => we can easily store historical data
Experiences with efficient, immutable, "persistent" data structures used in modern FP languages => cheap creation of new "database values"

The Unique Value Proposition And Capabilities of Datomic

We have now learned about and hopefully understood the unique design of Datomic. But what does it give to us, what does it distinguish from other databases?

The architecture, together with few other design decisions, provides the following key characteristics:

Programmability - data, schema, query input/output, transaction metadata are all just elementary data structures that you have fully available at the peer and can thus combine and process in powerful ways unimaginable before
Persistence/accountability - you never lose history, can annotate transactions with metadata about who/why etc., support for finding out how things were, how they have been changing, performing what-if analysis
Elastic scalability - since a lot of the load has been pushed to the peers
Flexibility - no rigid schema, easy to navigate and combine and cache data based on each peer's unique needs, extensibility via data functions

Closing Notes

Datomic has similar goals as relational databases (especially ACID) and could be used in similar use cases. Performance-wise, if writes are more important than reads, if you need to write really a lot of data each second continuously, or if you have over billions of "rows" then you might prefer another solution. Thanks to the design and recommended architecture for heavily loaded installations, i.e. with memcached in front of the storage, the performance of the backend isn't so important (as the peers have the data they need locally or get it from memcached) so it should be selected more based on the usage-related characteristics.

Summary

The design of Datomic - peers fetching data and running queries locally, a single coordinator of writes (transactor), building on existing databases/storage tools (and keeping all the history) seemed very strange and perhaps inefficient to me until I realized that the traditional databases are designed around constraints that do not exist anymore. Datomic now makes sense to me and seems as a tool with intriguing capabilities and great potential. I hope you see it the same way now :-).

I have left out some interesting topics such as what data structures can be stored in Datomic and the data model and query model used. To learn about these and more about Datomic, head to Datomic for Five Year Olds and Datomic's home page.

Bonus Links

Data functions for optimistic and pesimistic locking in Datomic (forum answer)
HighScalability.com: VoltDB Decapitates Six SQL Urban Myths and Delivers Internet Scale OLTP in the Process - description of the architecture of VoltDB, that has a few things in common with Datomic (single-threaded writes, "stored procedures" as units of transaction etc.)
VoltDB - Mike Stonebraker's incredibly scaleable, SQL, ACID database that also breaks up with the constraint of 70s and leverages huge RAM, single-threaded access etc.

^[1] Harizopoulos, S., Abadi, D. J., Madden, S., & Stonebraker, M. (2008, June). OLTP through the looking glass, and what we found there. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data (pp. 981-992). ACM. - this paper shows that traditional RDBMS spend nearly 30% time on locking and latching, that could be eliminated with single-threaded access, as is also done in VoltDB. See also the VoltDB whitepaper.

Tags: clojure performance