Graph databases

17 June 2009

There's a new database concept in town devised especially to store a traditional computer science data structure: a graph or a network. This type of database is called a graph database or, sometimes, a key/value database. They've come to the fore recently because of an extremely good open source implementation in Java called Neo4j, released a couple of months ago. There's none that I can find for .NET yet, although I suppose some enterprising C# dev may be converting Neo4j à la NHibernate.

If you've done much database work, you'll know that although relational databases work extremely well for what we may describe as traditional data, data for which we know the attributes ahead of time and that naturally falls into what you might call a tabular form, rows and columns. Customers, orders, order lines and so on so forth. (I know, I know, a gross over-generalization.) In essence because the data has such structure (or because we've been able to impose such a structure on it) modern RDBMSs have optimized the heck out of themselves and out of SQL. We achieved simplicity of abstraction, robustness of implementation, great performance with these database servers, however there's still one area where the story is not quite so simple: scalability (we tend to throw faster hardware at our database server, not necessarily more servers).

Although RDBMSs have given us a lot of benefits and a great abstraction for many years, they've become a jack-of-all-trades type solution rather than being the best at everything. The example I've given so far is scalability: with an RDBMS we add more disks, replace the processor with a faster one, and so on, when the access load starts to bite. Because of the inherent complexity of the whole RDBMS it's really hard to add another server. There's whole decisions about partitioning the database in the most efficient manner and other types of problems. If you are running a large web site or are running many heavily-loaded web services, a single database server will be your bottleneck.

Another problem with relational database systems is the "baking in" of initial design decisions. How many times have you run into the issue of "oops, we forgot this attribute for this data type" and then have to rebuild your schema to accommodate it? Easy enough during development of the first version, but ruddy hard in version 3.0 when the schema is "bedded down" and you are really loath to change it. I'm sure we've all done the "add this special code to the Description field to store this attribute" hack. 

Yet another problem is the fact that, in order to try and mitigate the above issue perhaps, you add many columns to your records and end up with a sparse table, where not every record stores the same data and so columns have little data in them, maybe just values for a few records only.

Finally -- and I'm not trying to nail down the coffin lid here, just pointing out issues with the relational model -- we have what might be called the social media issue or the six degrees of separation problem. More and more of our data forms a network or a graph. In computer science a graph is a set of nodes and edges, the latter being the links between the nodes. A simple binary tree is a graph, a very simple one where the edges only go between a parent and its children. Imagine a graph that describes the interstate system in the US: we have cities as nodes, and then the edges are the various highways. We can attach attributes to the nodes and we can attach attributes to the interstates. Sometimes those attributes will be the same for all nodes (for example, population, area) and for all edges (distance), but there may be other attributes that do not apply to all cities (name of main island would apply to New York, but not Denver). Another example of such a model is Twitter, where the nodes are people on the service and the edges are defined by the "follows" relation. (In fact, here the edge is directed, the link has a direction, an arrow. I may be following you, but it doesn't mean you are following me back.) Persisting this kind of network model is not geared to a relational database, especially when you want to ask questions like "Am I following someone who is following Joe Blow?".

Whether a node or an edge, you are modeling key/value pairs. Each entity in your model has a key (so you can find it again) and a value (which can be a whole list of properties, themselves key/value pairs). 

The various cloud implementations have been pointing the way for a while: if you store database-type data in the cloud, you basically give up a wonderful traditional (academic?) normalized relational schema and go for something more on the wild side. individual tables that you load with data and query the heck out of (Azure uses LINQ, for example). No relations between tables (unless you code that in, in some fashion). Amazon's SimpleDB works in the same fashion: you create a data set, insert a bunch of data (the "records" do not have to have the same attributes or structure), and then query the heck out of it. In both these cloud examples, you do not have to predefine your data structure or worry about indexes or keys; in essence, the items you insert have associated properties that are defined when the item is added and the cloud engine takes care of everything else (including, hurrah, spreading your data across several servers both for scalability and for redundancy). Sometimes, even, the data you store has to be string values only; sometimes you get access to other primitives like ints, booleans, and lists.

A graph database, then, is a special persistence engine for storing key/value data. If you like, it's a persistent hash table or dictionary (shameless plug: I described a persistent hash table in my 2001 book Tomes of Delphi, Algorithms and Data Structures). Because of its association with cloud computing, its API tends to be REST-based (it's a web service after all).  It has a query system that allows you to easily query the data, but it's nowhere near as powerful as SQL. It doesn't require a fixed schema, which can be good in some ways (you can extend the attributes you store easily), but bad in others (optimizing a schema for efficient access is impossible without one). It maps very easily to an object-relational model in your code, in particular there's no need for any kind of object-relational mapping framework.

All is not sunshine and roses, however. In throwing away the RDBMS, you're also throwing away things like data integrity and data robustness, especially integrity enforced by the database schema. As I noted above, by planning a schema, you have a better chance of being able to optimize it (setting up indexes, and so on, based on the application's needs) rather than rely on the graph engine's own auto-optimization code. Also there's no standardization between graph databases yet, unlike, say, SQL.

Despite all that, Neo4j and the various cloud computing services are pointing the way to graph databases becoming more prevalent in certain situations. Things are starting to look very interesting on the persistence front.

(Some extra reading:


Free DevExpress Products - Get Your Copy Today

The following free DevExpress product offers remain available. Should you have any questions about the free offers below, please submit a ticket via the DevExpress Support Center at your convenience. We'll be happy to follow-up.
No Comments

Please login or register to post comments.