Graph databases

17 June 2009

There's a new database concept in town devised especially to store a traditional computer science data structure: a graph or a network. This type of database is called a graph database or, sometimes, a key/value database. They've come to the fore recently because of an extremely good open source implementation in Java called Neo4j, released a couple of months ago. There's none that I can find for .NET yet, although I suppose some enterprising C# dev may be converting Neo4j à la NHibernate.

If you've done much database work, you'll know that although relational databases work extremely well for what we may describe as traditional data, data for which we know the attributes ahead of time and that naturally falls into what you might call a tabular form, rows and columns. Customers, orders, order lines and so on so forth. (I know, I know, a gross over-generalization.) In essence because the data has such structure (or because we've been able to impose such a structure on it) modern RDBMSs have optimized the heck out of themselves and out of SQL. We achieved simplicity of abstraction, robustness of implementation, great performance with these database servers, however there's still one area where the story is not quite so simple: scalability (we tend to throw faster hardware at our database server, not necessarily more servers).

Although RDBMSs have given us a lot of benefits and a great abstraction for many years, they've become a jack-of-all-trades type solution rather than being the best at everything. The example I've given so far is scalability: with an RDBMS we add more disks, replace the processor with a faster one, and so on, when the access load starts to bite. Because of the inherent complexity of the whole RDBMS it's really hard to add another server. There's whole decisions about partitioning the database in the most efficient manner and other types of problems. If you are running a large web site or are running many heavily-loaded web services, a single database server will be your bottleneck.

Another problem with relational database systems is the "baking in" of initial design decisions. How many times have you run into the issue of "oops, we forgot this attribute for this data type" and then have to rebuild your schema to accommodate it? Easy enough during development of the first version, but ruddy hard in version 3.0 when the schema is "bedded down" and you are really loath to change it. I'm sure we've all done the "add this special code to the Description field to store this attribute" hack. 

Yet another problem is the fact that, in order to try and mitigate the above issue perhaps, you add many columns to your records and end up with a sparse table, where not every record stores the same data and so columns have little data in them, maybe just values for a few records only.

Finally -- and I'm not trying to nail down the coffin lid here, just pointing out issues with the relational model -- we have what might be called the social media issue or the six degrees of separation problem. More and more of our data forms a network or a graph. In computer science a graph is a set of nodes and edges, the latter being the links between the nodes. A simple binary tree is a graph, a very simple one where the edges only go between a parent and its children. Imagine a graph that describes the interstate system in the US: we have cities as nodes, and then the edges are the various highways. We can attach attributes to the nodes and we can attach attributes to the interstates. Sometimes those attributes will be the same for all nodes (for example, population, area) and for all edges (distance), but there may be other attributes that do not apply to all cities (name of main island would apply to New York, but not Denver). Another example of such a model is Twitter, where the nodes are people on the service and the edges are defined by the "follows" relation. (In fact, here the edge is directed, the link has a direction, an arrow. I may be following you, but it doesn't mean you are following me back.) Persisting this kind of network model is not geared to a relational database, especially when you want to ask questions like "Am I following someone who is following Joe Blow?".

Whether a node or an edge, you are modeling key/value pairs. Each entity in your model has a key (so you can find it again) and a value (which can be a whole list of properties, themselves key/value pairs). 

The various cloud implementations have been pointing the way for a while: if you store database-type data in the cloud, you basically give up a wonderful traditional (academic?) normalized relational schema and go for something more on the wild side. individual tables that you load with data and query the heck out of (Azure uses LINQ, for example). No relations between tables (unless you code that in, in some fashion). Amazon's SimpleDB works in the same fashion: you create a data set, insert a bunch of data (the "records" do not have to have the same attributes or structure), and then query the heck out of it. In both these cloud examples, you do not have to predefine your data structure or worry about indexes or keys; in essence, the items you insert have associated properties that are defined when the item is added and the cloud engine takes care of everything else (including, hurrah, spreading your data across several servers both for scalability and for redundancy). Sometimes, even, the data you store has to be string values only; sometimes you get access to other primitives like ints, booleans, and lists.

A graph database, then, is a special persistence engine for storing key/value data. If you like, it's a persistent hash table or dictionary (shameless plug: I described a persistent hash table in my 2001 book Tomes of Delphi, Algorithms and Data Structures). Because of its association with cloud computing, its API tends to be REST-based (it's a web service after all).  It has a query system that allows you to easily query the data, but it's nowhere near as powerful as SQL. It doesn't require a fixed schema, which can be good in some ways (you can extend the attributes you store easily), but bad in others (optimizing a schema for efficient access is impossible without one). It maps very easily to an object-relational model in your code, in particular there's no need for any kind of object-relational mapping framework.

All is not sunshine and roses, however. In throwing away the RDBMS, you're also throwing away things like data integrity and data robustness, especially integrity enforced by the database schema. As I noted above, by planning a schema, you have a better chance of being able to optimize it (setting up indexes, and so on, based on the application's needs) rather than rely on the graph engine's own auto-optimization code. Also there's no standardization between graph databases yet, unlike, say, SQL.

Despite all that, Neo4j and the various cloud computing services are pointing the way to graph databases becoming more prevalent in certain situations. Things are starting to look very interesting on the persistence front.

(Some extra reading:

Enjoy.)

5 comment(s)
Emil Eifrem

Hi,

Good post! Just wanted to point out that there's been some discussions about a .NET port of Neo4j:

  www.mail-archive.com/.../msg01107.html

Currently, the best way to use Neo4j in a .NET environment is to wrap the database in a (preferably domain-oriented) REST layer. See for example the discussion in this thread:

  www.mail-archive.com/.../msg01312.html

-EE

17 June, 2009
David Moloney

Julian, please do not confuse the relational model with current SQL implementations.  Modern RDBMS pay lip service to the core principles of the model.

It is important to note that the relational model can express all variations of the network (graph/hierarchy) data model but the inverse is not true.  

There are plenty of reasons that we moved away from IMS (hierarchy) and towards the relational model..... Scalability and performance are NOT reasons for the transition.

In saying that... right tool for the right job...

17 June, 2009
Maurice Thomas

This is strangely coincidental. I've currently got a project where - in memory at least - I have no rigid classes. Just dictionaries for each attribute - so for each object that needs a StudentID column, there's an entry in that kvp dictionary.

I've taken the thought a stage further, having a string lookup core (so that there's only one storage of the word 'Smith', and the Surname column is an Int32).

Weird thing is I thought I was just being odd and trying a new thing - I didn't realise there's a whole school of thought out there suggesting the same architecture.

18 June, 2009
Jason Short

"A graph database, then, is a special persistence engine for storing key/value data"

Persistence is not a storage model.  You can serialize data to disk.  Is that a database?  Not hardly.

What you are really talking about is a graph representation of KV pairs.  Not what I call a database at all.  When all nodes can be any value you are just talking about a flat store.  XML could then be described as a database - ACK!

Is SQL broken?  Yep.  Has been for 30+ years.  But so is SMTP, POP, DNS, etc.  Lots of people have predicted their replacements for years.  Why is IP4 still the standard when IP6 has been around for 15+ years?

Relational data storage is not going anywhere.  Look at the stink Microsoft took when they tried to go to a KV system (ala AmazonDB) for SQL Server in the cloud.  People screamed bloody murder until MS dropped that and went back to relational.

18 June, 2009
Peter Neubauer

Hi there,

Magnus Mårtensson from Jayway just wrote a small PoC app from C# to the newly developed low level REST API to Neo4j: blog.jayway.com/.../neo4j-net-client-over-http-using-rest-and-json

Thought that might be interesting?

/peter

18 April, 2010

Please login or register to post comments.