Blog
Why graph?

Why graph?

The database area is dominated by relational database systems (tables) and text queries since the 1970s. However the issues with the relational database systems are numerous and they even gave rise the the regular SW profession - database engineer. This is because contrary to their name they are very awkward at representing actual relations between data which is always demanded by the real world applications. They typically use foreign keys and/or proxy tables to represent them. Additionally the tables naturally enforce fixed immutable data schema upon the data they store. To change the schema one needs to create a new database with the changed schema and copy the data over (this is called database migration). Such operation is very costly and most database systems fair poorly when there are foreign keys involved (requiring them to be disabled for the migration to happen). As it turns out nowadays no database schema is truly immutable. New and changed requirements happen so often that the database schemas usually need updating (migrating) nearly every time there is an update to the systems using it.

There is no solution to this "schema" issue because it is the inherent feature of representing data in tabular form. It can be only mitigated to some degree but your mileage will vary greatly when using these techniques many of which are considered anti-patterns. Things like indexes, indirection (storing data with varied length), storing blob data, data with internal format unknown to the database itself (e.g. JSON) are all ways to prevent the need for database migration at the cost of efficiency. While there are good reasons for representing data in tabular form (lookup speed and space efficiency) the costs of very often far exceed the benefits. Plus as it turns out it is not even that efficient!

The tables are represented as fixed size records (rows) one after another (this is what makes the schema immutable). This representation is the most efficient when we are reading entire rows at the time (all columns) which is very rarely the case. Most often we want only some of the columns which means we are discarding some (or most) of the row when reading it. This is the same problem the CPU itself has when using memory. It reads is using cache lines. If we happen to read only some of the line the rest is wasted and another line needs to be fetched for the next item(s) (this is called a cache miss). This is why contiguous collections (like a vector) are almost always the most efficient because they minimize the cache misses. Chandler Carruth had numerous talks at CPPCon on this subject demonstrating that by far the biggest performance impact on software are the cache misses (over 50 % and up to 80 % !!!) with everything else being dwarfed in comparison.

Beside trying to optimize the tables the most prominent "solution" are the NoSQL databases. They typically use a different way to store data, often in a "schema-less" to cater to the above use cases - easing database migrations (or eliminating them) and providing more efficient data lookup. They typically choose some combination of key-value representation, document representation or a graph representation to scaffold the data instead of tables. They often trade in ACID properties, use write only - never delete "big tables" and other techniques.

Of NoSQL databases the graph databases stand out in particular because by definition they actually store the relations between the data in the database. How the values are then "attached" to the graph vary but the graph itself serves as an "index" as well as a "map" to be efficiently searched and reason about. The sparse graph (not all nodes are connected to all others) representation is then actually the most flexible and accurate way to store and represent the sparse data (as mentioned nearly or real world data is sparse).

There are two key properties of representing data as a graph that directly relates to the aforementioned issues of schema and data searching. Firstly the graph itself is the schema that can change freely as needed at any time without any restrictions eliminating the schema issue entirely. You do not need to be clairvoyant and agonize over the right database schema. You can do what works now and change your mind later without any issues. Secondly the graph allows accurately representing any actual relations between the data allowing the most efficient native traversal and lookup of data (vaguely resembling traditional indexing) making the lookup constantly efficient regardless of the data set size. Where table performance will deteriorate as it grows the graph will stay constantly efficient if you can traverse only the subset of the nodes via their relations even if the graph itself contained billions of nodes.

This is in a nutshell why the graph database is the best choice for most problem domains and data sets out there and why agdb is the graph database.

Costs

Everything has the cost and graph databases are no exception. Some operations and some data representations may be costlier in them as opposed to table based databases. For example if you had immutable schema that never updates then table based database might a better fit as the representation in form of tables is more storage efficient. or if you always read the whole table or whole rows then once again the table based databases might be more performant. Typically though these are uncommon edge cases unlikely to be found in the real world applications. The data is almost always sparse and diverse in nature, the schema is never truly stable etc. On the other hand most use cases benefit greatly from graph based representation and thus such a database is well worth it despite some (often more theoretical) costs.

Why not an existing graph database?

The following is the list of requirements for an ideal graph database:

  • Free license
  • Faster than table based databases in most common use cases
  • No new language for querying
  • No text based queries
  • Rust and/or C++ driver
  • Resource efficient (storage & memory)

Surprisingly there is no database that would fit the bill. Even the most popular graph databases such as Neo4J or OrientDB fall short on several of these requirements. They do have their own text based language (e.g. Cypher for Neo4J). They lack the drivers for C++/Rust. They are not particularly efficient (being mostly written in Java). Even the recent addition built in Rust - SurrealDb - is using text based SQL queries. Quite incomprehensibly its driver support for Rust itself is not very mature so far and was added only later despite the system being written in Rust. Something which is oddly common in the database world, e.g. RethinkDb, itself a document database, written mostly in C++, has no C++ support but does officially support for example Ruby. Atop of these issues they often do not leverage the graph structure very well (except for Neo4J which does great job at this) still leaning heavily towards tables.