Exploring graph databases and Neo4j
Recently I started learning about graph databases and Neo4j specifically.
In a graph database, you have two main things:
- Nodes (or vertices): the entities or things – could be people, bank accounts, or anything else
- Relationships (or edges): the connections between the things
Nodes and relationships can each have properties – pieces of data about them – what type of node or relationship it is, and any other data about it – like it’s name, when it was created, etc. In a relational SQL database, to store relationships you need to add columns in your tables to track foreign keys, indexes, etc. In other words, you need to pre-think things out and get your data into a shape that fits in tables of columns and rows, and define what kinds of things belong to or have many other things. In a graph database, you just start adding nodes, then add relationships from one thing to another. Just interconnected things.
I decided to play around with Neo4j and get a feel for how it works.
I used the instructions on Digital Ocean’s site for installing Neo4j on Ubuntu.
wget -O - http://debian.neo4j.org/neotechnology.gpg.key | apt-key add - echo 'deb http://debian.neo4j.org/repo stable/' > /etc/apt/sources.list.d/neo4j.list apt-get update apt-get install neo4j
Now Neo4j is running. The HTTP api is available on port 7474.
We can send some requests to the api and see how it responds.
Some facts I’ve gathered:
- The API accepts and returns JSON. It may return other formats too, I haven’t checked.
- The Neo4j query language is called Cypher. It’s inspired by SQL, but it’s different. There are more special characters and brackets going on.
- To send a request to the api, you POST JSON to http://localhost:7474/db/data/cypher.
- In Neo4j, collections are zero based. So the first node has id 0, not id 1, which is typical in SQL databases.
Every request, when using curl, looks basically like this:
$ curl \ -H "Accept: application/json; charset=UTF-8" \ -H "Content-Type: application/json" \ -X POST \ http://localhost:7474/db/data/cypher \ -d '{...some json...}'
That JSON generally follows this format:
{ "query": "...SOME CYPHER HERE...", "params": { "key1": "value1", "key2": "value2", } }
The values in the params key/value hash get substituted for names in the query. Name substitution is a nice built in feature.
The exception to this JSON format, atleast in the tutorial I followed, was when creating relationships. For that, you post to a different URL and use a slightly different syntax.
POST http://localhost:7474/db/data/node/0/relationships (url of first node’s relationships resource)
Request Body JSON:
{ "to": "http://localhost:7474/db/data/node/1", // url to another node "type": "Comes Before" // name you've given this type of relationship }
You can also create relationships with the Cypher query language.
Creating a node of type Person with a “name” attribute:
Cypher:
CREATE (n:Person { name : {name} }) RETURN n
Params:
name: "John"
Full on JSON:
{ "query" : "CREATE (n:Person { name : {name} }) RETURN n", "params" : { "name" : "John" } }
The API gives you a lot of data back:
{ "columns": [ "n" ], "data": [ [ { "outgoing_relationships": "http://localhost:7474/db/data/node/0/relationships/out", "labels": "http://localhost:7474/db/data/node/0/labels", "data": { "name": "John" }, "traverse": "http://localhost:7474/db/data/node/0/traverse/{returnType}", "all_typed_relationships": "http://localhost:7474/db/data/node/0/relationships/all/{-list|&|types}", "self": "http://localhost:7474/db/data/node/0", "property": "http://localhost:7474/db/data/node/0/properties/{key}", "properties": "http://localhost:7474/db/data/node/0/properties", "outgoing_typed_relationships": "http://localhost:7474/db/data/node/0/relationships/out/{-list|&|types}", "incoming_relationships": "http://localhost:7474/db/data/node/0/relationships/in", "extensions": {}, "create_relationship": "http://localhost:7474/db/data/node/0/relationships", "paged_traverse": "http://localhost:7474/db/data/node/0/paged/traverse/{returnType}{?pageSize,leaseTime}", "all_relationships": "http://localhost:7474/db/data/node/0/relationships/all", "incoming_typed_relationships": "http://localhost:7474/db/data/node/0/relationships/in/{-list|&|types}", "metadata": { "id": 0, "labels": [ "Person" ] } } ] ] }
Creating a second person:
{ "query" : "CREATE (n:Person { name : {name} }) RETURN n", "params" : { "name" : "Susan" } }
Create a relationship between them:
Assume the John node has a url of
http://localhost:7474/db/data/node/0
And the Susan node has a URL of
http://localhost:7474/db/data/node/1
We want to create a relationship between John and Susan. John loves Susan. That’s the relationship. In Neo4j, relationships have a direction. They go from one node to another. They also have a type. In this relationship, the from node is John, the to node is Susan, and the type is “Loves”. Each node has a set of relationships. This set or collection of relationships is mapped to a rest API.
http://{url-to-node}/relationships
For John’s node, which is node 0, this would be:
http://localhost:7474/db/data/node/0/relationships
Then, to create this relationship:
POST http://localhost:7474/db/data/node/0/relationships
{ "to": "http://localhost:7474/db/data/node/1", "type": "Loves" }
Neo4j returns a lot more data back:
{ "extensions" : { }, "start" : "http://localhost:7474/db/data/node/0", "property" : "http://localhost:7474/db/data/relationship/1/properties/{key}", "self" : "http://localhost:7474/db/data/relationship/1", "properties" : "http://localhost:7474/db/data/relationship/1/properties", "type" : "Loves", "end" : "http://localhost:7474/db/data/node/1", "metadata" : { "id" : 1, "type" : "Loves" }, "data" : { } }
It tells us where the relationship starts, where it ends, any metadata, any extensions, properties, and any other data.
Now, lets do a simple query to figure out who loves who (JSON enclosure left out, just the Cypher query):
MATCH (n)-[r:Loves]->(m) RETURN n.name AS from, type(r) AS `->`, m.name AS to
And we get back:
{ "columns" : [ "FROM", "->", "to" ], "data" : [ [ "John", "Loves", "Susan" ] ] }
Once you’ve got a lot of relationships set up, you can use Cypher to quickly query these. An immediate use case for graph databases is social networks. If you have all this data regarding who is friends with who, and you want to get some stats on that, like how many friends of my friends friends like pizza and live in Spain, and so forth, with a SQL database, this can be a real pain to write – loads of joins, nested queries, and very slow as the size of the network grows. A graph database is optimized for this, so it’s very fast to perform queries that traverse through many levels of relationships – at least it’s purported to be.
Although social networks are the first data structures to jump to mind, graph databases seem well suited to many other types of data as well, so long as the items are interrelated.
Another nice thing – graph databases seem to be more able to easily represent real life scenarios, where data doesn’t easily fit in a two dimensional grid. Have more data? Add nodes. Does that data relate? Add relationships. This includes one off items, special cases, and things which don’t fit into a traditional two-dimensional grid, or a hierarchical tree, or a flat set of key value pairs. Everything can be inter connected in a giant messy hairball of strange and wonderful relationships, just as it is in real life, and that’s ok – a graph database can easily represent that.