Tutorials

The Benefits of Using a Graph Database Instead of SQL

by
Ivan Despot

Introduction

Graph databases provide a much faster and more intuitive way to model and query your data compared to traditional relational databases. In a previous article, we discussed the main differences between a graph database and a relational database, what kind of use-cases are best suited for each database type, and what are their strengths and weaknesses.

Choosing the right tool for the right job is critical to speed up development, ensure flexibility, and avoid operational complexity as your application evolves. In this tutorial, you will go through the process of building a simple customer order management application to learn how a graph approach compares to a SQL one in terms of data modeling, querying, and development flexibility.

Prerequisites.

To follow along, you will need:

  • An installation of Memgraph DB: a native, in-memory graph database. To install Memgraph DB and set it up, please follow the Docker Installation instructions on the Installation guide.
  • [Optional] An installation of Memgraph Lab: an integrated development environment used to import data, develop, debug and profile database queries and visualize query results.

Modeling Data with SQL

To model your customer order management system using an SQL implementation, you will need three tables: Customer, Purchase and Product. You should also define a structure for them upfront, which in the case of a more complex real-world problem would imply a lot of meetings and discussions with your team to figure out what Customer, Purchase and Product should look like.

Once that’s done, the next step is to connect your tables. Many customers have many purchases, which means an additional many-to-many table. This is not a problem and many ORMs will handle it for you. You will need another many-to-many table for connecting purchases and products. A Purchase containing a Product can have a quantity. Again, not a problem, you just need to expose your many-to-many and add a field.

At this point, you have a schema, you deployed it to your database, learned a bit more about your favorite ORM, cursed a few times because it was not 100% intuitive, and are rolling it out. Great!

SQL schema

Modeling Data with Graph

Modeling your customer order management system using graphs is much faster. First, property graph databases like Memgraph, don’t have a strict schema, so you don’t have to define it upfront. You need to differentiate customers, purchases, and products. Those will be nodes (vertices) in our graph with appropriate labels. You will then need to connect them and add properties on those connections, which is pretty straightforward too.

Can you handle many-to-many? Of course, graph databases support fast arbitrary relationships. Let’s make a layout diagram. It’s not as explicit as the relational schema, but it’s semantically clearer, and in general more intuitive.

Graph schema

Let’s write a few Cypher queries so you can populate your database. Unlike SQL, where you actually had to define the database schema beforehand, in graph databases, the insertion itself defines the structure. We won’t explain the queries in detail, feel free to check out our quick-start on Cypher. Let’s start by creating a Customer and a Product node:

CREATE (customer:Customer {name: "John Doe", id: 0}) 
CREATE (product:Product {brand: "Shooz", name: "X3D Ultra Turbo Pro 3.1", 
                         cost: 399.95, id: 0})
RETURN customer.id, product.id;

Notice how you haven’t added all the properties yet. You don’t need to know the whole data structure upfront, you can modify it as the need arises. If you’re building some sort of a web API in front of your database, the input queries don’t even have to be predefined. Here’s how a creation-query generator would look in Python (slightly different Cypher syntax to facilitate the property map generation):

def make_create_query(id, label, properties):
	"""
	Generates a Cypher query for creating a vertex with the given
	label and properties. Properties map strings (names) to primitive
	values (ints, floats, bools, strings).
	"""
    
	return "CREATE (node:{1} {{{2}}}) " \
       	"SET node.id = {0} RETURN id".format(
    	id, label, ", ".join("{}: {!r}".format(*kv) for kv in properties.items()))

Using such a function, you can easily create vertices with properties extracted from, for example, an HTTP URL. If the structure of your Customer or Product entities changes, you don’t have to modify the DB schema nor the back-end server (OK, you might want to validate the key-value pairs you get). Zero downtime redeployment comes down to switching to a new front-end. Pretty cool!

Now let’s handle creating orders and connecting customers to products. Assuming that you have the info on who the Customer is and what their Purchase is (handled by your app’s front end), the Cypher query would look something like this:

MATCH (customer:Customer {id: 42}),
  	(p1:Product {id: 123}),
  	(p2:Product {id: 654}),
  	(p3:Product {id: 12})
    
CREATE  (customer)-[:PURCHASED]->(purchase:Purchase {id: 0}),
    (purchase)-[:HAS {quantity: 3}]->(p1),
    (purchase)-[:HAS {quantity: 5}]->(p2),
    (purchase)-[:HAS {quantity: 1}]->(p3)

It’s fairly simple. First you MATCH the known Customer and Products and then you create a Purchase that is connected to them. Generating a query dynamically (the number of different products in a purchase varies) also shouldn’t be too hard.

As you can see, the Cypher implementation of your data model is simple and easily extendible. With SQL, there’s definitely more work to be done. There are five tables to define with their complete schemata. If you want to avoid writing raw SQL you also need to know and depend on a third-party tool. Personally, I used a few relational ORMs (SQLAlchemy the most), and while they helped it still wasn’t exactly enjoyable.

Querying Data with Cypher

Once you start getting some data in, you’ll obviously want to get it out. Let’s say you want to get all the products a specific customer purchased. This would be unpleasant to write in SQL because it would require five JOINs. A good ORM tool will probably make it substantially smaller and easier to handle. With Cypher it’s quite easy:

MATCH (:Customer {id: 42})-[:PURCHASED]->(purchase)-[purchase_product:HAS]->(product)
RETURN purchase.year, purchase_product.quantity, product.id

After all, the Cypher language was designed for easily querying connected data. Extending the query to include arbitrary filters can be done by query generation in your app’s backend, similar to creation queries. While not entirely trivial, it’s doable. In all fairness, an SQL ORM should probably make it even easier. This is a common and simple requirement, and it can be handled well.

Now let’s say you want to find the ten top-spending customers during a given time period (for simplicity let’s say during a calendar year). The SQL query would again be cumbersome to write, and an ORM would help. With Cypher it’s fast and intuitive:

MATCH (customer:Customer)-[:PURCHASED]->(purchase {year: 2017})-[purchase_product:HAS]->(product)
WITH customer, sum(purchase_product.quantity * product.price) AS yearly_expenditure
ORDER BY yearly_expenditure DESC
RETURN customer, yearly_expenditure LIMIT 10

Voila! You just wrote it and there are absolutely no JOINs.

Refactoring Your Data Model

Database schemata rarely remain unchanged throughout a system’s lifetime. Let’s say that due to a business expansion the variety of products on offer in your store suddenly greatly increased. You should now store many different product types that have a large number of incompatible attributes (both running shoes and mobile phones are products on offer).

With SQL databases there are several ways of tackling this problem, but none of them are really pleasant. ORMs typically attempt to map OOP class hierarchies to database columns, but the underlying storage limitations make it finicky. And again, the data structures need to be defined upfront, which means a lot of iterative discussing, refactoring, testing, and redeploying. Of course, it all needs to be done with zero downtime. This kind of intervention into a relational database is neither trivial nor fast to execute, even on a toy example. In a real-world scenario it’s point-blank avoided (quickly earning the “technical debt” tag).

With a property graph like Memgraph, it’s fairly simple. Every vertex can have an arbitrary number of labels, so a Product can at the same time also be of a specific type. It can even belong to multiple types, it all just comes down to adding a label. Here’s how it’s done:

MATCH (product:Product) WHERE product.brand in ["Nike", "Adidas"] 
SET product:TrainerShoe

With a few queries, you can label products making them easy to differentiate. The queries are likely to be simple enough to execute on a production database, but you should test on a clone first. Since there is no strict schema, the system can naturally evolve along with the business logic.

There are a number of other refactors you might need to introduce into the system as it grows. You might need to relate products to each other, introduce a concept of suborders, as well as introduce new fields. Most of these will be easier to handle in a graph than in a relational database.

Personally, I believe all of them will be easier to handle. A property graph is extremely flexible, both for data structure and connections. Many concepts are much easier to express naturally than in a relational database. In my opinion, when it comes to evolving a system, a property graph beats relational hands down.

Graph Database Performance

Over time the quantity of active data in your system will grow. The scale can vary wildly, and in all but the largest (think Google) scales, you would expect a modern database to deliver good and predictable performance. So, let’s test what kind of a scale can Memgraph handle. We are running this on a strong laptop (4 core i7 with 16GB RAM).

Creating a dataset

Here are a few Cypher queries to help you set up a mock dataset. First, you’ll need to create some indexes:

CREATE INDEX ON :Customer(id)
CREATE INDEX ON :Product(id)

Next, keeping things simple, let’s create ten thousand customers with the same name:

UNWIND range(0, 9999) AS ignored
CREATE (:Customer {id: counter('Customer.id', 0), name: "John Doe"})

Then, let’s create 10 000 products. Let’s add random prices so you can play with analytic queries:

UNWIND range(0, 9999) AS ignored
CREATE (:Product {id: counter('Product.id', 0), 
                  name: "Awesome product", 
                  price: rand() * 100})

Finally, ler’s generate 100 000 purchases from random customers to random products. To keep it simple, each purchase contains three products with random quantities:

UNWIND range(0, 99999) AS ignored
WITH tointeger(rand() * 10000) AS c_id,
 	tointeger(rand() * 10000) AS p1_id,
 	tointeger(rand() * 10000) AS p2_id,
 	tointeger(rand() * 10000) AS p3_id
MATCH (customer:Customer {id: c_id}),
      (p1:Product {id: p1_id}),
      (p2:Product {id: p2_id}),
      (p3:Product {id: p3_id})
CREATE  (purchase:Purchase {id: counter('Purchase.id', 0),
                            year: 2000 + tointeger(rand() * 18)}),
    (customer)-[:PURCHASED]->(purchase),
    (purchase)-[:HAS {quantity: 1 + tointeger(rand() * 5)}]->(p1),
    (purchase)-[:HAS {quantity: 1 + tointeger(rand() * 5)}]->(p2),
    (purchase)-[:HAS {quantity: 1 + tointeger(rand() * 5)}]->(p3)

After a few seconds, your mock dataset is ready for testing. It contains 120 thousand vertices and 400 thousand edges. Memgraph is using around 350MB of RAM which shouldn’t be a problem.

Let’s see how the query for finding a customer’s purchases performs.

MATCH (:Customer {id: 42})-[:PURCHASED]->(purchase)-[purchase_product:HAS]->(product)
RETURN purchase.year, purchase_product.quantity, product.id

The results (around 30 products) are obtained in around 0.1 milliseconds.

What about that analytical query for getting last year’s top 10 spenders?

MATCH (customer:Customer)-[:PURCHASED]->(purchase {year: 2017})-[purchase_product:HAS]->(product)
WITH customer, sum(purchase_product.quantity * product.price) AS yearly_expenditure
ORDER BY yearly_expenditure DESC
RETURN customer, yearly_expenditure LIMIT 10

This one takes around 106 milliseconds. Not too bad for an analytic query executed on a transactional database. It would also be important to measure performance when there are parallel inserts, but that’s for another time.

Let’s also try a ten times bigger scale. That means 100 thousand customers and 100 thousand products. This is already big business. Let’s connect them with a million purchases (that’s almost a thousand purchases daily, for three years). Generating a new dataset on the given scale takes a bit longer (about a minute). Once set up, Memgraph uses around 2GB of RAM. Getting all the products a customer purchased still takes only a few milliseconds. The analytic query for the top 10 spenders in 2017 takes around 782 milliseconds, which is also reasonable.

Note that all of the queries above execute in a single thread in Memgraph and the test machine was nowhere near at full load. Also, don’t forget that the test machine is a personal laptop running other things in the background.

Conclusion: The Infamous Graph Use-case

In this post, you learned how a graph database can help you model, query, and expand your data in a faster and more intuitive way when compared with a traditional SQL approach.

You often come across the question: What is a use-case for graph databases?

Typically, you start thinking about queries spanning a large number of connections (relationships, edges) or finding the shortest path from point A to point B. I believe that to be a very limiting and simplistic mindset.

What if the question was: Do you need a lightweight, fast, scalable, initially flexible, and refactor-friendly solution?

The answer: Who doesn’t?

Table of Contents
Sign up for our Newsletter

Continue Reading