Back to blog

# Handling large graph datasets

January 30, 2024
Ante Javor

As you start to read this blog post, you probably have an answer to the question “What is a large graph dataset?”. In your mind, it’s probably on the scale of your current production data plus a few years of production expansion of that dataset. That means that different dataset sizes mean different things to people. For example, 1 million nodes and relationships sound like a large dataset to someone. Meanwhile, people have datasets with up to a few hundred billion of nodes and relationships.

So, how to define a large graph dataset? So let's say that a big dataset is so big that if Memgaph is poorly configured or operated, transactions last infinitely long instead of just a few seconds. This blog post will give you a few ideas on how to avoid such pitfalls.

If we talk about specific sizes in general and not related to the blog post, for the sake of understanding, let's assume that this is the scale:

SizeNumber of entities
Extra small (XS)From 0 to around 100k graph entities or less (nodes + edges)
Small (S)From 100k to a million graph entities (nodes + edges)
Medium (M)From 1 million to 10 million graph entities (nodes + edges)
Large (L)From 10 million to 100 million graph entities (nodes + edges)
Extra large (XL)From 100 million to a billion graph entities (nodes + edges)
Extra extra large (XLL)From billion to infinity

The nice thing about this approximation is that the difference is about a factor of 10, which means you can run performance experiments on the extra small dataset and interpolate further. This, of course, assumes that things can be interpolated linearly. Hence, you can test two data points on extra small and small-size datasets, and the rest should follow a similar pattern.

If you make a wrong step on an extra small or a small-size dataset, the operation will take a few minutes. Multiply that by a factor of 10, and your mistake will take more like a few hours on a Medium or Large dataset.

The number of graph entities is not the only challenge. Graph density factor, high node degree, and the number of properties can form another challenge when handling large graphs. But they differ in each dataset and won’t be the focus of this blog.

Now that the basic understanding of scale has been reached, it is time to start with modeling.

On a large graph dataset modeling is quite important and it should happen before trying to import any data because you want to have a clear understanding of how your graph structure should look.

The key part when talking about the scale is the question of how to apply structure to nodes, relationships, and properties. On a bigger scale, it is a trade-off between the memory usage and execution speed.

If your use case is a book database, and each `Book` is a node, a typical question is whether the book genre should be a property of the `Book` node or whether there should be separate nodes for different genres.

Understanding what and how you will query the database will solve this dilemma. Let’s assume the goal is to search for books that have specific genres.

In the case of genre being a node property, you would need to touch all the `Book` nodes in the database or create an index on a very redundant property. As you can assume, there are a lot of thrillers out there, and that index would not work great. This would lead to slow execution time and moderate RAM usage if you use the index.

In the case of the `Genre` as a node, you have much less redundancy for the exact property, and your query can now start from the `Genre` node. As you can assume, there are much fewer genres out there than real books. Hence, the range of searches is much smaller now, and each genre is connected to appropriate books. This leads to having better execution speed but a bit higher memory usage, not because you need extra nodes but because each Book node needs a relationship to the Genre nodes.

Once you understand the basics of modeling a graph, the above-described dilemma is the typical problem that will pop out at some point. There are a few resources written on this topic, such as Memgraph graph-modeling.

Keep in mind that at this point, you should be pretty sure how you will index and set constraints on your graph model.

## Data import

Now that you have a clear picture of the graph model, how will you import the data?

There are two recommended options for importing the data in Memgraph:

There are other ways of importing data, but these two proved to be the most scalable in regard to performance. Below are guidelines and the rough data point for performance and what you can expect from each process. Keep in mind that these numbers are not from a production server.

For the purpose of simulating import, the Graph500 dataset is used. The same data is being used to test the performance of the supercomputers. There are multiple sizes of the Graph500 datasets. If you go to the Network repository, you will see the following table:

SizeNodesRelationships
Graph500 - scale 18 (Medium)174k8M
Graph500 - scale 19 (Medium)335k15M
Graph500 - scale 20 (Medium)646k31M
Graph500 - scale 21 (Medium)1M63M
Graph500 - scale 22 (Large)2M128M
Graph500 - scale 23 (Large)5M259M

There are multiple data sizes that correspond to previously large and medium-scale datasets defined at the beginning of the blog post. In this case, experiments will be on the medium scale 18 because it is easier to operate on a 16GB machine.

Cypher is the query language used by Memgraph and can be leveraged to do quite efficient import into Memgraph. The nice thing about loading data via Cypher commands is that it is the most flexible way and provides the most control in loading your data into Memgraph. Since you have control over the process, the downside is that it requires some work, and also it is quite easy to shoot yourself in the foot. To avoid this, here are a few pointers when importing large datasets.

Your data can be in CSV, TSV, Parque, TXT, or other file format. The format is not important, as you can open it in your backend service and transform it into Cypher.

To use a concrete example, in the Graph500 dataset, there is a list of adjacent nodes. Each pair represents a unique relationship between nodes. No nodes are defined in the dataset. Each unique ID value in the relationship is a node. Here is how the file looks like:

``````2 1
3 1
4 1
6 1
29 1
43 1
44 1
51 1
53 1
87 1``````

For example, the line 1 pair (2, 1) represents the relationship between Node (1) and Node (2). So, for line 1, it is necessary to create two nodes before creating a relationship since the nodes are not present in the database.

In general, it is a good process to create the nodes first, because that allows you to create nodes concurrently and connect them with relationships later.

The simplest form of node creation goes something like this:

``````# Rest of the code was omitted for brevity
for node in node_set
create_nodes.append(f"CREATE (n:Node {{id: {node}}})")

for query in create_nodes:
cursor.execute(query)
conn.commit()``````

In the code above, the details about the driver, file handling, and extraction of the unique values are omitted since they are unimportant, as it will be in the rest of the code samples.

What is important is that for each node in the dataset (174k nodes) , there are 174k separate transactions, creating a single node in each transaction. Each transaction needs to generate a query plan, deltas, etc. There is a lot of overload included. It took 137.27 seconds to finish the import of just 174k nodes. This gives us 1.3k nodes per second, which is very slow. But what we are actually measuring here is the single core transaction performance since the bottleneck is not writing to the database but rather the transactions per second.

If you try to minimize that overhead, you can start by introducing parameters. Now, the query will look like this:

``````# Rest of the code omitted for brevity (opening a file, getting unique nods, and formatting)
for node in node_set
create_nodes.append(("CREATE (n:Node {id: \$id})",  {"id": node}))

for query, argument in create_nodes:
cursor.execute(query, argument)
conn.commit()
``````

With this code, for each of 174k transactions, we are reusing the Memgraph query plan for each transaction. It is not being calculated every time. This lead to an improved speed of around 1.7k nodes per second.

However, the query plan is quite a small overhead compared to running everything as a single transaction. You can run multiple queries as part of a single transaction.

Here is a single transaction that will execute the 174k nodes inserts as a single commit.

``````# Rest of the code omitted for brevity (opening a file, getting unique nods and formatting)
for node in node_set
create_nodes.append(("CREATE (n:Node {id: \$id})",  {"id": node}))

for query, argument in create_nodes:
cursor.execute(query, argument)
conn.commit()
``````

As you can see, the commit is being called after all queries have been executed. This led to big improvement, and now the import speed is 3.4k nodes per second. But you do not need to run 174k queries to create a single node in each query, you could use a single transaction and single query call.

Here is how the query looks like:

``````    query = """
WITH \$batch AS nodes
UNWIND nodes AS node
CREATE (n:Node {id:node.id})
"""

time_start = time.time()
cursor.execute(query, {"batch": create_list})
conn.commit()
time_end = time.time()
return time_end - time_start``````

As you can see from the query, the batch of 174k IDs has been sent to write to Memgraph. Then by using WITH and UNWIND, Memgraph will create a node for each entry. You should always chunk your work into the least number of queries and transactions since it will lead to big improvements on a larger scale. This now runs at the rate of creating 145k nodes per second.

Currently, all performance comes from single-core processing. The next step could be running this in parallel. You could split your dataset into chunks of data and execute query concurrently:

``````        print("Starting processing chunks...")
start = time.time()
with multiprocessing.Pool(10) as pool:
results = pool.starmap(process_chunk, [(query, chunk) for chunk in chunks])
TOTAL_TIME = sum(results)
end = time.time()
print("Processing chunks finished in ", end - start, " seconds")
``````

Each chunk is executed as a separate process. Each process has an open connection to the database, and is running a single transaction and a single query as a batch:

``````    #Rest of proces_chunk omitted for brevity
conn = mgclient.connect(host='127.0.0.1', port=7687)
cursor = conn.cursor()
time_start = time.time()
cursor.execute(query, {"batch": create_list})
conn.commit()
time_end = time.time()
conn.close()
return time_end - time_start``````

This leads to 386k nodes per second. Changing the number of processes or chunk size will change the speed, but that is dependent on the actual underlying dataset and the number of hardware cores. The environment can also play an important role. Here is a summary of the results on the Docker container and native Linux package:

ConfigurationExecution time DockerNodes per secondExecution time nativeNodes per second native
Single thread - A Single transaction and query per node137.27 seconds1.2k127.87 seconds1.3k
Single thread - A Single transaction and query per node with query parameters128.34 seconds1.3k97.38 seconds1.7k
Single thread - A single transaction for all queries, with query per each node and parameters67.52 seconds2.5k50.98 seconds3.4k
Single thread - Single transaction and single query for all nodes with batch parameters1.36 seconds127k1.20 seconds145k
10 threads, 10000k chunks - Single transaction and single query per batch with batch parameters0.52 seconds334k0.45 seconds386k

As you can see, in most cases, the native version of Memgraph is a bit faster than the Docker version of Memgraph. Also, notice how from naive 1.3k nodes per second, the performance increased to 386k nodes per second. It is important to follow this guideline if you plan to import a large dataset in Memgraph.

The same steps mentioned above apply to relationships. But remember that before creating a relationship, nodes that the relationship will connect already need to be created or matched. Here is an example:

`` MATCH (a:Node {id: node.a}), (b:Node {id: node.b}) CREATE (a)-[:RELATIONSHIP]->(b)``

Since you match the nodes first, your relationship import won’t work if you do not index the `:Node(id)` property. This is a must for importing a relationship.

Also, relationships have one limitation. Creating relationhsips connecting the same nodes concurrently will yield a conflicting transaction, so you must retry if a conflict happens, this makes code for inserting edges a bit more complicated.

Here is the sample code to perform a batch insert of the edges on the single thread:

``````def execute_chunk(query, create_list, max_retries=100, initial_wait_time=0.200, backoff_factor=1.1):
conn = mgclient.connect(host='127.0.0.1', port=7687)
cursor = conn.cursor()
time_start = time.time()
for attempt in range(max_retries):
try:
cursor.execute(query, {"batch": create_list})
conn.commit()
break
except Exception as e:
wait_time = initial_wait_time * (backoff_factor ** attempt)
print(f"Commit failed on attempt {attempt+1}. Retrying in {wait_time} seconds...")
time.sleep(wait_time)
else:
print("Max retries reached. Commit failed.")
conn.close()
return time.time() - time_start
time_end = time.time()
conn.close()
return time_end - time_start``````

Running relationships in a single query and transaction via multiple concurrent threads will result in 80k relationships per second for the native version and around 74k for the Docker version of Memgraph. Relationship loading is slower due to possible conflict occurrence, retries, and matching of nodes.

This can be improved if you pay attention to how your data is structured and avoid writing relationships to the same node via multiple threads. In this dataset, Graph500 has nodes with a degree of 50k, which goes into supernode territory, slowing the performance.

If we have a CSV file with node IDs, the query for import looks like this:

``````    LOAD CSV FROM '/usr/lib/memgraph/nodes.csv' WITH HEADER AS row
CREATE (n:Node {id: row.id})
``````

The nodes from the Graph500 file were first converted to a CSV file, and it took 0.7 seconds to load 174k nodes, which is around 250k nodes per second.

The same goes for the relationships - first, they were converted to a CSV file, and imported using the following command:

``````    LOAD CSV FROM '/usr/lib/memgraph/relationships.csv' WITH HEADER AS row
MATCH (source:Node {id: row.source}), (sink:Node {id: row.sink})
CREATE (source)-[:RELATIONSHIP]->(sink)
``````

The relationships were imported at a rate of 75k relationships per second. It is important to note that `LOAD CSV` command also utilizes an index, so it is critical to set the index on `:Node(id)` in this case.

If you want to see more scalable version of Load CSV go to How to Import 1 Million Nodes and Edges per Second into Memgraph.

## Building on top of your graph model

At some point after the import is done, you will probably start optimizing for query performance. That thought process should have happened in the graph modeling part already. If you didn’t miss the graph structure for a mile, and don’t need to get back to the drawing board, indexing may get you out of trouble.

Indexing is pretty notorious. If you index your data properly, you will have fast queries in your database. A lot of information can be found on the topic of indexes in Memgraph.

When the focus is a large graph dataset, there are a few tips that are pretty general but fundamental in the database world:

1. Understand query patterns - understanding what you want from your database will be fundamental in modeling and performance optimization, since that should drive the decision of how to model and how to index. You want to index all the queries that will hit the graph on a larger scale. Compared to relational databases, graph databases are particularly slow if indexes are not properly set. That is why indexes are important for importing relationships.

2. Understand indexes - Understanding the types of indexes you have at your disposal will limit the range of the search that can happen in your query plan, hence leading to lower execution time. Memgraph currently supports the label and label-property index, none of them are applied by default.

3. Avoid over-indexing - Going all in with indexes and applying indexes on every node, property, and label, will impact your write speed performance and increase total memory usage. If you over-index, indexes need to be updated on each write, hence it slows down the write operation. In Memgraph, indexing is the process of loading properties into the skip list that provides fast access, aldo being in-memory, Memgraph also consumes extra RAM to store the index, hence increasing your total cost of ownership.

If you are not satisfied with the performance of some particular query, you should probably inspect that query using the PROFILE command and double-check your query plans for some extra secrets.

Do not forget that keeping a large dataset in a consistent state is a must, and to further reinforce your graph structure, increase consistency, and avoid redundancy, use constraints you see necessary. The advice here is similar to the advice concerning indexes, you do not want to set constraints on things that are not important just for the sake of having them since it can also kill your performance.

## Configuring Memgraph for larger-scale operations

There are a few configuration flags you may want to know about, so you can fine-tune them to fit into your production environment and the scales you need.

If you are going to import a billion nodes via LOAD CSV command, it will probably take a bit more than 10 minutes. You do not want your import to fail due to timeout, so if you need a bigger execution time frame, bump up the default 10 minutes `--query-execution-timeout-sec=600`. The flag defines the 10 minutes allowed to run a single query.

The Memgraph storage engine has a garbage collector that will run periodically, and the interval is defined by the `--storage-gc-cycle-sec=30` configuration flag. Running garbage collector every 30 seconds on a larger scale can impact the server's performance. If you have a lot of volatility in data, maybe consider running it more often so it does not make big deletes that could take some processing capacity form the server, thus slowing down your Memgraph operation. If you have a low volatility in your data, it would be running it less often, since it is just wasting server resources.

By utilizing snapshots and WAL Memgraph is persistent. In case of a failure, you can restart the server, and data will be loaded from the snapshots. By default, Memgraph will keep the three latest snapshots stored on disk. So if your snapshot takes 300GB, you are looking at 1T of disk storage for snapshots. If you have disk space available, consider increasing it: `--storage-snapshot-retention-count=3`, which means you will have more history if something goes sideways.

The snapshot flag should be strategically combined with the ` **--storage-snapshot-interval-sec=300`** that creates a snapshot every 5 minutes. If you have 3 snapshots, each 5 minutes apart, you have 15 minutes of the latest history to undo some changes before they become permanent. On a bigger scale, it is a good idea to increase this period since it can take a few minutes to create a snapshot on a billion nodes and relationships scale.

Having billions of nodes and relationships in your dataset, RAM will take some hits. You will need serious machinery to store all graph entites. If you model your data without having properties on relationships, you should set the `--storage-properties-on-edges` to `false`. That way, you will save a lot of memory on each relationship since they will not be created as full objects but rather as small pointers. This will significantly lower your total cost of ownership.

As the story with the total cost of ownership and RAM comes along, we have on-disk transactional mode. This supports storing the data directly to the disk without using RAM as primary storage, so you do not need an expensive and big instance with a lot of RAM. Of course, this will be magnitudes slower and still require some engineering work, but we are improving it!

## Monitoring and an extra pair of hands

Having a large dataset means having a big machine to manage that supports many services and people. It goes without saying that monitoring for that type of scale is a must. If you are managing a large Memgraph instance, we have monitoring metrics as part of our Enterprise edition.

Also, if you need an extra pair of hands to manage large-scale datasets in Memgraph, let us know on Discord, and we will be happy to help.