Memgraph logo
Back to blog
How to Import 1 Million Nodes and Edges per Second Into Memgraph

How to Import 1 Million Nodes and Edges per Second Into Memgraph

April 4, 2024
Ante Javor

In this blog post, we focus on the complexities of dataset import into databases and introduce the most efficient methods for large-scale graph data import. We’ll emphasize using optimized LOAD CSV commands and concurrent operations within Memgraph's architecture to achieve high-speed data import with minimal resources.

In the upcoming few weeks, we’ll dedicate several blog posts to the importing best practices series to help guide you alongside the Import Best Practices chapter under Memgraph docs.

Simplify Data Imports

Importing datasets into databases presents its own set of challenges, shaped by various factors such as timing needs, dataset formats, and specific use-case requirements. The path to efficient data import is not one-size-fits-all; rather, it's influenced by these diverse considerations.

Different databases are designed with unique architectures, each favoring certain data import methodologies over others. Memgraph, with its different storage modes, offers versatile options for dataset importation, developed to accommodate different scenarios.

It's important to know, however, that not all import methods are equally efficient.

This introduction aims to guide you toward the best and resource-effective strategies for importing large graph datasets into Memgraph.

Untangling the Import Complexity

Our documentation includes a dedicated section on Import Best Practices, packed with insights and strategies for managing data imports effectively. This resource is especially valuable when you're looking to import upwards of a million graph elements, providing a solid foundation for tackling large datasets.

Key Takeaway for Large-Scale Data Imports

For those dealing with large datasets and seeking the quickest import method, the LOAD CSV command is your go-to solution. Memgraph has fine-tuned the LOAD CSV process to offer peak efficiency, especially when your data is in CSV format, ensuring a smooth and rapid import experience.

Prerequisites for High-Speed Data Import into Memgraph

Initial considerations for smaller datasets

  • For datasets with less than a few million graph entities, using a single-threaded LOAD CSV command is typically sufficient and efficient.

Boosting write speeds for larger datasets

  • For importing millions of graph entities per second, use multiple concurrent LOAD CSV commands in IN_MEMORY_ANALYTICAL mode. This approach requires meeting some straightforward prerequisites.

Hardware requirements

  • Your system should ideally have a few cores, 4 or more cores will do just fine.

Organizing CVS files

  • Divide your CVS files into two main types: nodes and relationships.

  • Further split these files into smaller batches (e.g., nodes_0.csv, nodes_1.csv, etc.), where each file represents a segment of your dataset. This division allows for parallel execution of multiple LOAD CSV commands, leveraging the full potential of your hardware cores.

Determine batch size

  • If your dataset includes 100 million nodes, consider dividing nodes into batches of 5 to 10 million nodes each to prevent files from becoming too large, and to optimize for core count.

  • For a 32-core CPU, split your CSV files into 32 or more batches to maximize CPU utilization. Less batching may underutilize available threads, impacting import speed.

Storage mode requirement

Make sure Memgraph is set to IN_MEMORY_ANALYTICAL storage mode. This setting is crucial to prevent the interruption of the import process by conflicting transaction errors.

Maximizing CPU Efficiency During Import

Before you start the import process, make sure you're equipped with multicore hardware, and your CSV files are split according to your dataset's needs. Confirm Memgraph is running in IN_MEMORY_ANALYTICAL mode for optimized import.

If you're using Docker to run Memgraph, transfer the CSV files into the container where Memgraph operates. This makes the files accessible for import.

Based on the paths of the files, you can generate the queries, each using a different file from the total pool of files, or parametrize a single query, where each parameter is a different file.

Example for preparing and transferring files:

target_nodes_directory = Path(__file__).parents[3].joinpath(f"datasets/graph500/{size}/csv_node_chunks")

for file in target_nodes_directory.glob("*.csv"):

subprocess.run(["docker", "cp", str(file), f"memgraph:/usr/lib/memgraph/{file.name}"], check=True)

queries = []

for file in target_nodes_directory.glob("*.csv"):

queries.append(f"LOAD CSV FROM '/usr/lib/memgraph/{file.name}' WITH HEADER AS row CREATE (n:Node {{id: row.id}})")

Execution Code for Concurrent Execution

Once the files and queries are in place, you can start with the concurrent query execution using multiprocessing. This is specific to Python, and threads will be sufficient in other programming languages.

You are running LOAD CSV concurrently by running the query in separate processes and opening a new connection to the database in each process.

Keep in mind that the number of processes running depends on your hardware availability. Generally, it should be close to the number of threads on the machine. Going above that number will not bring any performance improvements, and it can even slow down the import process.

Here is the code that spans ten different processes, each running a separate CSV file via separate sessions and in parallel:

with multiprocessing.Pool(10) as pool:

pool.starmap(execute_csv_chunk, [(q, ) for q in queries])

#Rest of the code...

def execute_csv_chunk(query):

try:

driver = GraphDatabase.driver(HOST_PORT, auth=("", ""))

with driver.session() as session:

session.run(query)

except:

print("Failed to execute transaction")

raise e

Order of Import

It is important first to import all the nodes and, after that, import all the relationships; if you skip this step, the dataset won’t be imported properly.

Performance Expectations

The hardware we’ve mentioned and Graph500 dataset resulted in 1.2 million node inserts per second and 1.2 million relationship inserts per second.

The same or better can be achieved on your datasets and on your hardware, of course, there will be some variations in different implementations.

Transactional ACID-supported solution

Having full ACID support and running transactions for importing billions of nodes and edges can put stress on the system and slow down the import process. Memgraph IN_MEMORY_ANALYTICAL unlocks the concurrent LOAD CSV performance. However, it’s important to understand all the implications of that mode. It does not provide full ACID support, but it is safe to use if operated correctly.

If you are looking for full ACID-enabled import, you can run single-threaded LOAD CSV or concurrent Cypher import. For more details on how to do it, check out the Cypher best practices under Memgraph docs.

Conclusion

We’ve covered the technical specifics of importing massive datasets into Memgraph at very high speeds, specifically targeting the efficiency and optimization of the import process. We also talked about the importance of preprocessing, feature selection, and the use of concurrent operations to achieve high throughput.

If you have any questions or perhaps ideas related to this topic, consider joining us and our community on Discord . I’d love to chat about this and help you improve your import process.

Further Reading

If you are interested in the broader aspects of working with large datasets in Memgraph, head over to the Handling large graph datasets blog post. I’ve taken a broader view of dealing with large graph datasets. Starting from the conceptual understanding of what constitutes a large dataset and extending into modeling, data import methods, and performance optimization.

In that blog post, I’ve also covered the importance of the planning phase before data import, including graph modeling, the selection of data import methods (Cypher commands vs. LOAD CSV), and the importance of indexing and constraints for query performance and data consistency.

Join us on Discord!
Find other developers performing graph analytics in real time with Memgraph.
© 2024 Memgraph Ltd. All rights reserved.