Import data from Parquet file

The data from Parquet files can be imported using the LOAD PARQUET Cypher clause from the local disk and from the s3.

LOAD PARQUET Cypher clause

The LOAD PARQUET clause uses a background thread that reads the Parquet file in column batches, assembles them into row batches of 64K rows and places those batches into a queue. The main thread then pulls each batch from the queue and processes it row by row. For every row, it binds the parsed values to the specified variables and either populates the database (if it is empty) or appends the new rows to an existing dataset.

LOAD PARQUET clause syntax

The syntax of the LOAD PARQUET clause is:

LOAD PARQUET FROM <parquet-location> ( WITH CONFIG configs=configMap ) ? AS <variable-name>
  • <parquet-location> is a string that specifies where the Parquet file is located.
    If the path does not start with s3://, it is treated as a local file path. If it does start with s3://, Memgraph retrieves the file from the S3-compatible storage using the provided URI. There are no restrictions on the file’s location within your local file system, as long as the path is valid and the file exists. If you are using Docker to run Memgraph, you will need to copy the files from your local directory into Docker container where Memgraph can access them.

  • <configs> Represents an optional configuration map through which you can specify configuration options: aws_region, aws_access_key, aws_secret_key and aws_endpoint_url.

    • <aws_region>: The region in which your S3 service is being located
    • <aws_access_key>: Access key used to connect to S3 service
    • <aws_secret_key>: Secret key used to connect S3 service
    • <aws_endpoint_url>: Optional configuration parameter. Can be used to set the URL of the S3 compatible storage.
  • <variable-name> is a symbolic name representing the variable to which the contents of the parsed row will be bound to, enabling access to the row contents later in the query. The variable doesn’t have to be used in any subsequent clause.

LOAD PARQUET clause specificities

When using the LOAD PARQUET clause please keep in mind:

  • Type handling:
    The parser reads each value using its native Parquet type, so you should receive the same data type inside Memgraph. The following types are supported: BOOL, INT8, INT16, INT32, INT64, UINT8, UINT16, UINT32, UINT64, HALF_FLOAT, FLOAT, DOUBLE, STRING, LARGE_STRING, STRING_VIEW, DATE32, DATE64, TIME32, TIME64, TIMESTAMP, DURATION, DECIMAL128, DECIMAL256, BINARY, LARGE_BINARY, FIXED_SIZE_BINARY, LIST, MAP.
    Any unsupported types are automatically stored as strings. Note that UINT64_T values are cast to INT64_T because Memgraph does not support unsigned 64-bit integers, and the Cypher standard only defines 64-bit signed integers.

  • Authentication parameters:
    Parameters for accessing S3-compatible storage (aws_region, aws_access_key, aws_secret_key, and aws_endpoint_url) can be provided in three ways:

    1. Directly in the LOAD PARQUET query using the WITH CONFIG clause.
    2. Through environment variables: AWS_REGION, AWS_ACCESS_KEY, AWS_SECRET_KEY, and AWS_ENDPOINT_URL.
    3. Through run-time database settings, using: SET DATABASE SETTING <key> TO <value>; The corresponding setting keys are: aws.access_key, aws.region, aws.secret_key, and aws.endpoint_url.
  • The LOAD PARQUET clause is not a standalone clause, meaning a valid query must contain at least one more clause, for example:

    LOAD PARQUET FROM "/people.parquet" AS row
    CREATE (p:Person) SET p += row;

    In this regard, the following query will throw an exception:

    LOAD PARQUET FROM "/file.parquet" AS row;

    Adding a MATCH or MERGE clause before LOAD PARQUET allows you to match certain entities in the graph before running LOAD PARQUET, optimizing the process as matched entities do not need to be searched for every row in the PARQUET file.

    But, the MATCH or MERGE clause can be used prior the LOAD PARQUET clause only if the clause returns only one row. Returning multiple rows before calling the LOAD PARQUET clause will cause a Memgraph runtime error.

  • The LOAD PARQUET clause can be used at most once per query, so queries like the one below will throw an exception:

    LOAD PARQUET FROM "/x.parquet" AS x
    LOAD PARQUET FROM "/y.parquet" AS y
    CREATE (n:A {p1 : x, p2 : y});

Increase import speed

You can significantly increase data-import speed when using the LOAD PARQUET clause by taking advantage of indexing, batching, and analytical storage mode.

1. Create indexes

LOAD PARQUET can establish relationships much faster if indexes on nodes or node properties are created after loading the associated nodes:

CREATE INDEX ON :Node(id);

If LOAD PARQUET is merging existing data rather than creating new records, then create the indexes before running the import.

2. Use Periodic commits

The USING PERIODIC COMMIT <BATCH_SIZE> construct optimizes memory allocation and can improve import speed by 25–35% based on our benchmarks.

USING PERIODIC COMMIT 1024
LOAD PARQUET FROM "/x.parquet" AS x
CREATE (n:A {p1: x, p2: y});

3. Switch to analytical storage mode

Import performance can also improve by switching Memgraph to analytical storage mode, which relaxes ACID guarantees except for manually created snapshots. Once the import is complete, you can switch back to transactional mode to restore full ACID guarantees.

Switch storage modes within a session:

STORAGE MODE IN_MEMORY_{TRANSACTIONAL|ANALYTICAL};

4. Run Imports in Parallel

When using IN_MEMORY_ANALYTICAL mode and storing nodes and relationships in separate Parquet files, you can run multiple concurrent LOAD PARQUET queries to accelerate the import even further.

For best performance:

  1. Split node and relationship data into smaller files.
  2. Run all LOAD PARQUET statements that create nodes first.
  3. Then run all LOAD PARQUET statements that create relationships.

Usage example

In this example, we will import multiple Parquet files with distinct graph objects. The data is split across four files, each file contains nodes of a single label or relationships of a single type.

Parquet files

  • people_nodes.parquet is used to create nodes labeled :Person.
    The file contains the following data:

    id,name,age,city
    100,Daniel,30,London
    101,Alex,15,Paris
    102,Sarah,17,London
    103,Mia,25,Zagreb
    104,Lucy,21,Paris
  • restaurants_nodes.parquet is used to create nodes labeled :Restaurants.
    The file contains the following data:

    id,name,menu
    200,Mc Donalds,Fries;BigMac;McChicken;Apple Pie
    201,KFC,Fried Chicken;Fries;Chicken Bucket
    202,Subway,Ham Sandwich;Turkey Sandwich;Foot-long
    203,Dominos,Pepperoni Pizza;Double Dish Pizza;Cheese filled Crust
  • people_relationships.parquet is used to connect people with the :IS_FRIENDS_WITH relationship.
    The file contains the following data:

    first_person,second_person,met_in
    100,102,2014
    103,101,2021
    102,103,2005
    101,104,2005
    104,100,2018
    101,102,2017
    100,103,2001
  • restaurants_relationships.parquet is used to connect people with restaurants using the :ATE_AT relationship.
    The file contains the following data:

    PERSON_ID,REST_ID,liked
    100,200,true
    103,201,false
    104,200,true
    101,202,false
    101,203,false
    101,200,true
    102,201,true

Import nodes

Each row will be parsed as a map, and the fields can be accessed using the property lookup syntax (e.g. id: row.id). Files can be imported directly from s3 or can be downloaded and then accessed from the local disk.

The following query will load row by row from the file, and create a new node for each row with properties based on the parsed row values:

LOAD PARQUET FROM "s3://download.memgraph.com/asset/docs/people_nodes.parquet" AS row
CREATE (n:Person {id: row.id, name: row.name, age: row.age, city: row.city});

In the same manner, the following query will create new nodes for each restaurant:

LOAD PARQUET FROM "s3://download.memgraph.com/asset/docs/restaurants_nodes.parquet" AS row
CREATE (n:Restaurant {id: row.id, name: row.name, menu: row.menu});

Create indexes

Creating an index on a property used to connect nodes with relationships, in this case, the id property of the :Person nodes, will speed up the import of relationships, especially with large datasets:

CREATE INDEX ON :Person(id);

Import relationships

The following query will create relationships between the people nodes:

LOAD PARQUET FROM "s3://download.memgraph.com/asset/docs/people_relationships.parquet" AS row
MATCH (p1:Person {id: row.first_person})
MATCH (p2:Person {id: row.second_person})
CREATE (p1)-[f:IS_FRIENDS_WITH]->(p2)
SET f.met_in = row.met_in;

The following query will create relationships between people and restaurants where they ate:

LOAD PARQUET FROM "s3://download.memgraph.com/asset/docs/restaurants_relationships.parquet" AS row
MATCH (p1:Person {id: row.PERSON_ID})
MATCH (re:Restaurant {id: row.REST_ID})
CREATE (p1)-[ate:ATE_AT]->(re)
SET ate.liked = ToBoolean(row.liked);

Final result

Run the following query to see how the imported data looks as a graph:

MATCH p=()-[]-() RETURN p;