Query Optimization in Memgraph: Best Practices and Common Mistakes
Whether you’re trying to visualize your data or you’re running complex algorithms, querying is key to accessing information from your database. Yet, even experienced professionals can stumble over common pitfalls—from syntax errors to performance issues.
In this blog post, we’ll explore the most common querying challenges and provide practical tips to avoid typical errors. You'll have the tools to refine your querying strategies, optimize your techniques, and unlock valuable insights from your data.
Query Execution
Querying Memgraph using Cypher is a common operation that needs to be properly optimized. There are some tips and tricks for executing queries as fast as possible with the fewest resources.
Profiling for Performance
Memgraph executes queries based on the generated query plan. A query plan is a set of operations that need to be executed, and changing a single operator in a query plan can significantly change data pipeline and query performance.
To lower the data pipeline, you need to optimize the query plan. You can start by indexing your data properly.
Effective Indexing Strategies
Creating a label and label-property index can significantly lower the time needed to find a specific node and reduce the data in the pipeline.
It is important to note that indexing in a database depends on the actual dataset stored there. The general rule is that properties with high cardinality should be used as indexes. The following query creates an index on a property of a particular label:
CREATE INDEX ON :Person(prop);
Besides creating indexes to decrease the data pipeline and improve search, features such as index hinting and analyze graph can help do that automatically.
Query Parameterization
Multiple query plans will be generated for each query, but only a single strategy is optimal. Query planning time can be significant for complex queries because the query planner will try to generate multiple versions of the query plan and choose the optimal one.
After the optimal query plan is generated, it is cached and reused for the same query. To benefit from query plan caching, use query parameters as much as possible.
Here is an example of a query without query parameters:
CREATE (n:Node {id: 123});
CREATE (n:Node {id: 154});
CREATE (n:Node {id: 322});
Here is an example of a query with query parameters:
CREATE (n:Node {id: $id});
CREATE (n:Node {id: $id});
CREATE (n:Node {id: $id});
Memgraph’s query engine will try to cache the query plan without the parameters. Still, to ensure the best performance, use the query parameters whenever you can.
Reduce Roundtrip
When using Memgraph Lab, the total time it takes to process queries is known as the Lab full round trip. This time includes Memgraph's execution time, which covers query parsing, planning, and execution. To avoid returning duplicate nodes and reduce the full round trip time, you can use projection in the return clause.
Let’s show how that works on a couple of examples on the Game of Thrones deaths dataset from Memgraph Lab.
Here is an example query that returns the whole path:
MATCH path=(:Character { name: "Jon Snow" })-[:KILLED *bfs]->(:Character)
RETURN path;
When run in Memgraph Lab, the full roundtrip is 82 ms.
Project to Avoid Duplicates
To reduce roundtrip time, you can use projection to prevent returning duplicate nodes. Here is an example of a query that uses a projection.
MATCH path=(:Character { name: "Jon Snow" })-[:KILLED *bfs]->(:Character)
WITH project(path) AS subgraph
RETURN subgraph;
When run in Memgraph Lab, the full roundtrip is down to 49 ms.
Return a Single Value
Here is a query that returns only the number of paths:
MATCH path=(:Character { name: "Jon Snow" })-[:KILLED *bfs]->(:Character)
RETURN count(path);
When run in Memgraph Lab, the full roundtrip is down to 30 ms.
Similarly, size()
and sum()
functions can be used to speed up the full roundtrip:
MATCH path=(:Character { name: "Jon Snow" })-[:KILLED *bfs]->(:Character)
WITH size(nodes(path)) as nodes, size(relationships(path)) AS relationships
RETURN sum(nodes), sum(relationships);
When run in Memgraph Lab, the full roundtrip is down to 33 ms.
Here is an example of using both project()
and size()
:
MATCH path=(:Character { name: "Jon Snow" })-[:KILLED *bfs]->(:Character)WITH project(path) as subgraph
RETURN size(subgraph.nodes), size(subgraph.edges);
When run in Memgraph Lab, the full roundtrip is down to 31 ms.
Return Only a Property
Here is a query that extracts nodes’ internal IDs from the path:
MATCH path=(:Character { name: "Jon Snow" })-[:KILLED *bfs]->(:Character)
WITH extract(n IN nodes(path) | ID(n)) AS nodes_ids
RETURN nodes_ids;
When run in Memgraph Lab, the full roundtrip is down to 41 ms.
Similarly, the following query returns only name
nodes’ property from the path:
MATCH path=(:Character { name: "Jon Snow" })-[:KILLED *bfs]->(:Character)
WITH extract(n IN nodes(path) | n {.name}) AS nodes_ids
RETURN nodes_ids;
When run in Memgraph Lab, the full roundtrip is down to 31 ms.
These are some tips that can help reduce roundtrip time. These optimization strategies have a larger impact on larger datasets and should be utilized whenever possible to achieve the best performance.
Optimizing Path Traversals
In contrast to other graph databases, Memgraph deep path traversals efficiently handle complex graph queries, as these algorithms have been built into Memgraph’s core. This eliminates the need for the overhead of business logic on the application side. There are four built-in deep path traversal algorithms: depth-first search (DFS), breadth-first search (BFS), weighted shortest path, and all shortest paths.
For example, let’s take the following unoptimized query that traverses the European transportation network. Using the BFS algorithm, we’ll find all of the cities London is connected to with the shortest road path available. In other words, we'll find all shortest paths from the starting node with the name property being London to all of the nodes in the dataset.
MATCH path=(:City {name: 'London'})-[*BFS]->(:City)
RETURN path;
This query returns nodes representing cities London is connected to with the transportation network.
Since we didn’t provide any restrictions and filtering, the algorithm scans and traverses through the entire dataset, which can lead to slower performance on larger-scale datasets. There are a few ways we can optimize the traversal.
Creating Indexes
Creating indexes on relevant properties can drastically speed up traversal queries by shortening the database scanning time.
Filtering by Relationship Type
Unlike other graph databases, Memgraph supports inline filtering, enabling efficient traversal through graph structures. This approach allows for precise control over how relationships are traversed, including filtering by type and the direction of the relationship, avoiding subsequent filtering using the WHERE
clause.
Let’s take the same example from above, but this time we are limiting traversal across roads only, eliminating other types of transportation. In other words, we’ll provide a relationship type filter to the previously used query and limit it only to traverse through the relationship type ROAD
.
MATCH path=(:City {name: 'London'})-[r:ROAD *BFS]->(:City)
RETURN path;
Memgraph eliminates traversing through unnecessary relationships, shortening the execution time.
Filtering by Property Value
Traversal algorithms allow an expression filter that determines if an expansion is allowed over a certain relationship or node property value.
I’ve taken the same example from above, but this time we’ve limited the traversal through the European roads only. We want to apply filters to the relationship property continent
and set the value to exactly Europe
.
MATCH path=(:City {name: 'London'})-[r:ROAD *BFS (r, n | r.continent = 'Europe')]->(:City)
RETURN path;
This way, Memgraph eliminates traversing through unnecessary relationships and property values, shortening the execution time even more.
Constraining Path Length
By constraining the path length, the algorithm won’t do unnecessary scanning and return results after finding results with the maximum number of hops.
The following query will only return the results if the path is equal to or shorter than two hops:
MATCH path=(:City {name: 'London'})-[r:ROAD *BFS ..2 (r, n | r.continent = 'Europe')]->(:City)
RETURN path;
Knowing the schema of your dataset and filtering and limiting the desired results can help you achieve a much more optimized use of the traversal algorithms.
Avoiding Inefficient Queries
Cyphermorphism
If you’re looking for a longer path in your dataset, try to keep the query as compact as possible. When you’re matching a longer path (chain), a query similar to this one:
EXPLAIN MATCH (n)-[r1]-(m)-[r2]-(l)
RETURN *;
The query plan applies the EdgeUniquenessFilter
, meaning that relationships r1
and r2
must be unique. The query engine treats the following triplet syntax (start node
, edge
, end node
) in the same way:
EXPLAIN MATCH (n)-[r1]-(m), (m)-[r2]-(l)
RETURN *;
The EdgeUniquenessFilter
was applied because of graph-matching morphism called cyphermorphism, which ensures that all relationships matched by the same clause must be different.
If the above query was split into two MATCH
clauses:
EXPLAIN MATCH (n)-[r1]-(m)
MATCH (m)-[r2]-(l)
RETURN *;
The query plan for this query does not use the EdgeUniquenessFilter
operator anymore, leading to unexpected results, duplicated data in the data pipeline, and, consequently, poor performance.
The EdgeUniquenessFilter
operator was not applied, because during query parsing, multiple MATCH
, MERGE
, OPTIONAL MATCH
, WITH
and UNION
statements split the query into multiple logical query parts. They are considered independent, so no uniqueness filter is applied.
Workarounds for Fast Filtering
For simple filtering queries where label and label-property indexes are set, those indexes will be used as expected. Here is an example of such a query:
MATCH (p:Person)
WHERE p.age = 20
RETURN p;
For more advanced filtering with boolean operators, profile the query and check if indexes are being adequately used. The following query will not use the label-property index as expected because it uses the OR
boolean operator:
MATCH (p:Person)
WHERE p.age = 20 OR p.age = 25
RETURN p;
Such queries haven’t been optimized to use the union of indexes from its two branches.
As a workaround, use IN
so that the ScanAllByLabelPropertyValue
operator is used:
MATCH (p:Person)
WHERE p.age IN [20, 25]
RETURN p;
A similar thing happens if you try using the range()
function:
MATCH (p:Person)
WHERE p.age IN range(20, 25)
RETURN p;
In this case, only the ScanAllByLabel
operator will be used, and the execution won’t be as fast as expected. To use the ScanAllByLabelPropertyValue
operator and to achieve better execution time, do the following workaround:
UNWIND range(20, 25) AS person_age
MATCH (p:Person)
WHERE p.age = person_age,
RETURN p;
Conclusion
Mastering querying techniques is a step-by-step process that will allow you to handle data more efficiently. This article has explored common pitfalls and provided strategies for overcoming them, including optimizing execution, leveraging indexing, reducing roundtrip times and techniques for optimizing path traversals. By implementing these strategies, you ensure smoother database operations as well as less memory consumption and shorter query execution time.