Indexes
When running queries, you want to get results as soon as possible. In the worst-case scenario, during query execution, all nodes need to be checked to find a match.
This is how the query plan looks when there is no index on the data:
memgraph> EXPLAIN MATCH (n:Person {prop: 1}) RETURN n;
+-----------------------------------+
| QUERY PLAN |
+-----------------------------------+
| " * Produce {n}" |
| " * Filter (n :Person), {n.prop}" |
| " * ScanAllByLabel (n :Person)" |
| " * Once" |
+-----------------------------------+
Notice the ScanAllByLabel
operations. By creating indexes, query execution
can be much faster because indexes partition data with a key. When a query is
executed, the engine first checks if there is an index, and instead of
explicitly checking every node, it can just check the indexed ones, making
retrieving indexed data more efficient.
The following query creates an index on a property of a certain label:
CREATE INDEX ON :Person(prop);
The query plan of a query matching that specific node and property:
memgraph> EXPLAIN MATCH (n:Person {prop: 1}) RETURN n;
+-----------------------------------------------------+
| QUERY PLAN |
+-----------------------------------------------------+
| " * Produce {n}" |
| " * ScanAllByLabelPropertyValue (n :Person {prop})" |
| " * Once" |
+-----------------------------------------------------+
The ScanAllByLabel
operations have been replaced by
ScanAllByLabelPropertyValue
, a more efficient operation.
But there are also some downsides to indexing:
- each index requires extra storage (memory)
- indexes slow down write operations to the database.
It is important to choose the right data to create indexes on, as indexing all of the content will not improve the database speed.
The structures in the index are dynamically updated on modifications or insertions of new nodes, slowing down the write operations.
Indexing won't bring any improvements on properties that are mostly of the same value, as they have no proper distinguishers.
For the same reason, indexing certain data types will not bring any significant performance gain. For example, for properties with boolean values, the time will be cut in half.
Create an index
Indexes are not created automatically.
You can explicitly create indexes on a data with a specific label or
label-property combination using the CREATE INDEX ON
syntax or CREATE EDGE INDEX ON
syntax in the case of edge-type indexes.
Label index
To optimize queries that fetch nodes by label, you need to create a label index:
CREATE INDEX ON :Person;
Creating an index will optimize the following type of queries:
MATCH (n:Person) RETURN n;
Label-property index
To optimize queries that fetch nodes with a certain label and property combination, you need to create a label-property index. For the best performance, create an index on properties containing unique integer values.
Creating a label-property index will not create a label index!
For example, to index nodes that are labeled as :Person
and have a property
named age
:
CREATE INDEX ON :Person(age);
Creating an index will optimize the queries that need to match a specific label and property combination:
MATCH (n :Person {age: 42}) RETURN n;
Index will also optimize queries that filter labels and properties with the
WHERE
clause:
MATCH (n) WHERE n:Person AND n.age = 42 RETURN n;
Be aware that since the filter inside WHERE
can contain any kind of an
expression, the expression can be so complicated that the index doesn't get used.
If there is any suspicion that an index isn't used, we recommend writing
labels and properties inside the MATCH
pattern.
Edge-type index
To optimize queries that fetch only the edges by specific edge-types, you need to create an edge-type index.
Creating an edge-type index requires the --storage-properties-on-edges flag to be set to true!
CREATE EDGE INDEX ON :EDGE_TYPE;
Creating an edge-type index will optimize the following type of queries:
MATCH ()-[r:EDGE_TYPE]->() RETURN r;
If you need to access nodes of found edges, you can use the startNode(r)
and endNode(r)
functions.
Named parameters are not supported for edge-type indexes.
Edge-type property index
To optimize queries that fetch only the edges by specific edge types and properties, you need to create an edge-type property index.
Creating an edge-type property index requires the --storage-properties-on-edges flag to be set to true!
CREATE EDGE INDEX ON :EDGE_TYPE(property_name);
Creating an edge-type property index will optimize the following type of queries:
MATCH ()-[r:EDGE_TYPE {property_name: value}]->() RETURN r;
If you need to access nodes of found edges, you can use the startNode(r)
and endNode(r)
functions.
Named parameters are not supported for edge-type property indexes.
Point index
To run queries such as:
WITH point({x:1, y:1}) AS p
MATCH (node:Label) WHERE point.distance(p, node.property) < 1
RETURN node;
in a performant way, feel free to create a new point index as follows:
CREATE POINT INDEX ON :Label(property);
From Memgraph 2.21, to utilize the point index in queries using the point.distance function, the comparison point needs to exist as a variable first. For example, instead of:
MATCH (node:Label) WHERE point.distance(point({x:1, y:1}), node.property) < 1
RETURN node;
do:
WITH point({x:1, y:1}) AS p
MATCH (node:Label) WHERE point.distance(p, node.property) < 1
RETURN node;
To drop the point index use:
DROP POINT INDEX ON :Label(property);
Analyze graph
When multiple label-property indexes exist, the database can sometimes select a non-optimal index due to the data's distribution.
The ANALYZE GRAPH;
query
calculates the distribution of property values so the database can select a more
optimal label-property index with the smallest average property value size. The
query is run only once after all indexes have been created and data inserted in
the database.
Index hinting
You can also instruct the planner to use specific
index(es) (if possible) by specifying which
index(es) to use at the beginning of the query with USING INDEX
clause.
Overriding planner behavior with index hints should be used with caution, and only by experienced developers and/or database administrators, as a poor index choice may cause queries to perform poorly.
Schema-related procedures
You can modify indexes using the schema.assert()
procedure.
Speed comparison
Below is a comparison of the same query run without an index and with an index. The query without an index took 0.015 seconds to execute, and the query with an index 0.006 seconds.
memgraph> SHOW INDEX INFO;
Empty set (0.001 sec)
memgraph> MATCH (n:Person) WHERE n.name =~ ".*an$" RETURN n.name;
+-------------+
| n.name |
+-------------+
| "Lillian" |
| "Logan" |
| "Susan" |
| "Sebastian" |
+-------------+
4 rows in set (0.021 sec)
memgraph> CREATE INDEX ON :Person(name);
Empty set (0.015 sec)
memgraph> MATCH (n:Person) WHERE n.name =~ ".*an$" RETURN n.name;
+-------------+
| n.name |
+-------------+
| "Lillian" |
| "Logan" |
| "Susan" |
| "Sebastian" |
+-------------+
4 rows in set (0.006 sec)
Show created indexes
To check all the labels and label-property pairs that Memgraph currently indexes, use the following query:
SHOW INDEX INFO;
The query displays a table of all label and label-property indexes presently kept by Memgraph, ordered by index type, label, property and count.
Delete an index
Created indexes can be deleted using the following syntax:
DROP INDEX ON :Label;
DROP INDEX ON :Label(property);
DROP EDGE INDEX ON :EDGE_TYPE;
These queries instruct all active transactions to abort as soon as possible. Once all transactions have finished, the index will be deleted.
Delete all node indexes
The schema.assert()
procedure will not drop edge-type and point indexes.
Our plan is to update it, and you can track the progress on our GitHub (opens in a new tab).
To delete all indexes, use the schema.assert()
procedure with the following parameters:
indices_map
={}
unique_constraints
= map of key-value pairs of all uniqueness constraints in the databaseexistence_constraints
= map of key-value pairs of all existence constraints in the databasedrop_existing
=true
Here is an example of indexes and constraints set in the database:
CREATE CONSTRAINT ON (n:Person) ASSERT EXISTS (n.name);
CREATE CONSTRAINT ON (n:Employee) ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:Employee) ASSERT n.email IS UNIQUE;
CREATE CONSTRAINT ON (n:Employee) ASSERT n.name, n.surname IS UNIQUE;
CREATE INDEX ON :Student(id);
CREATE INDEX ON :Student;
There are three uniqueness and one existence constraint. Additionally, there are two indexes - one label and one label-property index. To delete all indexes, run:
CALL schema.assert({}, {}, {}, true)
YIELD action, key, keys, label, unique
RETURN action, key, keys, label, unique;
The above query removes all existing indexes because the empty indices_map
indicates that no indexes should be asserted as existing, while the
drop_existing
set to true
specifies that all existing indexes should be
dropped.
Primarily, the assert()
procedure is used to define a schema, but it's also
useful if you need to delete all constraints or delete all node indexes and constraints.
Automatic index creation
Automatic index creation can only be used if the database has IN_MEMORY_TRANSACTIONAL
mode enabled.
Using the
storage-automatic-label-index-creation-enabled
and
storage-automatic-edge-type-index-creation-enabled
flags, it is possible to create label and edge-type indices automatically. Every
time the database encounters a label or edge-type that is currently not indexed,
it will create an index for that construct.
Recovery
Existence and unique constraints, and indexes can be
recovered in parallel. To enable this behavior, set the
storage-parallel-schema-recovery
configuration flag to true
.
Query optimization
Analyze graph
The ANALYZE GRAPH
will check and calculate certain properties of a graph so
that the database can choose a more optimal index or MERGE
transaction.
Before the introduction of the ANALYZE GRAPH
query, the database would choose
an index solely based on the number of indexed nodes. But if the number of nodes
is the only condition, in some cases the database would choose a non-optimal
index. Once the ANALYZE GRAPH
is run, Memgraph analyzes the distribution of
property values and can select a more optimal label-property index, the one with
the smallest average property value size.
The average property value's group size directly represents the database's expected number of hits which can be used to estimate the query's cost. When the average group size is the same, the chi-squared statistic is used to measure how close the distribution of property-value group size is to the uniform distribution. The index with a distribution closest to the uniform distribution is selected.
Upon running the ANALYZE GRAPH
query, Memgraph also check the node degree of
every indexed nodes and calculates the average degree. By having these values,
Memgraph can make a more optimal MERGE
expansion and improve performance. It's
always better to perform a MERGE
by expanding from the node that has a lower
degree than the connecting node.
The ANALYZE GRAPH;
command should be run only once after all indexes have been
created and nodes inserted in the database. In rare situations when one property
is set on many more nodes than another property, choosing an index based on
average group size and uniform distribution would be misleading. That's why the
database always selects the label-property index with >= 10x fewer nodes than
the other label-property index.
Calculate the statistic
Run the following query to calculate the statistics:
ANALYZE GRAPH;
The query will iterate over all label and label-property indexes in the database and calculate the average group size, chi-squared statistic and avg degree for each one, then return the following output:
label | property | num estimation nodes | num groups | avg group size | chi-squared value | avg degree |
---|---|---|---|---|---|---|
index's label | index's property | number of nodes used for estimation | number of distinct values the property contains | average group size of property's values | value of the chi-squared statistic | average degree of the indexed nodes |
Once the necessary information is obtained, Memgraph can choose the optimal
index and MERGE
expansion. If you don't want to run the analysis on all labels,
you can specify which labels to use by adding the labels to the query:
ANALYZE GRAPH ON LABELS :Label1, :Label2;
Delete statistic
Information about the graph is persistent between instance reruns as is recovered as all the other data, using snapshots and WAL files. If you want the database to ignore information about the average group size, the chi-squared statistic and the average degree, the existing statistic can be deleted by running:
ANALYZE GRAPH DELETE STATISTICS;
The results will contain all label-property indexes that were successfully deleted:
label | property |
---|---|
index's label | index's property |
Specific labels can be specified with the construct ON LABELS
:
ANALYZE GRAPH ON LABELS :Label1 DELETE STATISTICS;
Index hinting
When executing a query, Memgraph query planner needs to decide where in the query it should
start matching. To get the optimal match, it checks the MATCH
clause conditions
and finds the index that's likely to be the best choice, if there are multiple indexes
to choose from.
However, the selected index might not always be the best one. Sometimes, there are multiple candidate indexes, and the query planner picks the suboptimal one from a performance point of view.
With index hinting, you can instruct the planner to use specific index(es) (if possible) in the query that follows. Here is the syntax for such query:
USING INDEX :Label, :Label2 ...;
USING INDEX :Label(property) ...;
It is also possible to specify multiple hints separated with comma. In that case, the planner will apply the first hint that is applicable for a given match.
An example of selecting an index with USING INDEX
:
USING INDEX :Person(name)
MATCH (n:Person {name: 'John', gender: 'male'})
RETURN n;
Overriding planner behavior with index hints should be used with caution, and only by experienced developers and/or database administrators, as poor index choice may cause queries to perform poorly.
Underlying implementation
The central part of Memgraph's index data structure is a highly concurrent skip list (opens in a new tab). Skip lists are probabilistic data structures that allow fast search within an ordered sequence of elements. The structure itself is built in layers, where the bottom layer is an ordinary linked list that preserves the order. Each higher level can be imagined as a highway for layers below.
The implementation details behind skip list operations are well documented in
the literature and are out of the scope of this document. Nevertheless, we believe
that it is important for more advanced users to understand the following
implications of this data structure (n
denotes the current number of elements
in a skip list):
- The average insertion time is
O(log(n))
- The average deletion time is
O(log(n))
- The average search time is
O(log(n))
- The average memory consumption is
O(n)
When it comes to label-property indexes, Memgraph stores a list of specific properties that are used in label-property indexes. This list is ordered to make the search faster. All property types can be ordered. First, they are ordered based on the type and then within that type.