High availability (Enterprise)

A cluster is considered highly available if, at any point, there is some instance that can respond to a user query. Our high availability relies on replication. The cluster consists of:

The MAIN instance on which the user can execute write queries
REPLICA instances that can only respond to read queries
COORDINATOR instances that manage the cluster state.

Depending on how configuration flags are set, Memgraph can run as a data instance or coordinator instance. The coordinator instance is a new addition to enable the high availability feature and orchestrates data instances to ensure that there is always one MAIN instance in the cluster.

Cluster management

For achieving high availability, Memgraph uses Raft consensus protocol, which is very similar to Paxos in terms of performance and fault-tolerance but with a significant advantage that it is much easier to understand. It’s important to say that Raft isn’t a Byzantine fault-tolerant algorithm. You can learn more about Raft in the paper In Search of an Understandable Consensus Algorithm.

Typical Memgraph’s highly available cluster consists of 3 data instances (1 MAIN and 2 REPLICAS) and 3 coordinator instances backed up by Raft protocol. Users can create more than 3 coordinators, but the replication factor (RF) of 3 is a de facto standard in distributed databases.

One coordinator instance is the leader whose job is to always ensure one writeable data instance (MAIN). The other two coordinator instances replicate changes the leader coordinator did in its own Raft log. Operations saved into the Raft log are those that are related to cluster management. Memgraph doesn’t have its implementation of the Raft protocol. For this task, Memgraph uses an industry-proven library NuRaft.

You can start the coordinator instance by specifying --coordinator-id, --coordinator-port and --management-port flags. Followers ping the leader on the --management-port to get health state of the cluster. The coordinator instance only responds to queries related to high availability, so you cannot execute any data-oriented query on it. The coordinator port is used for the Raft protocol, which all coordinators use to ensure the consistency of the cluster’s state. Data instances are distinguished from coordinator instances by specifying only --management-port flag. This port is used for RPC network communication between the coordinator and data instances. When started by default, the data instance is MAIN. The coordinator will ensure that no data inconsistency can happen during and after the instance’s restart. Once all instances are started, the user can start adding data instances to the cluster.

The Raft consensus algorithm ensures that all nodes in a distributed system agree on a single source of truth, even in the presence of failures, by electing a leader to manage a replicated log. It simplifies the management of the replicated log across the cluster, providing a way to achieve consistency and coordination in a fault-tolerant manner. Users are advised to use an odd number of instances since Raft, as a consensus algorithm, works by forming a majority in the decision making.

Observability

Monitoring the cluster state is very important and tracking various metrics can provide us with a valuable information. Currently, we track metrics which reveal us p50, p90 and p99 latencies of RPC messages, the duration of recovery process and the time needed to react to changes in the cluster. We are also counting the number of different RPC messages exchanged and the number of failed requests since this can give us information about parts of the cluster that need further care. You can see the full list of metrics here.

When deploying coordinators to servers, you can use the instance of almost any size. Instances of 4GiB or 8GiB will suffice since coordinators’ job mainly involves network communication and storing Raft metadata. Coordinators and data instances can be deployed on same servers (pairwise) but from the availability perspective, it is better to separate them physically.

Bolt+routing

Directly connecting to the MAIN instance isn’t preferred in the HA cluster since the MAIN instance changes due to various failures. Because of that, users can use bolt+routing so that write queries can always be sent to the correct data instance. This will prevent a split-brain issue since clients, when writing, won’t be routed to the old main but rather to the new main instance on which failover got performed. This protocol works in a way that the client first sends a ROUTE bolt message to any coordinator instance. The coordinator replies to the message by returning the routing table with three entries specifying from which instance can the data be read, to which instance data can be written to and which instances behave as routers. In the Memgraph HA cluster, the MAIN data instance is the only writeable instance, REPLICAs are readable instances, and COORDINATORs behave as routers. However, the cluster can be configured in such a way that MAIN can also be used for reading. Check this paragraph for more info. Bolt+routing is the client-side routing protocol, meaning network endpoint resolution happens inside drivers. For more details about the Bolt messages involved in the communication, check the following link.

Users only need to change the scheme they use for connecting to coordinators. This means instead of using bolt://<main_ip_address>, you should use neo4j://<coordinator_ip_adresss> to get an active connection to the current MAIN instance in the cluster. You can find examples of how to use bolt+routing in different programming languages here.

It is important to note that setting up the cluster on one coordinator (registration of data instances and coordinators, setting main) must be done using bolt connection since bolt+routing is only used for routing data-related queries, not coordinator-based queries.

System configuration

⚠️

Important note if you’re using native Memgraph deployment with Red Hat.

Red Hat uses SELinux to enforce security policies. SELinux (Security-Enhanced Linux) is a security mechanism for implementing mandatory access control (MAC) in the Linux kernel. It restricts programs, users, and processes to only the resources they require, following a least-privilege model. When deploying Memgraph with high availability (HA), consider checking out this attribute for instance visibility and setting the level of security mechanism to permissive.

This rule could also apply to CentOS and Fedora, but at the moment it’s not tested and verified.

Authentication

When using authentication features for high availability, there are a couple of things to consider first.

Currently, it is not possible to create users with the query CREATE USER on coordinators. Instead, for the direct Bolt connection to coordinators to work as expected, create user by setting environment variables MEMGRAPH_USER and MEMGRAPH_PASSWORD.

On the other hand, users can be created using the CREATE USER query on the main data instance. Such user will be replicated to other replica data instances but coordinators won’t know about this user.

When using the bolt+routing type of connection, the drivers are using the same authentication arguments for connecting to coordinators and data instances. Because of that, for bolt+routing connections, the user created on data instances needs to exist on coordinators too.

Starting instances

You can start the data and coordinator instances using environment flags or configuration flags. The main difference between data instance and coordinator is that data instances have --management-port, whereas coordinators must have --coordinator-id and --coordinator-port.

Configuration Flags

Data instance

Memgraph data instance must use flag --management-port=<port>. This flag is tied to the high availability feature, enables the coordinator to connect to the data instance, and allows the Memgraph data instance to use the high availability feature.

docker run --name instance1 -p 7687:7687 -p 7444:7444 memgraph/memgraph-mage
--management-port=13011 \
--bolt-port=7692 \

Coordinator instance

docker run --name coord1 -p 7691:7691 -p 7445:7444 memgraph/memgraph-mage
--coordinator-port=10111 
--bolt-port=7691
--coordinator-id=1 
--coordinator-hostname=localhost
--management-port=12121

Coordinator IDs serve as identifiers, the coordinator port is used for synchronization and log replication between coordinators and management port is used to get health state of cluster from leader coordinator. Coordinator IDs, coordinator ports and management ports must be different for all coordinators.

Configuration option --coordinator-hostname must be set on all coordinator instances. It is used on followers to ping the leader coordinator on the correct IP address and return the health state about the cluster. You can set this configuration flag to the IP address, the fully qualified domain name (FQDN), or even the DNS name. The suggested approach is to use DNS, otherwise, in case the IP address changes, network communication between instances in the cluster will stop working.

When testing on a local setup, the flag --coordinator-hostname should be set to localhost for each instance.

It is important that in the host you set the bolt ports distinct for every instance, regardless of them being a data instance, or a coordinator instance.

Env flags

There is an additional way to set high availability instances using environment variables. It is important to say that for the following configuration options, you can either use environment variables or configuration flags:

bolt port
coordinator port
coordinator id
management port
path to nuraft log file
coordinator hostname

Data instances

Here are the environment variables you need to use to set data instance using only environment variables:

export MEMGRAPH_MANAGEMENT_PORT=13011
export MEMGRAPH_BOLT_PORT=7692

When using any of these environment variables, flags --bolt-port and --management-port will be ignored.

Coordinator instances

export MEMGRAPH_COORDINATOR_PORT=10111
export MEMGRAPH_COORDINATOR_ID=1
export MEMGRAPH_BOLT_PORT=7687
export MEMGRAPH_NURAFT_LOG_FILE="<path-to-log-file>"
export MEMGRAPH_COORDINATOR_HOSTNAME="localhost"
export MEMGRAPH_MANAGEMENT_PORT=12121

When using any of these environment variables, flags for --bolt-port, --coordinator-port, --coordinator-id and --coordinator-hostname will be ignored.

There is an additional environment variable you can use to set the path to the file with cypher queries used to start a high availability cluster. Here, you can use queries we define in the next chapter called User API.

export MEMGRAPH_HA_CLUSTER_INIT_QUERIES=<file_path>

After the coordinator instance is started, Memgraph will run queries one by one from this file to set up a high availability cluster.

User API

Register instance

Registering instances should be done on a single coordinator. The chosen coordinator will become the cluster’s leader.

The coordinator instance will connect to the data instance on the management_server network address.
The coordinator instance will start pinging the data instance every --instance-health-check-frequency-sec seconds to check its status.
Data instance will be demoted from MAIN to REPLICA.
Data instance will start the replication server on replication_server.

REGISTER INSTANCE instanceName ( AS ASYNC ) WITH CONFIG {"bolt_server": boltServer, "management_server": managementServer, "replication_server": replicationServer};

This operation will result in writing to the Raft log.

In case the MAIN instance already exists in the cluster, a replica instance will be automatically connected to the MAIN. You can specify whether the replica should behave synchronously or asynchronously by using AS ASYNC construct after instanceName.

Add coordinator instance

The user can choose any coordinator instance to run cluster setup queries. This can be done before or after registering data instances, the order isn’t important.

ADD COORDINATOR coordinatorId WITH CONFIG {"bolt_server": boltServer, "coordinator_server": coordinatorServer};

ADD COORDINATOR query needs to be run for all coordinators in the cluster.

ADD COORDINATOR 1 WITH CONFIG {"bolt_server": "127.0.0.1:7691", "coordinator_server": "127.0.0.1:10111", "management_server": "127.0.0.1:12111"};
ADD COORDINATOR 2 WITH CONFIG {"bolt_server": "127.0.0.1:7692", "coordinator_server": "127.0.0.1:10112", "management_server": "127.0.0.1:12112"};
ADD COORDINATOR 3 WITH CONFIG {"bolt_server": "127.0.0.1:7693", "coordinator_server": "127.0.0.1:10113", "management_server": "127.0.0.1:12113"};

Remove coordinator instance

If during cluster setup or at some later stage of cluster life, the user decides to remove some coordinator instance, REMOVE COORDINATOR query can be used. Only on leader can this query be executed in order to remove followers. Current cluster’s leader cannot be removed since this is prohibited by NuRaft. In order to remove the current leader, you first need to trigger leadership change.

REMOVE COORDINATOR <COORDINATOR-ID>;

Set instance to MAIN

Once all data instances are registered, one data instance should be promoted to MAIN. This can be achieved by using the following query:

SET INSTANCE instanceName to MAIN;

This query will register all other instances as REPLICAs to the new MAIN. If one of the instances is unavailable, setting the instance to MAIN will not succeed. If there is already a MAIN instance in the cluster, this query will fail.

This operation will result in writing to the Raft log.

Demote instance

Demote instance query can be used by an admin to demote the current main to replica. In this case, the leader coordinator won’t perform a failover, but as a user, you should choose promote one of the data instances to MAIN using the SET INSTANCE instance TO MAIN query.

DEMOTE INSTANCE instanceName;

This operation will result in writing to the Raft log.

By combining the functionalities of queries DEMOTE INSTANCE instanceName and SET INSTANCE instanceName TO MAIN you get the manual failover capability. This can be useful e.g during a maintenance work on the instance where the current MAIN is deployed.

Unregister instance

There are various reasons which could lead to the decision that an instance needs to be removed from the cluster. The hardware can be broken, network communication could be set up incorrectly, etc. The user can remove the instance from the cluster using the following query:

UNREGISTER INSTANCE instanceName;

When unregistering an instance, ensure that the instance being unregistered is not the MAIN instance. Unregistering MAIN can lead to an inconsistent cluster state. Additionally, the cluster must have an alive MAIN instance during the unregistration process. If no MAIN instance is available, the operation cannot be guaranteed to succeed.

The instance requested to be unregistered will also be unregistered from the current MAIN’s REPLICA set.

Force reset cluster state

In case the cluster gets stuck there is an option to do the force reset of the cluster. You need to execute a command on the leader coordinator. This command will result in the following actions:

The coordinator instance will demote each alive instance to REPLICA.
From the alive instance it will choose a new MAIN instance.
Instances that are down will be demoted to REPLICAs once they come back up.

FORCE RESET CLUSTER STATE;

This operation will result in writing to the Raft log.

Show instances

You can check the state of the whole cluster using the SHOW INSTANCES query. The query will display all the Memgraph servers visible in the cluster. With each server you can see the following information:

Network endpoints they are using for managing cluster state
Health state of server
Role - MAIN, REPLICA, LEADER, FOLLOWER or unknown if not alive
The time passed since the last response time to the leader’s health ping

This query can be run on either the leader or followers. Since only the leader knows the exact status of the health state and last response time, followers will execute actions in this exact order:

Try contacting the leader to get the health state of the cluster, since the leader has all the information. If the leader responds, the follower will return the result as if the SHOW INSTANCES query was run on the leader.
When the leader doesn’t respond or currently there is no leader, the follower will return all the Memgraph servers with the health state set to “down”.

SHOW INSTANCES;

Show instance

You can check the state of the current coordinator to which you are connected by running the following query:

SHOW INSTANCE;

This query will return the information about:

instance name
external bolt server to which you can connect using Memgraph clients
coordinator server over which Raft communication is done
management server which is also used for inter-coordinators communication and
cluster role: whether the coordinator is currently a leader of the follower.

If the query ADD COORDINATOR wasn’t run for the current instance, the value of the bolt server will be "".

Setting config for highly-available cluster

There are several flags that you can use for managing the cluster. Flag --management-port is used by both data instances and coordinators. The provided flag needs to be unique. Setting a flag will create an RPC server on instances capable of responding to the coordinator’s RPC messages.

RPC (Remote Procedure Call) is a protocol for executing functions on a remote system. RPC enables direct communication in distributed systems and is crucial for replication and high availability tasks.

Flags --coordinator-id, --coordinator-port and --management-port need to be unique and specified on coordinator instances. They will cause the creation of a Raft server that coordinator instances use for communication. Flag --instance-health-check-frequency-sec specifies how often should leader coordinator check the status of the replication instance to update its status. Flag --instance-down-timeout-sec gives the user the ability to control how much time should pass before the coordinator starts considering the instance to be down.

There is also a configuration option for specifying whether reads from the main are enabled. The configuration value is by default false but can be changed in run-time using the following query:

SET COORDINATOR SETTING 'enabled_reads_on_main' TO 'true'/'false' ;

Users can also choose whether failover to the async replica is allowed by using the following query:

SET COORDINATOR SETTING 'sync_failover_only' TO 'true'/'false' ;

By default, the value is true, which means that only sync replicas are candidates in the election. When the value is set to false, the async replica is also considered, but there is an additional risk of experiencing data loss. However, failover to an async replica may be necessary when other sync replicas are down and you want to manually perform a failover.

All run-time configuration options can be retrieved using:

SHOW COORDINATOR SETTINGS ;

Consider the instance to be down only if several consecutive pings fail because a single ping can fail because of a large number of different reasons in distributed systems.

RPC timeouts

For the majority of RPC messages, Memgraph uses a default timeout of 10s. This is to ensure that when sending a RPC request, the client will not block indefinitely before receiving a response if the communication between the client and the server is broken. The list of RPC messages for which the timeout is used is the following:

ShowInstancesReq -> coordinator sending to coordinator
DemoteMainToReplicaReq -> coordinator sending to data instances
PromoteToMainReq -> coordinator sending to data instances
RegisterReplicaOnMainReq -> coordinator sending to data instances
UnregisterReplicaReq -> coordinator sending to data instances
EnableWritingOnMainReq -> coordinator sending to data instances
GetInstanceUUIDReq -> coordinator sending to data instances
GetDatabaseHistoriesReq -> coordinator sending to data instances
StateCheckReq -> coordinator sending to data instances. The timeout is set to 5s.
SwapMainUUIDReq -> coordinator sending to data instances
FrequentHeartbeatReq -> main sending to replica. The timeout is set to 5s.
HeartbeatReq -> main sending to replica
TimestampReq -> main sending to replica
SystemHeartbeatReq -> main sending to replica
ForceResetStorageReq -> main sending to replica. The timeout is set to 60s.
SystemRecoveryReq -> main sending to replica. The timeout set to 5s.

For replication-related RPC messages — AppendDeltasRpc, CurrentWalRpc, and WalFilesRpc — it is not practical to set a strict execution timeout. The processing time on the replica side is directly proportional to the amount of data being transferred. To handle this, the replica sends periodic progress updates to the main instance after processing every 100,000 deltas. Since processing 100,000 deltas is expected to take a relatively consistent amount of time, we can enforce a timeout based on this interval. The default timeout for these RPC messages is 30 seconds, though in practice, processing 100,000 deltas typically takes less than 3 seconds.

SnapshotRpc is also a replication-related RPC message, but its execution time is tracked differently. The replica sends an update to the main instance after completing 1,000,000 units of work. The work units are assigned as follows:

Processing nodes, edges, or indexed entities (label index, label-property index, edge type index, edge type property index) = 1 unit
Processing a node inside a point or text index = 10 units
Processing a node inside a vector index (most computationally expensive) = 1,000 units

With this unit-based tracking system, the replica is expected to report progress every 2–3 seconds. Given this, a timeout of 60 seconds is set to avoid unnecessary network instability while ensuring responsiveness.

Except for timeouts on read and write operations, Memgraph also has a timeout of 5s for sockets when establishing a connection. Such a timeout helps in having a low p99 latencies when using the RPC stack, which manifests for users as smooth and predictable network communication between instances.

Failover

Determining instance’s health

Every --instance-health-check-frequency-sec seconds, the coordinator contacts each instance. The instance is not considered to be down unless --instance-down-timeout-sec has passed and the instance hasn’t responded to the coordinator in the meantime. Users must set --instance-health-check-frequency-sec to be less or equal to the --instance-down-timeout-sec but we advise users to set --instance-down-timeout-sec to a multiplier of --instance-health-check-frequency-sec. Set the multiplier coefficient to be N>=2. For example, set --instance-down-timeout-sec=5 and --instance-health-check-frequency-sec=1 which will result in coordinator contacting each instance every second and the instance is considered dead after it doesn’t respond 5 times (5 seconds / 1 second).

In case a REPLICA doesn’t respond to a health check, the leader coordinator will try to contact it again every --instance-health-check-frequency-sec. When the REPLICA instance rejoins the cluster (comes back up), it always rejoins as REPLICA. For MAIN instance, there are two options. If it is down for less than --instance-down-timeout-sec, it will rejoin as MAIN because it is still considered alive. If it is down for more than --instance-down-timeout-sec, the failover procedure is initiated. Whether MAIN will rejoin as MAIN depends on the success of the failover procedure. If the failover procedure succeeds, now old MAIN will rejoin as REPLICA. If failover doesn’t succeed, MAIN will rejoin as MAIN once it comes back up.

Failover procedure - high level description

From alive REPLICAs coordinator chooses a new potential MAIN. This instance is only potentially new MAIN as the failover procedure can still fail due to various factors (networking issues, promote to MAIN fails, any alive REPLICA failing to accept an RPC message, etc). The coordinator sends an RPC request to the potential new MAIN, which is still in REPLICA state, to promote itself to the MAIN instance with info about other REPLICAs to which it will replicate data. Once that request succeeds, the new MAIN can start replication to the other instances and accept write queries.

Choosing new MAIN from available REPLICAs

When failover is happening, some REPLICAs can also be down. From the list of alive REPLICAs, a new MAIN is chosen. First, the leader coordinator contacts each alive REPLICA to get info about each database’s last commit timestamp. In the case of enabled multi-tenancy, from each instance coordinator will get info on all databases and their last commit timestamp. Currently, the coordinator chooses an instance to become a new MAIN by comparing the latest commit timestamps of all databases. The instance which is newest on most databases is considered the best candidate for the new MAIN. If there are multiple instances which have the same number of newest databases, we sum timestamps of all databases and consider instance with a larger sum as the better candidate.

Old MAIN rejoining to the cluster

Once the old MAIN gets back up, the coordinator sends an RPC request to demote the old MAIN to REPLICA. The coordinator tracks at all times which instance was the last MAIN.

The leader coordinator sends two RPC requests in the given order to demote old MAIN to REPLICA:

Demote MAIN to REPLICA RPC request
A request to store the UUID of the current MAIN, which the old MAIN, now acting as a REPLICA instance, must listen to.

How REPLICA knows which MAIN to listen

Each REPLICA has a UUID of MAIN it listens to. If a network partition happens where MAIN can talk to a REPLICA but the coordinator can’t talk to the MAIN, from the coordinator’s point of view that MAIN is down. From REPLICA’s point of view, the MAIN instance is still alive. The coordinator will start the failover procedure, and we can end up with multiple MAINs where REPLICAs can listen to both MAINs. To prevent such an issue, each REPLICA gets a new UUID that no current MAIN has. The coordinator generates the new UUID, which the new MAIN will get to use on its promotion to MAIN.

If REPLICA was down at one point, MAIN could have changed. When REPLICA gets back up, it doesn’t listen to any MAIN until the coordinator sends an RPC request to REPLICA to start listening to MAIN with the given UUID.

Replication concerns

Force sync of data

During a failover event, Memgraph selects the most up-to-date, alive instance to become the new MAIN. The selection process works as follows:

From the list of available REPLICA instances, Memgraph chooses the one with the latest commit timestamp for the default database.
If an instance that had more recent data was down during this selection process, it will not be considered for promotion to MAIN.

If a previously down instance had more up-to-date data but was unavailable during failover, it will go through a specific recovery process upon rejoining the cluster:

The new MAIN will clear the returning replica’s storage.
The returning replica will then receive all commits from the new MAIN to synchronize its state.
The replica’s old durability files will be preserved in a .old directory in data_directory/snapshots and data_directory/wal folders, allowing admins to manually recover data if needed.

Memgraph prioritizes availability over strict consistency (leaning towards AP in the CAP theorem). While it aims to maintain consistency as much as possible, the current failover logic can result in a non-zero Recovery Point Objective (RPO), that is, the loss of committed data, because:

The promoted MAIN might not have received all commits from the previous MAIN before the failure.
This design ensures that the MAIN remains writable for the maximum possible time.

If your environment requires strong consistency and can tolerate write unavailability, reach out to us. We are actively exploring support for a fully synchronous mode.

Actions on follower coordinators

From follower coordinators you can only execute SHOW INSTANCES. Registration of data instance, unregistration of data instances, demoting instance, setting instance to MAIN and force resetting cluster state are all disabled.

Instances’ restart

Data instances’ restart

Data instances can fail both as MAIN and as REPLICA. When an instance that was REPLICA comes back, it won’t accept updates from any instance until the coordinator updates its responsible peer. This should happen automatically when the coordinator’s ping to the instance passes. When the MAIN instance comes back, any writing to the MAIN instance will be forbidden until a ping from the coordinator passes.

Coordinator instances restart

In case the coordinator instance dies and it is restarted, it will not lose any data from the RAFT log or RAFT snapshots, since coordinator data is always backed-up by a durable storage. For more details read about high availability durability in the durability chapter.

Durability

All NuRaft data is made durable by default. This includes all Raft logs, Raft snapshots and information about cluster connectivity. The details about the cluster connectivity are made durable since without that information, the coordinator can’t rejoin the cluster on its restart.

Information about logs and snapshots is stored under one RocksDB instance in the high_availability/raft_data/logs directory stored under the top-level --data-directory folder. All the data stored there is recovered in case the coordinator restarts.

Data about other coordinators is recovered from the high_availability/raft_data/network directory stored under the top-level --data-directory folder. When the coordinator rejoins, it will reestablish the communication with other coordinators and receive updates from the current leader.

First start

On the first start of coordinators, each will store the current version of the logs and network durability store. From that point on, each RAFT log that is sent to the coordinator is also stored on disk. For every new coordinator instance, the server config is updated. Logs are created for each user action and failover action. Snapshots are created every N (N currently being 5) logs.

Restart of coordinator

In case of the coordinator’s failure, on the restart, it will read information about other coordinators stored under high_availability/raft_data/network directory.

From the network directory we will recover the server state before the coordinator stopped, including the current term, for whom the coordinator voted, and whether election timer is allowed.

It will also recover the following server config information:

other servers, including their endpoints, id, and auxiliary data
ID of the previous log
ID of the current log
additional data needed by nuRaft

The following information will be recovered from a common RocksDB logs instance:

current version of logs durability store
snapshots found with snapshot_id_ prefix in database:
- coordinator cluster state - all data instances with their role (MAIN or REPLICA), all coordinator instances and UUID of MAIN instance which REPLICA is listening to
- last log idx
- last log term
- last cluster config
logs found in the interval between the start index and the last log index
- data - each log holds data on what has changed since the last state
- term - nuRAFT term
- log type - nuRAFT log type

Handling of durability errors

If snapshots are not correctly stored, the exception is thrown and left for the nuRAFT library to handle the issue. Logs can be missed and not stored since they are compacted and deleted every two snapshots and will be removed relatively fast.

Memgraph throws an error when failing to store cluster config, which is updated in the high_availability/raft_data/network folder. If this happens, it will happen only on the first cluster start when coordinators are connecting since coordinators are configured only once at the start of the whole cluster. This is a non-recoverable error since in case the coordinator rejoins the cluster and has the wrong state of other clusters, it can become a leader without being connected to other coordinators.

Recovering from errors

Distributed systems can fail in numerous ways. Memgraph processes are resilient to network failures, omission faults and independent machine failures. Byzantine failures aren’t tolerated since the Raft consensus protocol cannot deal with them either.

Recovery Time Objective (RTO) is an often used term for measuring the maximum tolerable length of time that an instance or cluster can be down. Since every highly available Memgraph cluster has two types of instances, we need to analyze the failures of each separately.

Raft is a quorum-based protocol and it needs a majority of instances alive in order to stay functional. Hence, with just one coordinator instance down, RTO is 0 since the cluster stays available. With 2+ coordinator instances down (in a cluster with RF = 3), the RTO depends on the time needed for instances to come back.

Failure of REPLICA data instance isn’t very harmful since users can continue writing to MAIN data instance while reading from MAIN or other REPLICAs. The most important thing to analyze is what happens when MAIN gets down. In that case, the leader coordinator uses user-controllable parameters related to the frequency of health checks from the leader to replication instances (--instance-health-check-frequency-sec) and the time needed to realize the instance is down (--instance-down-timeout-sec). After collecting enough evidence, the leader concludes the MAIN is down and performs failover using just a handful of RPC messages (correct time depends on the distance between instances). It is important to mention that the whole failover is performed without the loss of committed data if the newly chosen MAIN (previously REPLICA) had all up-to-date data.

Raft configuration parameters

Several Raft-related parameters are important for the correct functioning of the cluster. The leader coordinator sends a heartbeat message to other coordinators every second to determine their health. This configuration option is connected with leader election timeout which is a randomized value from the interval [2000ms, 4000ms] and which is used by followers to decide when to trigger new election process. Leadership expiration is set to 2000ms so that cluster can never get into situation where multiple leaders exist. These specific values give a cluster the ability to survive occasional network hiccups without triggering leadership changes.

Data center failure

The architecture we currently use allows us to deploy coordinators in 3 data centers and hence tolerate a failure of the whole data center. Data instances can be freely distributed in any way you want between data centers. The failover time will be slightly increased due to the network communication needed.

Kubernetes

We support deploying Memgraph HA as part of the Kubernetes cluster through Helm charts. You can see example configurations here.

Docker Compose

The following example shows you how to setup Memgraph cluster using Docker Compose. The cluster will use user-defined bridge network.

License file license.cypherl should be in the format:


SET DATABASE SETTING 'organization.name' TO '<YOUR_ORGANIZATION_NAME>';
SET DATABASE SETTING 'enterprise.license' TO '<YOUR_ENTERPRISE_LICENSE>';

You can directly use initialization file HA_register.cypherl:


ADD COORDINATOR 1 WITH CONFIG {"bolt_server": "localhost:7691", "coordinator_server": "coord1:10111", "management_server": "coord1:12121"};
ADD COORDINATOR 2 WITH CONFIG {"bolt_server": "localhost:7692", "coordinator_server": "coord2:10112", "management_server": "coord2:12122"};
ADD COORDINATOR 3 WITH CONFIG {"bolt_server": "localhost:7693", "coordinator_server": "coord3:10113", "management_server": "coord3:12123"};

REGISTER INSTANCE instance_1 WITH CONFIG {"bolt_server": "localhost:7687", "management_server": "instance1:13011", "replication_server": "instance1:10001"};
REGISTER INSTANCE instance_2 WITH CONFIG {"bolt_server": "localhost:7688", "management_server": "instance2:13012", "replication_server": "instance2:10002"};
REGISTER INSTANCE instance_3 WITH CONFIG {"bolt_server": "localhost:7689", "management_server": "instance3:13013", "replication_server": "instance3:10003"};
SET INSTANCE instance_3 TO MAIN;

Since the host can’t resolve the IP for coordinators and data instances, Bolt servers in Docker Compose setup require bolt_server set to localhost:<port>.

You can directly use the following docker-compose.yml to start the cluster using docker compose up:

volumes:
  mg_lib1:
  mg_lib2:
  mg_lib3:
  mg_lib4:
  mg_lib5:
  mg_lib6:
  mg_log1:
  mg_log2:
  mg_log3:
  mg_log4:
  mg_log5:
  mg_log6:

networks:
  memgraph_ha:
    name: memgraph_ha
    driver: bridge
    ipam:
      driver: default
      config:
        - subnet: "172.21.0.0/16"

services:
  coord1:
    image: "memgraph/memgraph"
    container_name: coord1
    volumes:
      - ./license.cypherl:/tmp/init/license.cypherl:ro
      - ./HA_register.cypherl:/tmp/init/HA_register.cypherl:ro
      - mg_lib1:/var/lib/memgraph
      - mg_log1:/var/log/memgraph
    environment:
      - MEMGRAPH_HA_CLUSTER_INIT_QUERIES=/tmp/init/HA_register.cypherl
    command: [ "--init-file=/tmp/init/license.cypherl", "--log-level=TRACE", "--also-log-to-stderr", "--bolt-port=7691", "--coordinator-id=1", "--coordinator-port=10111", "--management-port=12121", "--coordinator-hostname=coord1", "--nuraft-log-file=/var/log/memgraph/nuraft"]
    networks:
      memgraph_ha:
        ipv4_address: 172.21.0.4
    ports:
      - "7691:7691"
    depends_on:
      - instance1
      - instance2
      - instance3

  coord2:
    image: "memgraph/memgraph"
    container_name: coord2
    volumes:
      - ./license.cypherl:/tmp/init/license.cypherl:ro
      - mg_lib2:/var/lib/memgraph
      - mg_log2:/var/log/memgraph
    command: [ "--init-file=/tmp/init/license.cypherl", "--log-level=TRACE", "--also-log-to-stderr", "--bolt-port=7692", "--coordinator-id=2", "--coordinator-port=10112", "--management-port=12122", "--coordinator-hostname=coord2" , "--nuraft-log-file=/var/log/memgraph/nuraft"]
    networks:
      memgraph_ha:
        ipv4_address: 172.21.0.2
    ports:
      - "7692:7692"
    depends_on:
      - instance1
      - instance2
      - instance3

  coord3:
    image: "memgraph/memgraph"
    container_name: coord3
    volumes:
      - ./license.cypherl:/tmp/init/license.cypherl:ro
      - mg_lib3:/var/lib/memgraph
      - mg_log3:/var/log/memgraph
    command: [ "--init-file=/tmp/init/license.cypherl",  "--log-level=TRACE", "--also-log-to-stderr", "--bolt-port=7693", "--coordinator-id=3", "--coordinator-port=10113", "--management-port=12123", "--coordinator-hostname=coord3" , "--nuraft-log-file=/var/log/memgraph/nuraft"]

    networks:
      memgraph_ha:
        ipv4_address: 172.21.0.3
    ports:
      - "7693:7693"
    depends_on:
      - instance1
      - instance2
      - instance3

  instance1:
    image: "memgraph/memgraph"
    container_name: instance1
    volumes:
      - ./license.cypherl:/tmp/init/license.cypherl:ro
      - mg_lib4:/var/lib/memgraph
      - mg_log4:/var/log/memgraph
    command: ["--init-file=/tmp/init/license.cypherl", "--log-level=TRACE", "--also-log-to-stderr", "--bolt-port=7687", "--management-port=13011"]
    networks:
      memgraph_ha:
        ipv4_address: 172.21.0.6
    ports:
      - "7687:7687"

  instance2:
    image: "memgraph/memgraph"
    container_name: instance2
    volumes:
      - ./license.cypherl:/tmp/init/license.cypherl:ro
      - mg_lib5:/var/lib/memgraph
      - mg_log5:/var/log/memgraph
    command: ["--init-file=/tmp/init/license.cypherl", "--log-level=TRACE", "--also-log-to-stderr", "--bolt-port=7688", "--management-port=13012"]
    networks:
      memgraph_ha:
        ipv4_address: 172.21.0.7
    ports:
      - "7688:7688"

  instance3:
    image: "memgraph/memgraph"
    container_name: instance3
    volumes:
      - ./license.cypherl:/tmp/init/license.cypherl:ro
      - mg_lib6:/var/lib/memgraph
      - mg_log6:/var/log/memgraph
    command: ["--init-file=/tmp/init/license.cypherl", "--log-level=TRACE", "--also-log-to-stderr", "--bolt-port=7689", "--management-port=13013"]
    networks:
      memgraph_ha:
        ipv4_address: 172.21.0.8
    ports:
      - "7689:7689"

Cluster can be shut-down using docker compose down.

Manual Docker setup

This example will show how to set up a highly available cluster in Memgraph using three coordinators and 3 data instances.

Start all instances

Start coordinator1:

docker run  --name coord1 --network=host -p 7691:7691 -p 7444:7444 -v mg_lib1:/var/lib/memgraph -v mg_log1:/var/log/memgraph -e MEMGRAPH_ORGANIZATION_NAME=<YOUR ORGANIZATION NAME> -e MEMGRAPH_ENTERPRISE_LICENSE="<YOUR ENTERPRISE LICENSE>"  memgraph/memgraph --bolt-port=7691 --log-level=TRACE --also-log-to-stderr --coordinator-id=1 --coordinator-port=10111 --management-port=12121--coordinator-hostname=localhost --nuraft-log-file=/var/log/memgraph/nuraft

Start coordinator2:

docker run  --name coord2 --network=host -p 7692:7692 -p 7445:7444 -v mg_lib2:/var/lib/memgraph -v mg_log2:/var/log/memgraph -e MEMGRAPH_ORGANIZATION_NAME=<YOUR ORGANIZATION NAME> -e MEMGRAPH_ENTERPRISE_LICENSE="<YOUR ENTERPRISE LICENSE>" memgraph/memgraph --bolt-port=7692 --log-level=TRACE --also-log-to-stderr --coordinator-id=2 --coordinator-port=10112 --management-port=12122--coordinator-hostname=localhost --nuraft-log-file=/var/log/memgraph/nuraft

Start coordinator3:

docker run  --name coord3 --network=host -p 7693:7693 -p 7446:7444 -v mg_lib3:/var/lib/memgraph -v mg_log3:/var/log/memgraph -e MEMGRAPH_ORGANIZATION_NAME=<YOUR ORGANIZATION NAME> -e MEMGRAPH_ENTERPRISE_LICENSE="<YOUR ENTERPRISE LICENSE>" memgraph/memgraph --bolt-port=7693 --log-level=TRACE --also-log-to-stderr --coordinator-id=3 --coordinator-port=10113 --management-port=12123--coordinator-hostname=localhost --nuraft-log-file=/var/log/memgraph/nuraft

Start instance1:

docker run  --name instance1 --network=host -p 7687:7687 -p 7447:7444 -v mg_lib4:/var/lib/memgraph -v mg_log4:/var/log/memgraph -e MEMGRAPH_ORGANIZATION_NAME=<YOUR ORGANIZATION NAME> -e MEMGRAPH_ENTERPRISE_LICENSE="<YOUR ENTERPRISE LICENSE>" memgraph/memgraph --bolt-port=7687 --log-level=TRACE --also-log-to-stderr --management-port=13011

Start instance2:

docker run --name instance2 --network=host -p 7688:7688 -p 7448:7444 -v mg_lib5:/var/lib/memgraph -v mg_log5:/var/log/memgraph -e MEMGRAPH_ORGANIZATION_NAME=<YOUR ORGANIZATION NAME> -e MEMGRAPH_ENTERPRISE_LICENSE="<YOUR ENTERPRISE LICENSE>" memgraph/memgraph --bolt-port=7688 --log-level=TRACE --also-log-to-stderr --management-port=13012

Start instance3:

docker run --name instance3 --network=host -p 7689:7689 -p 7449:7444 -v mg_lib6:/var/lib/memgraph -v mg_log6:/var/log/memgraph -e MEMGRAPH_ORGANIZATION_NAME=<YOUR ORGANIZATION NAME> -e MEMGRAPH_ENTERPRISE_LICENSE="<YOUR ENTERPRISE LICENSE>" memgraph/memgraph --bolt-port=7689 --log-level=TRACE --also-log-to-stderr --management-port=13013

Register instances

Start communication with any Memgraph client on any coordinator. Here we chose coordinator 1.

mgconsole --port=7691

Add coordinator instances to the cluster.

ADD COORDINATOR 1 WITH CONFIG {"bolt_server": "localhost:7691", "coordinator_server": "localhost:10111", "management_server": "localhost:12121"};
ADD COORDINATOR 2 WITH CONFIG {"bolt_server": "localhost:7692", "coordinator_server": "localhost:10112", "management_server": "localhost:12122"};
ADD COORDINATOR 3 WITH CONFIG {"bolt_server": "localhost:7693", "coordinator_server": "localhost:10113", "management_server": "localhost:12123"};

Replace <ip_address> with the container’s IP address. This is necessary for Docker deployments where instances are not on the local host.

REGISTER INSTANCE instance_1 WITH CONFIG {"bolt_server": "localhost:7687", "management_server": "localhost:13011", "replication_server": "localhost:10001"};
REGISTER INSTANCE instance_2 WITH CONFIG {"bolt_server": "localhost:7688", "management_server": "localhost:13012", "replication_server": "localhost:10002"};
REGISTER INSTANCE instance_3 WITH CONFIG {"bolt_server": "localhost:7689", "management_server": "localhost:13013", "replication_server": "localhost:10003"};

Set instance_3 as MAIN:

SET INSTANCE instance_3 TO MAIN;

Connect to the leader coordinator and check cluster state with SHOW INSTANCES;

name	bolt_server	coordinator_server	management_server	health	role	last_succ_resp_ms
coordinator_1	localhost:7691	localhost:10111	localhost:12121	up	leader	0
coordinator_2	localhost:7692	localhost:10112	localhost:12122	up	follower	16
coordinator_3	localhost:7693	localhost:10113	localhost:12123	up	follower	25
instance_1	localhost:7687	""	localhost:13011	up	replica	39
instance_2	localhost:7688	""	localhost:13012	up	replica	21
instance_3	localhost:7689	""	localhost:13013	up	main	91

Check automatic failover

Let’s say that the current MAIN instance is down for some reason. After --instance-down-timeout-sec seconds, the coordinator will realize that and automatically promote the first alive REPLICA to become the new MAIN. The output of running SHOW INSTANCES on the leader coordinator could then look like:

name	bolt_server	coordinator_server	management_server	health	role	last_succ_resp_ms
coordinator_1	localhost:7691	localhost:10111	localhost:12121	up	leader	0
coordinator_2	localhost:7692	localhost:10112	localhost:12122	up	follower	34
coordinator_3	localhost:7693	localhost:10113	localhost:12123	up	follower	28
instance_1	localhost:7687	""	localhost:13011	up	main	61
instance_2	localhost:7688	""	localhost:13012	up	replica	74
instance_3	localhost:7689	""	localhost:13013	down	unknown	71222

Clustering Replication