High availability

High availability (Enterprise)

⚠️

Memgraph 2.15 introduced a high availability enterprise feature, which is only enabled if used with --experimental-enabled=high-availability flag.

A cluster is considered highly available if, at any point, there is some instance that can respond to a user query. Our high availability relies on replication. The cluster consists of:

  • The MAIN instance on which the user can execute write queries
  • REPLICA instances that can only respond to read queries
  • COORDINATOR instances that manage the cluster state.

Cluster management

Typical Memgraph's highly available cluster consists of 3 data instances (1 MAIN and 2 REPLICAS) and 3 coordinator instances backed up by Raft protocol. One coordinator instance is the leader whose job is to make sure there is always one writeable data instance (MAIN). The other two coordinator instances replicate changes done by the leader coordinator in its own Raft log. Operations saved into the Raft log are those that are related to cluster management. Memgraph doesn't have its implementation of the Raft protocol, rather this task was offloaded to an industry-proven library NuRaft (opens in a new tab). You can start the coordinator instance by specifying --coordinator-id and --coordinator-port flags. The coordinator instance only responds to queries related to high availability, so you cannot execute any data-oriented query on it. The coordinator port is used for the Raft protocol, which all coordinators use to ensure the consistency of the cluster's state. Data instances are distinguished from coordinator instances by specifying --management-port flag. This port is used for RPC network communication between the coordinator and data instances. When started by default, the data instance is MAIN. All data instances should be started with the flag --replication-restore-state-on-startup=true so that its role (MAIN or REPLICA) is restored on restart. The coordinator will ensure that no data inconsistency can happen during and after the instance's restart. Once all instances are started, the user can start adding data instances to the cluster.

The Raft consensus algorithm ensures that all nodes in a distributed system agree on a single source of truth, even in the presence of failures, by electing a leader to manage a replicated log. It simplifies the management of the replicated log across the cluster, providing a way to achieve consistency and coordination in a fault-tolerant manner.

Bolt+routing

Directly connecting to the MAIN instance isn't preferred in the HA cluster since the MAIN instance changes due to various failures. Because of that, Memgraph also supports bolt+routing so that users can always send write queries to the correct data instance. This protocol works in a way that the client first sends a ROUTE bolt message to any coordinator instance. The coordinator replies to the message by returning the routing table with three entries specifying from which instance can be data read, to which instance data can be written and which instances can behave as routers. In the Memgraph HA cluster, the MAIN data instance is the only writeable instance, REPLICAs are readable instances and COORDINATORs behave as routers. For more details check Request message ROUTE (opens in a new tab) documentation.

Users only need to change the scheme which they use for connecting to coordinators. This means instead of using bolt://<main_ip_address>, you should use neo4j://<coordinator_ip_adresss> to get an active connection to the current MAIN instance in the cluster. You can find examples of how to use bolt+routing in different programming languages here (opens in a new tab).

User API

Register instance

Registering instances should be done on a single coordinator. The chosen coordinator will become the cluster's leader.

Register instance query will result in several actions:

  1. The coordinator instance will connect to the data instance on the management_server network address.
  2. The coordinator instance will start pinging the data instance every --instance-health-check-frequency-sec seconds to check its status.
  3. Data instance will be demoted from MAIN to REPLICA.
  4. Data instance will start the replication server on replication_server.
REGISTER INSTANCE instanceName WITH CONFIG {"bolt_server": boltServer, "management_server": managementServer, "replication_server": replicationServer};

This operation will result in writing to the Raft log.

Add coordinator instance

The user can choose any coordinator instance and connect other two (or more) coordinator instances to the cluster. This can be done before or after registering data instances, the order isn't important.

ADD COORDINATOR coordinatorId WITH CONFIG {"bolt_server": boltServer, "coordinator_server": coordinatorServer}; 

Set instance to MAIN

Once all data instances are registered, one data instance should be promoted to MAIN. This can be achieved by using the following query:

SET INSTANCE instanceName to MAIN;

This query will register all other instances as REPLICAs to the new MAIN. If one of the instances is unavailable, setting the instance to MAIN will not succeed. If there is already a MAIN instance in the cluster, this query will fail.

This operation will result in writing to the Raft log.

Unregister instance

There are various reasons which could lead to the decision that an instance needs to be removed from the cluster. The hardware can be broken, network communication could be set up incorrectly, etc. The user can remove the instance from the cluster using the following query:

UNREGISTER INSTANCE instanceName;

At the moment of registration, the instance that you want to unregister must not be MAIN because unregistering MAIN could lead to an inconsistent cluster state.

The instance requested to be unregistered will also be unregistered from the current MAIN's REPLICA set.

Show instances

You can check the state of the whole cluster using the SHOW INSTANCES query. The query will show which instances are visible in the cluster, which network ports they are using for managing cluster state, and whether they are considered alive from the coordinator's perspective and their role (are they MAIN, REPLICA, coordinator or unknown if not alive). Currently, the health status of coordinators isn't available, so the return value for the health field will always be unknown. This query can be run both on leader and followers with the sole difference that followers cannot know the health status of other instances since only the leader is pinging instances for their status.

SHOW INSTANCES;

Setting config for highly-available cluster

There are several flags that you can use for managing the cluster. Flag --management-port is used for distinguishing data instances from coordinators. The provided flag needs to be unique. Setting a flag will create an RPC server on data instances capable of responding to the coordinator's RPC messages.

RPC (Remote Procedure Call) is a protocol for executing functions on a remote system. RPC enables direct communication in distributed systems and is crucial for replication and high availability tasks.

Flags --coordinator-id and --coordinator-port need to be unique and specified on coordinator instances. They will cause the creation of a Raft server that coordinator instances use for communication. Flag --instance-health-check-frequency-sec specifies how often should leader coordinator should check the status of the replication instance to update its status. Flag --instance-down-timeout-sec gives the user the ability to control how much time should pass before the coordinator starts considering the instance to be down.

Consider the instance to be down only if several consecutive pings fail because a single ping can fail because of a large number of different reasons in distributed systems.

There is an additional flag, --instance-get-uuid-frequency-sec, which sets how often the coordinator should check on REPLICA instances if they follow the correct MAIN instance. REPLICA may die and get back up before the coordinator notices that. In that case, REPLICA will not follow MAIN for --instance-get-uuid-frequency-sec seconds. It is advisable to set the flag to more than --instance-down-timeout-sec, as it is needed first to confirm the instance was down and not that there was some networking issue.

Failover

Determining instance's health

Every --instance-health-check-frequency-sec seconds, the coordinator contacts each instance. The instance is not considered to be down unless --instance-down-timeout-sec has passed and the instance hasn't responded to the coordinator in the meantime. Expectation currently is to set --instance-health-check-frequency-sec to be less than --instance-down-timeout-sec and for --instance-down-timeout-sec to be multiplier of --instance-health-check-frequency-sec with coefficiient N. Set the multiplier coefficient to be N>=2. For example, set --instance-down-timeout-sec=5 and --instance-health-check-frequency-sec=1 which will result in coordinator contacting each instance every 1 second and the instance is considered dead after it doesn't respond 5 times (5 seconds / 1 second).

In case a REPLICA doesn't respond to a health check, the leader coordinator will try to contact it again every --instance-health-check-frequency-sec. When the REPLICA instance rejoins the cluster (comes back up), it always rejoins as REPLICA. For MAIN instance, there are two options. If it is down for less than --instance-down-timeout-sec, it will rejoin as MAIN because it is still considered alive. If it is down for more than --instance-down-timeout-sec, failover procedure is initiated. Whether MAIN will rejoin as MAIN depends on the success of the failvoer procedure. If the failover procedure succeeds, now old MAIN will rejoin as REPLICA. If failover doesn't succeed, MAIN will rejoin as MAIN once it comes back up.

Failover procedure - high level description

From alive REPLICAs coordinator chooses a new potential MAIN. This instance is a potential MAIN as the failover procedure can still fail due to various factors (networking issues, promote to MAIN fails, any alive REPLICA failing to accept an RPC message, etc). Before promoting REPLICA to MAIN, the coordinator sends a RPC request to each alive REPLICA to stop listening to the old MAIN. Once each alive REPLICA acknowledges that it stopped listening to the old MAIN, the coordinator sends an RPC request to the potential new MAIN, which is still in REPLICA state, to promote itself to the MAIN instance with info about other REPLICAs to which it will replicate data. Once that request succeeds, the new MAIN can start replication to the other instances and accept write queries.

Choosing new MAIN from available REPLICAs

When failover is happening some REPLICAs can also be down. From the list of alive REPLICAs, a new MAIN is chosen. First, the leader coordinator contacts each alive REPLICA to get info about each database's last commit timestamp. In the case of enabled multi-tenancy, from each instance coordinator will get info on all databases and their last commit timestamp. Currently, the coordinator chooses an instance to become a new MAIN by comparing the latest commit timestamp only of the default database for every instance. If multiple instances have the same latest commit timestamp, the instance that was registered earlier will be chosen as a new MAIN.

Providing atomicity of action

To ensure the atomicity of each action, the leader coordinator wraps each of the following actions with locks:

  • Registering of replica instance
  • Unregistering of replica instance
  • Setting instance to MAIN
  • Failover procedure
  • Force reset of the cluster

The special mechanism includes opening the lock before each action starts and closing it once it is complete. A closed lock means the action is fully completed. If, due to some failure, action isn't complete fully, the cluster enters a state of force reset where the cluster is reset to the state held in the Raft log. The state of truth is what is stored in the Raft log across most of the coordinators.

Force reset of cluster

The leader coordinator executes a force reset of the cluster if the action isn't fully complete. Failure can happen anywhere, i.e. in the case of setting instance to MAIN, the RPC request to a REPLICA instance to promote itself to MAIN can succeed, but writing to the Raft log that the instance was promoted can fail. Force reset includes demoting every alive instance to REPLICA, and executing the failover procedure once again. Such a procedure is needed as currently cluster doesn't track where the action failed exactly, but only whether it fully succeded. Raft log is taken as a source of truth at all times. In case the leader coordinator dies while executing the force reset, the next coordinator which is elected as the leader, will continue executing the force reset. Action is executed until it succeeds.

If an instance is down at the point of force reset, the leader coordinator writes in the Raft log that the instance needs to be demoted to REPLICA once it comes back up.

If all instances are down at the point of force reset, the action won't succeed as a new MAIN instance can't be chosen.

Old MAIN rejoining to the cluster

Once the old MAIN gets back up, the coordinator sends an RPC request to demote the old MAIN to REPLICA. The coordinator tracks at all times which instance was the last MAIN.

The leader coordinator sends two RPC requests in the given order to demote old MAIN to REPLICA:

  1. Demote MAIN to REPLICA RPC request
  2. A request to store the UUID of the current MAIN, which the old MAIN, now acting as a REPLICA instance, must listen to. If the MAIN gets demoted, but the RPC request to change the UUID of the new MAIN doesn't succeed, the coordinator will retry both actions until success.

How REPLICA knows which MAIN to listen

Each REPLICA has a UUID of MAIN it listens. If a network partition happens where MAIN can talk to a REPLICA but the coordinator can't talk to the MAIN, from the coordinator's point of view that MAIN is down. From REPLICA's point of view, the MAIN instance is still alive. The coordinator will start the failover procedure, and we can end up with multiple MAINs where REPLICAs can listen to both MAINs. To prevent such an issue, before failover happens to a new MAIN, each REPLICA gets a new UUID that no current MAIN has. The coordinator generates the new UUID, which the new MAIN will get to use once each alive replica acknowledges it received the new UUID. In RPC's request to promote itself to MAIN, the new potential MAIN instance will receive the generated UUID, which REPLICAs are ready to listen to.

If REPLICA was down at one point, MAIN could have changed. When REPLICA gets back up, it doesn't listen to any MAIN until the coordinator sends an RPC request to REPLICA to start listening to MAIN with the given UUID.

Replication concerns

Force sync of data

On failover, the current logic is to choose the first available instance from alive REPLICAs to promote to the new MAIN. For promotion to the MAIN to successfully happen, the new MAIN figures out if REPLICA is behind (has less up-to-date data) or has data that the new MAIN doesn't have. If REPLICA has data that MAIN doesn't have, in that case REPLICA is in a diverged-from-MAIN state. If at least one REPLICA is in diverged-from-MAIN state, failover won't succeed as MAIN can't replicate data to diverged-from-MAIN REPLICA. When choosing a new MAIN in the failover procedure from the list of available REPLICAs, the instance with the latest commit timestamp for the default database is chosen as the new MAIN. In case some other instance had more up-to-date data when the failover procedure was choosing a new MAIN but was down at that point when rejoining the cluster, the new MAIN instance sends a force sync RPC request to such instance. Force sync RPC request deletes all current data on all databases on a given instance and accepts data from the current MAIN. This way cluster will always follow the current MAIN.

Instances' restart

Data instances' restart

Data instances can fail both as MAIN and as REPLICA. When an instance that was REPLICA comes back, it won't accept updates from any instance until the coordinator updates its responsible peer. This should happen automatically when the coordinator's ping to the instance passes. When the MAIN instance comes back, any writing to the MAIN instance will be forbidden until a ping from the coordinator passes. When the coordinator realizes the once-MAIN instance is alive, it will choose between enabling writing on old MAIN or demoting it to REPLICA. The choice depends on whether there was any other MAIN instance chosen between the old MAIN's failure and restart.

Since the instance can die and get back up, use --replication-state-at-startup=true to restore the last replication state. This configuration will restore the instance to the state it was last, and everything should function normally from that point on. If this flag is not set, every instance will restart as MAIN and that case is not handled. Furthermore, it is important for data instances to have --data-recovery-on-startup=true, which recovers data from the data directory. It also recovers important information for each database on data instance. This way, MAIN can establish a connection to REPLICAs and replicate data to the correct databases.

Coordinator instances restart

In case the coordinator instance dies and it is restarted, it loses all data from Raft log. Information about other coordinators and their endpoints is recovered from high_availability/coordinator directory stored under top level --data-directory folder. When the coordinator rejoins, it will communicate with other coordinators and receive updates from the current leader. It is important that at least one coordinator is alive at all the time so that data from the Raft log is not lost.

Handling errors

Distributed systems can fail in various ways. The logic is implemented in a way that the Memgraph instance will be resilient to occasional network failures and independent machine failures. That's why there are parameters controlling the frequency of health checks from coordinator to replication instances and the time needed to realize the instance is down.

Example

This example will show how to set up a highly available cluster in Memgraph using three coordinators and 3 data instances.

1. Start all instances

i.) Start coordinator1:

docker run  --name coord1 -p 7687:7687 -p 7444:7444 memgraph/memgraph-mage --bolt-port=7687 --log-level=TRACE --data-directory=/tmp/mg_data_coord1 --log-file=/tmp/coord1.log --also-log-to-stderr --coordinator-id=1 --coordinator-port=10111 --experimental-enabled=high-availability

ii.) Start coordinator2:

docker run  --name coord2 -p 7688:7688 -p 7445:7444 memgraph/memgraph-mage --bolt-port=7688 --log-level=TRACE --data-directory=/tmp/mg_data_coord2 --log-file=/tmp/coord2.log --also-log-to-stderr --coordinator-id=2 --coordinator-port=10112 --experimental-enabled=high-availability

iii.) Start coordinator3:

docker run  --name coord3 -p 7689:7689 -p 7446:7444 memgraph/memgraph-mage --bolt-port=7689 --log-level=TRACE --data-directory=/tmp/mg_data_coord3 --log-file=/tmp/coord3.log --also-log-to-stderr --coordinator-id=3 --coordinator-port=10113 --experimental-enabled=high-availability

iv.) Start instance1:

docker run  --name instance1 -p 7690:7690 -p 7447:7444 memgraph/memgraph-mage --bolt-port=7690 --log-level=TRACE --data-directory=/tmp/mg_data_instance1 --log-file=/tmp/instance1.log --also-log-to-stderr --replication-restore-state-on-startup=true --management-port=10011 --experimental-enabled=high-availability

v.) Start instance2:

docker run --name instance2 -p 7691:7691 -p 7448:7444 memgraph/memgraph-mage --bolt-port=7691 --log-level=TRACE --data-directory=/tmp/mg_data_instance2 --log-file=/tmp/instance2.log --also-log-to-stderr --replication-restore-state-on-startup=true --management-port=10012 --experimental-enabled=high-availability

vi.) Start instance3:

docker run --name instance3 -p 7692:7692  -p 7449:7444 memgraph/memgraph-mage --bolt-port=7692 --log-level=TRACE --data-directory=/tmp/mg_data_instance3 --log-file=/tmp/instance3.log --also-log-to-stderr --replication-restore-state-on-startup=true --management-port=10013 --experimental-enabled=high-availability

2. Register instances

i.) Start communication with any Memgraph client on any coordinator. Here we chose coordinator 1.

mgconsole --port=7687

ii.) Connect the other two coordinator instances to the cluster.

ADD COORDINATOR 2 WITH CONFIG {"bolt_server": "127.0.0.1:7688", "coordinator_server": "127.0.0.1:10112"};
ADD COORDINATOR 3 WITH CONFIG {"bolt_server": "127.0.0.1:7689", "coordinator_server": "127.0.0.1:10113"};

iii.) Register 3 data instances as part of the cluster:

Replace <ip_address> with the container's IP address. This is necessary for Docker deployments where instances are not on the local host.

REGISTER INSTANCE instance_1 WITH CONFIG {"bolt_server": "127.0.0.1:7687", "management_server": "127.0.0.1:10011", "replication_server": "127.0.0.1:10001"};
REGISTER INSTANCE instance_2 WITH CONFIG {"bolt_server": "127.0.0.1:7688", "management_server": "127.0.0.1:10012", "replication_server": "127.0.0.1:10002"};
REGISTER INSTANCE instance_3 WITH CONFIG {"bolt_server": "127.0.0.1:7689", "management_server": "127.0.0.1:10013", "replication_server": "127.0.0.1:10003"};

iv.) Set instance_3 as MAIN:

SET INSTANCE instance_3 TO MAIN;

iv.) Connect to the leader coordinator and check cluster state with SHOW INSTANCES;

nameraft_socket_addresscoordinator_socket_addresshealthrole
coordinator_1127.0.0.1:10111""unknowncoordinator
coordinator_2127.0.0.1:10112""unknowncoordinator
coordinator_3127.0.0.1:10113""unknowncoordinator
instance_1""127.0.0.1:10011upreplica
instance_2""127.0.0.1:10012upreplica
instance_3""127.0.0.1:10013upmain

3. Check automatic failover

Let's say that the current MAIN instance is down for some reason. After --instance-down-timeout-sec seconds, the coordinator will realize that and automatically promote the first alive REPLICA to become the new MAIN. The output of running SHOW INSTANCES on the leader coordinator could then look like:

instance_nameraft_socket_addresscoordinator_socket_addressaliverole
coordinator_1127.0.0.1:10111""unknowncoordinator
coordinator_2127.0.0.1:10112""unknowncoordinator
coordinator_3127.0.0.1:10113""unknowncoordinator
instance_1""127.0.0.1:10011upmain
instance_2""127.0.0.1:10012upreplica
instance_3""127.0.0.1:10013downunknown