High availability

High availability (Enterprise)

⚠️

Memgraph 2.15 introduced a high availability enterprise feature, which is only enabled if used with --experimental-enabled=high-availability flag.

A cluster is considered highly available if, at any point, there is some instance that can respond to a user query. Our high availability relies on replication. The cluster consists of:

  • The MAIN instance on which the user can execute write queries
  • REPLICA instances that can only respond to read queries
  • COORDINATOR instances that manage the cluster state.

Depending on how configuration flags are set, Memgraph can run as a data instance or coordinator instance. The coordinator instance is a new addition to enable the high availability feature and orchestrates data instances to ensure that there is always one MAIN instance in the cluster.

Cluster management

For achieving high availability, Memgraph uses Raft consensus protocol, which is very similar to Paxos in terms of performance and fault-tolerance but with a significant advantage that it is much easier to understand. It's important to say that Raft isn't a Byzantine fault-tolerant algorithm. You can learn more about Raft in the paper In Search of an Understandable Consensus Algorithm (opens in a new tab).

Typical Memgraph's highly available cluster consists of 3 data instances (1 MAIN and 2 REPLICAS) and 3 coordinator instances backed up by Raft protocol. Users can create more than 3 coordinators, but the replication factor (RF) of 3 is a de facto standard in distributed databases.

One coordinator instance is the leader whose job is always to ensure one writeable data instance (MAIN). The other two coordinator instances replicate changes the leader coordinator did in its own Raft log. Operations saved into the Raft log are those that are related to cluster management. Memgraph doesn't have its implementation of the Raft protocol. For this task, Memgraph uses an industry-proven library NuRaft (opens in a new tab).

You can start the coordinator instance by specifying --coordinator-id, --coordinator-port and --management-port flags. Followers ping the leader on the --management-port to get health state of the cluster. The coordinator instance only responds to queries related to high availability, so you cannot execute any data-oriented query on it. The coordinator port is used for the Raft protocol, which all coordinators use to ensure the consistency of the cluster's state. Data instances are distinguished from coordinator instances by specifying only --management-port flag. This port is used for RPC network communication between the coordinator and data instances. When started by default, the data instance is MAIN. The coordinator will ensure that no data inconsistency can happen during and after the instance's restart. Once all instances are started, the user can start adding data instances to the cluster.

The Raft consensus algorithm ensures that all nodes in a distributed system agree on a single source of truth, even in the presence of failures, by electing a leader to manage a replicated log. It simplifies the management of the replicated log across the cluster, providing a way to achieve consistency and coordination in a fault-tolerant manner. Users are advised to use an odd number of instances since Raft, as a consensus algorithm, works by forming a majority in the decision making.

Bolt+routing

Directly connecting to the MAIN instance isn't preferred in the HA cluster since the MAIN instance changes due to various failures. Because of that, users can use bolt+routing so that write queries can always be sent to the correct data instance. This protocol works so that the client first sends a ROUTE bolt message to any coordinator instance. The coordinator replies to the message by returning the routing table with three entries specifying from which instance can be data read, to which instance data can be written and which instances can behave as routers. In the Memgraph HA cluster, the MAIN data instance is the only writeable instance, REPLICAs are readable instances, and COORDINATORs behave as routers. For more details check Request message ROUTE (opens in a new tab) documentation.

Users only need to change the scheme they use for connecting to coordinators. This means instead of using bolt://<main_ip_address>, you should use neo4j://<coordinator_ip_adresss> to get an active connection to the current MAIN instance in the cluster. You can find examples of how to use bolt+routing in different programming languages here (opens in a new tab).

It is important to note that setting up the cluster on one coordinator (registration of data instances and coordinators, setting main) must be done using bolt connection since bolt+routing is only used for routing data-related queries, not coordinator-based queries.

Starting instances

You can start the data and coordinator instances using environment flags or configuration flags. The main difference between data instance and coordinator is that data instances have --management-port, whereas coordinators must have --coordinator-id and --coordinator-port.

Configuration Flags

Data instance

Memgraph data instance must have --replication-restore-state-on-startup=true and --data-recovery-on-startup=true flags. These two flags enable the recovery of all the necessary data and durability configurations for Memgraph to continue the replication process, meaning that the instance can continue receiving updates from MAIN and MAIN can continue accepting write queries.

There are two additional flags, --management-port=<port> and --experimental-enabled=high_availability. These two flags are tied to the high availability feature, enable the coordinator to connect to the data instance, and allow the Memgraph data instance to use the high availability feature.

docker run --name instance1 -p 7687:7687 -p 7444:7444 memgraph/memgraph-mage
--data-recovery-on-startup=true \
--replication-restore-state-on-startup=true \
--management-port=10011 \
--bolt-port=7687 \
--experimental-enabled=high-availability

Coordinator instance

docker run --name coord1 -p 7690:7687 -p 7445:7444 memgraph/memgraph-mage
--coordinator-port=10111 
--bolt-port=7687 
--coordinator-id=1 
--experimental-enabled=high-availability
--coordinator-hostname=localhost
--management-port=10121

Coordinator IDs serve as identifiers, the coordinator port is used for synchronization and log replication between coordinators and management port is used to get health state of cluster from leader coordinator. Coordinator IDs, coordinator ports and management ports must be different for all coordinators.

Configuration option --coordinator-hostname must be set on all coordinator instances. It is used on followers to ping the leader coordinator on the correct IP address and return the health state about the cluster. You can set this configuration flag to the IP address, the fully qualified domain name (FQDN), or even the DNS name. The suggested approach is to use DNS, otherwise, in case the IP address changes, network communication between instances in the cluster will stop working.

In local setup example would be to set --coordinator-hostname for each instance to localhost.

Env flags

There is an additional way to set high availability instances using environment variables. It is important to say that for the following configuration options, you can either use environment variables or configuration flags:

  • experimental enabled
  • bolt port
  • coordinator port
  • coordinator id
  • management port
  • high availability durability
  • path to nuraft log file
  • coordinator hostname

Data instances

Here are the environment variables you need to use to set data instance using only environment variables:

export MEMGRAPH_EXPERIMENTAL_ENABLED=high-availability
export MEMGRAPH_MANAGEMENT_PORT=10011
export MEMGRAPH_BOLT_PORT=7692

When using any of these environment variables, flags --bolt-port, --management-port and --experimental-enabled will be ignored.

Coordinator instances

export MEMGRAPH_EXPERIMENTAL_ENABLED=high-availability
export MEMGRAPH_COORDINATOR_PORT=10111
export MEMGRAPH_COORDINATOR_ID=1
export MEMGRAPH_BOLT_PORT=7692
export MEMGRAPH_HA_DURABILITY=true
export MEMGRAPH_NURAFT_LOG_FILE="<path-to-log-file>"
export MEMGRAPH_COORDINATOR_HOSTNAME="localhost"
export MEMGRAPH_MANAGEMENT_PORT=10121

When using any of these environment variables, flags for --bolt-port, --coordinator-port, --coordinator-id,--coordinator-hostname and --experimental-enabled will be ignored.

There is an additional environment variable you can use to set the path to the file with cypher queries used to start a high availability cluster. Here, you can use queries we define in the next chapter called User API.

export MEMGRAPH_HA_CLUSTER_INIT_QUERIES=<file_path>

After the coordinator instance is started, Memgraph will run queries one by one from this file to set up a high availability cluster.

User API

Register instance

Registering instances should be done on a single coordinator. The chosen coordinator will become the cluster's leader.

Register instance query will result in several actions:

  1. The coordinator instance will connect to the data instance on the management_server network address.
  2. The coordinator instance will start pinging the data instance every --instance-health-check-frequency-sec seconds to check its status.
  3. Data instance will be demoted from MAIN to REPLICA.
  4. Data instance will start the replication server on replication_server.
REGISTER INSTANCE instanceName WITH CONFIG {"bolt_server": boltServer, "management_server": managementServer, "replication_server": replicationServer};

This operation will result in writing to the Raft log.

In case MAIN is already set in the cluster, a replica instance will also be registered on MAIN.

Add coordinator instance

The user can choose any coordinator instance and connect other two (or more) coordinator instances to the cluster. This can be done before or after registering data instances, the order isn't important.

ADD COORDINATOR coordinatorId WITH CONFIG {"bolt_server": boltServer, "coordinator_server": coordinatorServer}; 

We need to inform the leader coordinator of the existence of other follower coordinators. If you are running ADD COORDINATOR Cypher queries on coordinator with --coordinator-id=1, you only need to add other two coordinator instances. For example, when using 3 coordinators you need to use two ADD COORDINATOR queries:

ADD COORDINATOR 2 WITH CONFIG {"bolt_server": "127.0.0.1:7691", "coordinator_server": "127.0.0.1:10112"};
ADD COORDINATOR 3 WITH CONFIG {"bolt_server": "127.0.0.1:7692", "coordinator_server": "127.0.0.1:10113"};

Set instance to MAIN

Once all data instances are registered, one data instance should be promoted to MAIN. This can be achieved by using the following query:

SET INSTANCE instanceName to MAIN;

This query will register all other instances as REPLICAs to the new MAIN. If one of the instances is unavailable, setting the instance to MAIN will not succeed. If there is already a MAIN instance in the cluster, this query will fail.

This operation will result in writing to the Raft log.

Unregister instance

There are various reasons which could lead to the decision that an instance needs to be removed from the cluster. The hardware can be broken, network communication could be set up incorrectly, etc. The user can remove the instance from the cluster using the following query:

UNREGISTER INSTANCE instanceName;

At the moment of registration, the instance that you want to unregister must not be MAIN because unregistering MAIN could lead to an inconsistent cluster state.

The instance requested to be unregistered will also be unregistered from the current MAIN's REPLICA set.

Demote instance

Demoting instances should be done by a leader coordinator.

Demote instance query will result in several actions:

  1. The coordinator instance will demote the instance to REPLICA.
  2. The coordinator instance will continue pinging the data instance every --instance-health-check-frequency-sec seconds to check its status. In this case, the leader coordinator won't choose a new MAIN, but as a user, you should choose one instance to promote it to MAIN using the SET INSTANCE instance TO MAIN query.
DEMOTE INSTANCE instanceName;

This operation will result in writing to the Raft log.

Force reset cluster state

In case the cluster gets stuck there is an option to do the force reset of the cluster. You need to execute a command on the leader coordinator. This command will result in the following actions:

  1. The coordinator instance will demote each alive instance to REPLICA.
  2. From the alive instance it will choose a new MAIN instance.
  3. Instances that are down will be demoted to REPLICAs once they come back up.
FORCE RESET CLUSTER STATE;

This operation will result in writing to the Raft log.

Show instances

You can check the state of the whole cluster using the SHOW INSTANCES query. The query will show which instances are visible in the cluster, which network ports they are using for managing cluster state, whether they are considered alive from the coordinator's perspective, their role (are they MAIN, REPLICA, LEADER, FOLLOWER or unknown if not alive) and the response time from instances to the leader's ping. This query can be run both on the leader and followers. Since only the leader knows the exact state of health state and response time, followers will do the following actions in exact order:

  1. Try contacting the leader to get the state of the cluster, since the leader has all the information. If the leader responds, the follower will return the result as if SHOW INSTANCES was run by the leader.
  2. When the leader doesn't respond, or currently there is no leader, the follower will return all the data instances and coordinators in the cluster but with the health state and last response time set to 0, respectively.
SHOW INSTANCES;

Setting config for highly-available cluster

There are several flags that you can use for managing the cluster. Flag --management-port is used for distinguishing data instances from coordinators. The provided flag needs to be unique. Setting a flag will create an RPC server on data instances capable of responding to the coordinator's RPC messages.

RPC (Remote Procedure Call) is a protocol for executing functions on a remote system. RPC enables direct communication in distributed systems and is crucial for replication and high availability tasks.

Flags --coordinator-id, --coordinator-port and --management-port need to be unique and specified on coordinator instances. They will cause the creation of a Raft server that coordinator instances use for communication. Flag --instance-health-check-frequency-sec specifies how often should leader coordinator should check the status of the replication instance to update its status. Flag --instance-down-timeout-sec gives the user the ability to control how much time should pass before the coordinator starts considering the instance to be down.

Consider the instance to be down only if several consecutive pings fail because a single ping can fail because of a large number of different reasons in distributed systems.

There is an additional flag, --instance-get-uuid-frequency-sec, which sets how often the coordinator should check on REPLICA instances if they follow the correct MAIN instance. REPLICA may die and get back up before the coordinator notices that. In that case, REPLICA will not follow MAIN for --instance-get-uuid-frequency-sec seconds. It is advisable to set the flag to more than --instance-down-timeout-sec, as it is needed first to confirm the instance was down and not that there was some networking issue.

Failover

Determining instance's health

Every --instance-health-check-frequency-sec seconds, the coordinator contacts each instance. The instance is not considered to be down unless --instance-down-timeout-sec has passed and the instance hasn't responded to the coordinator in the meantime. Users must set --instance-health-check-frequency-sec to be less or equal to the --instance-down-timeout-sec but we advise users to set --instance-down-timeout-sec to a multiplier of --instance-health-check-frequency-sec. Set the multiplier coefficient to be N>=2. For example, set --instance-down-timeout-sec=5 and --instance-health-check-frequency-sec=1 which will result in coordinator contacting each instance every 1 second and the instance is considered dead after it doesn't respond 5 times (5 seconds / 1 second).

In case a REPLICA doesn't respond to a health check, the leader coordinator will try to contact it again every --instance-health-check-frequency-sec. When the REPLICA instance rejoins the cluster (comes back up), it always rejoins as REPLICA. For MAIN instance, there are two options. If it is down for less than --instance-down-timeout-sec, it will rejoin as MAIN because it is still considered alive. If it is down for more than --instance-down-timeout-sec, failover procedure is initiated. Whether MAIN will rejoin as MAIN depends on the success of the failvoer procedure. If the failover procedure succeeds, now old MAIN will rejoin as REPLICA. If failover doesn't succeed, MAIN will rejoin as MAIN once it comes back up.

Failover procedure - high level description

From alive REPLICAs coordinator chooses a new potential MAIN. This instance is a potential MAIN as the failover procedure can still fail due to various factors (networking issues, promote to MAIN fails, any alive REPLICA failing to accept an RPC message, etc). Before promoting REPLICA to MAIN, the coordinator sends a RPC request to each alive REPLICA to stop listening to the old MAIN. Once each alive REPLICA acknowledges that it stopped listening to the old MAIN, the coordinator sends an RPC request to the potential new MAIN, which is still in REPLICA state, to promote itself to the MAIN instance with info about other REPLICAs to which it will replicate data. Once that request succeeds, the new MAIN can start replication to the other instances and accept write queries.

Choosing new MAIN from available REPLICAs

When failover is happening some REPLICAs can also be down. From the list of alive REPLICAs, a new MAIN is chosen. First, the leader coordinator contacts each alive REPLICA to get info about each database's last commit timestamp. In the case of enabled multi-tenancy, from each instance coordinator will get info on all databases and their last commit timestamp. Currently, the coordinator chooses an instance to become a new MAIN by comparing the latest commit timestamp only of the default database for every instance. If multiple instances have the same latest commit timestamp, the instance that was registered earlier will be chosen as a new MAIN.

Providing atomicity of action

To ensure the atomicity of each action, the leader coordinator wraps each of the following actions with locks:

  • Registering of replica instance
  • Unregistering of replica instance
  • Setting instance to MAIN
  • Failover procedure
  • Force reset of the cluster

The special mechanism includes opening the lock before each action starts and closing it once it is complete. A closed lock means the action is fully completed. If, due to some failure, action isn't complete fully, the cluster enters a state of force reset where the cluster is reset to the state held in the Raft log. The state of truth is what is stored in the Raft log across most of the coordinators.

Force reset of cluster

The leader coordinator executes a force reset of the cluster if the action isn't fully complete. Failure can happen anywhere, i.e. in the case of setting instance to MAIN, the RPC request to a REPLICA instance to promote itself to MAIN can succeed, but writing to the Raft log that the instance was promoted can fail. Force reset includes demoting every alive instance to REPLICA, and executing the failover procedure once again. Such a procedure is needed as of this moment cluster doesn't track where the action failed exactly, but only whether it fully succeded. Raft log is taken as a source of truth at all times. In case the leader coordinator dies while executing the force reset, the next coordinator which is elected as the leader, will continue executing the force reset. Action is executed until it succeeds.

It is important to note that if an action fails and all instances are down, the leader will attempt to execute a force reset until one instance is promoted to MAIN. Until then, no other actions are allowed on the cluster.

If an instance is down at the point of force reset, the leader coordinator writes in the Raft log that the instance needs to be demoted to REPLICA once it comes back up.

If all instances are down at the point of force reset, the action won't succeed as a new MAIN instance can't be chosen.

Old MAIN rejoining to the cluster

Once the old MAIN gets back up, the coordinator sends an RPC request to demote the old MAIN to REPLICA. The coordinator tracks at all times which instance was the last MAIN.

The leader coordinator sends two RPC requests in the given order to demote old MAIN to REPLICA:

  1. Demote MAIN to REPLICA RPC request
  2. A request to store the UUID of the current MAIN, which the old MAIN, now acting as a REPLICA instance, must listen to. If the MAIN gets demoted, but the RPC request to change the UUID of the new MAIN doesn't succeed, the coordinator will retry both actions until success.

How REPLICA knows which MAIN to listen

Each REPLICA has a UUID of MAIN it listens. If a network partition happens where MAIN can talk to a REPLICA but the coordinator can't talk to the MAIN, from the coordinator's point of view that MAIN is down. From REPLICA's point of view, the MAIN instance is still alive. The coordinator will start the failover procedure, and we can end up with multiple MAINs where REPLICAs can listen to both MAINs. To prevent such an issue, before failover happens to a new MAIN, each REPLICA gets a new UUID that no current MAIN has. The coordinator generates the new UUID, which the new MAIN will get to use once each alive replica acknowledges it received the new UUID. In RPC's request to promote itself to MAIN, the new potential MAIN instance will receive the generated UUID, which REPLICAs are ready to listen to.

If REPLICA was down at one point, MAIN could have changed. When REPLICA gets back up, it doesn't listen to any MAIN until the coordinator sends an RPC request to REPLICA to start listening to MAIN with the given UUID.

Replication concerns

Force sync of data

On failover, the current logic is to choose the first available instance from alive REPLICAs to promote to the new MAIN. For promotion to the MAIN to successfully happen, the new MAIN figures out if REPLICA is behind (has less up-to-date data) or has data that the new MAIN doesn't have. If REPLICA has data that MAIN doesn't have, in that case REPLICA is in a diverged-from-MAIN state. If at least one REPLICA is in diverged-from-MAIN state, failover won't succeed as MAIN can't replicate data to diverged-from-MAIN REPLICA. When choosing a new MAIN in the failover procedure from the list of available REPLICAs, the instance with the latest commit timestamp for the default database is chosen as the new MAIN. In case some other instance had more up-to-date data when the failover procedure was choosing a new MAIN but was down at that point when rejoining the cluster, the new MAIN instance sends a force sync RPC request to such instance. Force sync RPC request deletes all current data on all databases on a given instance and accepts data from the current MAIN. This way cluster will always follow the current MAIN.

Actions on follower coordinators

From follower coordinators you can only execute SHOW INSTANCES. Registration of data instance, unregistration of data instances, demoting instance, setting instance to MAIN and force reseting cluster state are all disabled.

⚠️

Under certain extreme scenarios, the current implementation of HA could lead to having Recovery Point Objective (RPO) != 0 (aka data loss). These are environments with high volume of transactions where data is constantly changed, added, deleted... If you are operating in such scenarios, please open an issue on GitHub (opens in a new tab) as we are eager to expand our support for this kind of workload.

Instances' restart

Data instances' restart

Data instances can fail both as MAIN and as REPLICA. When an instance that was REPLICA comes back, it won't accept updates from any instance until the coordinator updates its responsible peer. This should happen automatically when the coordinator's ping to the instance passes. When the MAIN instance comes back, any writing to the MAIN instance will be forbidden until a ping from the coordinator passes. When the coordinator realizes the once-MAIN instance is alive, it will choose between enabling writing on old MAIN or demoting it to REPLICA. The choice depends on whether there was any other MAIN instance chosen between the old MAIN's failure and restart.

Since the instance can die and get back up, use --replication-state-at-startup=true to restore the last replication state. This configuration will restore the instance to the state it was last, and everything should function normally from that point on. If this flag is not set, every instance will restart as MAIN and that case is not handled. Furthermore, it is important for data instances to have --data-recovery-on-startup=true, which recovers data from the data directory. It also recovers important information for each database on data instance. This way, MAIN can establish a connection to REPLICAs and replicate data to the correct databases.

Coordinator instances restart

In case the coordinator instance dies and it is restarted, it will not lose any data from the RAFT log or RAFT snapshots, since --ha-durability is enabled by default. For more details read about high availability durability in the durability chapter.

Durability

High availability durability is enabled by default. Durability flag controls logs and snapshots. Details about other coordinators are always made durable since without that information, the coordinator can't rejoin the cluster.

Information about logs and snapshots is stored under one RocksDB instance in the high_availability/raft_data/logs directory stored under the top-level --data-directory folder. All the data stored there is recovered in case the coordinator restarts.

Data about other coordinators and their endpoints is recovered from the high_availability/raft_data/network directory stored under the top-level --data-directory folder. When the coordinator rejoins, it will communicate with other coordinators and receive updates from the current leader.

In case durability is disabled, at least one coordinator must be alive at all times so that data from the Raft log is not lost.

First start

On the first start of coordinators, each will store the current version of the logs and network durability store. From that point on, each RAFT log that is sent to the coordinator is also stored on disk. For every new coordinator instance, the server config is updated. Logs are created for each user action and failover action. Snapshots are created every N (N currently being 5) logs.

Restart of coordinator

In case of failure of the coordinator, on the restart, it will read information about other coordinators stored under high_availability/raft_data/network directory.

From the network directory we will recover the server state before the coordinator stopped, including the current term, for whom the coordinator voted, and whether election timer is allowed.

It will also recover the following server config information:

  • other servers, including their endpoints, id, and auxiliary data
  • ID of the previous log
  • ID of the current log
  • additional data needed by nuRaft

If durability is not disabled, the following information will be recovered from a common RocksDB logs instance:

  • current version of logs durability store
  • snapshots found with snapshot_id_ prefix in database:
    • coordinator cluster state - all data instances with their role (MAIN or REPLICA), UUID of MAIN instance which REPLICA is listening to
    • last log idx
    • last log term
    • last cluster config
  • logs found in the interval between the start index and the last log index
    • data - each log holds data on what has changed since the last state
    • term - nuRAFT term
    • log type - nuRAFT log type

Handling of durability errors

If snapshots are not stored, an exception is thrown and left for the nuRAFT library to handle the issue. Logs can be missed and not stored since they are compacted and deleted every two snapshots and will be removed relatively fast. In case the log is missing on recovery, the coordinator will not start. In that case, it is best to start a coordinator without durability and leave other coordinators to send missing data. Users are advised to use the same strategy in case a snapshot is missing on one of the coordinators.

Memgraph throws an error in case of failure to store cluster config, which is updated in the high_availability/raft_data/network folder. If this happens, it will happen only on the first cluster start when coordinators are connecting since coordinators are configured only once at the start of the whole cluster. This is a non-recoverable error since in case the coordinator rejoins the cluster and has the wrong state of other clusters, it can become a leader without being connected to other coordinators.

Recovering from errors

Distributed systems can fail in numerous ways. With the current implementation, Memgraph instances are resilient to occasional network failures and independent machine failures. Byzantine failures aren't handled since the Raft consensus protocol cannot deal with them either.

Recovery Time Objective (RTO) is an often used term for measuring the maximum tolerable length of time that an instance or cluster can be down. Since every highly available Memgraph cluster has two types of instances, we need to analyze the failures of each separately.

Raft is a quorum-based protocol and it needs a majority of instances alive in order to stay functional. Hence, with just one coordinator instance down, RTO is 0 since the cluster stays available. With 2+ coordinator instances down (in a cluster with RF = 3), the RTO depends on the time needed for instances to come back.

Failure of REPLICA data instance isn't very harmful since users can continue writing to MAIN data instance while reading from MAIN or other REPLICAs. The most important thing to analyze is what happens when MAIN gets down. In that case, the leader coordinator uses user-controllable parameters related to the frequency of health checks from the leader to replication instances (--instance-health-check-frequency-sec) and the time needed to realize the instance is down (--instance-down-timeout-sec). After collecting enough evidence, the leader concludes the MAIN is down and performs failover using just a handful of RPC messages (correct time depends on the distance between instances). It is important to mention that the whole failover is performed with a zero data-loss if the newly chosen MAIN (previously REPLICA) had all up-to-date data.

Current deployment assumes the existence of only one datacenter, which automatically means that Memgraph won't be available in the case the whole datacenter goes down. We are actively working on 2 datacenter (2-DC) architecture.

Raft configuration parameters

Several Raft-related parameters are important for the correct functioning of the cluster. The leader coordinator sends a heartbeat message to other coordinators every second to determine their health. This configuration option is connected with leader election timeout which is a randomized value from the interval [2000ms, 4000ms] and which is used by followers to decide when to trigger new election process. Leadership expiration is set to 1800ms so that cluster can never get into situation where multiple leaders exist. These specific values give a cluster the ability to survive occasional network hiccups without triggering leadership changes.

Kubernetes

We support deploying Memgraph HA instances as part of the Kubernetes cluster. You can see example configurations here.

Docker Compose

The following example shows you how to setup Memgraph cluster using docker compose. The cluster will use user-defined bridge network.

License file license.cypher should be in the format:


SET DATABASE SETTING 'organization.name' TO '<YOUR_ORGANIZATION_NAME>';
SET DATABASE SETTING 'enterprise.license' TO '<YOUR_ENTERPRISE_LICENSE>';

You can directly use initialization file HA_register.cypher:


ADD COORDINATOR 2 WITH CONFIG {"bolt_server": "coord2:7691", "coordinator_server": "coord2:10112", "management_server": "coord2:10122"};
ADD COORDINATOR 3 WITH CONFIG {"bolt_server": "coord3:7692", "coordinator_server": "coord3:10113", "management_server": "coord3:10123"};

REGISTER INSTANCE instance_1 WITH CONFIG {"bolt_server": "instance1:7687", "management_server": "instance1:10011", "replication_server": "instance1:10001"};
REGISTER INSTANCE instance_2 WITH CONFIG {"bolt_server": "instance2:7688", "management_server": "instance2:10012", "replication_server": "instance2:10002"};
REGISTER INSTANCE instance_3 WITH CONFIG {"bolt_server": "instance3:7689", "management_server": "instance3:10013", "replication_server": "instance3:10003"};
SET INSTANCE instance_3 TO MAIN;

You can directly use the following docker-compose.yml to start the cluster using docker compose up:

networks:
  memgraph_ha:
    name: memgraph_ha
    driver: bridge
    ipam:
      driver: default
      config:
        - subnet: "172.21.0.0/16"

services:
  coord1:
    image: "memgraph/memgraph"
    container_name: coord1
    volumes:
      - ./license.cypher:/tmp/init/license.cypher:ro
      - ./HA_register.cypher:/tmp/init/HA_register.cypher:ro
    environment:
      - MEMGRAPH_HA_CLUSTER_INIT_QUERIES=/tmp/init/HA_register.cypher
    command: [ "--init-file=/tmp/init/license.cypher", "--log-level=TRACE", "--data-directory=/tmp/mg_data_coord1", "--log-file=/tmp/coord1.log", "--also-log-to-stderr", "--coordinator-id=1", "--coordinator-port=10111", "--management-port=10121", "--coordinator-hostname=coord1", "--experimental-enabled=high-availability"]
    networks:
      memgraph_ha:
        ipv4_address: 172.21.0.4

  coord2:
    image: "memgraph/memgraph"
    container_name: coord2
    volumes:
      - ./license.cypher:/tmp/init/license.cypher:ro
    command: [ "--init-file=/tmp/init/license.cypher", "--log-level=TRACE", "--data-directory=/tmp/mg_data_coord2", "--log-file=/tmp/coord2.log", "--also-log-to-stderr", "--coordinator-id=2", "--coordinator-port=10112", "--management-port=10122", "--coordinator-hostname=coord2", "--experimental-enabled=high-availability"]
    networks:
      memgraph_ha:
        ipv4_address: 172.21.0.2

  coord3:
    image: "memgraph/memgraph"
    container_name: coord3
    volumes:
      - ./license.cypher:/tmp/init/license.cypher:ro
    command: [ "--init-file=/tmp/init/license.cypher",  "--log-level=TRACE", "--data-directory=/tmp/mg_data_coord3", "--log-file=/tmp/coord3.log", "--also-log-to-stderr", "--coordinator-id=3", "--coordinator-port=10113", "--management-port=10123", "--coordinator-hostname=coord3", "--experimental-enabled=high-availability"]

    networks:
      memgraph_ha:
        ipv4_address: 172.21.0.3

  instance1:
    image: "memgraph/memgraph"
    container_name: instance1
    volumes:
      - ./license.cypher:/tmp/init/license.cypher:ro
    command: ["--init-file=/tmp/init/license.cypher","--data-recovery-on-startup=true", "--log-level=TRACE", "--data-directory=/tmp/mg_data_instance1", "--log-file=/tmp/instance1.log", "--also-log-to-stderr", "--management-port=10011", "--experimental-enabled=high-availability"]
    networks:
      memgraph_ha:
        ipv4_address: 172.21.0.6

  instance2:
    image: "memgraph/memgraph"
    container_name: instance2
    volumes:
      - ./license.cypher:/tmp/init/license.cypher:ro
    command: ["--init-file=/tmp/init/license.cypher","--data-recovery-on-startup=true", "--log-level=TRACE", "--data-directory=/tmp/mg_data_instance2", "--log-file=/tmp/instance2.log", "--also-log-to-stderr", "--management-port=10012", "--experimental-enabled=high-availability"]
    networks:
      memgraph_ha:
        ipv4_address: 172.21.0.7

  instance3:
    image: "memgraph/memgraph"
    container_name: instance3
    volumes:
      - ./license.cypher:/tmp/init/license.cypher:ro
    command: ["--init-file=/tmp/init/license.cypher","--data-recovery-on-startup=true", "--log-level=TRACE", "--data-directory=/tmp/mg_data_instance3", "--log-file=/tmp/instance3.log", "--also-log-to-stderr", "--management-port=10013", "--experimental-enabled=high-availability"]
    networks:
      memgraph_ha:
        ipv4_address: 172.21.0.8

Cluster can be shut-down using docker compose down.

Manual Docker setup

This example will show how to set up a highly available cluster in Memgraph using three coordinators and 3 data instances.

Start all instances

  1. Start coordinator1:
docker run  --name coord1 -p 7690:7690 -p 7444:7444 memgraph/memgraph-mage --bolt-port=7690 --log-level=TRACE --data-directory=/tmp/mg_data_coord3 --log-file=/tmp/coord1.log --also-log-to-stderr --coordinator-id=1 --coordinator-port=10111 --management-port=10121 --experimental-enabled=high-availability --coordinator-hostname=localhost
  1. Start coordinator2:
docker run  --name coord2 -p 7691:7691 -p 7445:7444 memgraph/memgraph-mage --bolt-port=7691 --log-level=TRACE --data-directory=/tmp/mg_data_coord2 --log-file=/tmp/coord2.log --nuraft-log-file=/tmp/nuraft/coord2.log --also-log-to-stderr --coordinator-id=2 --coordinator-port=10112  --management-port=10122 --experimental-enabled=high-availability --coordinator-hostname=localhost
  1. Start coordinator3:
docker run  --name coord3 -p 7692:7692 -p 7446:7444 memgraph/memgraph-mage --bolt-port=7692 --log-level=TRACE --data-directory=/tmp/mg_data_coord3 --log-file=/tmp/coord3.log --nuraft-log-file=/tmp/nuraft/coord1.log --also-log-to-stderr --coordinator-id=3 --coordinator-port=10113 --management-port=10123 --experimental-enabled=high-availability --coordinator-hostname=localhost
  1. Start instance1:
docker run  --name instance1 -p 7687:7687 -p 7447:7444 memgraph/memgraph-mage --bolt-port=7687 --log-level=TRACE --data-directory=/tmp/mg_data_instance1 --log-file=/tmp/instance1.log --also-log-to-stderr --management-port=10011 --experimental-enabled=high-availability --data-recovery-on-startup=true
  1. Start instance2:
docker run --name instance2 -p 7688:7688 -p 7448:7444 memgraph/memgraph-mage --bolt-port=7688 --log-level=TRACE --data-directory=/tmp/mg_data_instance2 --log-file=/tmp/instance2.log --also-log-to-stderr --management-port=10012 --experimental-enabled=high-availability --data-recovery-on-startup=true
  1. Start instance3:
docker run --name instance3 -p 7689:7689  -p 7449:7444 memgraph/memgraph-mage --bolt-port=7689 --log-level=TRACE --data-directory=/tmp/mg_data_instance3 --log-file=/tmp/instance3.log --also-log-to-stderr --management-port=10013 --experimental-enabled=high-availability --data-recovery-on-startup=true

Register instances

  1. Start communication with any Memgraph client on any coordinator. Here we chose coordinator 1.
mgconsole --port=7690
  1. Connect the other two coordinator instances to the cluster.
ADD COORDINATOR 2 WITH CONFIG {"bolt_server": "localhost:7691", "coordinator_server": "localhost:10112", "management_server": "localhost:10122"};
ADD COORDINATOR 3 WITH CONFIG {"bolt_server": "localhost:7692", "coordinator_server": "localhost:10113", "management_server": "localhost:10123"};
  1. Register 3 data instances as part of the cluster:

Replace <ip_address> with the container's IP address. This is necessary for Docker deployments where instances are not on the local host.

REGISTER INSTANCE instance_1 WITH CONFIG {"bolt_server": "localhost:7687", "management_server": "localhost:10011", "replication_server": "localhost:10001"};
REGISTER INSTANCE instance_2 WITH CONFIG {"bolt_server": "localhost:7688", "management_server": "localhost:10012", "replication_server": "localhost:10002"};
REGISTER INSTANCE instance_3 WITH CONFIG {"bolt_server": "localhost:7689", "management_server": "localhost:10013", "replication_server": "localhost:10003"};
  1. Set instance_3 as MAIN:
SET INSTANCE instance_3 TO MAIN;
  1. Connect to the leader coordinator and check cluster state with SHOW INSTANCES;
namebolt_servercoordinator_servermanagement_serverhealthrolelast_succ_resp_ms
coordinator_1localhost:7690localhost:10111localhost:10121upleader0
coordinator_2localhost:7691localhost:10112localhost:10122upfollower16
coordinator_3localhost:7692localhost:10113localhost:10123upfollower25
instance_1localhost:7687""localhost:10011upreplica39
instance_2localhost:7688""localhost:10012upreplica21
instance_3localhost:7689""localhost:10013upmain91

Check automatic failover

Let's say that the current MAIN instance is down for some reason. After --instance-down-timeout-sec seconds, the coordinator will realize that and automatically promote the first alive REPLICA to become the new MAIN. The output of running SHOW INSTANCES on the leader coordinator could then look like:

namebolt_servercoordinator_servermanagement_serverhealthrolelast_succ_resp_ms
coordinator_1localhost:7690localhost:10111localhost:10121upleader0
coordinator_2localhost:7691localhost:10112localhost:10122upfollower34
coordinator_3localhost:7692localhost:10113localhost:10123upfollower28
instance_1localhost:7687""localhost:10011upmain61
instance_2localhost:7688""localhost:10012upreplica74
instance_3localhost:7689""localhost:10013downunknown71222