High availability (Enterprise)
Memgraph 2.15 introduced a high availability enterprise feature, which is only enabled if used with --experimental-enabled=high-availability
flag.
High availability currently works only in the in-memory transactional storage mode.
A cluster is considered highly available if, at any point, there is some instance that can respond to a user query. Our high availability relies on replication. The cluster consists of:
- The MAIN instance on which the user can execute write queries
- REPLICA instances that can only respond to read queries
- COORDINATOR instances that manage the cluster state.
Depending on how configuration flags are set, Memgraph can run as a data instance or coordinator instance. The coordinator instance is a new addition to enable the high availability feature and orchestrates data instances to ensure that there is always one MAIN instance in the cluster.
Cluster management
For achieving high availability, Memgraph uses Raft consensus protocol, which is very similar to Paxos in terms of performance and fault-tolerance but with a significant advantage that it is much easier to understand. It's important to say that Raft isn't a Byzantine fault-tolerant algorithm. You can learn more about Raft in the paper In Search of an Understandable Consensus Algorithm (opens in a new tab).
Typical Memgraph's highly available cluster consists of 3 data instances (1 MAIN and 2 REPLICAS) and 3 coordinator instances backed up by Raft protocol. Users can create more than 3 coordinators, but the replication factor (RF) of 3 is a de facto standard in distributed databases.
One coordinator instance is the leader whose job is always to ensure one writeable data instance (MAIN). The other two coordinator instances replicate changes the leader coordinator did in its own Raft log. Operations saved into the Raft log are those that are related to cluster management. Memgraph doesn't have its implementation of the Raft protocol. For this task, Memgraph uses an industry-proven library NuRaft (opens in a new tab).
You can start the coordinator instance by specifying --coordinator-id
,
--coordinator-port
and --management-port
flags. Followers ping the leader on the --management-port
to get health state of the cluster. The coordinator instance only responds to
queries related to high availability, so you cannot execute any data-oriented query on it. The coordinator port is used for the Raft protocol, which
all coordinators use to ensure the consistency of the cluster's state. Data instances are distinguished from coordinator instances by
specifying only --management-port
flag. This port is used for RPC network communication between the coordinator and data
instances. When started by default, the data instance is MAIN. The coordinator will ensure that no data inconsistency can happen during and after the instance's
restart. Once all instances are started, the user can start adding data instances to the cluster.
The Raft consensus algorithm ensures that all nodes in a distributed system agree on a single source of truth, even in the presence of failures, by electing a leader to manage a replicated log. It simplifies the management of the replicated log across the cluster, providing a way to achieve consistency and coordination in a fault-tolerant manner. Users are advised to use an odd number of instances since Raft, as a consensus algorithm, works by forming a majority in the decision making.
Bolt+routing
Directly connecting to the MAIN instance isn't preferred in the HA cluster since the MAIN instance changes due to various failures. Because of that, users can use bolt+routing so that write queries can always be sent to the correct data instance. This protocol works so that the client first sends a ROUTE bolt message to any coordinator instance. The coordinator replies to the message by returning the routing table with three entries specifying from which instance can be data read, to which instance data can be written and which instances can behave as routers. In the Memgraph HA cluster, the MAIN data instance is the only writeable instance, REPLICAs are readable instances, and COORDINATORs behave as routers. For more details check Request message ROUTE (opens in a new tab) documentation.
Users only need to change the scheme they use for connecting to coordinators. This means instead of using bolt://<main_ip_address>,
you should
use neo4j://<coordinator_ip_adresss>
to get an active connection to the current MAIN instance in the cluster. You can find examples of how to
use bolt+routing in different programming languages here (opens in a new tab).
It is important to note that setting up the cluster on one coordinator (registration of data instances and coordinators, setting main) must be done using bolt connection since bolt+routing is only used for routing data-related queries, not coordinator-based queries.
Starting instances
You can start the data and coordinator instances using environment flags or configuration flags.
The main difference between data instance and coordinator is that data instances have --management-port
, whereas coordinators must have --coordinator-id
and --coordinator-port
.
Configuration Flags
Data instance
Memgraph data instance must have --replication-restore-state-on-startup=true
and --data-recovery-on-startup=true
flags.
These two flags enable the recovery of all the necessary data and durability configurations for Memgraph to continue the replication process,
meaning that the instance can continue receiving updates from MAIN and MAIN can continue accepting write queries.
There are two additional flags, --management-port=<port>
and --experimental-enabled=high_availability
.
These two flags are tied to the high availability feature, enable the coordinator to connect to the data instance, and allow the Memgraph data instance to use the high availability feature.
docker run --name instance1 -p 7687:7687 -p 7444:7444 memgraph/memgraph-mage
--data-recovery-on-startup=true \
--replication-restore-state-on-startup=true \
--management-port=13011 \
--bolt-port=7687 \
--experimental-enabled=high-availability
Coordinator instance
docker run --name coord1 -p 7690:7687 -p 7445:7444 memgraph/memgraph-mage
--coordinator-port=10111
--bolt-port=7687
--coordinator-id=1
--experimental-enabled=high-availability
--coordinator-hostname=localhost
--management-port=12121
Coordinator IDs serve as identifiers, the coordinator port is used for synchronization and log replication between coordinators and management port is used to get health state of cluster from leader coordinator. Coordinator IDs, coordinator ports and management ports must be different for all coordinators.
Configuration option --coordinator-hostname
must be set on all coordinator instances. It is used on followers to ping the leader coordinator on the correct IP address and return the health state about
the cluster. You can set this configuration flag to the IP address, the fully qualified domain name (FQDN), or even the DNS name. The suggested approach is to use DNS, otherwise, in case the IP address changes, network communication between instances in the cluster will stop working.
In local setup example would be to set --coordinator-hostname
for each instance to localhost
.
Env flags
There is an additional way to set high availability instances using environment variables. It is important to say that for the following configuration options, you can either use environment variables or configuration flags:
- experimental enabled
- bolt port
- coordinator port
- coordinator id
- management port
- high availability durability
- path to nuraft log file
- coordinator hostname
Data instances
Here are the environment variables you need to use to set data instance using only environment variables:
export MEMGRAPH_EXPERIMENTAL_ENABLED=high-availability
export MEMGRAPH_MANAGEMENT_PORT=13011
export MEMGRAPH_BOLT_PORT=7692
When using any of these environment variables, flags --bolt-port
, --management-port
and --experimental-enabled
will be ignored.
Coordinator instances
export MEMGRAPH_EXPERIMENTAL_ENABLED=high-availability
export MEMGRAPH_COORDINATOR_PORT=10111
export MEMGRAPH_COORDINATOR_ID=1
export MEMGRAPH_BOLT_PORT=7692
export MEMGRAPH_HA_DURABILITY=true
export MEMGRAPH_NURAFT_LOG_FILE="<path-to-log-file>"
export MEMGRAPH_COORDINATOR_HOSTNAME="localhost"
export MEMGRAPH_MANAGEMENT_PORT=12121
When using any of these environment variables, flags for --bolt-port
, --coordinator-port
, --coordinator-id
,--coordinator-hostname
and --experimental-enabled
will be ignored.
There is an additional environment variable you can use to set the path to the file with cypher queries used to start a high availability cluster. Here, you can use queries we define in the next chapter called User API.
export MEMGRAPH_HA_CLUSTER_INIT_QUERIES=<file_path>
After the coordinator instance is started, Memgraph will run queries one by one from this file to set up a high availability cluster.
User API
Register instance
Registering instances should be done on a single coordinator. The chosen coordinator will become the cluster's leader.
Register instance query will result in several actions:
- The coordinator instance will connect to the data instance on the
management_server
network address. - The coordinator instance will start pinging the data instance every
--instance-health-check-frequency-sec
seconds to check its status. - Data instance will be demoted from MAIN to REPLICA.
- Data instance will start the replication server on
replication_server
.
REGISTER INSTANCE instanceName WITH CONFIG {"bolt_server": boltServer, "management_server": managementServer, "replication_server": replicationServer};
This operation will result in writing to the Raft log.
In case MAIN is already set in the cluster, a replica instance will also be registered on MAIN.
Add coordinator instance
The user can choose any coordinator instance and connect other two (or more) coordinator instances to the cluster. This can be done before or after registering data instances, the order isn't important.
ADD COORDINATOR coordinatorId WITH CONFIG {"bolt_server": boltServer, "coordinator_server": coordinatorServer};
We need to inform the leader coordinator of the existence of other follower coordinators.
If you are running ADD COORDINATOR
Cypher queries on coordinator with --coordinator-id=1, you only need to add other two coordinator instances. For example, when using 3
coordinators you need to use two ADD COORDINATOR
queries:
ADD COORDINATOR 2 WITH CONFIG {"bolt_server": "127.0.0.1:7691", "coordinator_server": "127.0.0.1:10112", "management_server": "127.0.0.1:12112"};
ADD COORDINATOR 3 WITH CONFIG {"bolt_server": "127.0.0.1:7692", "coordinator_server": "127.0.0.1:10113", "management_server": "127.0.0.1:12113"};
Set instance to MAIN
Once all data instances are registered, one data instance should be promoted to MAIN. This can be achieved by using the following query:
SET INSTANCE instanceName to MAIN;
This query will register all other instances as REPLICAs to the new MAIN. If one of the instances is unavailable, setting the instance to MAIN will not succeed. If there is already a MAIN instance in the cluster, this query will fail.
This operation will result in writing to the Raft log.
Unregister instance
There are various reasons which could lead to the decision that an instance needs to be removed from the cluster. The hardware can be broken, network communication could be set up incorrectly, etc. The user can remove the instance from the cluster using the following query:
UNREGISTER INSTANCE instanceName;
At the moment of registration, the instance that you want to unregister must not be MAIN because unregistering MAIN could lead to an inconsistent cluster state.
The instance requested to be unregistered will also be unregistered from the current MAIN's REPLICA set.
Demote instance
Demoting instances should be done by a leader coordinator.
Demote instance query will result in several actions:
- The coordinator instance will demote the instance to REPLICA.
- The coordinator instance will continue pinging the data instance every
--instance-health-check-frequency-sec
seconds to check its status. In this case, the leader coordinator won't choose a new MAIN, but as a user, you should choose one instance to promote it to MAIN using theSET INSTANCE
instanceTO MAIN
query.
DEMOTE INSTANCE instanceName;
This operation will result in writing to the Raft log.
Force reset cluster state
In case the cluster gets stuck there is an option to do the force reset of the cluster. You need to execute a command on the leader coordinator. This command will result in the following actions:
- The coordinator instance will demote each alive instance to REPLICA.
- From the alive instance it will choose a new MAIN instance.
- Instances that are down will be demoted to REPLICAs once they come back up.
FORCE RESET CLUSTER STATE;
This operation will result in writing to the Raft log.
Show instances
You can check the state of the whole cluster using the SHOW INSTANCES
query. The query will display all the Memgraph servers visible in the cluster. With
each server you can see the following information:
- Network ports they are using for managing cluster state
- Health state of server
- Role - MAIN, REPLICA, LEADER, FOLLOWER or unknown if not alive
- The time passed since the last response time to the leader's health ping
This query can be run on either the leader or followers. Since only the leader knows the exact status of the health state and last response time, followers will execute actions in this exact order:
- Try contacting the leader to get the health state of the cluster, since the leader has all the information.
If the leader responds, the follower will return the result as if the
SHOW INSTANCES
query was run on the leader. - When the leader doesn't respond or currently there is no leader, the follower will return all the Memgraph servers with the health state set to "down".
SHOW INSTANCES;
Setting config for highly-available cluster
There are several flags that you can use for managing the cluster. Flag --management-port
is used for distinguishing data instances
from coordinators. The provided flag needs to be unique. Setting a flag will create an RPC server on data instances capable of
responding to the coordinator's RPC messages.
RPC (Remote Procedure Call) is a protocol for executing functions on a remote system. RPC enables direct communication in distributed systems and is crucial for replication and high availability tasks.
Flags --coordinator-id
, --coordinator-port
and --management-port
need to be unique and specified on coordinator instances. They will cause the creation of a Raft
server that coordinator instances use for communication. Flag --instance-health-check-frequency-sec
specifies how often should leader coordinator
should check the status of the replication instance to update its status. Flag --instance-down-timeout-sec
gives the user the ability to control how much time should
pass before the coordinator starts considering the instance to be down.
Consider the instance to be down only if several consecutive pings fail because a single ping can fail because of a large number of different reasons in distributed systems.
There is an additional flag, --instance-get-uuid-frequency-sec
, which sets
how often the coordinator should check on REPLICA instances if they follow the correct MAIN instance. REPLICA may die and get back up
before the coordinator notices that. In that case, REPLICA will not follow MAIN for --instance-get-uuid-frequency-sec
seconds. It is advisable to set
the flag to more than --instance-down-timeout-sec
, as it is needed first to confirm the instance was down and not that there was some networking issue.
Failover
Determining instance's health
Every --instance-health-check-frequency-sec
seconds, the coordinator contacts each instance.
The instance is not considered to be down unless --instance-down-timeout-sec
has passed and the instance hasn't responded to the coordinator in the meantime. Users must set --instance-health-check-frequency-sec
to be less or equal to the --instance-down-timeout-sec
but we advise users to set --instance-down-timeout-sec
to a multiplier of --instance-health-check-frequency-sec
.
Set the multiplier coefficient to be N>=2. For example, set --instance-down-timeout-sec=5
and --instance-health-check-frequency-sec=1
which will result in coordinator
contacting each instance every 1 second and the instance is considered dead after it doesn't respond 5 times (5 seconds / 1 second).
In case a REPLICA doesn't respond to a health check, the leader coordinator will try to contact it again every --instance-health-check-frequency-sec
. When the REPLICA instance rejoins the cluster (comes back up),
it always rejoins as REPLICA. For MAIN instance, there are two options. If it is down for less than --instance-down-timeout-sec
, it will rejoin as MAIN because it is still considered alive. If it is down for more than
--instance-down-timeout-sec
, failover procedure is initiated. Whether MAIN will rejoin as MAIN depends on the success of the failvoer procedure. If the failover procedure succeeds, now old MAIN
will rejoin as REPLICA. If failover doesn't succeed, MAIN will rejoin as MAIN once it comes back up.
Failover procedure - high level description
From alive REPLICAs coordinator chooses a new potential MAIN. This instance is a potential MAIN as the failover procedure can still fail due to various factors (networking issues, promote to MAIN fails, any alive REPLICA failing to accept an RPC message, etc). Before promoting REPLICA to MAIN, the coordinator sends a RPC request to each alive REPLICA to stop listening to the old MAIN. Once each alive REPLICA acknowledges that it stopped listening to the old MAIN, the coordinator sends an RPC request to the potential new MAIN, which is still in REPLICA state, to promote itself to the MAIN instance with info about other REPLICAs to which it will replicate data. Once that request succeeds, the new MAIN can start replication to the other instances and accept write queries.
Choosing new MAIN from available REPLICAs
When failover is happening some REPLICAs can also be down. From the list of alive REPLICAs, a new MAIN is chosen. First, the leader coordinator contacts each alive REPLICA to get info about each database's last commit timestamp. In the case of enabled multi-tenancy, from each instance coordinator will get info on all databases and their last commit timestamp. Currently, the coordinator chooses an instance to become a new MAIN by comparing the latest commit timestamp only of the default database for every instance. If multiple instances have the same latest commit timestamp, the instance that was registered earlier will be chosen as a new MAIN.
Providing atomicity of action
To ensure the atomicity of each action, the leader coordinator wraps each of the following actions with locks:
- Registering of replica instance
- Unregistering of replica instance
- Setting instance to MAIN
- Failover procedure
- Force reset of the cluster
The special mechanism includes opening the lock before each action starts and closing it once it is complete. A closed lock means the action is fully completed. If, due to some failure, action isn't complete fully, the cluster enters a state of force reset where the cluster is reset to the state held in the Raft log. The state of truth is what is stored in the Raft log across most of the coordinators.
Force reset of cluster
The leader coordinator executes a force reset of the cluster if the action isn't fully complete. Failure can happen anywhere, i.e. in the case of setting instance to MAIN, the RPC request to a REPLICA instance to promote itself to MAIN can succeed, but writing to the Raft log that the instance was promoted can fail. Force reset includes demoting every alive instance to REPLICA, and executing the failover procedure once again. Such a procedure is needed as of this moment cluster doesn't track where the action failed exactly, but only whether it fully succeded. Raft log is taken as a source of truth at all times. In case the leader coordinator dies while executing the force reset, the next coordinator which is elected as the leader, will continue executing the force reset. Action is executed until it succeeds.
It is important to note that if an action fails and all instances are down, the leader will attempt to execute a force reset until one instance is promoted to MAIN. Until then, no other actions are allowed on the cluster.
If an instance is down at the point of force reset, the leader coordinator writes in the Raft log that the instance needs to be demoted to REPLICA once it comes back up.
If all instances are down at the point of force reset, the action won't succeed as a new MAIN instance can't be chosen.
Old MAIN rejoining to the cluster
Once the old MAIN gets back up, the coordinator sends an RPC request to demote the old MAIN to REPLICA. The coordinator tracks at all times which instance was the last MAIN.
The leader coordinator sends two RPC requests in the given order to demote old MAIN to REPLICA:
- Demote MAIN to REPLICA RPC request
- A request to store the UUID of the current MAIN, which the old MAIN, now acting as a REPLICA instance, must listen to. If the MAIN gets demoted, but the RPC request to change the UUID of the new MAIN doesn't succeed, the coordinator will retry both actions until success.
How REPLICA knows which MAIN to listen
Each REPLICA has a UUID of MAIN it listens. If a network partition happens where MAIN can talk to a REPLICA but the coordinator can't talk to the MAIN, from the coordinator's point of view that MAIN is down. From REPLICA's point of view, the MAIN instance is still alive. The coordinator will start the failover procedure, and we can end up with multiple MAINs where REPLICAs can listen to both MAINs. To prevent such an issue, before failover happens to a new MAIN, each REPLICA gets a new UUID that no current MAIN has. The coordinator generates the new UUID, which the new MAIN will get to use once each alive replica acknowledges it received the new UUID. In RPC's request to promote itself to MAIN, the new potential MAIN instance will receive the generated UUID, which REPLICAs are ready to listen to.
If REPLICA was down at one point, MAIN could have changed. When REPLICA gets back up, it doesn't listen to any MAIN until the coordinator sends an RPC request to REPLICA to start listening to MAIN with the given UUID.
Replication concerns
Force sync of data
On failover, the current logic is to choose the first available instance from alive REPLICAs to promote to the new MAIN. For promotion to the MAIN to successfully happen, the new MAIN figures out if REPLICA is behind (has less up-to-date data) or has data that the new MAIN doesn't have. If REPLICA has data that MAIN doesn't have, in that case REPLICA is in a diverged-from-MAIN state. If at least one REPLICA is in diverged-from-MAIN state, failover won't succeed as MAIN can't replicate data to diverged-from-MAIN REPLICA. When choosing a new MAIN in the failover procedure from the list of available REPLICAs, the instance with the latest commit timestamp for the default database is chosen as the new MAIN. In case some other instance had more up-to-date data when the failover procedure was choosing a new MAIN but was down at that point when rejoining the cluster, the new MAIN instance sends a force sync RPC request to such instance. Force sync RPC request deletes all current data on all databases on a given instance and accepts data from the current MAIN. This way cluster will always follow the current MAIN.
Actions on follower coordinators
From follower coordinators you can only execute SHOW INSTANCES
. Registration of data instance, unregistration of data instances, demoting instance, setting instance to MAIN and force reseting cluster state
are all disabled.
Under certain extreme scenarios, the current implementation of HA could lead to having Recovery Point Objective (RPO) != 0 (aka data loss). These are environments with high volume of transactions where data is constantly changed, added, deleted... If you are operating in such scenarios, please open an issue on GitHub (opens in a new tab) as we are eager to expand our support for this kind of workload.
Instances' restart
Data instances' restart
Data instances can fail both as MAIN and as REPLICA. When an instance that was REPLICA comes back, it won't accept updates from any instance until the coordinator updates its responsible peer. This should happen automatically when the coordinator's ping to the instance passes. When the MAIN instance comes back, any writing to the MAIN instance will be forbidden until a ping from the coordinator passes. When the coordinator realizes the once-MAIN instance is alive, it will choose between enabling writing on old MAIN or demoting it to REPLICA. The choice depends on whether there was any other MAIN instance chosen between the old MAIN's failure and restart.
Since the instance can die and get back up, use --replication-state-at-startup=true
to restore the last replication state. This configuration will restore the instance to
the state it was last, and everything should function normally from that point on. If this flag is not set, every instance will restart as MAIN and that case is not handled.
Furthermore, it is important for data instances to have --data-recovery-on-startup=true
, which recovers data from the data directory. It also recovers important information
for each database on data instance. This way, MAIN can establish a connection to REPLICAs and replicate data to the correct databases.
Coordinator instances restart
In case the coordinator instance dies and it is restarted, it will not lose any data from the RAFT log or RAFT snapshots, since --ha-durability
is enabled by default. For more details
read about high availability durability in the durability chapter.
Durability
High availability durability is enabled by default. Durability flag controls logs and snapshots. Details about other coordinators are always made durable since without that information, the coordinator can't rejoin the cluster.
Information about logs and snapshots is stored under one RocksDB instance in the high_availability/raft_data/logs
directory stored under
the top-level --data-directory
folder. All the data stored there is recovered in case the coordinator restarts.
Data about other coordinators and their endpoints is recovered from the high_availability/raft_data/network
directory stored under
the top-level --data-directory
folder. When the coordinator rejoins, it will communicate with other coordinators and receive updates from the current leader.
In case durability is disabled, at least one coordinator must be alive at all times so that data from the Raft log is not lost.
First start
On the first start of coordinators, each will store the current version of the logs
and network
durability store. From that point on,
each RAFT log that is sent to the coordinator is also stored on disk. For every new coordinator instance, the server config is updated. Logs are created
for each user action and failover action. Snapshots are created every N (N currently being 5) logs.
Restart of coordinator
In case of failure of the coordinator, on the restart, it will read information about other coordinators stored under high_availability/raft_data/network
directory.
From the network
directory we will recover the server state before the coordinator stopped, including the current term, for whom the coordinator voted, and whether
election timer is allowed.
It will also recover the following server config information:
- other servers, including their endpoints, id, and auxiliary data
- ID of the previous log
- ID of the current log
- additional data needed by nuRaft
If durability is not disabled, the following information will be recovered from a common RocksDB logs
instance:
- current version of
logs
durability store - snapshots found with
snapshot_id_
prefix in database:- coordinator cluster state - all data instances with their role (MAIN or REPLICA), UUID of MAIN instance which REPLICA is listening to
- last log idx
- last log term
- last cluster config
- logs found in the interval between the start index and the last log index
- data - each log holds data on what has changed since the last state
- term - nuRAFT term
- log type - nuRAFT log type
Handling of durability errors
If snapshots are not stored, an exception is thrown and left for the nuRAFT library to handle the issue. Logs can be missed and not stored since they are compacted and deleted every two snapshots and will be removed relatively fast. In case the log is missing on recovery, the coordinator will not start. In that case, it is best to start a coordinator without durability and leave other coordinators to send missing data. Users are advised to use the same strategy in case a snapshot is missing on one of the coordinators.
Memgraph throws an error in case of failure to store cluster config, which is updated in the high_availability/raft_data/network
folder.
If this happens, it will happen only on the first cluster start when coordinators are connecting since
coordinators are configured only once at the start of the whole cluster. This is a non-recoverable error since in case the coordinator rejoins the cluster and has
the wrong state of other clusters, it can become a leader without being connected to other coordinators.
Recovering from errors
Distributed systems can fail in numerous ways. With the current implementation, Memgraph instances are resilient to occasional network failures and independent machine failures. Byzantine failures aren't handled since the Raft consensus protocol cannot deal with them either.
Recovery Time Objective (RTO) is an often used term for measuring the maximum tolerable length of time that an instance or cluster can be down. Since every highly available Memgraph cluster has two types of instances, we need to analyze the failures of each separately.
Raft is a quorum-based protocol and it needs a majority of instances alive in order to stay functional. Hence, with just one coordinator instance down, RTO is 0 since the cluster stays available. With 2+ coordinator instances down (in a cluster with RF = 3), the RTO depends on the time needed for instances to come back.
Failure of REPLICA data instance isn't very harmful since users can continue writing to MAIN data instance while reading from MAIN or other
REPLICAs. The most important thing to analyze is what happens when MAIN gets down. In that case, the leader coordinator uses
user-controllable parameters related to the frequency of health checks from the leader to replication instances (--instance-health-check-frequency-sec
)
and the time needed to realize the instance is down (--instance-down-timeout-sec
). After collecting enough evidence, the leader concludes the MAIN is down and performs failover using just a handful of RPC
messages (correct time depends on the distance between instances). It is important to mention that the whole failover is performed with a zero data-loss if the newly chosen MAIN (previously REPLICA)
had all up-to-date data.
Current deployment assumes the existence of only one datacenter, which automatically means that Memgraph won't be available in the case the whole datacenter goes down. We are actively working on 2 datacenter (2-DC) architecture.
Raft configuration parameters
Several Raft-related parameters are important for the correct functioning of the cluster. The leader coordinator sends a heartbeat message to other coordinators every second to determine their health. This configuration option is connected with leader election timeout which is a randomized value from the interval [2000ms, 4000ms] and which is used by followers to decide when to trigger new election process. Leadership expiration is set to 1800ms so that cluster can never get into situation where multiple leaders exist. These specific values give a cluster the ability to survive occasional network hiccups without triggering leadership changes.
Kubernetes
We support deploying Memgraph HA instances as part of the Kubernetes cluster. You can see example configurations here.
Docker Compose
The following example shows you how to setup Memgraph cluster using docker compose. The cluster will use user-defined bridge network.
License file license.cypher
should be in the format:
SET DATABASE SETTING 'organization.name' TO '<YOUR_ORGANIZATION_NAME>';
SET DATABASE SETTING 'enterprise.license' TO '<YOUR_ENTERPRISE_LICENSE>';
You can directly use initialization file HA_register.cypher
:
ADD COORDINATOR 2 WITH CONFIG {"bolt_server": "coord2:7691", "coordinator_server": "coord2:10112", "management_server": "coord2:12122"};
ADD COORDINATOR 3 WITH CONFIG {"bolt_server": "coord3:7692", "coordinator_server": "coord3:10113", "management_server": "coord3:12123"};
REGISTER INSTANCE instance_1 WITH CONFIG {"bolt_server": "instance1:7687", "management_server": "instance1:13011", "replication_server": "instance1:10001"};
REGISTER INSTANCE instance_2 WITH CONFIG {"bolt_server": "instance2:7688", "management_server": "instance2:13012", "replication_server": "instance2:10002"};
REGISTER INSTANCE instance_3 WITH CONFIG {"bolt_server": "instance3:7689", "management_server": "instance3:13013", "replication_server": "instance3:10003"};
SET INSTANCE instance_3 TO MAIN;
You can directly use the following docker-compose.yml
to start the cluster using docker compose up
:
volumes:
mg_lib1:
mg_lib2:
mg_lib3:
mg_lib4:
mg_lib5:
mg_lib6:
mg_log1:
mg_log2:
mg_log3:
mg_log4:
mg_log5:
mg_log6:
networks:
memgraph_ha:
name: memgraph_ha
driver: bridge
ipam:
driver: default
config:
- subnet: "172.21.0.0/16"
services:
coord1:
image: "memgraph/memgraph"
container_name: coord1
volumes:
- ./license.cypher:/tmp/init/license.cypher:ro
- ./HA_register.cypher:/tmp/init/HA_register.cypher:ro
- mg_lib1:/var/lib/memgraph
- mg_log1:/var/log/memgraph
environment:
- MEMGRAPH_HA_CLUSTER_INIT_QUERIES=/tmp/init/HA_register.cypher
command: [ "--init-file=/tmp/init/license.cypher", "--log-level=TRACE", "--also-log-to-stderr", "--coordinator-id=1", "--coordinator-port=10111", "--management-port=12121", "--coordinator-hostname=coord1", "--experimental-enabled=high-availability"]
networks:
memgraph_ha:
ipv4_address: 172.21.0.4
coord2:
image: "memgraph/memgraph"
container_name: coord2
volumes:
- ./license.cypher:/tmp/init/license.cypher:ro
- mg_lib2:/var/lib/memgraph
- mg_log2:/var/log/memgraph
command: [ "--init-file=/tmp/init/license.cypher", "--log-level=TRACE", "--also-log-to-stderr", "--coordinator-id=2", "--coordinator-port=10112", "--management-port=12122", "--coordinator-hostname=coord2", "--experimental-enabled=high-availability"]
networks:
memgraph_ha:
ipv4_address: 172.21.0.2
coord3:
image: "memgraph/memgraph"
container_name: coord3
volumes:
- ./license.cypher:/tmp/init/license.cypher:ro
- mg_lib3:/var/lib/memgraph
- mg_log3:/var/log/memgraph
command: [ "--init-file=/tmp/init/license.cypher", "--log-level=TRACE", "--also-log-to-stderr", "--coordinator-id=3", "--coordinator-port=10113", "--management-port=12123", "--coordinator-hostname=coord3", "--experimental-enabled=high-availability"]
networks:
memgraph_ha:
ipv4_address: 172.21.0.3
instance1:
image: "memgraph/memgraph"
container_name: instance1
volumes:
- ./license.cypher:/tmp/init/license.cypher:ro
- mg_lib4:/var/lib/memgraph
- mg_log4:/var/log/memgraph
command: ["--init-file=/tmp/init/license.cypher","--data-recovery-on-startup=true", "--log-level=TRACE", "--also-log-to-stderr", "--management-port=13011", "--experimental-enabled=high-availability"]
networks:
memgraph_ha:
ipv4_address: 172.21.0.6
instance2:
image: "memgraph/memgraph"
container_name: instance2
volumes:
- ./license.cypher:/tmp/init/license.cypher:ro
- mg_lib5:/var/lib/memgraph
- mg_log5:/var/log/memgraph
command: ["--init-file=/tmp/init/license.cypher","--data-recovery-on-startup=true", "--log-level=TRACE", "--also-log-to-stderr", "--management-port=13012", "--experimental-enabled=high-availability"]
networks:
memgraph_ha:
ipv4_address: 172.21.0.7
instance3:
image: "memgraph/memgraph"
container_name: instance3
volumes:
- ./license.cypher:/tmp/init/license.cypher:ro
- mg_lib6:/var/lib/memgraph
- mg_log6:/var/log/memgraph
command: ["--init-file=/tmp/init/license.cypher","--data-recovery-on-startup=true", "--log-level=TRACE", "--also-log-to-stderr", "--management-port=13013", "--experimental-enabled=high-availability"]
networks:
memgraph_ha:
ipv4_address: 172.21.0.8
Cluster can be shut-down using docker compose down
.
Manual Docker setup
This example will show how to set up a highly available cluster in Memgraph using three coordinators and 3 data instances.
Start all instances
- Start coordinator1:
docker run --name coord1 -p 7690:7690 -p 7444:7444 -v mg_lib1:/var/lib/memgraph -v mg_log1:/var/log/memgraph memgraph/memgraph --bolt-port=7690 --log-level=TRACE --also-log-to-stderr --coordinator-id=1 --coordinator-port=10111 --management-port=12121 --experimental-enabled=high-availability --coordinator-hostname=localhost --nuraft-log-file=/var/log/memgraph/nuraft
- Start coordinator2:
docker run --name coord2 -p 7691:7691 -p 7445:7444 -v mg_lib2:/var/lib/memgraph -v mg_log2:/var/log/memgraph memgraph/memgraph --bolt-port=7691 --log-level=TRACE --also-log-to-stderr --coordinator-id=2 --coordinator-port=10112 --management-port=12122 --experimental-enabled=high-availability --coordinator-hostname=localhost --nuraft-log-file=/var/log/memgraph/nuraft
- Start coordinator3:
docker run --name coord3 -p 7692:7692 -p 7446:7444 -v mg_lib3:/var/lib/memgraph -v mg_log3:/var/log/memgraph memgraph/memgraph --bolt-port=7692 --log-level=TRACE --also-log-to-stderr --coordinator-id=3 --coordinator-port=10113 --management-port=12123 --experimental-enabled=high-availability --coordinator-hostname=localhost --nuraft-log-file=/var/log/memgraph/nuraft
- Start instance1:
docker run --name instance1 -p 7687:7687 -p 7447:7444 -v mg_lib4:/var/lib/memgraph -v mg_log4:/var/log/memgraph memgraph/memgraph --bolt-port=7687 --log-level=TRACE --also-log-to-stderr --management-port=13011 --experimental-enabled=high-availability --data-recovery-on-startup=true
- Start instance2:
docker run --name instance2 -p 7688:7688 -p 7448:7444 -v mg_lib5:/var/lib/memgraph -v mg_log5:/var/log/memgraph memgraph/memgraph --bolt-port=7688 --log-level=TRACE --also-log-to-stderr --management-port=13012 --experimental-enabled=high-availability --data-recovery-on-startup=true
- Start instance3:
docker run --name instance3 -p 7689:7689 -p 7449:7444 -v mg_lib6:/var/lib/memgraph -v mg_log6:/var/log/memgraph memgraph/memgraph --bolt-port=7689 --log-level=TRACE --also-log-to-stderr --management-port=13013 --experimental-enabled=high-availability --data-recovery-on-startup=true
Register instances
- Start communication with any Memgraph client on any coordinator. Here we chose coordinator 1.
mgconsole --port=7690
- Connect the other two coordinator instances to the cluster.
ADD COORDINATOR 2 WITH CONFIG {"bolt_server": "localhost:7691", "coordinator_server": "localhost:10112", "management_server": "localhost:12122"};
ADD COORDINATOR 3 WITH CONFIG {"bolt_server": "localhost:7692", "coordinator_server": "localhost:10113", "management_server": "localhost:12123"};
- Register 3 data instances as part of the cluster:
Replace <ip_address>
with the container's IP address. This is necessary for Docker deployments where instances are not on the local host.
REGISTER INSTANCE instance_1 WITH CONFIG {"bolt_server": "localhost:7687", "management_server": "localhost:13011", "replication_server": "localhost:10001"};
REGISTER INSTANCE instance_2 WITH CONFIG {"bolt_server": "localhost:7688", "management_server": "localhost:13012", "replication_server": "localhost:10002"};
REGISTER INSTANCE instance_3 WITH CONFIG {"bolt_server": "localhost:7689", "management_server": "localhost:13013", "replication_server": "localhost:10003"};
- Set instance_3 as MAIN:
SET INSTANCE instance_3 TO MAIN;
- Connect to the leader coordinator and check cluster state with
SHOW INSTANCES
;
name | bolt_server | coordinator_server | management_server | health | role | last_succ_resp_ms |
---|---|---|---|---|---|---|
coordinator_1 | localhost:7690 | localhost:10111 | localhost:12121 | up | leader | 0 |
coordinator_2 | localhost:7691 | localhost:10112 | localhost:12122 | up | follower | 16 |
coordinator_3 | localhost:7692 | localhost:10113 | localhost:12123 | up | follower | 25 |
instance_1 | localhost:7687 | "" | localhost:13011 | up | replica | 39 |
instance_2 | localhost:7688 | "" | localhost:13012 | up | replica | 21 |
instance_3 | localhost:7689 | "" | localhost:13013 | up | main | 91 |
Check automatic failover
Let's say that the current MAIN instance is down for some reason. After --instance-down-timeout-sec
seconds, the coordinator will realize
that and automatically promote the first alive REPLICA to become the new MAIN. The output of running SHOW INSTANCES
on the leader coordinator could then look like:
name | bolt_server | coordinator_server | management_server | health | role | last_succ_resp_ms |
---|---|---|---|---|---|---|
coordinator_1 | localhost:7690 | localhost:10111 | localhost:12121 | up | leader | 0 |
coordinator_2 | localhost:7691 | localhost:10112 | localhost:12122 | up | follower | 34 |
coordinator_3 | localhost:7692 | localhost:10113 | localhost:12123 | up | follower | 28 |
instance_1 | localhost:7687 | "" | localhost:13011 | up | main | 61 |
instance_2 | localhost:7688 | "" | localhost:13012 | up | replica | 74 |
instance_3 | localhost:7689 | "" | localhost:13013 | down | unknown | 71222 |