ClusteringHigh availabilitySet up HA cluster with K8s

Set up HA cluster with K8s Enterprise

💡

Users are advised to first read the guide on how replication works, followed by the guide on how high availability works, and how to query the cluster.

Install Memgraph HA on Kubernetes

To deploy a Memgraph High Availability (HA) cluster on Kubernetes, you must first add the Memgraph Helm repository and then install the HA Helm chart.

Add the Helm repository

Add the Memgraph Helm chart repository to your local Helm setup by running the following command:

helm repo add memgraph https://memgraph.github.io/helm-charts

Make sure to update the repository to fetch the latest Helm charts available:

helm repo update

Install Memgraph HA

Since Memgraph HA requires an Enterprise license, you must provide the license and organization name to the chart through a Kubernetes Secret.

⚠️

Breaking change: Starting with Memgraph HA chart version 1.0.0, the HA chart no longer accepts the license and organization name as plaintext values via env.MEMGRAPH_ENTERPRISE_LICENSE and env.MEMGRAPH_ORGANIZATION_NAME. Both values are now read from a Kubernetes Secret referenced via secretKeyRef, and the secret must exist before you run helm install — the StatefulSets will fail to start otherwise. The previous env.* values have been removed from values.yaml.

Create the secret first, then install the chart:

kubectl create secret generic memgraph-secrets \
  --from-literal=MEMGRAPH_ENTERPRISE_LICENSE=<your-license> \
  --from-literal=MEMGRAPH_ORGANIZATION_NAME=<your-organization-name>

helm install <release-name> memgraph/memgraph-high-availability

Replace <release-name> with a name of your choice for the release. The secret name and keys are configurable via secrets.name, secrets.licenseKey and secrets.organizationKey (defaults: memgraph-secrets, MEMGRAPH_ENTERPRISE_LICENSE, MEMGRAPH_ORGANIZATION_NAME).

The cluster will be fully connected once installation completes. Note that the install command may take a moment while instances establish connections. If clients connect from outside the cluster, update the Bolt server address on each instance to use its external IP as explained in the section on setting up the cluster.

Tip: Always install a specific chart version. Using the latest tag can lead to unexpected behavior if pods restart and pull newer, incompatible images.

Install Memgraph HA with minikube

If you are installing Memgraph HA chart locally with minikube, we are strongly recommending to enable csi-hostpath-driver and use its storage class. Otherwise, you could have problems with attaching PVCs to pods.

Enable csi-hostpath-driver

minikube addons disable storage-provisioner
minikube addons disable default-storageclass
minikube addons enable volumesnapshots
minikube addons enable csi-hostpath-driver

Create a StorageClass (save as sc.yaml)

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: csi-hostpath-delayed
provisioner: hostpath.csi.k8s.io
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete

Apply the StorageClass

kubectl apply -f sc.yaml

Configure the Helm chart

In your values.yaml, set:

storage:
  libStorageClassName: csi-hostpath-delayed

Configure the Helm chart

Override default chart values

You can customize the Memgraph HA Helm chart either inline with --set flags or by using a values.yaml file.

Option 1: Override values inline

helm install <release-name> memgraph/memgraph-high-availability \
  --set <flag1>=<value1>,<flag2>=<value2>,...

Option 2: Use a values file

helm install <release-name> memgraph/memgraph-high-availability \
  -f values.yaml

You can also combine both approaches. Values specified with --set override those in values.yaml.

Upgrade Helm chart

To upgrade the helm chart you can use:

helm upgrade <release-name> memgraph/memgraph-high-availability --set <flag1>=<value1>,<flag2>=<value2>

Again it is possible use both --set and values.yaml to set configuration options.

If you’re using IngressNginx and performing an upgrade, the attached public IP should remain the same. It will only change if the release includes specific updates that modify it—and such changes will be documented.

Uninstall Helm chart

Uninstallation is done with:

helm uninstall <release-name>

Uninstalling the chart does not delete PersistentVolumeClaims (PVCs). Even if the default StorageClass reclaim policy is Delete, data on the underlying PersistentVolumes (PVs) will not be removed automatically when uninstalling the chart.

However, we still recommend configuring the reclaim policy to Retain, as described in the High availability storage section.

Runtime environment & security

Security context

All Memgraph HA instances run as Kubernetes StatefulSet workloads, each with a single pod. Depending on configuration, the pod contains two or three containers:

  • memgraph-coordinator - runs the Memgraph binary.
  • Optional init container - enabled when sysctlInitContainer.enabled is set.

Memgraph processes run as the non-root memgraph user with no Linux capabilities and no privilege escalation.

High availability storage

Memgraph HA always uses PersistentVolumeClaims (PVCs) to store database files and logs.

  • Default storage size: 1Gi (you will likely need to increase this).
  • Default access mode: ReadWriteOnce (can be set to ReadOnlyMany, ReadWriteMany, or ReadWriteOncePod).
  • PVCs use the cluster’s default StorageClass, unless overridden.

You can explicitly set storage classes using:

  • storage.libStorageClassName - for data volumes
  • storage.logStorageClassName - for log volumes

Most default StorageClasses use a Delete reclaim policy, meaning deleting the PVC deletes the underlying PersistentVolume (PV). We recommend switching to Retain.


After your cluster is running, you can patch all PVs:

#!/bin/bash
PVS=$(kubectl get pv --no-headers -o custom-columns=":metadata.name")
 
for pv in $PVS; do
  echo "Patching PV: $pv"
  kubectl patch pv $pv -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'
done
 
echo "All PVs have been patched."

Kubernetes uses Storage Object in Use Protection, preventing deletion of PVCs while still attached to pods.
Similarly, PVs will remain until their PVCs are fully removed.

If a PVC is stuck terminating, you can remove its finalizers:

kubectl patch pvc PVC_NAME -p '{"metadata":{"finalizers": []}}' --type=merge

Network configuration

All Memgraph HA components communicate internally using ClusterIP network for communicating between themselves.

Default ports:

  • Management: 10000
  • Replication (data instances): 20000
  • Coordinator communication: 12000

You can change this configuration by specifying:

ports:
  managementPort: <value>
  replicationPort: <value>
  coordinatorPort: <value>

External network configuration

Memgraph HA uses client-side routing, so DNS resolution happens internal Cluster IP network. Because of that, we need one more type of network which will be used for clients accessing instances from outside the cluster. Our HA supports out of the box following K8s resources used to setup external access:

  • IngressNginx - one LoadBalancer for all instances.
  • NodePort - exposes ports on each node (requires public node IPs).
  • LoadBalancer - one LoadBalancer per instance (highest cost).
  • CommonLoadBalancer (coordinators only) - single LB for all coordinators.
  • Gateway API - uses Kubernetes Gateway API resources (Gateway + TCPRoute). Configured under externalAccessConfig.gateway.

For coordinators, there is an additional option of using CommonLoadBalancer. In this scenario, there is one load balancer sitting in front of coordinators. You can save the cost of two load balancers compared to LoadBalancer option since usually you don’t need to distinguish specific coordinators while using Memgraph capabilities. Note that if you will be connecting to the coordinator directly for some reason (e.g to run show instances query), you can run show instance query to see which coordinator you got routed to. The default Bolt port is opened on 7687 but you can change it by setting ports.boltPort.

For more detailed IngressNginx setup, see Use Memgraph HA chart with IngressNginx.

Note however that Ingress Nginx is getting retired and one of the alternatives is using the Kubernetes Gateway API with controllers like Envoy Gateway, Istio, Cilium, Traefik, or Kong. The HA chart has native Gateway API support — see Use Memgraph HA chart with Gateway API.

By default, the chart does not expose any external network services.

Per-instance external access annotations

When using LoadBalancer or NodePort external access, you can set annotations globally via externalAccessConfig.dataInstance.annotations and externalAccessConfig.coordinator.annotations. These apply to every external Service of that type.

If you need different annotations per instance — for example, to assign unique DNS hostnames via external-dns — use the externalAccessAnnotations field on individual entries in data[] or coordinators[]. Per-instance annotations are merged with the global annotations, and per-instance values take precedence when the same key appears in both.

externalAccessConfig:
  dataInstance:
    serviceType: "LoadBalancer"
    annotations:
      service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
 
data:
  - id: "0"
    externalAccessAnnotations:
      external-dns.alpha.kubernetes.io/hostname: "data-0.memgraph.example.com"
  - id: "1"
    externalAccessAnnotations:
      external-dns.alpha.kubernetes.io/hostname: "data-1.memgraph.example.com"

In this example, each data instance’s external Service gets the shared aws-load-balancer-scheme annotation plus its own unique external-dns hostname. Bolt and management ports are not set per-instance — they come from ports.boltPort and ports.managementPort.

Per-instance internal access annotations

Each data instance and coordinator also has an internal ClusterIP Service used for in-cluster communication. You can set per-instance annotations on these internal Services using the internalAccessAnnotations field on individual entries in data[] or coordinators[]. This is useful for integrations or other tooling that consumes annotations on the internal Services.

data:
  - id: "0"
    internalAccessAnnotations:
      mycompany.io/service-mesh: "enabled"
  - id: "1"
    internalAccessAnnotations:
      mycompany.io/service-mesh: "enabled"
 
coordinators:
  - id: "1"
    internalAccessAnnotations:
      mycompany.io/service-mesh: "enabled"

Node affinity

Memgraph HA deploys multiple pods, and you can control pod placement with affinity settings.

Supported strategies:

  • default Attempts to to schedule the data pods and coordinator pods on the nodes where there is no other pod with the same role. If there is no such node, the pods will still be scheduled on the same node, and deployment will not fail.

  • unique (affinity.unique = true) Each coordinator and data pod must be placed on separate nodes. If not enough nodes exist, deployment fails. Coordinators get scheduled first. After that, data pods are looking for the nodes with coordinators.

  • parity (affinity.parity = true) Schedules at most one coordinator + one data pod per node. Coordinators schedule first; data pods follow.

  • nodeSelection (affinity.nodeSelection = true) Pods are scheduled onto explicitly labeled nodes using affinity.dataNodeLabelValue and affinity.coordinatorNodeLabelValue. If all the nodes with labels are occupied by the pods with the same role, the deployment will fail.

When using nodeSelection, ensure that nodes are labeled correctly.
Default role label key: role
Default values: data-node, coordinator-node

Example:

kubectl label nodes <node-name> role=data-node

A full AKS example is available in the chart repository.

Sysctl options

Use the sysctlInitContainer to configure kernel parameters required for high-memory workloads, such as increasing:

Authentication

By default, Memgraph HA starts without authentication enabled.

⚠️

Breaking change: The HA chart no longer creates a Memgraph user from the USER/PASSWORD keys of the memgraph-secrets Secret. The secrets.enabled, secrets.userKey and secrets.passwordKey values have been removed because the previous implementation also applied these env variables to coordinators, which run without auth. The memgraph-secrets Secret is now reserved for the license and organization name.

To configure credentials, connect to a data instance after installation and create users with Cypher, for example:

CREATE USER memgraph IDENTIFIED BY 'memgraph';

Run the same statements on every data instance you want the user to exist on. Coordinators run without authentication and do not need user setup.

Setting up the cluster

Although many configuration options exist, especially for networking, the workflow for creating a Memgraph HA cluster follows these steps:

  1. Provision the Kubernetes cluster. Ensure your nodes, storage, and networking are ready.
  2. Label nodes according to your chosen affinity strategy (optional). For example, when using nodeSelection, label nodes as data-node or coordinator-node.
  3. Create the memgraph-secrets Kubernetes secret holding MEMGRAPH_ENTERPRISE_LICENSE and MEMGRAPH_ORGANIZATION_NAME (required — the chart reads these via secretKeyRef).
  4. Install the Memgraph HA Helm chart using helm install. This creates a fully connected cluster.
  5. Install auxiliary components for external access, such as ingress-nginx (optional).
  6. Update Bolt server addresses if clients will connect from outside the cluster (optional).

Update bolt server

This step is required only when:

  • Clients access the database from outside the cluster, and
  • You’re using bolt+routing for client-side routing

Each instance must know its external address for routing to work correctly. Run the following queries on the leader coordinator:

UPDATE CONFIG FOR COORDINATOR 1 WITH CONFIG {"bolt_server": "<bolt-server-coord1>"};
UPDATE CONFIG FOR COORDINATOR 2 WITH CONFIG {"bolt_server": "<bolt-server-coord2>"};
UPDATE CONFIG FOR COORDINATOR 3 WITH CONFIG {"bolt_server": "<bolt-server-coord3>"};
UPDATE CONFIG FOR INSTANCE instance_0 WITH CONFIG {"bolt_server": "<bolt-server-instance0>"};
UPDATE CONFIG FOR INSTANCE instance_1 WITH CONFIG {"bolt_server": "<bolt-server-instance1>"};

Note that the only the bolt_server values are provided. The correct value depends on the type of external access you configured (LoadBalancer IP, Ingress host/port, NodePort, etc.).

Refer to the Memgraph HA User API docs for the full set of commands and usage patterns.

Use Memgraph HA chart with Gateway API

The Memgraph HA Helm chart has native support for the Kubernetes Gateway API. When enabled, the chart automatically creates TCPRoute resources for each data and coordinator instance. You can either let the chart create its own Gateway or attach routes to a pre-existing one.

Gateway API is orthogonal to the serviceType external access options (IngressNginx, NodePort, LoadBalancer). The routes point at internal ClusterIP services that always exist, so you can use Gateway API alongside or instead of other external access methods.

Prerequisites

Before enabling Gateway API in the chart, you need:

  1. A Gateway API controller installed in your cluster. Examples include Envoy Gateway, Istio, Cilium, Traefik, and Kong. This guide uses Envoy Gateway as an example:

    helm install eg oci://docker.io/envoyproxy/gateway-helm --version v1.2.4 -n envoy-gateway-system --create-namespace
  2. A GatewayClass resource that references your controller. A GatewayClass is a cluster-scoped resource that defines which controller manages Gateways — each Gateway references a GatewayClass by name. The Helm chart does not create a GatewayClass; you must create one yourself or use one provided by your controller installation. For Envoy Gateway:

    apiVersion: gateway.networking.k8s.io/v1
    kind: GatewayClass
    metadata:
      name: eg
    spec:
      controllerName: gateway.envoyproxy.io/gatewayclass-controller
⚠️

You must ensure the GatewayClass exists before enabling the gateway feature in the chart. If you create your own Gateway (Option 1 below), the chart requires gatewayClassName to reference an existing GatewayClass, and will fail with an error if it is not set.

Option 1: Chart-managed Gateway

When you want the chart to create its own Gateway along with TCPRoute resources, set externalAccessConfig.gateway.enabled to true and provide the gatewayClassName:

externalAccessConfig:
  gateway:
    enabled: true
    gatewayClassName: "eg"

The chart will create:

  • A Gateway (gateway.networking.k8s.io/v1) with TCP listeners auto-generated for each data and coordinator instance.
  • A TCPRoute (gateway.networking.k8s.io/v1alpha2) per instance, routing traffic from the Gateway listener to the instance’s Bolt port.

Data instance ports are assigned as dataPortBase + array index (default: 9000, 9001, …) and coordinator ports as coordinatorPortBase + coordinator id (default: 9011, 9012, 9013). You can customize the base ports:

externalAccessConfig:
  gateway:
    enabled: true
    gatewayClassName: "eg"
    dataPortBase: 9000
    coordinatorPortBase: 9010

You can also set annotations and labels on the Gateway resource:

externalAccessConfig:
  gateway:
    enabled: true
    gatewayClassName: "eg"
    annotations:
      example.io/owner: "memgraph"
    labels:
      app: memgraph-ha

To install with a chart-managed Gateway (assuming the memgraph-secrets Secret with the license and organization name already exists, see Install Memgraph HA):

helm install memgraph-ha memgraph/memgraph-high-availability \
  --set externalAccessConfig.gateway.enabled=true \
  --set externalAccessConfig.gateway.gatewayClassName=eg

Option 2: Existing (external) Gateway

When you already have a Gateway resource in your cluster (for example, a shared Gateway serving multiple services including Memgraph Lab), you can have the chart create only TCPRoute resources that attach to it:

externalAccessConfig:
  gateway:
    enabled: true
    existingGatewayName: "memgraph-gateway"

In this mode, the chart skips Gateway creation and only creates TCPRoute resources. The gatewayClassName is not required.

If the existing Gateway is in a different namespace, specify it:

externalAccessConfig:
  gateway:
    enabled: true
    existingGatewayName: "memgraph-gateway"
    existingGatewayNamespace: "gateway-system"

To install with an existing Gateway (assuming the memgraph-secrets Secret with the license and organization name already exists, see Install Memgraph HA):

helm install memgraph-ha memgraph/memgraph-high-availability \
  --set externalAccessConfig.gateway.enabled=true \
  --set externalAccessConfig.gateway.existingGatewayName=memgraph-gateway
⚠️

When using an existing Gateway, ensure it has listeners configured with the correct names and ports that match the TCPRoute sectionName references. The chart expects listener names in the format data-{id}-bolt for data instances and coordinator-{id}-bolt for coordinators. For example, the default HA setup (2 data instances, 3 coordinators) needs these listeners:

  • data-0-bolt on port 9000
  • data-1-bolt on port 9001
  • coordinator-1-bolt on port 9011
  • coordinator-2-bolt on port 9012
  • coordinator-3-bolt on port 9013

A standalone Gateway manifest with these pre-configured listeners is available in the Helm charts repository.

TCPRoute API version: TCPRoute uses v1alpha2, which is the latest available API version. It is supported by Envoy Gateway and other major implementations but is not yet GA. Gateway and HTTPRoute are both GA (v1).

Use Memgraph HA chart with IngressNginx

One of the most cost-efficient ways to expose a Memgraph HA cluster is by using IngressNginx. The controller supports TCP routing (including the Bolt protocol), allowing all Memgraph instances to share:

  • a single LoadBalancer, and
  • a single external IP address.

Clients connect to any coordinator or data instance by using different Bolt ports.

To install Memgraph HA with IngressNginx enabled (assuming the memgraph-secrets Secret with the license and organization name already exists, see Install Memgraph HA):

helm install mem-ha-test ./charts/memgraph-high-availability --set \
  affinity.nodeSelection=true,\
  externalAccessConfig.dataInstance.serviceType=IngressNginx,\
  externalAccessConfig.coordinator.serviceType=IngressNginx

When using these settings, the chart will automatically install and configure IngressNginx, including all required TCP routing setup for Memgraph.

Probes

Memgraph HA uses standard Kubernetes startup, readiness, and liveness probes to ensure correct container operation.

  • Startup probe Determines when Memgraph has fully started. It succeeds only after database recovery completes. Liveness and readiness probes do not run until startup succeeds.

  • Readiness probe Indicates when the instance is ready to accept client traffic.

  • Liveness probe Determines when the container should be restarted if it becomes unresponsive.


Default timing

  • On data instances, the startup probe must succeed within 2 hours. If recovery (e.g., from backup) may take longer, increase the timeout.

  • Liveness and readiness probes must succeed at least once every 5 minutes for the pod to be considered healthy.


Probe endpoints

  • Coordinators: probed on the NuRaft server
  • Data instances: probed on the Bolt server
⚠️

Breaking change (HA Helm chart 1.0.0): The probe target port is no longer configurable through container.data.{readinessProbe,livenessProbe,startupProbe}.tcpSocket.port or container.coordinators.{readinessProbe,livenessProbe,startupProbe}.tcpSocket.port. Probes are now hard-coded to a tcpSocket check against ports.boltPort for data instances and ports.coordinatorPort for coordinators. Only the probe timings (failureThreshold, timeoutSeconds, periodSeconds) remain configurable. Remove any tcpSocket overrides from your values.yaml and change the bolt or coordinator port via ports.boltPort / ports.coordinatorPort instead.

Debugging

There are different ways in which you can debug Memgraph’s HA cluster in production. One way is to send us logs from all instances if you notice some issue. That’s why we advise users to set the log level to TRACE if possible. Note however that running TRACE log level has some performance costs, especially when logging to stderr in addition to files. If performance is your concern, first set commonArgs.data.logging.also_log_to_stderr and commonArgs.coordinators.logging.also_log_to_stderr to false since logging to files is cheaper. If you’re still unhappy with the performance overhead of logging, set commonArgs.{data,coordinators}.logging.log_level to DEBUG (higher log levels like INFO or CRITICAL are also fine) and keep also_log_to_stderr: true. These settings replace the --log-level and --also-log-to-stderr flags that the chart now appends to instance args automatically — setting them directly in data[].args or coordinators[].args is rejected.

By default, the chart provisions a dedicated log PVC for every data and coordinator pod. If you only log to stderr and don’t need a persistent log volume, you can disable the log PVC by setting storage.data.createLogStorageClaim and/or storage.coordinators.createLogStorageClaim to false. When you do this you must also set the corresponding commonArgs.{data,coordinators}.logging.log_file to "" to disable file logging — installing the chart with file logging enabled but no log volume is rejected.

If you notice your application is crashing, you will be able to collect core dumps by setting storage.data.createCoreDumpsClaim and storage.coordinators.createCoreDumpsClaim to true. That will trigger the creation of an init container which will be run in privileged mode as the root user to set up all the necessary things on your nodes to be able to collect core dumps. You can then create the debug pod and attach the PVC containing core dumps to that pod to be able to extract core dumps outside of the K8s nodes. The example of such a debug pod is the following YAML file:

apiVersion: v1
kind: Pod
metadata:
  name: debug-coredump
spec:
  containers:
  - name: debug
    image: ubuntu:22.04
    command: ["sleep", "infinity"]
    volumeMounts:
    - name: coredumps
      mountPath: /var/core/memgraph
  volumes:
  - name: coredumps
    persistentVolumeClaim:
      claimName: memgraph-data-0-core-dumps-storage-memgraph-data-0-0
  restartPolicy: Never

There is also a possibility of automatically uploading core dumps to S3. To do that, set coreDumpUploader.enabled to true and configure the S3 bucket, AWS region, and credentials secret in the coreDumpUploader section. Note that the createCoreDumpsClaim flag for the relevant role (data/coordinators) must also be set to true, as the uploader sidecar mounts the same PVC used for core dump storage. Core dumps are uploaded to s3://<s3BucketName>/<s3Prefix>/<pod-hostname>/<core-dump-filename>.

Graceful termination

When a pod is stopped (e.g., during upgrades, rescheduling, or scale-down), Kubernetes sends SIGTERM and waits up to terminationGracePeriodSeconds for the container to exit cleanly before forcefully killing it with SIGKILL. The HA chart defaults this value to 30 seconds for both data and coordinator pods, configurable via:

  • container.data.terminationGracePeriodSeconds
  • container.coordinators.terminationGracePeriodSeconds

since --storage-snapshot-on-exit is explicitly set to false by default.

Using --storage-snapshot-on-exit with HA

If you enable the --storage-snapshot-on-exit flag on data instances, Memgraph will attempt to create a full snapshot of the database during shutdown. Snapshot creation time scales with dataset size and can easily exceed the default grace period on larger deployments.

⚠️

If terminationGracePeriodSeconds is shorter than the time needed to write the on-exit snapshot, Kubernetes will SIGKILL the Memgraph process mid-write, leaving the snapshot incomplete and defeating the purpose of the flag.

When enabling --storage-snapshot-on-exit, set container.data.terminationGracePeriodSeconds to a value that comfortably covers the expected snapshot duration for your dataset. Benchmark the snapshot time on a representative dataset and add a safety margin.

Monitoring

Memgraph HA integrates with Kubernetes monitoring tools through:

The chart kube-prometheus-stack should be installed independently from HA chart with the following command:

helm install kube-prometheus-stack oci://ghcr.io/prometheus-community/charts/kube-prometheus-stack \
  -f kube_prometheus_stack_values.yaml \
  -f kube_prometheus_stack_memgraph_dashboard.yaml \
  --namespace monitoring \
  --create-namespace

kube_prometheus_stack_values.yaml is optional. A template is available in the upstream chart’s repository.

kube_prometheus_stack_memgraph_dashboard.yaml is also optional - it provides a generic dashboard which shows the metrics that Memgraph exports for both standalone and HA deployments. This dashboard file can be downloaded from here.

If you install the kube-prometheus-stack in a non-default namespace, allow cross-namespace scraping. You can allow this by adding the following configuration to your kube_prometheus_stack_values.yaml file:

prometheus:
  prometheusSpec:
    serviceMonitorSelectorNilUsesHelmValues: false

Enable monitoring in the Memgraph HA chart

To enable the Memgraph Prometheus exporter and ServiceMonitor:

prometheus:
  enabled: true
  namespace: monitoring
  memgraphExporter:
    port: 9115
    pullFrequencySeconds: 5
    repository: memgraph/mg-exporter
    tag: 0.2.1
  serviceMonitor:
    kubePrometheusStackReleaseName: kube-prometheus-stack
    interval: 15s

If you set prometheus.enabled to false, resources from charts/memgraph-high-availability/templates/mg-exporter.yaml will still be installed into the monitoring namespace.

Refer to the configuration table later in the document for details on all parameters.

Uninstall kube-prometheus-stack

helm uninstall kube-prometheus-stack --namespace monitoring

Note: The stack’s CRDs are not deleted automatically and must be removed manually:

kubectl delete crd alertmanagerconfigs.monitoring.coreos.com
kubectl delete crd alertmanagers.monitoring.coreos.com
kubectl delete crd podmonitors.monitoring.coreos.com
kubectl delete crd probes.monitoring.coreos.com
kubectl delete crd prometheusagents.monitoring.coreos.com
kubectl delete crd prometheuses.monitoring.coreos.com
kubectl delete crd prometheusrules.monitoring.coreos.com
kubectl delete crd scrapeconfigs.monitoring.coreos.com
kubectl delete crd servicemonitors.monitoring.coreos.com
kubectl delete crd thanosrulers.monitoring.coreos.com

Remote metrics and logs

The HA chart supports optional remote observability:

  • vmagentRemote for shipping metrics with Prometheus remote_write
  • vectorRemote sidecars for shipping Memgraph logs to Loki-compatible endpoints

Prerequisites:

  • keep prometheus.enabled: true so mg-exporter is deployed
  • if you only need remote shipping and not local scraping, set prometheus.serviceMonitor.enabled: false to avoid duplicate scraping
  • configure vectorRemote.data and/or vectorRemote.coordinators depending on which pod roles should ship logs
  • when vectorRemote.enabled: true, add --monitoring-port=<vectorRemote.websocketPort> and --monitoring-address=0.0.0.0 to each instance args

Example values.yaml:

prometheus:
  enabled: true
  namespace: monitoring
  serviceMonitor:
    enabled: false
 
vmagentRemote:
  enabled: true
  namespace: monitoring
  remoteWrite:
    url: "https://<prom-remote-write>/api/v1/write"
    # Optional: only set basicAuth when your remote_write endpoint requires basic auth.
    basicAuth:
      secretName: monitoring-basic-auth
      usernameKey: username
      passwordKey: password
  externalLabels:
    cluster_id: "memgraph-testing-cluster-53"
    service_name: "memgraph-ha"
    cluster_env: "self-hosted-large-01"
 
vectorRemote:
  enabled: true
  data: true
  coordinators: true
  websocketPort: 7444
  logsEndpoint: "https://<loki-endpoint>"
  # Optional: only set auth when your endpoint requires basic auth.
  auth:
    secretName: monitoring-basic-auth
    usernameKey: username
    passwordKey: password
  extraLabels:
    cluster_id: "memgraph-testing-cluster-53"
    service_name: "memgraph-ha"
    cluster_env: "self-hosted-large-01"
 
data:
  - id: "0"
    args:
      - "--monitoring-port=7444"
      - "--monitoring-address=0.0.0.0"
  - id: "1"
    args:
      - "--monitoring-port=7444"
      - "--monitoring-address=0.0.0.0"
 
coordinators:
  - id: "1"
    args:
      - "--monitoring-port=7444"
      - "--monitoring-address=0.0.0.0"
  - id: "2"
    args:
      - "--monitoring-port=7444"
      - "--monitoring-address=0.0.0.0"
  - id: "3"
    args:
      - "--monitoring-port=7444"
      - "--monitoring-address=0.0.0.0"

The chart auto-appends --bolt-port, --management-port, --coordinator-port, --coordinator-id, --coordinator-hostname, --data-directory, --log-level, --also-log-to-stderr and --log-file from ports.* and commonArgs.{data,coordinators}.logging.*. Setting any of these in data[].args or coordinators[].args causes helm install to fail with a template error.

Create credentials secret in the namespace where vmagent runs (usually monitoring):

kubectl create secret generic monitoring-basic-auth -n monitoring \
  --from-literal=username='<username>' \
  --from-literal=password='<password>'

For HA Vector sidecars, create the same secret in the Memgraph release namespace as well:

kubectl create secret generic monitoring-basic-auth -n <memgraph-namespace> \
  --from-literal=username='<username>' \
  --from-literal=password='<password>'

Kubernetes infrastructure metrics

vmagentRemote can additionally scrape Kubernetes infrastructure metrics (kube-state-metrics, node-exporter, kubelet) required by kube-prometheus-stack Kubernetes and Node dashboards, and remote-write them to your centralized monitoring cluster.

Enable Kubernetes scraping by extending your existing vmagentRemote values:

vmagentRemote:
  # ... existing fields (enabled, remoteWrite, externalLabels) ...
  kubernetes:
    enabled: true
    kubeStateMetrics:
      enabled: true
      jobName: kube-state-metrics
      targets:
        - kube-prometheus-stack-kube-state-metrics.monitoring.svc.cluster.local:8080
    nodeExporter:
      enabled: true
      jobName: node-exporter
      targets:
        - kube-prometheus-stack-prometheus-node-exporter.monitoring.svc.cluster.local:9100
    kubelet:
      enabled: true
      jobName: kubelet
      metricsPath: /metrics/cadvisor
      apiServerAddress: kubernetes.default.svc:443
      insecureSkipVerify: false

Notes:

  • RBAC and ServiceAccount resources are created only when an enabled scrape job requires Kubernetes API access (for example kubelet.enabled=true or nodeExporter.useKubernetesDiscovery=true).
  • Keep jobName values aligned with dashboard and recording-rule expectations unless you also update those queries.
  • Dashboards that rely on precomputed recording-rule series still require rule evaluation in your monitoring stack.

A ready-to-use example values file is available in the Helm charts repository: examples/remote-monitoring/values-ha-k8s-metrics.yaml.

Configuration options

The following table lists the configurable parameters of the Memgraph HA chart and their default values.

ParameterDescriptionDefault
image.repositoryMemgraph Docker image repositorydocker.io/memgraph/memgraph
image.tagSpecific tag for the Memgraph Docker image. Overrides the image tag whose default is chart version.3.1.0
image.pullPolicyImage pull policyIfNotPresent
memgraphUserIdThe user id that is hardcoded in Memgraph and Mage images101
memgraphGroupIdThe group id that is hardcoded in Memgraph and Mage images103
storage.data.libPVCSizeSize of the lib storage PVC for data instances1Gi
storage.data.libStorageAccessModeAccess mode used for lib storage on data instancesReadWriteOnce
storage.data.libStorageClassNameThe name of the storage class used for storing data on data instances""
storage.data.createLogStorageClaimCreate a PVC for logs on data instances. When false, commonArgs.data.logging.log_file must be "".true
storage.data.logPVCSizeSize of the log PVC for data instances1Gi
storage.data.logStorageAccessModeAccess mode used for log storage on data instancesReadWriteOnce
storage.data.logStorageClassNameThe name of the storage class used for storing logs on data instances""
storage.data.createCoreDumpsClaimCreate a PVC for core dumps on data instancesfalse
storage.data.coreDumpsStorageClassNameStorage class name for core dumps PVC on data instances""
storage.data.coreDumpsStorageSizeSize of the core dumps PVC on data instances10Gi
storage.data.coreDumpsMountPathMount path for core dumps on data instances/var/core/memgraph
storage.data.coreDumpsImage.repositoryImage repository for the data instance core-dumps init container.docker.io/library/busybox
storage.data.coreDumpsImage.tagImage tag for the data instance core-dumps init container.latest
storage.data.coreDumpsImage.pullPolicyImage pull policy for the data instance core-dumps init container.IfNotPresent
storage.data.extraVolumesAdditional volumes to add to data instance pods[]
storage.data.extraVolumeMountsAdditional volume mounts to add to data instance containers[]
storage.coordinators.libPVCSizeSize of the lib storage PVC for coordinators1Gi
storage.coordinators.libStorageAccessModeAccess mode used for lib storage on coordinatorsReadWriteOnce
storage.coordinators.libStorageClassNameThe name of the storage class used for storing data on coordinators""
storage.coordinators.createLogStorageClaimCreate a PVC for logs on coordinators. When false, commonArgs.coordinators.logging.log_file must be "".true
storage.coordinators.logPVCSizeSize of the log PVC for coordinators1Gi
storage.coordinators.logStorageAccessModeAccess mode used for log storage on coordinatorsReadWriteOnce
storage.coordinators.logStorageClassNameThe name of the storage class used for storing logs on coordinators""
storage.coordinators.createCoreDumpsClaimCreate a PVC for core dumps on coordinatorsfalse
storage.coordinators.coreDumpsStorageClassNameStorage class name for core dumps PVC on coordinators""
storage.coordinators.coreDumpsStorageSizeSize of the core dumps PVC on coordinators10Gi
storage.coordinators.coreDumpsMountPathMount path for core dumps on coordinators/var/core/memgraph
storage.coordinators.coreDumpsImage.repositoryImage repository for the coordinator core-dumps init container.docker.io/library/busybox
storage.coordinators.coreDumpsImage.tagImage tag for the coordinator core-dumps init container.latest
storage.coordinators.coreDumpsImage.pullPolicyImage pull policy for the coordinator core-dumps init container.IfNotPresent
storage.coordinators.extraVolumesAdditional volumes to add to coordinator pods[]
storage.coordinators.extraVolumeMountsAdditional volume mounts to add to coordinator containers[]
externalAccessConfig.coordinator.serviceTypeIngressNginx, NodePort, CommonLoadBalancer or LoadBalancer. By default, no external service will be created.""
externalAccessConfig.coordinator.annotationsAnnotations for external services attached to coordinators.{}
externalAccessConfig.dataInstance.serviceTypeIngressNginx, NodePort or LoadBalancer. By default, no external service will be created.""
externalAccessConfig.dataInstance.annotationsAnnotations for external services attached to data instances.{}
externalAccessConfig.gateway.enabledEnable Gateway API external access.false
externalAccessConfig.gateway.gatewayClassNameName of a pre-existing GatewayClass. Required when creating a new Gateway.""
externalAccessConfig.gateway.existingGatewayNameName of an existing Gateway to attach routes to. Skips Gateway creation.""
externalAccessConfig.gateway.existingGatewayNamespaceNamespace of the existing Gateway. Defaults to release namespace.""
externalAccessConfig.gateway.annotationsAnnotations for the Gateway resource.{}
externalAccessConfig.gateway.labelsLabels for the Gateway resource.{}
externalAccessConfig.gateway.dataPortBaseBase port for data instance Gateway listeners (dataPortBase + index).9000
externalAccessConfig.gateway.coordinatorPortBaseBase port for coordinator Gateway listeners (coordinatorPortBase + id).9010
headlessService.enabledSpecifies whether headless services will be used inside K8s network on all instances.false
ports.boltPortBolt port used on coordinator and data instances.7687
ports.managementPortManagement port used on coordinator and data instances.10000
ports.replicationPortReplication port used on data instances.20000
ports.coordinatorPortCoordinator port used on coordinators.12000
ports.metricsPortMetrics port for coordinators and data instances. Opened only if prometheus.enabled is set to true.9091
affinity.uniqueSchedule pods on different nodes in the clusterfalse
affinity.paritySchedule pods on the same node with maximum one coordinator and one data nodefalse
affinity.nodeSelectionSchedule pods on nodes with specific labelsfalse
affinity.roleLabelKeyLabel key for node selectionrole
affinity.dataNodeLabelValueLabel value for data nodesdata-node
affinity.coordinatorNodeLabelValueLabel value for coordinator nodescoordinator-node
container.data.terminationGracePeriodSecondsGrace period for data instance pod termination1800
container.data.livenessProbe.failureThresholdFailure threshold for liveness probe20
container.data.livenessProbe.timeoutSecondsTimeout for liveness probe10
container.data.livenessProbe.periodSecondsPeriod seconds for liveness probe5
container.data.readinessProbe.failureThresholdFailure threshold for readiness probe20
container.data.readinessProbe.timeoutSecondsTimeout for readiness probe10
container.data.readinessProbe.periodSecondsPeriod seconds for readiness probe5
container.data.startupProbe.failureThresholdFailure threshold for startup probe1440
container.data.startupProbe.timeoutSecondsTimeout for probe10
container.data.startupProbe.periodSecondsPeriod seconds for startup probe10
container.data.terminationGracePeriodSecondsGrace period for data pod termination. Increase when --storage-snapshot-on-exit is enabled so the snapshot has time to finish.30
container.coordinators.livenessProbe.failureThresholdFailure threshold for liveness probe20
container.coordinators.livenessProbe.timeoutSecondsTimeout for liveness probe10
container.coordinators.livenessProbe.periodSecondsPeriod seconds for liveness probe5
container.coordinators.readinessProbe.failureThresholdFailure threshold for readiness probe20
container.coordinators.readinessProbe.timeoutSecondsTimeout for readiness probe10
container.coordinators.readinessProbe.periodSecondsPeriod seconds for readiness probe5
container.coordinators.startupProbe.failureThresholdFailure threshold for startup probe20
container.coordinators.startupProbe.timeoutSecondsTimeout for probe10
container.coordinators.startupProbe.periodSecondsPeriod seconds for startup probe10
container.coordinators.terminationGracePeriodSecondsGrace period for coordinators pod termination.30.
dataConfiguration for data instancesSee data section
coordinatorsConfiguration for coordinator instancesSee coordinators section
sysctlInitContainer.enabledEnable the init container to set sysctl parameterstrue
sysctlInitContainer.maxMapCountValue for vm.max_map_count to be set by the init container262144
sysctlInitContainer.image.repositoryImage repository for the sysctl init containerlibrary/busybox
sysctlInitContainer.image.tagImage tag for the sysctl init containerlatest
sysctlInitContainer.image.pullPolicyImage pull policy for the sysctl init containerIfNotPresent
secrets.nameName of the Kubernetes Secret holding the Memgraph Enterprise license and organization name. Must exist before helm install.memgraph-secrets
secrets.licenseKeyKey in the Secret whose value is exposed as MEMGRAPH_ENTERPRISE_LICENSE to data and coordinator pods.MEMGRAPH_ENTERPRISE_LICENSE
secrets.organizationKeyKey in the Secret whose value is exposed as MEMGRAPH_ORGANIZATION_NAME to data and coordinator pods.MEMGRAPH_ORGANIZATION_NAME
resources.coordinatorsCPU/Memory resource requests/limits for coordinators. Left empty by default.{}
resources.dataCPU/Memory resource requests/limits for data instances. Left empty by default.{}
prometheus.enabledIf set to true, K8s resources representing Memgraph’s Prometheus exporter will be deployed.false
prometheus.namespaceNamespace in which kube-prometheus-stack and Memgraph’s Prometheus exporter are installed. When empty, the release namespace is used.""
prometheus.memgraphExporter.portThe port on which Memgraph’s Prometheus exporter is available.9115
prometheus.memgraphExporter.pullFrequencySecondsHow often will Memgraph’s Prometheus exporter pull data from Memgraph instances.5
prometheus.memgraphExporter.repositoryThe repository where Memgraph’s Prometheus exporter image is available.docker.io/memgraph/prometheus-exporter
prometheus.memgraphExporter.tagThe tag of Memgraph’s Prometheus exporter image.0.2.1
prometheus.memgraphExporter.extraVolumesAdditional volumes mounted on the mg-exporter Deployment (e.g. ConfigMaps with custom exporter configs).[]
prometheus.memgraphExporter.extraVolumeMountsAdditional volume mounts for the mg-exporter container.[]
prometheus.serviceMonitor.enabledIf enabled, a ServiceMonitor object will be deployed.true
prometheus.serviceMonitor.kubePrometheusStackReleaseNameThe release name under which kube-prometheus-stack chart is installed.kube-prometheus-stack
prometheus.serviceMonitor.intervalHow often will Prometheus pull data from Memgraph’s Prometheus exporter.15s
vmagentRemote.enabledDeploy a vmagent Deployment that scrapes mg-exporter and remote-writes to a Prometheus-compatible endpoint.false
vmagentRemote.namespaceNamespace for the vmagent Deployment and its resources. Defaults to prometheus.namespace when empty.""
vmagentRemote.image.repositoryvmagent image repository.victoriametrics/vmagent
vmagentRemote.image.tagvmagent image tag.v1.139.0
vmagentRemote.image.pullPolicyvmagent image pull policy.IfNotPresent
vmagentRemote.remoteWrite.urlPrometheus remote_write URL. Required when vmagentRemote.enabled=true.""
vmagentRemote.remoteWrite.basicAuth.secretNameKubernetes Secret holding basic-auth credentials for remote_write. When empty, basic auth is not configured.""
vmagentRemote.remoteWrite.basicAuth.usernameKeyKey in the basic-auth Secret holding the username.username
vmagentRemote.remoteWrite.basicAuth.passwordKeyKey in the basic-auth Secret holding the password.password
vmagentRemote.scrapeIntervalGlobal scrape_interval applied to vmagent scrape jobs.15s
vmagentRemote.externalLabelsExternal labels attached to every scraped sample before remote-write.{}
vmagentRemote.resourcesResource requests/limits for the vmagent container.{}
vmagentRemote.httpPortvmagent local HTTP listen port for metrics/debug (the remote-write target is remoteWrite.url).8429
vmagentRemote.kubernetes.enabledEnable scraping of Kubernetes infrastructure metrics used by kube-prometheus dashboards.false
vmagentRemote.kubernetes.kubeStateMetrics.enabledScrape kube-state-metrics.true
vmagentRemote.kubernetes.kubeStateMetrics.jobNamePrometheus job label for kube-state-metrics. Keep aligned with dashboard/recording-rule expectations.kube-state-metrics
vmagentRemote.kubernetes.kubeStateMetrics.targetsStatic scrape targets for kube-state-metrics.[kube-prometheus-stack-kube-state-metrics.monitoring.svc.cluster.local:8080]
vmagentRemote.kubernetes.nodeExporter.enabledScrape node-exporter.true
vmagentRemote.kubernetes.nodeExporter.jobNamePrometheus job label for node-exporter.node-exporter
vmagentRemote.kubernetes.nodeExporter.useKubernetesDiscoveryDiscover node-exporter pods via Kubernetes SD so namespace/pod/node labels are present for recording rules.false
vmagentRemote.kubernetes.nodeExporter.podMetricsPortPod port used by Kubernetes SD to match node-exporter pods."9100"
vmagentRemote.kubernetes.nodeExporter.appNameLabelExpected value of app.kubernetes.io/name on node-exporter pods.prometheus-node-exporter
vmagentRemote.kubernetes.nodeExporter.appInstanceLabelExpected value of app.kubernetes.io/instance on node-exporter pods.kube-prometheus-stack-prometheus-node-exporter
vmagentRemote.kubernetes.nodeExporter.targetsStatic fallback targets for node-exporter when useKubernetesDiscovery=false.[kube-prometheus-stack-prometheus-node-exporter.monitoring.svc.cluster.local:9100]
vmagentRemote.kubernetes.kubelet.enabledScrape kubelet metrics via the Kubernetes API server node proxy.true
vmagentRemote.kubernetes.kubelet.jobNamePrometheus job label for kubelet. Keep as kubelet so kube-prometheus dashboards and rules still match.kubelet
vmagentRemote.kubernetes.kubelet.metricsPathMetrics path for the primary kubelet scrape (cAdvisor)./metrics/cadvisor
vmagentRemote.kubernetes.kubelet.additionalMetricsEnabledEnable a second kubelet scrape job for /metrics alongside the cAdvisor job.true
vmagentRemote.kubernetes.kubelet.additionalJobNamePrometheus job label for the additional kubelet scrape.kubelet-metrics
vmagentRemote.kubernetes.kubelet.additionalMetricsPathMetrics path for the additional kubelet scrape./metrics
vmagentRemote.kubernetes.kubelet.apiServerAddressKubernetes API server address used to proxy kubelet scrapes.kubernetes.default.svc:443
vmagentRemote.kubernetes.kubelet.insecureSkipVerifySkip TLS verification of the kube-apiserver serving cert when scraping kubelet.false
labels.coordinators.podLabelsEnables you to set labels on a pod level for coordinators.{}
labels.coordinators.statefulSetLabelsEnables you to set labels on a stateful set level for coordinators.{}
labels.coordinators.serviceLabelsEnables you to set labels on a service level for coordinators.{}
labels.data.podLabelsEnables you to set labels on a pod level for data instances.{}
labels.data.statefulSetLabelsEnables you to set labels on a stateful set level for data instances.{}
labels.data.serviceLabelsEnables you to set labels on a service level for data instances.{}
updateStrategy.typeUpdate strategy for StatefulSets. Possible values are RollingUpdate and OnDeleteRollingUpdate
extraEnv.dataEnv variables that users can define and are applied to data instances[]
extraEnv.coordinatorsEnv variables that users can define and are applied to coordinators[]
commonArgs.data.logging.log_levelLog level applied to every data instance via --log-level. Must not be empty.TRACE
commonArgs.data.logging.also_log_to_stderrWhen true, appends --also-log-to-stderr to every data instance. Must be a boolean.true
commonArgs.data.logging.log_fileLog-file path applied to every data instance via --log-file. Empty disables file logging./var/log/memgraph/memgraph.log
commonArgs.coordinators.logging.log_levelLog level applied to every coordinator via --log-level. Must not be empty.TRACE
commonArgs.coordinators.logging.also_log_to_stderrWhen true, appends --also-log-to-stderr to every coordinator. Must be a boolean.true
commonArgs.coordinators.logging.log_fileLog-file path applied to every coordinator via --log-file. Empty disables file logging./var/log/memgraph/memgraph.log
userContainers.dataAdditional sidecar containers for data instance pods[]
userContainers.coordinatorsAdditional sidecar containers for coordinator pods[]
tolerations.dataTolerations for data instance pods[]
tolerations.coordinatorsTolerations for coordinator pods[]
initContainers.dataInit containers that users can define that will be applied to data instances.[]
initContainers.coordinatorsInit containers that users can define that will be applied to coordinators.[]
coreDumpUploader.enabledEnable the core dump S3 uploader sidecar. Requires storage.<role>.createCoreDumpsClaim to be true.false
coreDumpUploader.image.repositoryDocker image repository for the uploader sidecaramazon/aws-cli
coreDumpUploader.image.tagDocker image tag for the uploader sidecar2.33.28
coreDumpUploader.image.pullPolicyImage pull policy for the uploader sidecarIfNotPresent
coreDumpUploader.s3BucketNameS3 bucket name where core dumps will be uploaded""
coreDumpUploader.s3PrefixS3 key prefix (folder) for uploaded core dumpscore-dumps
coreDumpUploader.awsRegionAWS region of the S3 bucketus-east-1
coreDumpUploader.pollIntervalSecondsHow often (in seconds) the sidecar checks for new core dump files30
coreDumpUploader.secretNameName of the K8s Secret containing AWS credentialsaws-s3-credentials
coreDumpUploader.accessKeySecretKeyKey in the K8s Secret for AWS_ACCESS_KEY_IDAWS_ACCESS_KEY_ID
coreDumpUploader.secretAccessKeySecretKeyKey in the K8s Secret for AWS_SECRET_ACCESS_KEYAWS_SECRET_ACCESS_KEY

For the data and coordinators sections, each item in the list has the following parameters:

ParameterDescriptionDefault
idID of the instance0 for data, 1 for coordinators
internalAccessAnnotationsPer-instance annotations for the internal ClusterIP Service.{}
externalAccessAnnotationsPer-instance annotations for the external access Service, merged with global annotations.{}
argsPer-instance Memgraph CLI flags. Append-only — see the note below for flags the chart manages.["--storage-snapshot-on-exit=false"] for data, [] for coordinators

The args field accepts any Memgraph CLI flag except the following, which the chart appends automatically and rejects when set per-instance: --bolt-port, --management-port, --coordinator-port, --coordinator-id, --coordinator-hostname, --data-directory, --log-level, --also-log-to-stderr, and --log-file. Configure those through ports.* and commonArgs.{data,coordinators}.logging.* instead.

For all available database settings, refer to the configuration settings docs.

In-Service Software Upgrade (ISSU)

Memgraph’s High Availability supports in-service software upgrades (ISSU). This guide explains the process when using HA Helm charts. The procedure is very similar for native deployments.

Some Memgraph versions require additional upgrade steps beyond the standard ISSU procedure. Check the Migrating to v3.9 HA page for version-specific instructions before proceeding.

⚠️

Important: Although the upgrade process is designed to complete successfully, unexpected issues may occur. We strongly recommend doing a backup of your lib directory on all of your StatefulSets or native instances depending on the deployment type.

Prerequisites

If you are using HA Helm charts, set the following configuration before doing any upgrade.

updateStrategy.type: OnDelete

Depending on the infrastructure on which you have your Memgraph cluster, the details will differ a bit, but the backbone is the same.

Prepare a backup of all data from all instances. This ensures you can safely downgrade cluster to the last stable version you had.

  • For native deployments, tools like cp or rsync are sufficient.

  • For Kubernetes, create a VolumeSnapshotClasswith the yaml file fimilar to this:

    apiVersion: snapshot.storage.k8s.io/v1
    kind: VolumeSnapshotClass
    metadata:
      name: csi-azure-disk-snapclass
    driver: disk.csi.azure.com
    deletionPolicy: Delete

    Apply it:

    kubectl apply -f azure_class.yaml
    • On Google Kubernetes Engine, the default CSI driver is pd.csi.storage.gke.io so make sure to change the field driver.
    • On AWS EKS, refer to the AWS snapshot controller docs.

Create snapshots

Now you can create a VolumeSnapshot of the lib directory using the yaml file:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: coord-3-snap # Use a unique name for each instance
  namespace: default
spec:
  volumeSnapshotClassName: csi-azure-disk-snapclass
  source:
    persistentVolumeClaimName: memgraph-coordinator-3-lib-storage-memgraph-coordinator-3-0

Apply it:

kubectl apply -f azure_snapshot.yaml

Repeat for every instance in the cluster.

Update configuration

Next you should update image.tag field in the values.yaml configuration file to the version to which you want to upgrade your cluster.

  1. In your values.yaml, update the image version:

    image:
      tag: <new_version>
  2. Apply the upgrade:

    helm upgrade <release> <chart> -f <path_to_values.yaml>

Since we are using updateStrategy.type=OnDelete, this step will not restart any pod, rather it will just prepare pods for running the new version.

  • For native deployments, ensure the new binary is available.

Upgrade procedure (zero downtime)

Our procedure for achieving zero-downtime upgrades consists of restarting one instance at a time. Memgraph uses primary–secondary replication. To avoid downtime:

  1. Upgrade replicas first.
  2. Upgrade the main instance.
  3. Upgrade coordinator followers, then the leader.

In order to find out on which pod/server the current main and the current cluster leader sits, run:

SHOW INSTANCES;

Upgrade replicas

If you are using K8s, the upgrade can be performed by deleting the pod. Start by deleting the replica pod (in this example replica is running on the pod memgraph-data-1-0):

kubectl delete pod memgraph-data-1-0

Native deployment: stop the old binary and start the new one.

Before starting the upgrade of the next pod, it is important to wait until all pods are ready. Otherwise, you may end up with a data loss. On K8s you can easily achieve that by running:

kubectl wait --for=condition=ready pod --all

For the native deployment, check if all your instances are alived manually.

This step should be repeated for all of your replicas in the cluster.

Upgrade the main

Before deleting the main pod, check replication lag to see whether replicas are behind MAIN:

SHOW REPLICATION LAG;

If replicas are behind, your upgrade will be prone to a data loss. In order to achieve zero-downtime upgrade without any data loss, either:

  • Use STRICT_SYNC mode (writes will be blocked during upgrade), or
  • Wait until replicas are fully caught up, then pause writes. This way, you can use any replication mode. Read queries should however work without any issues independently from the replica type you are using.

Upgrade the main pod:

kubectl delete pod memgraph-data-0-0
kubectl wait --for=condition=ready pod --all

Upgrade coordinators

The upgrade of coordinators is done in exactly the same way. Start by upgrading followers and finish with deleting the leader pod:

kubectl delete pod memgraph-coordinator-3-0
kubectl wait --for=condition=ready pod --all
 
kubectl delete pod memgraph-coordinator-2-0
kubectl wait --for=condition=ready pod --all
 
kubectl delete pod memgraph-coordinator-1-0
kubectl wait --for=condition=ready pod --all

Verify upgrade

Your upgrade should be finished now, to check that everything works, run:

SHOW VERSION;

It should show you the new Memgraph version.

Rollback

If during the upgrade, you figured out that an error happened or even after upgrading all of your pods something doesn’t work (e.g. write queries don’t pass), you can safely downgrade your cluster to the previous version using VolumeSnapshots you took on K8s or file backups for native deployments.

  • Kubernetes:

    helm uninstall <release>

    In values.yaml, for all instances set:

    restoreDataFromSnapshot: true

    Make sure to set correct name of the snapshot you will use to recover your instances.

  • Native deployments: restore from your file backups.

If you’re doing an upgrade on minikube, it is important to make sure that the snapshot resides on the same node on which the StatefulSet is installed. Otherwise, it won’t be able to restore StatefulSet's attached PersistentVolumeClaim from the VolumeSnapshot.