Set up HA cluster with K8s Enterprise
Users are advised to first read the guide on how replication works, followed by the guide on how high availability works, and how to query the cluster.
Install Memgraph HA on Kubernetes
To deploy a Memgraph High Availability (HA) cluster on Kubernetes, you must first add the Memgraph Helm repository and then install the HA Helm chart.
Add the Helm repository
Add the Memgraph Helm chart repository to your local Helm setup by running the following command:
helm repo add memgraph https://memgraph.github.io/helm-chartsMake sure to update the repository to fetch the latest Helm charts available:
helm repo updateInstall Memgraph HA
Since Memgraph HA requires an Enterprise
license, you must provide
the license and organization name to the chart through a Kubernetes Secret.
Breaking change: Starting with Memgraph HA chart version 1.0.0, the HA chart no longer accepts
the license and organization name as plaintext values via env.MEMGRAPH_ENTERPRISE_LICENSE
and env.MEMGRAPH_ORGANIZATION_NAME. Both values are now read from a Kubernetes
Secret referenced via secretKeyRef, and the secret must exist before you run
helm install — the StatefulSets will fail to start otherwise. The previous
env.* values have been removed from values.yaml.
Create the secret first, then install the chart:
kubectl create secret generic memgraph-secrets \
--from-literal=MEMGRAPH_ENTERPRISE_LICENSE=<your-license> \
--from-literal=MEMGRAPH_ORGANIZATION_NAME=<your-organization-name>
helm install <release-name> memgraph/memgraph-high-availabilityReplace <release-name> with a name of your choice for the release. The
secret name and keys are configurable via secrets.name, secrets.licenseKey
and secrets.organizationKey (defaults: memgraph-secrets,
MEMGRAPH_ENTERPRISE_LICENSE, MEMGRAPH_ORGANIZATION_NAME).
The cluster will be fully connected once installation completes. Note that the install command may take a moment while instances establish connections. If clients connect from outside the cluster, update the Bolt server address on each instance to use its external IP as explained in the section on setting up the cluster.
latest tag can lead to unexpected behavior if pods restart and pull newer,
incompatible images. Install Memgraph HA with minikube
If you are installing Memgraph HA chart locally with minikube, we are strongly
recommending to enable csi-hostpath-driver and use its storage class.
Otherwise,
you could have problems with attaching PVCs to pods.
Enable csi-hostpath-driver
minikube addons disable storage-provisioner
minikube addons disable default-storageclass
minikube addons enable volumesnapshots
minikube addons enable csi-hostpath-driverCreate a StorageClass (save as sc.yaml)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: csi-hostpath-delayed
provisioner: hostpath.csi.k8s.io
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: DeleteApply the StorageClass
kubectl apply -f sc.yaml
Configure the Helm chart
In your values.yaml, set:
storage:
libStorageClassName: csi-hostpath-delayedConfigure the Helm chart
Override default chart values
You can customize the Memgraph HA Helm chart either inline with --set flags or
by using a values.yaml file.
Option 1: Override values inline
helm install <release-name> memgraph/memgraph-high-availability \
--set <flag1>=<value1>,<flag2>=<value2>,...Option 2: Use a values file
helm install <release-name> memgraph/memgraph-high-availability \
-f values.yamlYou can also combine both approaches. Values specified with --set override
those in values.yaml.
Upgrade Helm chart
To upgrade the helm chart you can use:
helm upgrade <release-name> memgraph/memgraph-high-availability --set <flag1>=<value1>,<flag2>=<value2>Again it is possible use both --set and values.yaml to set configuration
options.
If you’re using IngressNginx and performing an upgrade, the attached public IP
should remain the same. It will only change if the release includes specific
updates that modify it—and such changes will be documented.
Uninstall Helm chart
Uninstallation is done with:
helm uninstall <release-name>Uninstalling the chart does not delete PersistentVolumeClaims (PVCs). Even
if the default StorageClass reclaim policy is Delete, data on the underlying
PersistentVolumes (PVs) will not be removed automatically when uninstalling the
chart.
However, we still recommend configuring the reclaim policy to Retain, as
described in the High availability storage
section.
Runtime environment & security
Security context
All Memgraph HA instances run as Kubernetes StatefulSet workloads, each with a
single pod. Depending on configuration, the pod contains two or three
containers:
- memgraph-coordinator - runs the Memgraph binary.
- Optional init container - enabled when
sysctlInitContainer.enabledis set.
Memgraph processes run as the non-root memgraph user with no Linux capabilities and no privilege escalation.
High availability storage
Memgraph HA always uses PersistentVolumeClaims (PVCs) to store database files and logs.
- Default storage size: 1Gi (you will likely need to increase this).
- Default access mode:
ReadWriteOnce(can be set toReadOnlyMany,ReadWriteMany, orReadWriteOncePod). - PVCs use the cluster’s default StorageClass, unless overridden.
You can explicitly set storage classes using:
storage.libStorageClassName- for data volumesstorage.logStorageClassName- for log volumes
Most default StorageClasses use a Delete reclaim policy, meaning deleting the
PVC deletes the underlying PersistentVolume (PV). We recommend switching to
Retain.
After your cluster is running, you can patch all PVs:
#!/bin/bash
PVS=$(kubectl get pv --no-headers -o custom-columns=":metadata.name")
for pv in $PVS; do
echo "Patching PV: $pv"
kubectl patch pv $pv -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'
done
echo "All PVs have been patched."Kubernetes uses Storage Object in Use Protection, preventing deletion of
PVCs while still attached to pods.
Similarly, PVs will remain until their PVCs are fully removed.
If a PVC is stuck terminating, you can remove its finalizers:
kubectl patch pvc PVC_NAME -p '{"metadata":{"finalizers": []}}' --type=mergeNetwork configuration
All Memgraph HA components communicate internally using ClusterIP network for communicating between themselves.
Default ports:
- Management: 10000
- Replication (data instances): 20000
- Coordinator communication: 12000
You can change this configuration by specifying:
ports:
managementPort: <value>
replicationPort: <value>
coordinatorPort: <value>External network configuration
Memgraph HA uses client-side routing, so DNS resolution happens internal Cluster IP network. Because of that, we need one more type of network which will be used for clients accessing instances from outside the cluster. Our HA supports out of the box following K8s resources used to setup external access:
- IngressNginx - one LoadBalancer for all instances.
- NodePort - exposes ports on each node (requires public node IPs).
- LoadBalancer - one LoadBalancer per instance (highest cost).
- CommonLoadBalancer (coordinators only) - single LB for all coordinators.
- Gateway API - uses Kubernetes Gateway API resources (Gateway + TCPRoute). Configured under
externalAccessConfig.gateway.
For coordinators, there is an additional option of using CommonLoadBalancer.
In this scenario, there is one load balancer sitting in front of coordinators.
You can save the cost of two load balancers compared to LoadBalancer option
since usually you don’t need to distinguish specific coordinators while using
Memgraph capabilities. Note that if you will be connecting to the coordinator directly for some reason (e.g to run show instances query),
you can run show instance query to see which coordinator you got routed to.
The default Bolt port is opened on 7687 but you can change it by setting ports.boltPort.
For more detailed IngressNginx setup, see Use Memgraph HA chart with IngressNginx.
Note however that Ingress Nginx is getting retired and one of the alternatives is using the Kubernetes Gateway API with controllers like Envoy Gateway, Istio, Cilium, Traefik, or Kong. The HA chart has native Gateway API support — see Use Memgraph HA chart with Gateway API.
By default, the chart does not expose any external network services.
Per-instance external access annotations
When using LoadBalancer or NodePort external access, you can set annotations
globally via externalAccessConfig.dataInstance.annotations and
externalAccessConfig.coordinator.annotations. These apply to every external
Service of that type.
If you need different annotations per instance — for example, to assign unique
DNS hostnames via external-dns — use the externalAccessAnnotations field on
individual entries in data[] or coordinators[]. Per-instance annotations are
merged with the global annotations, and per-instance values take precedence
when the same key appears in both.
externalAccessConfig:
dataInstance:
serviceType: "LoadBalancer"
annotations:
service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
data:
- id: "0"
externalAccessAnnotations:
external-dns.alpha.kubernetes.io/hostname: "data-0.memgraph.example.com"
- id: "1"
externalAccessAnnotations:
external-dns.alpha.kubernetes.io/hostname: "data-1.memgraph.example.com"In this example, each data instance’s external Service gets the shared
aws-load-balancer-scheme annotation plus its own unique external-dns
hostname. Bolt and management ports are not set per-instance — they come from
ports.boltPort and ports.managementPort.
Per-instance internal access annotations
Each data instance and coordinator also has an internal ClusterIP Service used
for in-cluster communication. You can set per-instance annotations on these
internal Services using the internalAccessAnnotations field on individual
entries in data[] or coordinators[]. This is useful for integrations or other tooling that consumes annotations on the internal
Services.
data:
- id: "0"
internalAccessAnnotations:
mycompany.io/service-mesh: "enabled"
- id: "1"
internalAccessAnnotations:
mycompany.io/service-mesh: "enabled"
coordinators:
- id: "1"
internalAccessAnnotations:
mycompany.io/service-mesh: "enabled"Node affinity
Memgraph HA deploys multiple pods, and you can control pod placement with affinity settings.
Supported strategies:
-
default Attempts to to schedule the data pods and coordinator pods on the nodes where there is no other pod with the same role. If there is no such node, the pods will still be scheduled on the same node, and deployment will not fail.
-
unique (
affinity.unique = true) Each coordinator and data pod must be placed on separate nodes. If not enough nodes exist, deployment fails. Coordinators get scheduled first. After that, data pods are looking for the nodes with coordinators. -
parity (
affinity.parity = true) Schedules at most one coordinator + one data pod per node. Coordinators schedule first; data pods follow. -
nodeSelection (
affinity.nodeSelection = true) Pods are scheduled onto explicitly labeled nodes usingaffinity.dataNodeLabelValueandaffinity.coordinatorNodeLabelValue. If all the nodes with labels are occupied by the pods with the same role, the deployment will fail.
When using nodeSelection, ensure that nodes are labeled correctly.
Default role label key: role
Default values: data-node, coordinator-node
Example:
kubectl label nodes <node-name> role=data-nodeA full AKS example is available in the chart repository.
Sysctl options
Use the sysctlInitContainer to configure kernel parameters required for
high-memory workloads, such as increasing:
Authentication
By default, Memgraph HA starts without authentication enabled.
Breaking change: The HA chart no longer creates a Memgraph user from the
USER/PASSWORD keys of the memgraph-secrets Secret. The secrets.enabled,
secrets.userKey and secrets.passwordKey values have been removed because
the previous implementation also applied these env variables to coordinators,
which run without auth. The memgraph-secrets Secret is now reserved for the
license and organization name.
To configure credentials, connect to a data instance after installation and create users with Cypher, for example:
CREATE USER memgraph IDENTIFIED BY 'memgraph';Run the same statements on every data instance you want the user to exist on. Coordinators run without authentication and do not need user setup.
Setting up the cluster
Although many configuration options exist, especially for networking, the workflow for creating a Memgraph HA cluster follows these steps:
- Provision the Kubernetes cluster. Ensure your nodes, storage, and networking are ready.
- Label nodes according to your chosen affinity strategy (optional). For example, when using
nodeSelection, label nodes asdata-nodeorcoordinator-node. - Create the
memgraph-secretsKubernetes secret holdingMEMGRAPH_ENTERPRISE_LICENSEandMEMGRAPH_ORGANIZATION_NAME(required — the chart reads these viasecretKeyRef). - Install the Memgraph HA Helm chart using
helm install. This creates a fully connected cluster. - Install auxiliary components for external access, such as
ingress-nginx(optional). - Update Bolt server addresses if clients will connect from outside the cluster (optional).
Update bolt server
This step is required only when:
- Clients access the database from outside the cluster, and
- You’re using bolt+routing for client-side routing
Each instance must know its external address for routing to work correctly. Run the following queries on the leader coordinator:
UPDATE CONFIG FOR COORDINATOR 1 WITH CONFIG {"bolt_server": "<bolt-server-coord1>"};
UPDATE CONFIG FOR COORDINATOR 2 WITH CONFIG {"bolt_server": "<bolt-server-coord2>"};
UPDATE CONFIG FOR COORDINATOR 3 WITH CONFIG {"bolt_server": "<bolt-server-coord3>"};
UPDATE CONFIG FOR INSTANCE instance_0 WITH CONFIG {"bolt_server": "<bolt-server-instance0>"};
UPDATE CONFIG FOR INSTANCE instance_1 WITH CONFIG {"bolt_server": "<bolt-server-instance1>"};Note that the only the bolt_server values are provided. The correct
value depends on the type of external access you configured (LoadBalancer IP,
Ingress host/port, NodePort, etc.).
Refer to the Memgraph HA User API docs for the full set of commands and usage patterns.
Use Memgraph HA chart with Gateway API
The Memgraph HA Helm chart has native support for the Kubernetes Gateway API. When enabled, the chart automatically creates TCPRoute resources for each data and coordinator instance. You can either let the chart create its own Gateway or attach routes to a pre-existing one.
Gateway API is orthogonal to the serviceType external access options (IngressNginx, NodePort, LoadBalancer). The routes point at internal ClusterIP services that always exist, so you can use Gateway API alongside or instead of other external access methods.
Prerequisites
Before enabling Gateway API in the chart, you need:
-
A Gateway API controller installed in your cluster. Examples include Envoy Gateway, Istio, Cilium, Traefik, and Kong. This guide uses Envoy Gateway as an example:
helm install eg oci://docker.io/envoyproxy/gateway-helm --version v1.2.4 -n envoy-gateway-system --create-namespace -
A GatewayClass resource that references your controller. A GatewayClass is a cluster-scoped resource that defines which controller manages Gateways — each Gateway references a GatewayClass by name. The Helm chart does not create a GatewayClass; you must create one yourself or use one provided by your controller installation. For Envoy Gateway:
apiVersion: gateway.networking.k8s.io/v1 kind: GatewayClass metadata: name: eg spec: controllerName: gateway.envoyproxy.io/gatewayclass-controller
You must ensure the GatewayClass exists before enabling the gateway feature in the chart. If you create your own Gateway (Option 1 below), the chart requires gatewayClassName to reference an existing GatewayClass, and will fail with an error if it is not set.
Option 1: Chart-managed Gateway
When you want the chart to create its own Gateway along with TCPRoute resources, set externalAccessConfig.gateway.enabled to true and provide the gatewayClassName:
externalAccessConfig:
gateway:
enabled: true
gatewayClassName: "eg"The chart will create:
- A Gateway (
gateway.networking.k8s.io/v1) with TCP listeners auto-generated for each data and coordinator instance. - A TCPRoute (
gateway.networking.k8s.io/v1alpha2) per instance, routing traffic from the Gateway listener to the instance’s Bolt port.
Data instance ports are assigned as dataPortBase + array index (default: 9000, 9001, …) and coordinator ports as coordinatorPortBase + coordinator id (default: 9011, 9012, 9013). You can customize the base ports:
externalAccessConfig:
gateway:
enabled: true
gatewayClassName: "eg"
dataPortBase: 9000
coordinatorPortBase: 9010You can also set annotations and labels on the Gateway resource:
externalAccessConfig:
gateway:
enabled: true
gatewayClassName: "eg"
annotations:
example.io/owner: "memgraph"
labels:
app: memgraph-haTo install with a chart-managed Gateway (assuming the memgraph-secrets
Secret with the license and organization name already exists, see Install
Memgraph HA):
helm install memgraph-ha memgraph/memgraph-high-availability \
--set externalAccessConfig.gateway.enabled=true \
--set externalAccessConfig.gateway.gatewayClassName=egOption 2: Existing (external) Gateway
When you already have a Gateway resource in your cluster (for example, a shared Gateway serving multiple services including Memgraph Lab), you can have the chart create only TCPRoute resources that attach to it:
externalAccessConfig:
gateway:
enabled: true
existingGatewayName: "memgraph-gateway"In this mode, the chart skips Gateway creation and only creates TCPRoute resources. The gatewayClassName is not required.
If the existing Gateway is in a different namespace, specify it:
externalAccessConfig:
gateway:
enabled: true
existingGatewayName: "memgraph-gateway"
existingGatewayNamespace: "gateway-system"To install with an existing Gateway (assuming the memgraph-secrets Secret
with the license and organization name already exists, see Install Memgraph
HA):
helm install memgraph-ha memgraph/memgraph-high-availability \
--set externalAccessConfig.gateway.enabled=true \
--set externalAccessConfig.gateway.existingGatewayName=memgraph-gatewayWhen using an existing Gateway, ensure it has listeners configured with the correct names and ports that match the TCPRoute sectionName references. The chart expects listener names in the format data-{id}-bolt for data instances and coordinator-{id}-bolt for coordinators. For example, the default HA setup (2 data instances, 3 coordinators) needs these listeners:
data-0-bolton port 9000data-1-bolton port 9001coordinator-1-bolton port 9011coordinator-2-bolton port 9012coordinator-3-bolton port 9013
A standalone Gateway manifest with these pre-configured listeners is available in the Helm charts repository.
TCPRoute API version: TCPRoute uses v1alpha2, which is the latest available API version. It is supported by Envoy Gateway and other major implementations but is not yet GA. Gateway and HTTPRoute are both GA (v1).
Use Memgraph HA chart with IngressNginx
One of the most cost-efficient ways to expose a Memgraph HA cluster is by using IngressNginx. The controller supports TCP routing (including the Bolt protocol), allowing all Memgraph instances to share:
- a single LoadBalancer, and
- a single external IP address.
Clients connect to any coordinator or data instance by using different Bolt ports.
To install Memgraph HA with IngressNginx enabled (assuming the
memgraph-secrets Secret with the license and organization name already
exists, see Install Memgraph HA):
helm install mem-ha-test ./charts/memgraph-high-availability --set \
affinity.nodeSelection=true,\
externalAccessConfig.dataInstance.serviceType=IngressNginx,\
externalAccessConfig.coordinator.serviceType=IngressNginxWhen using these settings, the chart will automatically install and configure IngressNginx, including all required TCP routing setup for Memgraph.
Probes
Memgraph HA uses standard Kubernetes startup, readiness, and liveness probes to ensure correct container operation.
-
Startup probe Determines when Memgraph has fully started. It succeeds only after database recovery completes. Liveness and readiness probes do not run until startup succeeds.
-
Readiness probe Indicates when the instance is ready to accept client traffic.
-
Liveness probe Determines when the container should be restarted if it becomes unresponsive.
Default timing
-
On data instances, the startup probe must succeed within 2 hours. If recovery (e.g., from backup) may take longer, increase the timeout.
-
Liveness and readiness probes must succeed at least once every 5 minutes for the pod to be considered healthy.
Probe endpoints
- Coordinators: probed on the NuRaft server
- Data instances: probed on the Bolt server
Breaking change (HA Helm chart 1.0.0): The probe target port is no longer configurable through
container.data.{readinessProbe,livenessProbe,startupProbe}.tcpSocket.port or
container.coordinators.{readinessProbe,livenessProbe,startupProbe}.tcpSocket.port.
Probes are now hard-coded to a tcpSocket check against ports.boltPort for data
instances and ports.coordinatorPort for coordinators. Only the probe timings
(failureThreshold, timeoutSeconds, periodSeconds) remain configurable.
Remove any tcpSocket overrides from your values.yaml and change the bolt or
coordinator port via ports.boltPort / ports.coordinatorPort instead.
Debugging
There are different ways in which you can debug Memgraph’s HA cluster in
production. One way is to send us logs from all instances if you notice some
issue. That’s why we advise users to set the log level to TRACE if possible.
Note however that running TRACE log level has some performance costs,
especially when logging to stderr in addition to files. If performance is your
concern, first set commonArgs.data.logging.also_log_to_stderr and
commonArgs.coordinators.logging.also_log_to_stderr to false since logging
to files is cheaper. If you’re still unhappy with the performance overhead of
logging, set commonArgs.{data,coordinators}.logging.log_level to DEBUG
(higher log levels like INFO or CRITICAL are also fine) and keep
also_log_to_stderr: true. These settings replace the --log-level and
--also-log-to-stderr flags that the chart now appends to instance args
automatically — setting them directly in data[].args or
coordinators[].args is rejected.
By default, the chart provisions a dedicated log PVC for every data and
coordinator pod. If you only log to stderr and don’t need a persistent log
volume, you can disable the log PVC by setting
storage.data.createLogStorageClaim and/or
storage.coordinators.createLogStorageClaim to false. When you do this you
must also set the corresponding commonArgs.{data,coordinators}.logging.log_file
to "" to disable file logging — installing the chart with file logging
enabled but no log volume is rejected.
If you notice your application is crashing, you will be able to collect core
dumps by setting storage.data.createCoreDumpsClaim and
storage.coordinators.createCoreDumpsClaim to true. That will trigger the
creation of an init container which will be run in privileged mode as the root user
to set up all the necessary things on your nodes to be able to collect core
dumps. You can then create the debug pod and attach the PVC containing core dumps to
that pod to be able to extract core dumps outside of the K8s nodes. The example
of such a debug pod is the following YAML file:
apiVersion: v1
kind: Pod
metadata:
name: debug-coredump
spec:
containers:
- name: debug
image: ubuntu:22.04
command: ["sleep", "infinity"]
volumeMounts:
- name: coredumps
mountPath: /var/core/memgraph
volumes:
- name: coredumps
persistentVolumeClaim:
claimName: memgraph-data-0-core-dumps-storage-memgraph-data-0-0
restartPolicy: NeverThere is also a possibility of automatically uploading core dumps to S3. To do that, set coreDumpUploader.enabled to true and configure the S3 bucket,
AWS region, and credentials secret in the coreDumpUploader section. Note that the createCoreDumpsClaim flag for the relevant role (data/coordinators)
must also be set to true, as the uploader sidecar mounts the same PVC used for core dump storage. Core dumps are uploaded to
s3://<s3BucketName>/<s3Prefix>/<pod-hostname>/<core-dump-filename>.
Graceful termination
When a pod is stopped (e.g., during upgrades, rescheduling, or scale-down),
Kubernetes sends SIGTERM and waits up to terminationGracePeriodSeconds
for the container to exit cleanly before forcefully killing it with SIGKILL.
The HA chart defaults this value to 30 seconds for both data
and coordinator pods, configurable via:
container.data.terminationGracePeriodSecondscontainer.coordinators.terminationGracePeriodSeconds
since --storage-snapshot-on-exit is explicitly set to false by default.
Using --storage-snapshot-on-exit with HA
If you enable the --storage-snapshot-on-exit
flag on data instances, Memgraph will attempt to create a full snapshot of the
database during shutdown. Snapshot creation time scales with dataset size and
can easily exceed the default grace period on larger deployments.
If terminationGracePeriodSeconds is shorter than the time needed to write the
on-exit snapshot, Kubernetes will SIGKILL the Memgraph process mid-write,
leaving the snapshot incomplete and defeating the purpose of the flag.
When enabling --storage-snapshot-on-exit, set
container.data.terminationGracePeriodSeconds to a value that comfortably
covers the expected snapshot duration for your dataset. Benchmark the snapshot
time on a representative dataset and add a safety margin.
Monitoring
Memgraph HA integrates with Kubernetes monitoring tools through:
- The kube-prometheus-stack Helm chart
- Memgraph’s Prometheus exporter
The chart kube-prometheus-stack should be installed independently from HA
chart with the following command:
helm install kube-prometheus-stack oci://ghcr.io/prometheus-community/charts/kube-prometheus-stack \
-f kube_prometheus_stack_values.yaml \
-f kube_prometheus_stack_memgraph_dashboard.yaml \
--namespace monitoring \
--create-namespacekube_prometheus_stack_values.yaml is optional. A template is available in the
upstream chart’s
repository.
kube_prometheus_stack_memgraph_dashboard.yaml is also optional - it provides a generic dashboard which shows the metrics
that Memgraph exports for both standalone and HA deployments. This dashboard file can be downloaded from
here.
If you install the kube-prometheus-stack in a non-default namespace, allow
cross-namespace scraping. You can allow this by adding the following
configuration to your kube_prometheus_stack_values.yaml file:
prometheus:
prometheusSpec:
serviceMonitorSelectorNilUsesHelmValues: falseEnable monitoring in the Memgraph HA chart
To enable the Memgraph Prometheus exporter and ServiceMonitor:
prometheus:
enabled: true
namespace: monitoring
memgraphExporter:
port: 9115
pullFrequencySeconds: 5
repository: memgraph/mg-exporter
tag: 0.2.1
serviceMonitor:
kubePrometheusStackReleaseName: kube-prometheus-stack
interval: 15sIf you set prometheus.enabled to false, resources from
charts/memgraph-high-availability/templates/mg-exporter.yaml will still be
installed into the monitoring namespace.
Refer to the configuration table later in the document for details on all parameters.
Uninstall kube-prometheus-stack
helm uninstall kube-prometheus-stack --namespace monitoringNote: The stack’s CRDs are not deleted automatically and must be removed manually:
kubectl delete crd alertmanagerconfigs.monitoring.coreos.com
kubectl delete crd alertmanagers.monitoring.coreos.com
kubectl delete crd podmonitors.monitoring.coreos.com
kubectl delete crd probes.monitoring.coreos.com
kubectl delete crd prometheusagents.monitoring.coreos.com
kubectl delete crd prometheuses.monitoring.coreos.com
kubectl delete crd prometheusrules.monitoring.coreos.com
kubectl delete crd scrapeconfigs.monitoring.coreos.com
kubectl delete crd servicemonitors.monitoring.coreos.com
kubectl delete crd thanosrulers.monitoring.coreos.comRemote metrics and logs
The HA chart supports optional remote observability:
vmagentRemotefor shipping metrics with Prometheusremote_writevectorRemotesidecars for shipping Memgraph logs to Loki-compatible endpoints
Prerequisites:
- keep
prometheus.enabled: truesomg-exporteris deployed - if you only need remote shipping and not local scraping, set
prometheus.serviceMonitor.enabled: falseto avoid duplicate scraping - configure
vectorRemote.dataand/orvectorRemote.coordinatorsdepending on which pod roles should ship logs - when
vectorRemote.enabled: true, add--monitoring-port=<vectorRemote.websocketPort>and--monitoring-address=0.0.0.0to each instanceargs
Example values.yaml:
prometheus:
enabled: true
namespace: monitoring
serviceMonitor:
enabled: false
vmagentRemote:
enabled: true
namespace: monitoring
remoteWrite:
url: "https://<prom-remote-write>/api/v1/write"
# Optional: only set basicAuth when your remote_write endpoint requires basic auth.
basicAuth:
secretName: monitoring-basic-auth
usernameKey: username
passwordKey: password
externalLabels:
cluster_id: "memgraph-testing-cluster-53"
service_name: "memgraph-ha"
cluster_env: "self-hosted-large-01"
vectorRemote:
enabled: true
data: true
coordinators: true
websocketPort: 7444
logsEndpoint: "https://<loki-endpoint>"
# Optional: only set auth when your endpoint requires basic auth.
auth:
secretName: monitoring-basic-auth
usernameKey: username
passwordKey: password
extraLabels:
cluster_id: "memgraph-testing-cluster-53"
service_name: "memgraph-ha"
cluster_env: "self-hosted-large-01"
data:
- id: "0"
args:
- "--monitoring-port=7444"
- "--monitoring-address=0.0.0.0"
- id: "1"
args:
- "--monitoring-port=7444"
- "--monitoring-address=0.0.0.0"
coordinators:
- id: "1"
args:
- "--monitoring-port=7444"
- "--monitoring-address=0.0.0.0"
- id: "2"
args:
- "--monitoring-port=7444"
- "--monitoring-address=0.0.0.0"
- id: "3"
args:
- "--monitoring-port=7444"
- "--monitoring-address=0.0.0.0"The chart auto-appends --bolt-port, --management-port, --coordinator-port,
--coordinator-id, --coordinator-hostname, --data-directory, --log-level,
--also-log-to-stderr and --log-file from ports.* and
commonArgs.{data,coordinators}.logging.*. Setting any of these in
data[].args or coordinators[].args causes helm install to fail with a
template error.
Create credentials secret in the namespace where vmagent runs (usually monitoring):
kubectl create secret generic monitoring-basic-auth -n monitoring \
--from-literal=username='<username>' \
--from-literal=password='<password>'For HA Vector sidecars, create the same secret in the Memgraph release namespace as well:
kubectl create secret generic monitoring-basic-auth -n <memgraph-namespace> \
--from-literal=username='<username>' \
--from-literal=password='<password>'Kubernetes infrastructure metrics
vmagentRemote can additionally scrape Kubernetes infrastructure metrics
(kube-state-metrics, node-exporter, kubelet) required by
kube-prometheus-stack Kubernetes and Node dashboards, and remote-write them
to your centralized monitoring cluster.
Enable Kubernetes scraping by extending your existing vmagentRemote values:
vmagentRemote:
# ... existing fields (enabled, remoteWrite, externalLabels) ...
kubernetes:
enabled: true
kubeStateMetrics:
enabled: true
jobName: kube-state-metrics
targets:
- kube-prometheus-stack-kube-state-metrics.monitoring.svc.cluster.local:8080
nodeExporter:
enabled: true
jobName: node-exporter
targets:
- kube-prometheus-stack-prometheus-node-exporter.monitoring.svc.cluster.local:9100
kubelet:
enabled: true
jobName: kubelet
metricsPath: /metrics/cadvisor
apiServerAddress: kubernetes.default.svc:443
insecureSkipVerify: falseNotes:
- RBAC and
ServiceAccountresources are created only when an enabled scrape job requires Kubernetes API access (for examplekubelet.enabled=trueornodeExporter.useKubernetesDiscovery=true). - Keep
jobNamevalues aligned with dashboard and recording-rule expectations unless you also update those queries. - Dashboards that rely on precomputed recording-rule series still require rule evaluation in your monitoring stack.
A ready-to-use example values file is available in the Helm charts repository:
examples/remote-monitoring/values-ha-k8s-metrics.yaml.
Configuration options
The following table lists the configurable parameters of the Memgraph HA chart and their default values.
| Parameter | Description | Default |
|---|---|---|
image.repository | Memgraph Docker image repository | docker.io/memgraph/memgraph |
image.tag | Specific tag for the Memgraph Docker image. Overrides the image tag whose default is chart version. | 3.1.0 |
image.pullPolicy | Image pull policy | IfNotPresent |
memgraphUserId | The user id that is hardcoded in Memgraph and Mage images | 101 |
memgraphGroupId | The group id that is hardcoded in Memgraph and Mage images | 103 |
storage.data.libPVCSize | Size of the lib storage PVC for data instances | 1Gi |
storage.data.libStorageAccessMode | Access mode used for lib storage on data instances | ReadWriteOnce |
storage.data.libStorageClassName | The name of the storage class used for storing data on data instances | "" |
storage.data.createLogStorageClaim | Create a PVC for logs on data instances. When false, commonArgs.data.logging.log_file must be "". | true |
storage.data.logPVCSize | Size of the log PVC for data instances | 1Gi |
storage.data.logStorageAccessMode | Access mode used for log storage on data instances | ReadWriteOnce |
storage.data.logStorageClassName | The name of the storage class used for storing logs on data instances | "" |
storage.data.createCoreDumpsClaim | Create a PVC for core dumps on data instances | false |
storage.data.coreDumpsStorageClassName | Storage class name for core dumps PVC on data instances | "" |
storage.data.coreDumpsStorageSize | Size of the core dumps PVC on data instances | 10Gi |
storage.data.coreDumpsMountPath | Mount path for core dumps on data instances | /var/core/memgraph |
storage.data.coreDumpsImage.repository | Image repository for the data instance core-dumps init container. | docker.io/library/busybox |
storage.data.coreDumpsImage.tag | Image tag for the data instance core-dumps init container. | latest |
storage.data.coreDumpsImage.pullPolicy | Image pull policy for the data instance core-dumps init container. | IfNotPresent |
storage.data.extraVolumes | Additional volumes to add to data instance pods | [] |
storage.data.extraVolumeMounts | Additional volume mounts to add to data instance containers | [] |
storage.coordinators.libPVCSize | Size of the lib storage PVC for coordinators | 1Gi |
storage.coordinators.libStorageAccessMode | Access mode used for lib storage on coordinators | ReadWriteOnce |
storage.coordinators.libStorageClassName | The name of the storage class used for storing data on coordinators | "" |
storage.coordinators.createLogStorageClaim | Create a PVC for logs on coordinators. When false, commonArgs.coordinators.logging.log_file must be "". | true |
storage.coordinators.logPVCSize | Size of the log PVC for coordinators | 1Gi |
storage.coordinators.logStorageAccessMode | Access mode used for log storage on coordinators | ReadWriteOnce |
storage.coordinators.logStorageClassName | The name of the storage class used for storing logs on coordinators | "" |
storage.coordinators.createCoreDumpsClaim | Create a PVC for core dumps on coordinators | false |
storage.coordinators.coreDumpsStorageClassName | Storage class name for core dumps PVC on coordinators | "" |
storage.coordinators.coreDumpsStorageSize | Size of the core dumps PVC on coordinators | 10Gi |
storage.coordinators.coreDumpsMountPath | Mount path for core dumps on coordinators | /var/core/memgraph |
storage.coordinators.coreDumpsImage.repository | Image repository for the coordinator core-dumps init container. | docker.io/library/busybox |
storage.coordinators.coreDumpsImage.tag | Image tag for the coordinator core-dumps init container. | latest |
storage.coordinators.coreDumpsImage.pullPolicy | Image pull policy for the coordinator core-dumps init container. | IfNotPresent |
storage.coordinators.extraVolumes | Additional volumes to add to coordinator pods | [] |
storage.coordinators.extraVolumeMounts | Additional volume mounts to add to coordinator containers | [] |
externalAccessConfig.coordinator.serviceType | IngressNginx, NodePort, CommonLoadBalancer or LoadBalancer. By default, no external service will be created. | "" |
externalAccessConfig.coordinator.annotations | Annotations for external services attached to coordinators. | {} |
externalAccessConfig.dataInstance.serviceType | IngressNginx, NodePort or LoadBalancer. By default, no external service will be created. | "" |
externalAccessConfig.dataInstance.annotations | Annotations for external services attached to data instances. | {} |
externalAccessConfig.gateway.enabled | Enable Gateway API external access. | false |
externalAccessConfig.gateway.gatewayClassName | Name of a pre-existing GatewayClass. Required when creating a new Gateway. | "" |
externalAccessConfig.gateway.existingGatewayName | Name of an existing Gateway to attach routes to. Skips Gateway creation. | "" |
externalAccessConfig.gateway.existingGatewayNamespace | Namespace of the existing Gateway. Defaults to release namespace. | "" |
externalAccessConfig.gateway.annotations | Annotations for the Gateway resource. | {} |
externalAccessConfig.gateway.labels | Labels for the Gateway resource. | {} |
externalAccessConfig.gateway.dataPortBase | Base port for data instance Gateway listeners (dataPortBase + index). | 9000 |
externalAccessConfig.gateway.coordinatorPortBase | Base port for coordinator Gateway listeners (coordinatorPortBase + id). | 9010 |
headlessService.enabled | Specifies whether headless services will be used inside K8s network on all instances. | false |
ports.boltPort | Bolt port used on coordinator and data instances. | 7687 |
ports.managementPort | Management port used on coordinator and data instances. | 10000 |
ports.replicationPort | Replication port used on data instances. | 20000 |
ports.coordinatorPort | Coordinator port used on coordinators. | 12000 |
ports.metricsPort | Metrics port for coordinators and data instances. Opened only if prometheus.enabled is set to true. | 9091 |
affinity.unique | Schedule pods on different nodes in the cluster | false |
affinity.parity | Schedule pods on the same node with maximum one coordinator and one data node | false |
affinity.nodeSelection | Schedule pods on nodes with specific labels | false |
affinity.roleLabelKey | Label key for node selection | role |
affinity.dataNodeLabelValue | Label value for data nodes | data-node |
affinity.coordinatorNodeLabelValue | Label value for coordinator nodes | coordinator-node |
container.data.terminationGracePeriodSeconds | Grace period for data instance pod termination | 1800 |
container.data.livenessProbe.failureThreshold | Failure threshold for liveness probe | 20 |
container.data.livenessProbe.timeoutSeconds | Timeout for liveness probe | 10 |
container.data.livenessProbe.periodSeconds | Period seconds for liveness probe | 5 |
container.data.readinessProbe.failureThreshold | Failure threshold for readiness probe | 20 |
container.data.readinessProbe.timeoutSeconds | Timeout for readiness probe | 10 |
container.data.readinessProbe.periodSeconds | Period seconds for readiness probe | 5 |
container.data.startupProbe.failureThreshold | Failure threshold for startup probe | 1440 |
container.data.startupProbe.timeoutSeconds | Timeout for probe | 10 |
container.data.startupProbe.periodSeconds | Period seconds for startup probe | 10 |
container.data.terminationGracePeriodSeconds | Grace period for data pod termination. Increase when --storage-snapshot-on-exit is enabled so the snapshot has time to finish. | 30 |
container.coordinators.livenessProbe.failureThreshold | Failure threshold for liveness probe | 20 |
container.coordinators.livenessProbe.timeoutSeconds | Timeout for liveness probe | 10 |
container.coordinators.livenessProbe.periodSeconds | Period seconds for liveness probe | 5 |
container.coordinators.readinessProbe.failureThreshold | Failure threshold for readiness probe | 20 |
container.coordinators.readinessProbe.timeoutSeconds | Timeout for readiness probe | 10 |
container.coordinators.readinessProbe.periodSeconds | Period seconds for readiness probe | 5 |
container.coordinators.startupProbe.failureThreshold | Failure threshold for startup probe | 20 |
container.coordinators.startupProbe.timeoutSeconds | Timeout for probe | 10 |
container.coordinators.startupProbe.periodSeconds | Period seconds for startup probe | 10 |
container.coordinators.terminationGracePeriodSeconds | Grace period for coordinators pod termination. | 30. |
data | Configuration for data instances | See data section |
coordinators | Configuration for coordinator instances | See coordinators section |
sysctlInitContainer.enabled | Enable the init container to set sysctl parameters | true |
sysctlInitContainer.maxMapCount | Value for vm.max_map_count to be set by the init container | 262144 |
sysctlInitContainer.image.repository | Image repository for the sysctl init container | library/busybox |
sysctlInitContainer.image.tag | Image tag for the sysctl init container | latest |
sysctlInitContainer.image.pullPolicy | Image pull policy for the sysctl init container | IfNotPresent |
secrets.name | Name of the Kubernetes Secret holding the Memgraph Enterprise license and organization name. Must exist before helm install. | memgraph-secrets |
secrets.licenseKey | Key in the Secret whose value is exposed as MEMGRAPH_ENTERPRISE_LICENSE to data and coordinator pods. | MEMGRAPH_ENTERPRISE_LICENSE |
secrets.organizationKey | Key in the Secret whose value is exposed as MEMGRAPH_ORGANIZATION_NAME to data and coordinator pods. | MEMGRAPH_ORGANIZATION_NAME |
resources.coordinators | CPU/Memory resource requests/limits for coordinators. Left empty by default. | {} |
resources.data | CPU/Memory resource requests/limits for data instances. Left empty by default. | {} |
prometheus.enabled | If set to true, K8s resources representing Memgraph’s Prometheus exporter will be deployed. | false |
prometheus.namespace | Namespace in which kube-prometheus-stack and Memgraph’s Prometheus exporter are installed. When empty, the release namespace is used. | "" |
prometheus.memgraphExporter.port | The port on which Memgraph’s Prometheus exporter is available. | 9115 |
prometheus.memgraphExporter.pullFrequencySeconds | How often will Memgraph’s Prometheus exporter pull data from Memgraph instances. | 5 |
prometheus.memgraphExporter.repository | The repository where Memgraph’s Prometheus exporter image is available. | docker.io/memgraph/prometheus-exporter |
prometheus.memgraphExporter.tag | The tag of Memgraph’s Prometheus exporter image. | 0.2.1 |
prometheus.memgraphExporter.extraVolumes | Additional volumes mounted on the mg-exporter Deployment (e.g. ConfigMaps with custom exporter configs). | [] |
prometheus.memgraphExporter.extraVolumeMounts | Additional volume mounts for the mg-exporter container. | [] |
prometheus.serviceMonitor.enabled | If enabled, a ServiceMonitor object will be deployed. | true |
prometheus.serviceMonitor.kubePrometheusStackReleaseName | The release name under which kube-prometheus-stack chart is installed. | kube-prometheus-stack |
prometheus.serviceMonitor.interval | How often will Prometheus pull data from Memgraph’s Prometheus exporter. | 15s |
vmagentRemote.enabled | Deploy a vmagent Deployment that scrapes mg-exporter and remote-writes to a Prometheus-compatible endpoint. | false |
vmagentRemote.namespace | Namespace for the vmagent Deployment and its resources. Defaults to prometheus.namespace when empty. | "" |
vmagentRemote.image.repository | vmagent image repository. | victoriametrics/vmagent |
vmagentRemote.image.tag | vmagent image tag. | v1.139.0 |
vmagentRemote.image.pullPolicy | vmagent image pull policy. | IfNotPresent |
vmagentRemote.remoteWrite.url | Prometheus remote_write URL. Required when vmagentRemote.enabled=true. | "" |
vmagentRemote.remoteWrite.basicAuth.secretName | Kubernetes Secret holding basic-auth credentials for remote_write. When empty, basic auth is not configured. | "" |
vmagentRemote.remoteWrite.basicAuth.usernameKey | Key in the basic-auth Secret holding the username. | username |
vmagentRemote.remoteWrite.basicAuth.passwordKey | Key in the basic-auth Secret holding the password. | password |
vmagentRemote.scrapeInterval | Global scrape_interval applied to vmagent scrape jobs. | 15s |
vmagentRemote.externalLabels | External labels attached to every scraped sample before remote-write. | {} |
vmagentRemote.resources | Resource requests/limits for the vmagent container. | {} |
vmagentRemote.httpPort | vmagent local HTTP listen port for metrics/debug (the remote-write target is remoteWrite.url). | 8429 |
vmagentRemote.kubernetes.enabled | Enable scraping of Kubernetes infrastructure metrics used by kube-prometheus dashboards. | false |
vmagentRemote.kubernetes.kubeStateMetrics.enabled | Scrape kube-state-metrics. | true |
vmagentRemote.kubernetes.kubeStateMetrics.jobName | Prometheus job label for kube-state-metrics. Keep aligned with dashboard/recording-rule expectations. | kube-state-metrics |
vmagentRemote.kubernetes.kubeStateMetrics.targets | Static scrape targets for kube-state-metrics. | [kube-prometheus-stack-kube-state-metrics.monitoring.svc.cluster.local:8080] |
vmagentRemote.kubernetes.nodeExporter.enabled | Scrape node-exporter. | true |
vmagentRemote.kubernetes.nodeExporter.jobName | Prometheus job label for node-exporter. | node-exporter |
vmagentRemote.kubernetes.nodeExporter.useKubernetesDiscovery | Discover node-exporter pods via Kubernetes SD so namespace/pod/node labels are present for recording rules. | false |
vmagentRemote.kubernetes.nodeExporter.podMetricsPort | Pod port used by Kubernetes SD to match node-exporter pods. | "9100" |
vmagentRemote.kubernetes.nodeExporter.appNameLabel | Expected value of app.kubernetes.io/name on node-exporter pods. | prometheus-node-exporter |
vmagentRemote.kubernetes.nodeExporter.appInstanceLabel | Expected value of app.kubernetes.io/instance on node-exporter pods. | kube-prometheus-stack-prometheus-node-exporter |
vmagentRemote.kubernetes.nodeExporter.targets | Static fallback targets for node-exporter when useKubernetesDiscovery=false. | [kube-prometheus-stack-prometheus-node-exporter.monitoring.svc.cluster.local:9100] |
vmagentRemote.kubernetes.kubelet.enabled | Scrape kubelet metrics via the Kubernetes API server node proxy. | true |
vmagentRemote.kubernetes.kubelet.jobName | Prometheus job label for kubelet. Keep as kubelet so kube-prometheus dashboards and rules still match. | kubelet |
vmagentRemote.kubernetes.kubelet.metricsPath | Metrics path for the primary kubelet scrape (cAdvisor). | /metrics/cadvisor |
vmagentRemote.kubernetes.kubelet.additionalMetricsEnabled | Enable a second kubelet scrape job for /metrics alongside the cAdvisor job. | true |
vmagentRemote.kubernetes.kubelet.additionalJobName | Prometheus job label for the additional kubelet scrape. | kubelet-metrics |
vmagentRemote.kubernetes.kubelet.additionalMetricsPath | Metrics path for the additional kubelet scrape. | /metrics |
vmagentRemote.kubernetes.kubelet.apiServerAddress | Kubernetes API server address used to proxy kubelet scrapes. | kubernetes.default.svc:443 |
vmagentRemote.kubernetes.kubelet.insecureSkipVerify | Skip TLS verification of the kube-apiserver serving cert when scraping kubelet. | false |
labels.coordinators.podLabels | Enables you to set labels on a pod level for coordinators. | {} |
labels.coordinators.statefulSetLabels | Enables you to set labels on a stateful set level for coordinators. | {} |
labels.coordinators.serviceLabels | Enables you to set labels on a service level for coordinators. | {} |
labels.data.podLabels | Enables you to set labels on a pod level for data instances. | {} |
labels.data.statefulSetLabels | Enables you to set labels on a stateful set level for data instances. | {} |
labels.data.serviceLabels | Enables you to set labels on a service level for data instances. | {} |
updateStrategy.type | Update strategy for StatefulSets. Possible values are RollingUpdate and OnDelete | RollingUpdate |
extraEnv.data | Env variables that users can define and are applied to data instances | [] |
extraEnv.coordinators | Env variables that users can define and are applied to coordinators | [] |
commonArgs.data.logging.log_level | Log level applied to every data instance via --log-level. Must not be empty. | TRACE |
commonArgs.data.logging.also_log_to_stderr | When true, appends --also-log-to-stderr to every data instance. Must be a boolean. | true |
commonArgs.data.logging.log_file | Log-file path applied to every data instance via --log-file. Empty disables file logging. | /var/log/memgraph/memgraph.log |
commonArgs.coordinators.logging.log_level | Log level applied to every coordinator via --log-level. Must not be empty. | TRACE |
commonArgs.coordinators.logging.also_log_to_stderr | When true, appends --also-log-to-stderr to every coordinator. Must be a boolean. | true |
commonArgs.coordinators.logging.log_file | Log-file path applied to every coordinator via --log-file. Empty disables file logging. | /var/log/memgraph/memgraph.log |
userContainers.data | Additional sidecar containers for data instance pods | [] |
userContainers.coordinators | Additional sidecar containers for coordinator pods | [] |
tolerations.data | Tolerations for data instance pods | [] |
tolerations.coordinators | Tolerations for coordinator pods | [] |
initContainers.data | Init containers that users can define that will be applied to data instances. | [] |
initContainers.coordinators | Init containers that users can define that will be applied to coordinators. | [] |
coreDumpUploader.enabled | Enable the core dump S3 uploader sidecar. Requires storage.<role>.createCoreDumpsClaim to be true. | false |
coreDumpUploader.image.repository | Docker image repository for the uploader sidecar | amazon/aws-cli |
coreDumpUploader.image.tag | Docker image tag for the uploader sidecar | 2.33.28 |
coreDumpUploader.image.pullPolicy | Image pull policy for the uploader sidecar | IfNotPresent |
coreDumpUploader.s3BucketName | S3 bucket name where core dumps will be uploaded | "" |
coreDumpUploader.s3Prefix | S3 key prefix (folder) for uploaded core dumps | core-dumps |
coreDumpUploader.awsRegion | AWS region of the S3 bucket | us-east-1 |
coreDumpUploader.pollIntervalSeconds | How often (in seconds) the sidecar checks for new core dump files | 30 |
coreDumpUploader.secretName | Name of the K8s Secret containing AWS credentials | aws-s3-credentials |
coreDumpUploader.accessKeySecretKey | Key in the K8s Secret for AWS_ACCESS_KEY_ID | AWS_ACCESS_KEY_ID |
coreDumpUploader.secretAccessKeySecretKey | Key in the K8s Secret for AWS_SECRET_ACCESS_KEY | AWS_SECRET_ACCESS_KEY |
For the data and coordinators sections, each item in the list has the
following parameters:
| Parameter | Description | Default |
|---|---|---|
id | ID of the instance | 0 for data, 1 for coordinators |
internalAccessAnnotations | Per-instance annotations for the internal ClusterIP Service. | {} |
externalAccessAnnotations | Per-instance annotations for the external access Service, merged with global annotations. | {} |
args | Per-instance Memgraph CLI flags. Append-only — see the note below for flags the chart manages. | ["--storage-snapshot-on-exit=false"] for data, [] for coordinators |
The args field accepts any Memgraph CLI flag except the following, which
the chart appends automatically and rejects when set per-instance:
--bolt-port, --management-port, --coordinator-port, --coordinator-id,
--coordinator-hostname, --data-directory, --log-level,
--also-log-to-stderr, and --log-file. Configure those through ports.*
and commonArgs.{data,coordinators}.logging.* instead.
For all available database settings, refer to the configuration settings docs.
In-Service Software Upgrade (ISSU)
Memgraph’s High Availability supports in-service software upgrades (ISSU). This guide explains the process when using HA Helm charts. The procedure is very similar for native deployments.
Some Memgraph versions require additional upgrade steps beyond the standard ISSU procedure. Check the Migrating to v3.9 HA page for version-specific instructions before proceeding.
Important: Although the upgrade process is designed to complete
successfully, unexpected issues may occur. We strongly recommend doing a backup
of your lib directory on all of your StatefulSets or native instances
depending on the deployment type.
Prerequisites
If you are using HA Helm charts, set the following configuration before doing any upgrade.
updateStrategy.type: OnDeleteDepending on the infrastructure on which you have your Memgraph cluster, the details will differ a bit, but the backbone is the same.
Prepare a backup of all data from all instances. This ensures you can safely downgrade cluster to the last stable version you had.
-
For native deployments, tools like
cporrsyncare sufficient. -
For Kubernetes, create a
VolumeSnapshotClasswith the yaml file fimilar to this:apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshotClass metadata: name: csi-azure-disk-snapclass driver: disk.csi.azure.com deletionPolicy: DeleteApply it:
kubectl apply -f azure_class.yaml- On Google Kubernetes Engine, the default CSI driver is
pd.csi.storage.gke.ioso make sure to change the fielddriver. - On AWS EKS, refer to the AWS snapshot controller docs.
- On Google Kubernetes Engine, the default CSI driver is
Create snapshots
Now you can create a VolumeSnapshot of the lib directory using the yaml file:
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: coord-3-snap # Use a unique name for each instance
namespace: default
spec:
volumeSnapshotClassName: csi-azure-disk-snapclass
source:
persistentVolumeClaimName: memgraph-coordinator-3-lib-storage-memgraph-coordinator-3-0Apply it:
kubectl apply -f azure_snapshot.yamlRepeat for every instance in the cluster.
Update configuration
Next you should update image.tag field in the values.yaml configuration file
to the version to which you want to upgrade your cluster.
-
In your
values.yaml, update the image version:image: tag: <new_version> -
Apply the upgrade:
helm upgrade <release> <chart> -f <path_to_values.yaml>
Since we are using updateStrategy.type=OnDelete, this step will not restart
any pod, rather it will just prepare pods for running the new version.
- For native deployments, ensure the new binary is available.
Upgrade procedure (zero downtime)
Our procedure for achieving zero-downtime upgrades consists of restarting one instance at a time. Memgraph uses primary–secondary replication. To avoid downtime:
- Upgrade replicas first.
- Upgrade the main instance.
- Upgrade coordinator followers, then the leader.
In order to find out on which pod/server the current main and the current cluster leader sits, run:
SHOW INSTANCES;Upgrade replicas
If you are using K8s, the upgrade can be performed by deleting the pod. Start by
deleting the replica pod (in this example replica is running on the pod
memgraph-data-1-0):
kubectl delete pod memgraph-data-1-0Native deployment: stop the old binary and start the new one.
Before starting the upgrade of the next pod, it is important to wait until all pods are ready. Otherwise, you may end up with a data loss. On K8s you can easily achieve that by running:
kubectl wait --for=condition=ready pod --allFor the native deployment, check if all your instances are alived manually.
This step should be repeated for all of your replicas in the cluster.
Upgrade the main
Before deleting the main pod, check replication lag to see whether replicas are behind MAIN:
SHOW REPLICATION LAG;If replicas are behind, your upgrade will be prone to a data loss. In order to achieve zero-downtime upgrade without any data loss, either:
- Use
STRICT_SYNCmode (writes will be blocked during upgrade), or - Wait until replicas are fully caught up, then pause writes. This way, you can use any replication mode. Read queries should however work without any issues independently from the replica type you are using.
Upgrade the main pod:
kubectl delete pod memgraph-data-0-0
kubectl wait --for=condition=ready pod --allUpgrade coordinators
The upgrade of coordinators is done in exactly the same way. Start by upgrading followers and finish with deleting the leader pod:
kubectl delete pod memgraph-coordinator-3-0
kubectl wait --for=condition=ready pod --all
kubectl delete pod memgraph-coordinator-2-0
kubectl wait --for=condition=ready pod --all
kubectl delete pod memgraph-coordinator-1-0
kubectl wait --for=condition=ready pod --allVerify upgrade
Your upgrade should be finished now, to check that everything works, run:
SHOW VERSION;It should show you the new Memgraph version.
Rollback
If during the upgrade, you figured out that an error happened or even after
upgrading all of your pods something doesn’t work (e.g. write queries don’t
pass), you can safely downgrade your cluster to the previous version using
VolumeSnapshots you took on K8s or file backups for native deployments.
-
Kubernetes:
helm uninstall <release>In
values.yaml, for all instances set:restoreDataFromSnapshot: trueMake sure to set correct name of the snapshot you will use to recover your instances.
-
Native deployments: restore from your file backups.
If you’re doing an upgrade on minikube, it is important to make sure that the
snapshot resides on the same node on which the StatefulSet is installed.
Otherwise, it won’t be able to restore StatefulSet's attached
PersistentVolumeClaim from the VolumeSnapshot.