Memgraph logo
Back to blog
Embeddings in GraphRAG: How Memgraph Computes and Scales Them

Embeddings in GraphRAG: How Memgraph Computes and Scales Them

By Marko Budiselic
6 min readNovember 11, 2025

Modern Retrieval-Augmented Generation (RAG) systems depend heavily on embeddings. When these embeddings need to capture not just the meaning of text but also the structure of relationships between entities, you enter the world of GraphRAG.

Here, vector representations are informed by both language and graph topology, while graph algorithms enhance initial searches to retrieve richer, more context-aware results.

In this setup, embeddings and graphs work hand in hand. RAG Tools like the Unstructured2Graph handle the conversion of unstructured text into graph structures, which embeddings then operate on to power retrieval and reasoning.

For machine learning and infrastructure engineers, this means unlocking better context and higher-quality retrieval, but it also introduces challenges in scaling the computation and management of embeddings.

In this article, we’ll explore what embeddings are, why they matter for GraphRAG systems, and how Memgraph handles them efficiently to support scalable, real-time reasoning.

What Are Embeddings?

Embeddings are numerical vector representations of data, such as text, images, or graph nodes, that capture their meaning or relationships in a continuous vector space. Instead of comparing words or entities directly, systems compare their embeddings using measures like cosine similarity to determine how semantically close they are.

This method allows for powerful, flexible search and reasoning across multiple data types. But creating and managing embeddings comes with two key challenges:

  1. How to compute embeddings effectively
  2. How to efficiently compare them to retrieve relevant results

From Manual Feature Engineering to Neural Embeddings

Before neural embeddings became standard, researchers relied on hand-engineered vectorization techniques to represent data. In text processing, approaches like Bag-of-Words (BoW) and TF-IDF (Term Frequency–Inverse Document Frequency) transformed documents into sparse vectors based on word counts or importance scores. While simple, they failed to capture semantic relationships between words.

Later techniques such as Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) added dimensionality reduction and topic modeling, but still relied on linear assumptions. Classical ML models like Support Vector Machines (SVMs) used these handcrafted feature vectors and required extensive feature engineering.

The problem? These representations treated every word or entity as independent. So, “car” and “automobile” appeared as unrelated as “car” and “banana.”

Neural embeddings changed this entirely. Instead of depending on human-designed features, they automatically learn rich semantic and relational patterns directly from data.

How Neural Embeddings Work

Modern neural networks map input data into high-dimensional vector spaces where similar items cluster close together. For text, transformer-based models like BERT or OpenAI’s embedding models process token sequences and produce dense vectors that summarize context.

For graphs, algorithms such as Node2VecDeepWalk, or GraphSAGE learn embeddings by analyzing structural patterns. This way, they can capture how nodes co-occur in random walks or share neighborhoods.

These learned embeddings form the foundation for retrieval, clustering, link prediction, and reasoning in RAG systems, where both semantic meaning and graph structure drive context retrieval.

The Challenges of Embeddings at Scale

Embeddings are the numerical backbone of retrieval and reasoning in modern AI systems. But using them at scale introduces practical challenges:

  • Computation Cost: Generating embeddings for millions of documents or graph nodes can quickly consume CPU/GPU-hours or costly via APIs.
  • Storage Requirements: Embeddings tend to be large. Standard embeddings (384–3,072 dimensions) can occupy 1.5–12 KB per item, quickly scaling to terabytes for large datasets, even before indexing overhead.
  • Freshness: Embeddings are not static. They become outdated as data evolves, with new documents, modified nodes, or changing relationships requiring updates or recomputation.

Maintaining high-quality embeddings at scale therefore demands continuous optimization and smart infrastructure design. This is especially a core Infrastructural challenge for GraphRAG pipelines.

A Real-World Example from Memgraph

If all of this sounds abstract, let’s look at a simple real-world example. In one of our GraphRAG use cases, the Memgraph team computed node sentence embeddings by combining all node properties and encoding them using the all-MiniLM-L6-v2 model.

When a user submitted a query, we:

  1. Calculated its embedding.
  2. Searched for the most similar nodes in the graph.
  3. Performed relevance expansion and ordering based on similarity and relationships.

graph-rag-embedding-computation-memgraph.png

The main challenge lay in scaling the embedding computation. Encoding 4 million nodes on CPU took about 10 hours. Switching to GPU acceleration allowed batch processing (1,000 embeddings per batch), cutting runtime to 19 minutes. This is an impressive 32× improvement. Depending on the model, dataset size, and batching strategy, GPUs can achieve up to 100× faster performance than CPUs.

This demonstrates why efficient, scalable embedding computation is critical for GraphRAG pipelines. Without it, even small experiments can become bottlenecks in production.

Running Embedding Pipelines in Memgraph

To make embedding-powered pipeline deployment straightforward, the Memgraph team created a Docker image that packages the embedding query module with all CUDA dependencies in one container. The base image extends memgraph/memgraph-mage, which already includes the database and the MAGE algorithms library.

To start the container using the right image, here’s an example command:

docker run -d --gpus=all -p 7687:7687 \
    memgraph/memgraph-mage:3.6.2-relwithdebinfo-cuda \
    --schema-info-enabled=True \
    --also-log-to-stderr --log-level=TRACE

To compute embeddings across the entire graph with default parameters:

CALL embeddings.node_sentence() YIELD success;

To see available options, such as running on subsets of nodes or using different models, refer to the Embeddings Query Module documentation.

Optimizing Storage with Advanced Vector Search

Traditionally, databases store primary data and create indexes separately. Memgraph’s Vector Search feature initially followed this pattern. But storing embeddings twice, once as primary data and once as index, proved inefficient.

To solve this, Memgraph introduced the Advanced Vector Search initiative, which:

  • Eliminates duplication by using indexes as the primary storage for embedding data.
  • Reduces memory footprint and improves query performance.
  • Simplifies maintenance for large-scale embedding storage.

You can explore this feature and ongoing updates via the Memgraph roadmap and Memgraph Docs.

Wrapping Up

Embeddings connect meaning, structure, and context. They serve as the bridge between unstructured information and structured graph intelligence. Memgraph not only supports efficient embedding computation but also provides the infrastructure to manage and scale them for real-world GraphRAG pipelines.

From GPU-accelerated embedding generation to optimized vector storage, Memgraph ensures that embeddings remain fast, compact, and contextually aware, while powering the next generation of retrieval-augmented AI systems.

Try it today:

Explore how embeddings and vector search come together to make GraphRAG workflows more intelligent and production-ready.

Join us on Discord!
Find other developers performing graph analytics in real time with Memgraph.
© 2025 Memgraph Ltd. All rights reserved.