
From Unstructured Data to Entity Graph: 5 Questions to Ask Before You Get Started
Before you start turning unstructured documents into Entity Graphs, it helps to understand why this challenge matters. Studies show that between 80 and 90 percent of enterprise data is unstructured. Yet most organizations still cannot effectively use it. In fact, only 18 percent of organizations in a Deloitte survey reported being able to take advantage of such data.
With most knowledge still trapped in unstructured form, turning it into graph context is a huge opportunity. That is why it helps to step back and ask the right technical questions.
Building a GraphRAG pipeline is not only about entity extraction. The real challenge lies in designing flexible schemas, optimizing embeddings, managing access control, and balancing speed with accuracy. All while staying in your budget.
Our latest Community Call introducing Unstructured2Graph tool in the Memgraph AI Toolkit covered exactly these topics. Developers and data engineers dove deep and asked detailed questions about chunking, embeddings, ontology design, and how to make the pipeline efficient in real production environments.
Here are the five key questions that reveal what it really takes to go from raw text to a live, queryable entity graph in Memgraph.
1. How Should You Approach Schema Design?
You rarely know what your schema should look like until you start extracting entities. Instead of defining everything upfront, begin with a flexible structure that can evolve. This is where Hybrid Graph Modeling (HyGM) comes in. It supports iterative schema refinement, allowing you to update the model as you analyze more documents and prompts.
Many engineers assume an ontology must come first, but that is not always necessary. If you do not yet know what exists in your documents, defining an ontology upfront adds unnecessary complexity. Start by extracting entities using LightRAG and observe what structures emerge naturally in the graph.
As recurring patterns or relationships appear, you can then decide whether building or integrating an ontology adds value for your specific use case.
2. Where and How Should You Store Embeddings?
Embeddings are essential if you want to perform vector search. They allow you to find related concepts even when exact keywords are missing.
You might wonder whether embeddings should be stored as node properties or separate linked nodes. They should be kept as a property inside a node.
Right now, Memgraph stores vector data both as node properties and within the index. However, we are working on an advanced vector search feature to be released in December that will store embeddings only inside the index.
This will reduce memory usage by about half and improve performance since vectors can be large, often hundreds or thousands of elements long, especially when encoding video content. Vector search runs most efficiently in memory, so avoiding data duplication is key.
3. In Most Enterprises, Not Everyone Has Access to All Documents. How Can You Manage Access Control on Your Graph?
Enterprise environments often involve restricted data access. Not every user should see every document or entity. Memgraph handles this through label-based access controls (also known as fine-grained access controls), which lets you assign access rules to different chunk groups such as chunk A, chunk B, or chunk C.
This label-based method integrates directly with ingestion pipelines. As documents are processed, they are automatically labeled. with LBAC, not everyone will have access to certain labels. This approach avoids the complexity of maintaining separate filtering systems and ensures compliance with internal data security requirements.
Memgraph is also enhancing its role-based access control (RBAC) to complement label authorization, giving teams even finer control over who can query or modify specific datasets.
4. How Do You Contextualize and Extract Entities Correctly?
LightRAG automatically identifies entities at a conceptual level, but context matters. Each chunk represents a meaningful unit of information, and chunk size directly affects extraction quality. The smaller the chunk, the faster the processing, but the less contextual meaning it holds.
Processing a single chunk with LightRAG takes about ten seconds because it involves multiple LLM calls per chunk. To keep pipelines efficient, it is important to find a balance between speed and context depth. For example, using domain-specific chunks (paragraphs or sections) yields more coherent graphs than arbitrary word-based chunking.
If you already have keyword-tagged documents, you can attach those keywords to their corresponding chunks. This preserves semantic relationships across documents while keeping searches efficient.
5. How Do You Balance Accuracy, Speed, and Cost?
Every engineering team faces this trade-off. Large Language Models (LLMs) like GPT are exceptional at entity extraction but expensive to run. Traditional NLP tools like spaCy or sentence-transformers are cheaper but less precise.
The Unstructured2Graph pipeline takes a unique approach by combining LightRAG’s accuracy with Memgraph’s efficiency. It optimizes GPU usage, minimizes data duplication, and handles embeddings directly in the index to reduce processing time.
If you are running fully on-premise, GPU acceleration is highly recommended for computing embeddings. On average, GPUs can process embeddings up to 100 times faster than CPUs. Memgraph is lightweight and supports ARM architectures, making it a practical option for both cloud and local deployments.
Wrapping Up
Moving from unstructured data to a connected, queryable graph is an iterative process. From schema evolution and access control to embedding strategies and cost optimization, each of these five questions highlights a critical design decision.
Unstructured2Graph powered unstructured.io and LightRAG together provide a practical path for you to build explainable, high-performance Knowledge Graphs that scale and efficiently support your GraphRAG workflows. Start small, refine continuously, and let the data guide the structure.