Unstructured2Graph
Every company sits on a pile of unstructured documents: reports, PDFs, research papers, policies, or meeting notes. They contain valuable knowledge, but little of it is connected or searchable.
With Unstructured2Graph, part of the Memgraph AI Toolkit, you can turn that unstructured text into a connected knowledge graph that LLMs can query and reason over.
Unstructured2Graph combines two powerful components:
- Unstructured IO - extracts, cleans, and chunks documents of various formats such as PDF, DOCX, or TXT.
- LightRAG - a graph-based reasoning layer that handles prompt engineering and entity extraction automatically, mapping entities and relationships into Memgraph.
Together, they convert raw text into a knowledge graph with nodes, edges, and embeddings ready for retrieval.
Getting started
In this guide, you’ll learn how to use the Unstructured2Graph step by step. You’ll quickly go from setting up your project to creating your first entity graph.
Start Memgraph
Start by preparing your workspace and running Memgraph locally using Docker:
docker run -p 7687:7687 -p 7444:7444 --name memgraph memgraph/memgraph-mageOpen your terminal, VS Code, Cursor, or any other development environment you prefer. This is where you’ll run Python scripts connected to your Memgraph instance.
You are now ready to start building your graph.
Clone the Memgraph AI Toolkit
Next, clone the AI Toolkit repository, which contains the Unstructured2Graph module:
git clone https://github.com/memgraph/ai-toolkit.git
cd ai-toolkit/unstructured2graphInstall dependencies
Install uv, the package manager used in the AI Toolkit:
# Install dependencies using uv
uv pip install -e .Detailed steps are available in the uv documentation. Once installed, you can use it to run the AI Toolkit packages easily.
Configure environment variables
Create a .env file to configure your OpenAI API key for LLM-based entity
extraction:
# Required for LLM-based entity extraction
OPENAI_API_KEY=your_api_key_hereIngest documents
Start by selecting the documents you want to process. In the code example below, you can see how to load a document from either a local file or a URL.Unstructured2Graph supports multiple file types through Unstructured.io, including PDF, DOCX, TXT, and HTML. It extracts readable text, removes unwanted elements such as headers or page numbers, and divides the content into structured chunks based on document layout. Each chunk is then ready for LightRAG to perform entity and relationship extraction.
Here is a complete example of how to ingest documents and create a knowledge graph:
import asyncio
import logging
from memgraph_toolbox.api.memgraph import Memgraph
from lightrag_memgraph import MemgraphLightRAGWrapper
from unstructured2graph import from_unstructured, create_index, compute_embeddings, create_vector_search_index
async def ingest_documents():
# Connect to Memgraph and clear existing data
memgraph = Memgraph()
memgraph.query("MATCH (n) DETACH DELETE n;")
create_index(memgraph, "Chunk", "hash")
# Initialize LightRAG for entity extraction
lrag = MemgraphLightRAGWrapper()
await lrag.initialize()
# Define your document sources
sources = [
"docs/paper.pdf", # local file
"https://example.com/page.html" # remote URL
]
# Process documents and extract entities
await from_unstructured(
sources=sources,
memgraph=memgraph,
lightrag_wrapper=lrag,
only_chunks=False, # create chunks and extract entities
link_chunks=True # link chunks sequentially with NEXT edges
)
await lrag.afinalize()
# Create embeddings and vector index for semantic search
compute_embeddings(memgraph, "Chunk")
create_vector_search_index(memgraph, "Chunk", "embedding")
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO)
asyncio.run(ingest_documents())Here’s what happens step by step:
- Text is extracted, cleaned, and chunked by Unstructured IO.
- Each chunk becomes a
Chunknode in Memgraph with properties likehashandtext. - LightRAG performs entity recognition and relationship extraction, creating
basenodes. - Entities are linked to chunks with
MENTIONED_INedges. - Chunks are connected sequentially with
NEXTedges for traversal. - Embeddings are generated and a vector index is created for semantic search.
After processing, your Memgraph instance will hold a complete, queryable knowledge graph ready for GraphRAG.
Query with GraphRAG
Once your data is ingested, you can perform GraphRAG retrieval directly inside Memgraph with a single query. This combines semantic search with graph traversal to retrieve the most relevant context for your questions.
import os
from memgraph_toolbox.api.memgraph import Memgraph
from openai import OpenAI
def graphrag_query(prompt: str):
memgraph = Memgraph()
# Retrieve relevant chunks using vector search + graph traversal
retrieved_chunks = []
for row in memgraph.query(
f"""
CALL embeddings.text(['{prompt}']) YIELD embeddings, success
CALL vector_search.search('vs_name', 5, embeddings[0]) YIELD distance, node, similarity
MATCH (node)-[r*bfs]-(dst:Chunk)
WITH DISTINCT dst, degree(dst) AS degree ORDER BY degree DESC
RETURN dst LIMIT 5;
"""
):
if "text" in row["dst"]:
retrieved_chunks.append(row["dst"]["text"])
if not retrieved_chunks:
print("No chunks retrieved.")
return
# Send retrieved context to LLM for summarization
context = "\n\n".join(retrieved_chunks)
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
completion = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Answer the question based on the provided context."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {prompt}"},
],
temperature=0.1,
)
answer = completion.choices[0].message.content
print(f"Question: {prompt}")
print(f"Answer: {answer}")
if __name__ == "__main__":
graphrag_query("What are the key findings in the document?")Here’s what the GraphRAG query does:
- Converts the input prompt into an embedding.
- Searches for the most semantically relevant chunks using vector search.
- Expands context through connected nodes in the graph using BFS traversal.
- Sends the retrieved text to an LLM for summarization or question answering.
Visualize the graph in Memgraph Lab
Open Memgraph Lab and connect to your local instance. Then, in the Query Execution tab, run:
MATCH (n)-[r]->(m) RETURN n, r, m;You’ll see:
Chunknodes for text sectionsbasenodes for extracted entitiesMENTIONED_INedges linking entities to their source chunksNEXTedges connecting sequential chunks
Explore this graph visually to understand how your content has been transformed into a connected network of knowledge. From here, you can keep querying and navigating the graph to uncover additional context, run advanced analyses, or expand your knowledge base as needed.
Try it in Memgraph Cloud
Want to skip local setup? You can also use Unstructured2Graph directly with Memgraph Cloud. Sign up, create a new project, and start building your knowledge graph in minutes.