Building a Movie Similarity Search Engine with Vector Search in Memgraph

By David Ivekovic

6 min readDecember 5, 2024

In this tutorial, we’ll explain how to use Memgraph vector search capabilities to find movies based on their plots or descriptions. We’ll leverage the Wikipedia Movie Plots dataset (available on Kaggle) and generate vector embeddings using the SentenceTransformer Python library.

For more information about the vector search feature, check out Simplify Data Retrieval with Memgraph’s Vector Search and Memgraph docs (vector search).

Step 1: Launch Memgraph with Vector Search Enabled

To enable vector search in Memgraph, use the --experimental-enabled=vector-search flag and the necessary configuration parameters.

Run the following Docker command to start Memgraph with the vector search feature:

docker run -p 7687:7687 -p 7444:7444 memgraph/memgraph:latest --experimental-enabled=vector-search --experimental-config='{"vector-search": {"movies_index": {"label": "Movie","property": "embedding","dimension": 384,"capacity": 100, "metric": "cos"}}}'

Here, we define a vector index named movies_index:

Label: Movie
Property: embedding
Dimension: 384
Metric: Cosine similarity

Step 2: Load and Preprocess the Dataset

Next, download the dataset and filter it to include only Christopher Nolan’s movies. These will serve as our example dataset.

Load the Dataset

import pandas as pd
import kagglehub
 
# Download the dataset from Kaggle
dataset_path = kagglehub.dataset_download("jrobischon/wikipedia-movie-plots")
df = pd.read_csv(dataset_path + "/wiki_movie_plots_deduped.csv")

Filter for Christopher Nolan Movies

# Select Christopher Nolan movies
nolan_movies = df[df['Director'] == 'Christopher Nolan']
nolan_movies.reset_index(drop=True, inplace=True)

Step 3: Compute Vector Embeddings

Use the SentenceTransformer library to generate 384-dimensional vector embeddings for the movie plots.

from sentence_transformers import SentenceTransformer
 
# Define a function to compute embeddings
def compute_embeddings(texts):
    model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
    return model.encode(texts)
 
# Generate embeddings for Nolan movies
embeddings = compute_embeddings(nolan_movies['Plot'].values)

Step 4: Import Data into Memgraph

Create nodes for each movie in Memgraph, storing the vector embeddings in the embedding property.

import neo4j
 
# Connect to Memgraph
driver = neo4j.GraphDatabase.driver("bolt://localhost:7687", auth=("", ""))
 
# Insert data into Memgraph
with driver.session() as session:
    for index, row in nolan_movies.iterrows():
        # remove quotes from the title to avoid parsing issues
        title = row["Title"].replace('"', '')
				
        embedding = embeddings[index].tolist()
        embeddings_str = ",".join([str(x) for x in embedding])
        query = f'CREATE (m:Movie {{title: "{title}", year: {row["Release Year"]}, embedding: [{embeddings_str}]}})'
        session.run(query)

Step 5: Query the Vector Index

Create a Python function to query the movies_index and find the most similar movies based on a given plot description.

def find_movie(plot):
    embeddings = compute_embeddings([plot])
    embeddings_str = ",".join([str(x) for x in embeddings[0]])
    with driver.session() as session:
        query = f"CALL vector_search.search('movies_index', 3, [{embeddings_str}]) YIELD node, similarity RETURN node.title, similarity"
        result = session.run(query)
        for record in result:
            print(record)

Step 6: Try It Out

Example 1: Finding Inception

plot = "A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O."
find_movie(plot)

Expected Results:

<Record node.title='Inception' similarity=0.5250678062438965>
<Record node.title='Interstellar' similarity=0.2907602787017822>
<Record node.title='The Dark Knight' similarity=0.2784501910209656>

Example 2: Finding Memento

plot = "An insurance investigator suffers from anterograde amnesia, leaving him unable to form new memories, and uses notes and tattoos to track down his wife's killer."
find_movie(plot)

Expected Results:

<Record node.title='Memento' similarity=0.37598633766174316>
<Record node.title='Insomnia' similarity=0.26347029209136963>
<Record node.title='Inception' similarity=0.23714923858642578>

Step 7: Expand the Dataset

Add more movies to the dataset to further experiment and observe how the search results evolve. You can also explore different similarity metrics, such as Jaccard, Hamming, or Pearson, to fine-tune the results for your use case.

Conclusion

By combining vector search and graph databases, Memgraph makes it easy to perform semantic searches with relational context. This example shows how powerful and flexible vector search can be for tasks like movie recommendation or content discovery.

If you’re are looking to build GraphRAG with Memgraph, check out other genAI-related features on Memgraph docs (GraphRAG with Memgraph).