node2vec_online

The node2vec_online algorithm learns and updates temporal node embeddings on the fly for tracking and measuring node similarity over time in graph streams. The algorithm creates similar embeddings for two nodes (e.g. v and u) if there is an option to reach one node from the other across edges that appeared recently. In other words, the embedding of a node v should be more similar to the embedding of node u if we can reach u by taking steps backward to node v across edges that appeared before the previous one. These steps backward from one node to the other form a temporal walk. It is temporal since it depends on when the edge appeared in the graph.

To make two nodes more similar and to create these temporal walks, the Node2Vec Online algorithm uses the StreamWalk updater and Word2Vec learner.

StreamWalk updater is a machine for sampling temporal walks. A sampling of the walk is done in a backward fashion because we look only at the incoming edges of the node. Since one node can have multiple incoming edges, when sampling a walk, StreamWalk updater uses probabilities to determine which incoming edge of the node it will take next, and that way leading to a new node. These probabilities are computed after the edge arrives and before temporal walk sampling. Probability represents a sum over all temporal walks z ending in node v using edges arriving no later than the latest one of already sampled ones in the temporal walk. When the algorithm decides which edge to take next for temporal walk creation, it uses these computed weights (probabilities). Every time a new edge appears in the graph, these probabilities are updated just for two nodes of a new edge.

After walks sampling, Word2Vec learner uses these prepared temporal walks to make node embeddings more similar using the gensim Word2Vec module. These sampled walks are given as sentences to the gensim Word2Vec module, which then optimizes for the similarity of the node embeddings in the walk with stochastic gradient descent using a skip-gram model or continuous-bag-of-words (CBOW).

Embeddings capture the graph topology, relationships between nodes, and further relevant information. How the embeddings capture this inherent information of the graph is not fixed.

Capturing information in networks often shuttles between two kinds of similarities: homophily and structural equivalence. Under the homophily hypothesis, nodes that are highly interconnected and belong to similar network clusters or communities should be embedded closely together. In contrast, under the structural equivalence hypothesis, nodes that have similar structural roles in networks should be embedded closely together (e.g., nodes that act as hubs of their corresponding communities).

Currently, our implementation captures for homophily - nodes that are highly interconnected and belong to similar network clusters or communities.

Check out the Node embeddings in dynamic graphs by Ferenc Béres, Róbert Pálovics, Domokos Miklós Kelen and András A. Benczúr.

TraitValue
Module typemodule
ImplementationPython
Graph directiondirected
Edge weightsunweighted
Parallelismsequential

Too slow?

If this algorithm implementation is too slow for your use case, open an issue on Memgraph’s GitHub repository and request a rewrite to C++!

Procedures

The basic node classification workflow is as follows:

  1. Load data to Memgraph.
  2. Set updater parameters by calling the set_streamwalk_updater() procedure.
  3. Set learner parameters by calling the set_streamwalk_learner() procedure.
  4. Call the update() procedure to set the node embeddings.
  5. Inspect the results by calling the get() procedures.
  6. If you want to reset the parameters and start again, call the restart() procedure.

set_streamwalk_updater()

Use the procedure to sample the temporal random walks in the graph. Once the set_word2vec_learner() parameters are set, you first need to call the reset() procedure to update the parameters.

Input:

  • subgraph: Graph (OPTIONAL) ➡ A specific subgraph, which is an object of type Graph returned by the project() function, on which the algorithm is run. If subgraph is not specified, the algorithm is computed on the entire graph by default.

  • half_life: integer ➡ Half-life [seconds], used in the temporal walk probability calculation.

  • max_length: integer ➡ The maximum length of the sampled temporal random walks.

  • beta: float ➡ The damping factor for long paths.

  • cutoff: integer ➡ A temporal cutoff in seconds to exclude very distant past.

  • sampled_walks: integer ➡ The number of sampled walks for each edge update.

  • full_walks: boolean ➡ An option to return every node of the sampled walk for representation learning (full_walks=True) or only the endpoints of the walk (full_walks=False).

Output:

  • message: string ➡ Whether parameters are set or they need to be reset.

Usage:

To start sampling the temporal random walks, use the following query:

CALL node2vec_online.set_streamwalk_updater(7200, 3, 0.9, 604800, 4, False) 
YIELD message 
RETURN message;

set_word2vec_learner()

Use the procedure to calculate embeddings. To recalculate embeddings with different parameters, you first need to call the reset() procedure to update the parameters.

Input:

  • subgraph: Graph (OPTIONAL) ➡ A specific subgraph, which is an object of type Graph returned by the project() function, on which the algorithm is run. If subgraph is not specified, the algorithm is computed on the entire graph by default.

  • embedding_dimension: integer ➡ The number of dimensions in the representation of the embedding vector.

  • learning_rate: float ➡ The learning rate.

  • skip_gram: boolean (default=True) ➡ Whether to use skip-gram model (True) or continuous-bag-of-words (CBOW) (False).

  • negative_rate: integer ➡ Negative rate for Gensim Word2Vec model.

  • threads: integer ➡ Maximum number of threads for parallelization.

Output:

  • message: string ➡ Whether parameters are set or they need to be reset.

Usage:

To set the learner parameter values, use the following query:

CALL node2vec_online.set_word2vec_learner(128, 0.01, True, 10, 1)
YIELD message
RETURN message;

update()

Use the udpate() procedure to (re)calculate embeddings.

Input:

  • subgraph: Graph (OPTIONAL) ➡ A specific subgraph, which is an object of type Graph returned by the project() function, on which the algorithm is run. If subgraph is not specified, the algorithm is computed on the entire graph by default.

  • edges: mgp.List[mgp.Edge] ➡ A list of relationships added to the graph. Embeddings will be (re)calculated only for the starting and ending nodes of those new relationships.

Usage:

You can use the update() procedure in two ways.

You can create a trigger, so every time a relationships is added to graph, the trigger calls a procedure and makes an update:

CREATE TRIGGER trigger ON --> CREATE BEFORE COMMIT
EXECUTE CALL node2vec_online.update(createdEdges)
RETURN *;

Or, you can filter out certain relationships and call the algorithm with those relationships:

MATCH (n)-[e]->(m)
WITH COLLECT(e) as edges
CALL node2vec_online.update(edges)
RETURN edges;

get()

The get() procedure returns the node embeddings.

Input:

  • subgraph: Graph (OPTIONAL) ➡ A specific subgraph, which is an object of type Graph returned by the project() function, on which the algorithm is run. If subgraph is not specified, the algorithm is computed on the entire graph by default.

Output:

  • node: mgp.Vertex ➡ The node in the graph for which embedding exists.
  • embedding: mgp.List[mgp.Number] ➡ Embedding for the given node.

Usage:

To return the calculated node embeddings, use the following query:

CALL node2vec_online.get()
YIELD node, embedding
RETURN node, embedding;

reset()

Use the procedure to reset the updater and learner parameters.

Output:

  • message: string ➡ Message that parameters are ready to be set again.

Usage:

To reset the parameters, use the following query:

CALL node2vec_online.reset()
YIELD message
RETURN message;

help()

The procedure will return procedures available in the query module, their parameters and descriptions.

Output:

  • name: string ➡ The name of available functions.
  • value: string ➡ Documentation for every function.

Usage:

To get help, use the following query:

CALL node2vec_online.help()
YIELD name, value RETURN name, value;

Example

Database state

The database contains the following data:

Input graph

Created with the following Cypher queries:

MERGE (n:Node {id: 1}) MERGE (m:Node {id: 6}) CREATE (n)-[:RELATION]->(m);
MERGE (n:Node {id: 2}) MERGE (m:Node {id: 6}) CREATE (n)-[:RELATION]->(m);
MERGE (n:Node {id: 10}) MERGE (m:Node {id: 5}) CREATE (n)-[:RELATION]->(m);
MERGE (n:Node {id: 5}) MERGE (m:Node {id: 2}) CREATE (n)-[:RELATION]->(m);
MERGE (n:Node {id: 9}) MERGE (m:Node {id: 7}) CREATE (n)-[:RELATION]->(m);
MERGE (n:Node {id: 7}) MERGE (m:Node {id: 3}) CREATE (n)-[:RELATION]->(m);
MERGE (n:Node {id: 3}) MERGE (m:Node {id: 6}) CREATE (n)-[:RELATION]->(m);
MERGE (n:Node {id: 9}) MERGE (m:Node {id: 8}) CREATE (n)-[:RELATION]->(m);
MERGE (n:Node {id: 8}) MERGE (m:Node {id: 4}) CREATE (n)-[:RELATION]->(m);
MERGE (n:Node {id: 4}) MERGE (m:Node {id: 6}) CREATE (n)-[:RELATION]->(m);

Set the paramaters

To set the parameters for the updater and the learner, use the following query:

CALL node2vec_online.set_streamwalk_updater(7200, 2, 0.9, 604800, 2, True) 
YIELD message RETURN message;
CALL node2vec_online.set_word2vec_learner(2, 0.01, True, 1, 1) 
YIELD message RETURN message;

Set a trigger

In order to recalculate node embeddings when new relationships are added to the graph, create a trigger:

CREATE TRIGGER trigger ON --> CREATE BEFORE COMMIT
EXECUTE CALL node2vec_online.update(createdEdges) 
RETURN *;

Add new relationships

Use the following queries to add new relationships between nodes:

MERGE (n:Node {id: 1}) MERGE (m:Node {id: 6}) CREATE (n)-[:RELATION]->(m);
MERGE (n:Node {id: 2}) MERGE (m:Node {id: 6}) CREATE (n)-[:RELATION]->(m);
MERGE (n:Node {id: 10}) MERGE (m:Node {id: 5}) CREATE (n)-[:RELATION]->(m);
MERGE (n:Node {id: 5}) MERGE (m:Node {id: 2}) CREATE (n)-[:RELATION]->(m);
 
MERGE (n:Node {id: 9}) MERGE (m:Node {id: 7}) CREATE (n)-[:RELATION]->(m);
MERGE (n:Node {id: 7}) MERGE (m:Node {id: 3}) CREATE (n)-[:RELATION]->(m);
MERGE (n:Node {id: 3}) MERGE (m:Node {id: 6}) CREATE (n)-[:RELATION]->(m);
 
MERGE (n:Node {id: 9}) MERGE (m:Node {id: 8}) CREATE (n)-[:RELATION]->(m);
MERGE (n:Node {id: 8}) MERGE (m:Node {id: 4}) CREATE (n)-[:RELATION]->(m);
MERGE (n:Node {id: 4}) MERGE (m:Node {id: 6}) CREATE (n)-[:RELATION]->(m);

Calculate embeddings

Use the following query to get the calculated embeddings:

CALL node2vec_online.get() 
YIELD node, embedding
RETURN node, embedding
ORDER BY node.id;

Results:

+-------------------------+-------------------------+
| node                    | embedding               |
+-------------------------+-------------------------+
| (:Node {id: 1})         | [0.255167, 0.450464]    |
| (:Node {id: 2})         | [-0.465147, -0.35584]   |
| (:Node {id: 3})         | [-0.243008, -0.0908009] |
| (:Node {id: 4})         | [-0.414261, -0.472441]  |
| (:Node {id: 5})         | [-0.250771, -0.188169]  |
| (:Node {id: 6})         | [-0.0268114, 0.0118215] |
| (:Node {id: 7})         | [-0.226831, 0.327703]   |
| (:Node {id: 8})         | [0.143829, 0.0495937]   |
| (:Node {id: 9})         | [0.369025, -0.0766736]  |
| (:Node {id: 10})        | [0.322944, 0.448649]    |
+-------------------------+-------------------------+