Skip to main content

node2vec_online

docs-source

Abstract#

The node2vec_online algorithm learns and updates temporal node embeddings on the fly for tracking and measuring node similarity over time in graph streams. The algorithm creates similar embeddings for two nodes (e.g. v and u) if there is an option to reach one node from the other across edges that appeared recently. In other words, the embedding of a node v should be more similar to the embedding of node u if we can reach u by taking steps backward to node v across edges that appeared before the previous one. These steps backward from one node to the other form a temporal walk. It is temporal since it depends on when the edge appeared in the graph.

To make two nodes more similar and to create these temporal walks, the Node2Vec Online algorithm uses the StreamWalk updater and Word2Vec learner.

StreamWalk updater is a machine for sampling temporal walks. A sampling of the walk is done in a backward fashion because we look only at the incoming edges of the node. Since one node can have multiple incoming edges, when sampling a walk, StreamWalk updater uses probabilities to determine which incoming edge of the node it will take next, and that way leading to a new node. These probabilities are computed after the edge arrives and before temporal walk sampling. Probability represents a sum over all temporal walks z ending in node v using edges arriving no later than the latest one of already sampled ones in the temporal walk. When the algorithm decides which edge to take next for temporal walk creation, it uses these computed weights (probabilities). Every time a new edge appears in the graph, these probabilities are updated just for two nodes of a new edge.

After walks sampling, Word2Vec learner uses these prepared temporal walks to make node embeddings more similar using the gensim Word2Vec module. These sampled walks are given as sentences to the gensim Word2Vec module, which then optimizes for the similarity of the node embeddings in the walk with stochastic gradient descent using a skip-gram model or continuous-bag-of-words (CBOW).

Embeddings capture the graph topology, relationships between nodes, and further relevant information. How the embeddings capture this inherent information of the graph is not fixed.

Capturing information in networks often shuttles between two kinds of similarities: homophily and structural equivalence. Under the homophily hypothesis, nodes that are highly interconnected and belong to similar network clusters or communities should be embedded closely together. In contrast, under the structural equivalence hypothesis, nodes that have similar structural roles in networks should be embedded closely together (e.g., nodes that act as hubs of their corresponding communities).

Currently, our implementation captures for homophily - nodes that are highly interconnected and belong to similar network clusters or communities.

1 Node embeddings in dynamic graphs, Ferenc Béres, Róbert Pálovics, Domokos Miklós Kelen and András A. Benczúr

TraitValue
Module typemodule
ImplementationPython
Graph directiondirected
Edge weightsunweighted
Parallelismsequential

Procedures#

set_streamwalk_updater(half_life, max_length, beta, cutoff, sampled_walks, full_walks)#

Input:#

  • half_life: int ➡ half-life [seconds], used in the temporal walk probability calculation
  • max_length: int ➡ Maximum length of the sampled temporal random walks
  • beta: float ➡ Damping factor for long paths
  • cutoff: int ➡ Temporal cutoff in seconds to exclude very distant past
  • sampled_walks: int ➡ Number of sampled walks for each edge update
  • full_walks: bool ➡ Return every node of the sampled walk for representation learning (full_walks=True) or only the endpoints of the walk (full_walks=False)

Output:#

  • message: str ➡ Whether parameters are set or they need to be reset

Usage:#

CALL node2vec_online.set_streamwalk_updater(7200, 3, 0.9, 604800, 4, False);

set_word2vec_learner(embedding_dimension, learning_rate, skip_gram )#

Input:#

  • embedding_dimension: int ➡ Number of dimensions in the representation of the embedding vector
  • learning_rate: float ➡ Learning rate
  • skip_gram: bool ➡ Whether to use skip-gram model (True) or continuous-bag-of-words (CBOW)
  • negative_rate: int ➡ Negative rate for Gensim Word2Vec model
  • threads: int ➡ Maximum number of threads for parallelization

Output:#

  • message: str ➡ Whether parameters are set or they need to be reset

Usage:#

CALL node2vec_online.set_word2vec_learner(128, 0.01, True, 10, 1);

get()#

Output:#

  • node: mgp.Vertex ➡ Node in the graph for which embedding exists
  • embedding: mgp.List[mgp.Number] ➡ Embedding for the given node

Usage:#

CALL node2vec_online.get();

update(edges)#

Input:#

  • edges: mgp.List[mgp.Edge] ➡ List of edges added to the graph. For those nodes only node2vec_online calculates embeddings.

Usage:#

There are a few options here. The first one is to create a trigger, so every time an edge is added to graph, the trigger calls a procedure and makes an update.

CREATE TRIGGER trigger ON --> CREATE BEFORE COMMITEXECUTE CALL node2vec_online.update(createdEdges) YIELD *;

The second option is to add all the edges and then call the algorithm with those edges:

MATCH (n)-[e]->(m)WITH COLLECT(e) as edgesCALL node2vec_online.update(edges) YIELD *WITH 1 as xRETURN x;

reset()#

Output:#

  • message: str ➡ Message that parameters are ready to be set again

Usage:#

CALL node2vec_online.reset();

help()#

Output:#

  • name: str ➡ Name of available functions
  • value: str ➡ Documentation for every function

Usage:#

CALL node2vec_online.help();

Example#