kmeans
The k-means algorithm clusters given data by trying to separate samples in n
groups of equal variance by minimizing the criterion known as
within-the-cluster sum-of-squares.
Trait | Value |
---|---|
Module type | module |
Implementation | Python |
Graph direction | directed/undirected |
Edge weights | weighted/unweighted |
Parallelism | sequential |
Too slow?
If this algorithm implementation is too slow for your use case, contact us and request a rewrite to C++ !
Procedures
You can execute this algorithm on graph projections, subgraphs or portions of the graph.
get_clusters(n_clusters, embedding_property, init, n_init, max_iter, tol, algorithm, random_state)
For each node, this procedure returns what cluster it belongs to.
Input:
n_clusters : int
➡ Number of clusters to be formed.embedding_property : str (default: "embedding")
➡ Node property where embeddings are stored.init : str (default: "k-means")
➡ Initialization method. Ifk-means++
is selected, initial cluster centroids are sampled per an empirical probability distribution of the points’ contribution to the overall inertia. This technique speeds up convergence and is theoretically proven to beO(logk)
-optimal. Ifrandom
,n_clusters
observations (rows) are randomly chosen for the initial centroids.n_init : int (default: 10)
➡ Number of times the k-means algorithm will be run with different centroid seeds.max_iter : int (default: 10)
➡ Length of sampling walks.tol : float (default: 1e-4)
➡ Relative tolerance of the Frobenius norm of the difference of cluster centers across consecutive iterations. Used in determining convergence.algorithm : str (default: "auto")
➡ Options arelloyd
,elkan
,auto
,full
. Description here (opens in a new tab).random_state : int (default: 1998)
➡ Random seed for the algorithm.
Output:
node: mgp.Vertex
➡ Graph node.cluster_id: mgp.Number
➡ Cluster ID of the above node.
Usage:
CALL kmeans.get_clusters(2, "embedding", "k-means++", 10, 10, 0.0001, "auto", 1) YIELD node, cluster_id
RETURN node.id as node_id, cluster_id
ORDER BY node_id ASC;
set_clusters( n_clusters, embedding_property, cluster_property, init, n_init, max_iter, tol, algorithm, random_state)
Procedure sets for each node to which cluster it belongs to by writing cluster id to cluster_property
.
Input:
n_clusters : int
➡ Number of clusters to be formed.embedding_property : str (default: "embedding")
➡ Node property where embeddings are stored.cluster_property: str (default: "cluster_id")
➡ Node property wherecluster_id
will be stored.init : str (default: "k-means")
➡ Initialization method. Ifk-means++
is selected, initial cluster centroids are sampled per an empirical probability distribution of the points’ contribution to the overall inertia. This technique speeds up convergence and is theoretically proven to beO(logk)
-optimal. Ifrandom
,n_clusters
observations (nodes) are randomly chosen for the initial centroids.n_init : int (default: 10)
➡ Number of times the k-means algorithm will be run with different centroid seeds.max_iter : int (default: 10)
➡ Length of sampling walks.tol : float (default: 1e-4)
➡ Relative tolerance of the Frobenius norm of the difference of cluster centers across consecutive iterations. Used in determining convergence.algorithm : str (default: "auto")
➡ Options arelloyd
,elkan
,auto
,full
. Description here (opens in a new tab).random_state : int (default: 1998)
➡ Random seed for the algorithm.
Output:
node: mgp.Vertex
➡ Graph node.cluster_id: mgp.Number
➡ Cluster ID of the above node.
Usage:
CALL kmeans.set_clusters(2, "embedding", "cluster_id", "k-means++", 10, 10, 0.0001, "auto", 1) YIELD node, cluster_id
RETURN node.id as node_id, cluster_id
ORDER BY node_id ASC;
Example
Step 1
Step 2
CREATE (:Node {id:0, embedding: [0.90678340196609497, 0.74690568447113037, -0.65984714031219482]});
...
MATCH (a:Node {id: 9}) MATCH (b:Node {id: 8}) MERGE (a)-[:RELATION]->(b);
Step 3
CALL kmeans.get_clusters(2, "embedding", "k-means++", 10, 10, 0.0001, "auto", 1) YIELD node, cluster_id
RETURN node.id as node_id, cluster_id
ORDER BY node_id ASC;
Step 4
+-------------------------+-------------------------+
| node_id | cluster_id |
+-------------------------+-------------------------+
| 0 | 1 |
| 1 | 1 |
| 2 | 1 |
| 4 | 1 |
| 5 | 0 |
| 6 | 0 |
| 7 | 0 |
| 8 | 0 |
| 9 | 0 |
+-------------------------+-------------------------+