kmeans

The k-means algorithm clusters given data by trying to separate samples in n groups of equal variance by minimizing the criterion known as within-the-cluster sum-of-squares.

docs-source (opens in a new tab)

TraitValue
Module typemodule
ImplementationPython
Graph directiondirected/undirected
Edge weightsweighted/unweighted
Parallelismsequential

Too slow?

If this algorithm implementation is too slow for your use case, contact us and request a rewrite to C++ !

Procedures

get_clusters(n_clusters, embedding_property, init, n_init, max_iter, tol, algorithm, random_state)

For each node, this procedure returns what cluster it belongs to.

Input:

  • n_clusters : int ➡ Number of clusters to be formed.
  • embedding_property : str (default: "embedding") ➡ Node property where embeddings are stored.
  • init : str (default: "k-means") ➡ Initialization method. If k-means++ is selected, initial cluster centroids are sampled per an empirical probability distribution of the points’ contribution to the overall inertia. This technique speeds up convergence and is theoretically proven to be O(logk)-optimal. If random, n_clusters observations (rows) are randomly chosen for the initial centroids.
  • n_init : int (default: 10) ➡ Number of times the k-means algorithm will be run with different centroid seeds.
  • max_iter : int (default: 10) ➡ Length of sampling walks.
  • tol : float (default: 1e-4) ➡ Relative tolerance of the Frobenius norm of the difference of cluster centers across consecutive iterations. Used in determining convergence.
  • algorithm : str (default: "auto") ➡ Options are lloyd, elkan, auto, full. Description here (opens in a new tab).
  • random_state : int (default: 1998) ➡ Random seed for the algorithm.

Output:

  • node: mgp.Vertex ➡ Graph node.
  • cluster_id: mgp.Number ➡ Cluster ID of the above node.

Usage:

 CALL kmeans.get_clusters(2, "embedding", "k-means++", 10, 10, 0.0001, "auto", 1) YIELD node, cluster_id
 RETURN node.id as node_id, cluster_id
   ORDER BY node_id ASC;

set_clusters( n_clusters, embedding_property, cluster_property, init, n_init, max_iter, tol, algorithm, random_state)

Procedure sets for each node to which cluster it belongs to by writing cluster id to cluster_property.

Input:

  • n_clusters : int ➡ Number of clusters to be formed.
  • embedding_property : str (default: "embedding") ➡ Node property where embeddings are stored.
  • cluster_property: str (default: "cluster_id") ➡ Node property where cluster_id will be stored.
  • init : str (default: "k-means") ➡ Initialization method. If k-means++ is selected, initial cluster centroids are sampled per an empirical probability distribution of the points’ contribution to the overall inertia. This technique speeds up convergence and is theoretically proven to be O(logk)-optimal. If random, n_clusters observations (nodes) are randomly chosen for the initial centroids.
  • n_init : int (default: 10) ➡ Number of times the k-means algorithm will be run with different centroid seeds.
  • max_iter : int (default: 10) ➡ Length of sampling walks.
  • tol : float (default: 1e-4) ➡ Relative tolerance of the Frobenius norm of the difference of cluster centers across consecutive iterations. Used in determining convergence.
  • algorithm : str (default: "auto") ➡ Options are lloyd, elkan, auto, full. Description here (opens in a new tab).
  • random_state : int (default: 1998) ➡ Random seed for the algorithm.

Output:

  • node: mgp.Vertex ➡ Graph node.
  • cluster_id: mgp.Number ➡ Cluster ID of the above node.

Usage:

 CALL kmeans.set_clusters(2, "embedding", "cluster_id", "k-means++", 10, 10, 0.0001, "auto", 1) YIELD node, cluster_id
 RETURN node.id as node_id, cluster_id
   ORDER BY node_id ASC;

Example

Step 1

Step 2

CREATE (:Node {id:0, embedding: [0.90678340196609497, 0.74690568447113037, -0.65984714031219482]});
...
MATCH (a:Node {id: 9}) MATCH (b:Node {id: 8}) MERGE (a)-[:RELATION]->(b);

Step 3

CALL kmeans.get_clusters(2, "embedding", "k-means++", 10, 10, 0.0001, "auto", 1) YIELD node, cluster_id
RETURN node.id as node_id, cluster_id
ORDER BY node_id ASC;

Step 4

+-------------------------+-------------------------+
| node_id                 | cluster_id              |
+-------------------------+-------------------------+
| 0                       | 1                       |
| 1                       | 1                       |
| 2                       | 1                       |
| 4                       | 1                       |
| 5                       | 0                       |
| 6                       | 0                       |
| 7                       | 0                       |
| 8                       | 0                       |
| 9                       | 0                       |
+-------------------------+-------------------------+