Using Graph Algorithms to Enhance Machine Learning for Cyber Threat Detection

By Maya Natarajan

6 min readMarch 28, 2024

Let’s dive deeper into how the hidden patterns in data relationships, often ignored by traditional machine learning, are now being used to improve accuracy and predict cyber threats, as shown in groundbreaking research by *Sikha et al. from the University of West Florida.

How Graph Algorithms Are Changing Traditional Machine Learning

Relationships are highly predictive but traditional machine learning (ML) processes throw away this valuable data type when transforming data for ML. This is where graph algorithms play a big role. Graph algorithms and embeddings capture and transform the shape of data to increase accuracy in machine learning.

A recent article by Sikha et al. from the University of West Florida’s cyber simulation department shows how Memgraph’s graph algorithms enhance machine learning. Specifically, how graph structure and graph-related features representing relationships can effectively indicate the nature of attacks to identify potential cyber threats.

For this study, Sikha et al. used the Memgraph database and advanced graph extensions called MAGE to analyze, visualize, and detect cyber threats within the UWF-ZeekData22 dataset. To create the graph network in Memgraph, the dataset needed to be preprocessed, where log data was transformed to representative graph nodes and edges. Once preprocessed, data was ingested into Memgraph, queried, and visualized.

Memgraph Graph Algorithms for Enhanced Machine Learning

Graphs are complex structures whose topology can vary depending on the number of nodes, edges, and whether they are directed or not. In this study, each node symbolized an individual IP address, while an edge denoted the connection or relationship between two nodes, namely, a source IP address and a destination IP address, representing the utilized attack tactic.

Essentially, the graph illustrated an attack approach from an originating computer (the attacker) to a target system (the victim). The presence of source and target nodes imply direction, and so the edges are "directed," and the graph is a "directed graph."

The degree of a node in a graph indicates the number of edges incident to that node. In the case of a directed graph, the degree can be categorized into two distinct measures: the in-degree (reflecting incoming edges) and the out-degree (depicting outgoing edges). This allowed the team to use the following seven Memgraph graph algorithms:

PageRank
Degree centrality
Bridges
Weakly connected components
Node and edge cardinality
Path length
Node classification

Let’s go into more detail.

PageRank

PageRank is an algorithm that is used to measure importance. In the article by Sikha et al., PageRank was used to calculate the probability that a random cyber attacker would attack a particular machine by randomly choosing machines on the network.

Memgraph has natively implemented PageRank in C++ and it was used to characterize tactics into 3 types: Benign (None), Reconnaissance, and Discovery. PageRank indicated that the IP address 143.88.2.12:22 in UWF-ZeekData22 was the most likely to be attacked using the Discovery Tactic.

Degree Centrality

Degree Centrality shows the connectedness of nodes based on the tactics being used.

Two measurements were studied for the Reconnaissance Tactic. In-degree centrality where degree refers to the number of incoming edges and out-degree centrality where degree is the number of outgoing edges. In this dataset, there were no significant differences between the Reconnaissance and Benign (or None) Tactics. However, the connectedness of the Reconnaissance nodes was significantly greater than the Benign (None) nodes indicating that an attack was highly possible using the Reconnaissance Tactic.

Bridges

A bridge is a single edge that connects subgraphs together. If removed or deleted, it would result in two subgraphs losing connection. Bridges provide more information about the graph structure. In this study, the Reconnaissance Tactic was heavily connected between nodes.

Weakly Connected Components

Given a directed graph, a weakly connected component (WCC) is a subgraph of the original graph where all vertices are connected to each other by some path, ignoring the direction of edges. WCC can be used to detect how many disparate graphs exist by tactic. The study showed that the Reconnaissance Tactic had many disparate components, whereas the Benign (None) Tactic had very few despite both tactics having a similar number of connections.

Node and Edge Cardinality

Node cardinality refers to the total number of nodes in a graph, while edge cardinality denotes the total number of connections between those nodes. The UWF-ZeekData22 resulted in a graph with 262,963 nodes and 18,562,438 directed edges (edges from source to target address).

Path Length

Path Length identifies the number of hops away from a source node. The results showed that a source node of the Reconnaissance Tactic, with address “143.88.2.10:41562” connected to intermediate address “143.88.7.12:3”, then to address “143.88.2.10:3”, and then to many destination addresses, whereas a source node of Benign (None) Tactic had path lengths of 2.

Node Classification

In this study, graph neural network models were generated for the UWF-ZeekData22 dataset to perform node classification—that is, to label a node as a source or destination node for the correct tactic under the MITRE ATT&CK framework.

Node Classification was used to predict the connection between IP addresses and ports and distinguish attacks from non-attacks in the MITRE framework of UWF-ZeekData22. It categorizes a node as either the source or destination of an attack tactic, facilitating the identification of the IP address–port combination responsible for initiating various attack tactics (thus, enabling multi-classification).

For node classification in graph analysis, two main steps are involved:

Preprocessing
Feature selection

Preprocessing involves labeling each node as either the source or destination of an attack, helping to distinguish the role of each node in the network. For feature selection, attributes like the number of connections (in-degree and out-degree) and importance (PageRank) of nodes based on their attack tactics are considered crucial. Graph Neural Network (GNN) models then use these features to classify combinations of IP addresses and ports as either attackers, targets, or benign entities.

The efficacy of these models varies by the specific algorithm or layer type used (GATJK, GAT, GATv2, “GraphSAGE”) and shows that graph structure and graph-related features representing relationships can effectively indicate the nature of attacks in cyber security settings.

Conclusion

Securing corporate networks against cyber threats is of key importance, especially given the constraints on manpower and financial resources within many organizations. With cyberattacks occurring at an alarming frequency, there's an urgent need for efficient strategies to support resource-constrained organizations. This study shows that graph-based machine learning offers a potent means of identifying potential threats.