Dynamic PageRank on Streaming Data
PageRank is one of those iconic algorithms that have changed the world of technology forever. You probably know Google's story - after designing a brilliant algorithm that is able to measure the influence of a webpage simply by looking at who has a link toward the targeted webpage, Google has changed the course of history within a web-browsing niche.
Where can PageRank be used?
PageRank can be used as an influence measurement in many different applications when working with highly interconnected data. For example, one can wonder who is an influential individual in the connected data of social network interactions. Others might want to investigate transactions made by a particular individual in some financial data warehouses. Having an influence measurement, in that case, can lead toward finding who is collaborating with a suspicious number of entities, which could be interpreted as fraudulent behavior. Finally, an extremely intersected internet network may require influence measurement to pinpoint the specific place where a data transfer bottleneck could occur.
All these examples have one thing in common: they can be solved/investigated with proper influence measurement information. Organizing data as connected entities within a graph opens up the possibility of using graph algorithms that are specifically designed for such problems. This is what the world's largest companies have been doing for years. However, the real world demands solving problems that are further ahead in the process of reaching the goal.
Large and constantly evolving data quickly makes common approaches obscure. While going through the following sections, think about your data, question yourself whether it is large or not, and whether it evolves with time or it's static. If you have large and dynamic data, this article can help you dive into building an influence measurement that constantly updates whenever new data is available.
How to use PageRank on dynamic data?
An obvious and trivial solution to the previous problem would be: whenever a new data point or connection comes into our platform, simply recalculate the influence measurement with the PageRank algorithm. This is very simple, and it works. However, the story changes over time when your business becomes more popular. You begin to have more work and therefore datapoints keep coming at a faster pace, accumulated data keeps rising in size and your algorithm fails to deliver valuable information efficiently.
To prevent that from happening, the next section will describe the method for incrementally updating PageRank measurements whenever a new data point hits the platform.
Imagine a case where data is ingested via streaming, and you demand an almost real-time answer to what is the influence of some entity.
Incremental PageRank on streaming data
I am going to describe the algorithm with the most simple and yet genius idea of how to update influence/ranking when new information is added into the system. The work from Twitter employees described in "Fast Incremental and Personalized PageRank" by Bahmani et al. can be simply put like this:
- Approximatively calculate the influence measurement on your current, static data.
- When a new connection/data point comes into the system, recalculate measurement only on data that is affected by the update.
The first one is trivial. There are numerous techniques to empower the PageRank algorithm on static data. One of them is approximative.
Intuitively, a PageRank is the probability to end up in a certain node starting from a randomly picked one. Implementing that exact idea is going to lead toward the approximative solution to the problem. If we sample R different random walks of average length 1/ε, where ε notes the probability of finishing a random walk, we have enough information to approximate centrality/influence.
Now, for a node v, let's introduce variable Xᵥ that denotes the number of times that v is stored in some random walk. If n is a number of total nodes, by the definition that was mentioned previously, PageRank can be approximated with the next formula:
When saying, "recalculate measurement only on data which is affected by the update" one wonders what specific data points should be updated once new data hits the system. We can think about it like this: after creating a connection, or a new data point, we are going to trigger an update on PageRank values for each node.
Calculation of parameters depends on previously calculated walks, therefore we need to store them beforehand. After we recalculate routes, we can easily derive new PageRank values from the formula above. Walks that need to be changed are affected by new graph entities. If a connection is added, all walks that are going through the connection's nodes need to be changed (the image below shows how walk #3 updates once a new edge arrives). As mentioned, after these changes, the calculation of PageRank is trivial by reusing the stored information and formula above.
The idea of this article is to describe how a smart algorithm can spare a large number of computations and deliver valuable insights into data faster than before. With the current progress and rise in streaming data, such insight can solve troubles for a huge amount of companies.
Our team of engineers is currently tackling the problem of graph analytics algorithms on real-time data. If you want to discuss how to apply online/streaming algorithms on connected data, feel free to join our Discord server and message us.
MAGE shares his wisdom on a Twitter channel. Get to know him better by following him 🐦
Last but not least, check out MAGE and don't hesitate to give a star ⭐ or contribute with new ideas.