Memgraph Powers Sayari's Billion-Node Graph for Global Risk Analysis
In our Memgraph webinar, James Conkling, Senior VP of Product Engineering at Sayari discussed how his company uses Memgraph's technology. Sayari has a huge knowledge graph that helps promote transparency in global corporate and trade networks. This graph contains over 2 billion entities and 7.5 billion relationships from 586 data points across more than 200 countries.
This blog post recaps the main topics of the webinar. It highlights how Sayari uses Memgraph for risk analysis and the challenges of managing a large dataset. It also includes a Q&A session where James answers technical questions about working with such extensive data.
Watch the full webinar recording–Memgraph at Scale: Analyzing Company Ownership & Supply Networks with a 2 Billion-Node Graph
In the meantime, here are the key talking points from the webinar.
Talking Point 1: Introduction to Sayari
Sayari is developing a large-scale knowledge graph to enhance corporate and trade transparency while addressing critical issues such as money laundering and financial fraud. The graph includes 2 billion entities and 7.5 billion relationships from 586 diverse global sources.
Talking Point 2: Challenges in Handling Large-scale Graph Data
Sayari uses Memgraph in an in-memory analytical mode to manage its extensive data graph, which includes 1 billion nodes. This allows for live analytical queries necessary for interactive applications and supports real-time, read-only queries from end users.
Talking Point 3: Data Management and Infrastructure
Sayari manages a complex ontology with 12 entity types, 38 relationship types, and numerous attribute types, necessitating sophisticated data parsing and integration techniques. James highlighted the challenge of managing 'super nodes' that could have up to 50,000 relationships, affecting query performance. Sayari uses advanced query optimization techniques to ensure efficiency despite such nodes' variable and often unpredictable performance.
Talking Point 4: Technical Insights on Data Management
James discussed Sayari's journey through different graph databases, from Neo4j to Memgraph, highlighting why Memgraph's capabilities better suited their needs, particularly for efficiently handling large volumes of data in memory.
Handling live queries directly from Memgraph has demonstrated scalability, with the ability to manage significant workloads like Sayari’s. Despite the in-memory model, performance metrics such as the average query response time remain robust.
Talking Point 5: Future Developments
James touched on the recent developments in Memgraph that enhance its functionality as the ability to limit the number of edges traversed in a query, which helps manage extensive data more predictably.
Sayari has taken Memgraph beyond typical use cases, integrating it with other systems like Cassandra for different workloads and leveraging Memgraph as an index to optimize deep, complex queries.
Talking Point 6: Strategic Data Ingestion and Pipeline Management
Sayari performs bulk data loads into Memgraph, which significantly enhances throughput while introducing latency in data freshness, making large-scale updates feasible. Database updates are managed through a comprehensive rebuild process, which involves re-running data enrichment and resolution pipelines before pushing updates across all databases.
Q&A
1. Have you considered data replication and read-only instances in analytical mode?
- James: Yes, absolutely. We are running a high-availability cluster, but we're not even using the native capabilities because once you have a read-only setup, you can just deploy a bunch of independent instances, put a load balancer in front of them. If it's a read-only instance, there's no need to have a Raft protocol or whatever it is running in the background that syncs updates. So, yes, we run in read-only clusters.
2. As far as I'm aware, ontologies are predominantly used by the RDF data model, not labeled property graphs, which Memgraph is based on. How did you adapt or simplify your ontology, which is usually atomic and contains first-order logical statements for labeled property graphs?
- James: Initially, we explored RDF databases utilizing SPARQL but shifted to implementing ontologies within Memgraph's labeled property graphs. Despite the semantic web's dominance in ontologies, we applied similar abstract concepts outside traditional settings, absent of reasoning engines or OWL statements. Our ontology, simplified and not stored as in semantic systems, outlines permissible relationships between entity types managed at the application level. This ensures no inappropriate relationships are formed, such as linking companies as family, which adapts the ontology approach to our specific needs within Memgraph’s framework.
3. Can you describe your data ingestion pipeline? For example, do you stream writes or perform a bulk load?
- James: Yes, we use a bulk load method, crucial for maintaining our databases in read-only mode. This involves writing to the database offline before making it available and switching all our services and APIs to the updated instance. Our entire upstream data processing, which generates our knowledge graph, operates on Spark—a highly scalable framework. Bulk loading prioritizes throughput over the latency seen with streaming updates, allowing us to rewrite vast amounts of data simultaneously and efficiently. We perform these comprehensive data rewrites every two weeks, although some systems may update daily. This method achieves high throughput but introduces considerable write latency, which can be as long as a day or up to two weeks, during which we process billions of entries.
4. How do you manage updates with a ten-hour load time? I would imagine you couldn't update the database in production. In analytical mode, do you rebuild from scratch and swap out?
- James: Exactly. Yep. That's the whole point. We're rerunning our resolution pipeline. That takes several hours. We're re-enriching parts of our data. We're ingesting new data. Then, we write everything to all of our databases end to end. That takes something like 24 hours for a full rebuild. And then that's the build that we only do every two weeks. And then we have smaller mini-batches. You could call it writes that go off up to once a day.
5. How do you sync updates across all of the replicas?
- James: We use a straightforward bulk write process across multiple machines, ensuring consistent data without the complexity of real-time synchronization. Our setup is primarily read-only.
6. Which tool did you use to visualize this large data set? How did you manage transferring large amounts of data?
- James: We've developed an open-source WebGL library called Trellis to visualize large graphs, which you can find under Sayari Trellis on NPM. It's one of the few open-source WebGL network renderers available, a space where we believe more development is needed compared to traditional tools like D3 that render to SVG and canvas. With Trellis, we primarily handle large data volumes by running breadth-first searches that return only the most relevant paths—often just tens to a few thousand nodes—out of millions, to avoid overwhelming users and minimize latency. This approach ensures efficient data handling and visualization without sacrificing performance.
Conclusion
That’s how Sayari uses Memgraph to efficiently handle and analyze extensive graph data. Moving forward, Sayari plans to continue enhancing their data processing capabilities and exploring new technological advancements to improve the scalability and performance of their systems.
Further Reading:
- Sayari customer story: Real-time Data Processing for Relationship Mapping: Network Risk Analysis
- Memgraph as a Graph Analytics Engine
- Memgraph Storage Modes Explained
- How to Import 1 Million Nodes and Edges per Second Into Memgraph
- Handling large graph datasets
- Query Optimization in Memgraph: Best Practices and Common Mistakes