This article is written by Nebula Graph NLP team members Chen Gao and Teng Chang Zhao and is available on the official Nebula Graph forum at discuss.nebula-graph.com.cn/t/topic/137…
1. Introduction
In recent years, deep learning and knowledge graph technology have developed rapidly. Compared with the “black box” of deep learning, knowledge graph is highly interpretable and widely used in search recommendation, intelligent assistant, financial risk control and other scenarios. Based on the accumulated massive business data, Meituan fully excavates the association in combination with the application scenarios, and gradually establishes the knowledge atlas of nearly ten fields, including food atlas, tourism atlas and commodity atlas, and implements it in multi-business scenarios to facilitate the intelligentizing of local life services.
In order to store and retrieve atlas data efficiently, choosing graph database as storage engine has obvious performance advantage in multi-hop query compared with traditional relational database. At present, there are dozens of well-known graph database products in the industry. Selecting a graph database product that can meet meituan’s actual business needs is the foundation of constructing graph storage and graph learning platform. Based on the current business situation, we have formulated the basic conditions for selection:
- Open source project, business friendly
- Having control over the source code ensures data security and service availability.
- Supports cluster mode, with horizontal scalability of storage and computing
- Meituan Tupu’s business data volume can reach more than 100 billion points and edges, and the throughput can reach tens of thousands of QPS, which cannot meet the storage requirements of a single node deployment.
- It can serve OLTP scenarios and has the multi-hop query capability at the millisecond level
- In the Meituan search scenario, to ensure user search experience, the timeout duration of each link is strictly limited and the response time of a query greater than seconds cannot be accepted.
- Batch data import
- Map data is stored in data warehouses such as Hive. There must be a means to quickly import data into the graph store to ensure the timeliness of the service.
We tested the top 30 graph database products on DB-Engines and found that most of the well-known open source versions of graph databases support only a single node and cannot scale horizontally to meet the storage requirements of large-scale graph data, such as: Neo4j, ArangoDB, Virtuoso, TigerGraph, RedisGraph. NebulaGraph (formerly Alibaba group), Dgraph (formerly Google group) and HugeGraph (formerly Baidu group) are the last products to be reviewed after their research and comparison.
2. Test summary
2.1 Hardware Configuration
- Database instance: Docker container running on different physical machines.
- Single-instance resources: 32 cores, 64GB memory, and 1TB SSD storage. 【Intel(R) Xeon(R) Gold 5218 CPU @ 2.30ghz 】
- Number of instances: 3
2.2 Deployment Scheme
- Nebula v1.0.1
Metad manages cluster metadata, Graphd executes queries, and Storaged is responsible for data sharding. The storage backend uses RocksDB.
Example 1 | Example 2 | Example 3 |
---|---|---|
Metad | Metad | Metad |
Graphd | Graphd | Graphd |
Storaged[RocksDB] | Storaged[RocksDB] | Storaged[RocksDB] |
- Dgraph v20.07.0
Zero manages cluster metadata, and Alpha performs queries and stores. The storage backend is Dgraph’s own implementation.
Example 1 | Example 2 | Example 3 |
---|---|---|
Zero | Zero | Zero |
Alpha | Alpha | Alpha |
- HugeGraph v0.10.4
HugeServer is responsible for managing cluster metadata and queries. HugeGraph supports the RocksDB backend, but does not support cluster deployment of the RocksDB backend. Therefore, HBase is used as the storage backend.
Example 1 | Example 2 | Example 3 |
---|---|---|
HugeServer[HBase] | HugeServer[HBase] | HugeServer[HBase] |
JournalNode | JournalNode | JournalNode |
DataNode | DataNode | DataNode |
NodeManager | NodeManager | NodeManager |
RegionServer | RegionServer | RegionServer |
ZooKeeper | ZooKeeper | ZooKeeper |
NameNode | NameNode[Backup] | – |
– | ResourceManager | ResourceManager[Backup] |
HBase Master | HBase Master[Backup] | – |
3. Measure the data set
- Social Graph data set:github.com/ldbc011
- Generate parameters: branch=stable, version=0.3.3, scale=1000
- Entities: 4 categories of entities, total 2.6 billion
- Relationships: 19 types of relationships, 17.7 billion
- Data format: CSV
- Size after GZip compression: 194 GB
4. Test results
4.1 Importing Data in Batches
4.1.1 Test Instructions
To import files in batches, perform the following steps: Bottom CSV file in Hive warehouse > Middle file supported by Graph database > Graph database. The specific import methods of each graph database are as follows:
- Nebula: Perform the Spark task to generate the underlying storage SST file for RocksDB from the data warehouse, and then perform the SST Ingest operation to insert the data.
- Dgraph: Performs the Spark task to generate triplet RDF files from the warehouse, and then performs bulk load to directly generate persistent files for each node.
- HugeGraph: Supports importing data directly from warehouse CSV files, so no warehouse – intermediate file step is required. Use loader to insert data in batches.
4.1.2 Test results
4.1.3 Data analysis
- Nebula: Data storage is distributed in a primary key hash format that is evenly distributed among nodes. Import speed is the fastest, storage amplification ratio is the best.
- Dgraph: Execute import command for original 194G data on the machine with 392G memory, exit OOM 8.7h later, cannot import full data. The distribution of data storage is a triplet predicate, and the same relationship can only be stored on one data node, resulting in a serious skew of storage and calculation.
- HugeGraph: The original 194G data was written to a disk of 1,000G of a node after the import command was executed. As a result, the import failed and the full data could not be imported. The storage amplification ratio is the worst, and the data skew is serious.
4.2 Real-time Data Writing
4.2.1 Test Instructions
- Inserts points and edges into the graph database to test real-time write and concurrency capabilities.
- Response time: fixed 50,000 pieces of data, fixed QPS to issue write requests, end when all are sent. Obtain the Avg, P99, and P999 time required by the client from sending a request to receiving a response.
- Maximum throughput: fixed 1,000,000 pieces of data for incremental QPS to issue write requests, used by the Query loop. Take the peak QPS of successful requests within 1 minute as the maximum throughput.
- The insertion point
- Nebula
INSERT VERTEX t_rich_node (creation_date, first_name, last_name, gender, birthday, location_ip, Browser_used) VALUES ${mid}:(' 2012-07-18t01:16:17.119 +0000', 'Rodrigo', 'Silva', 'female', '1984-10-11', '84.194.222.86', 'Firefox')Copy the code
- Dgraph
${{set {< mid} > < creation_date > "2012-07-18 T01: blessing. 119 + 0000", "${mid} > < first_name >" Rodrigo ". < ${mid} > < last_name > "Silva". < ${mid} > < gender > "female". < ${mid} > < birthday > "1984-10-11". < ${mid} > < location_ip > "84.194.222.86". <${mid}> <browser_used> "Firefox" . } }Copy the code
- HugeGraph
G. addvertex (t. label, "t_rich_node", T.id, ${mid}, "creation_date", "2012-07-18T01:16:17.119+0000", "first_name", "Rodrigo," "last_name", "Silva", "gender", "female", "birthday", "1984-10-11", "location_ip", "84.194.222.86", "browser_used", "Firefox")Copy the code
- Nebula
- Insert the side
- Nebula
INSERT EDGE t_edge () VALUES ${mid1}->${mid2}:(); Copy the code
- Dgraph
{ set { <${mid1}> <link> <${mid2}> . } } Copy the code
- HugeGraph
g.V(${mid1}).as('src').V(${mid2}).addE('t_edge').from('src') Copy the code
- Nebula
4.2.2 Test results
- Real time write
4.2.3 Data analysis
- Nebula: As discussed in section 4.1.3, Nebula’s write requests can be split among multiple storage nodes, resulting in a significant improvement in response time and throughput.
- Dgraph: As analyzed in section 4.1.3, the same relationship can only be stored on one data node, resulting in poor throughput.
- HugeGraph: With the hBase-based storage back end, the real-time concurrent read and write capabilities are lower than RocksDB (Nebula) and BadgerDB (Dgraph) and therefore have the worst performance.
4.3 Data Query
4.3.1 Test Instructions
- The common N-hop query returns the ID, the n-hop query returns the attribute, and the mutual friend query requests test the read performance of the graph database.
- Response time: fixed 50,000 queries, with fixed QPS issuing read requests, ending when all are sent. Obtain the Avg, P99, and P999 time required by the client from sending a request to receiving a response.
- If no result is returned within 60 seconds, timeout occurs.
- Maximum throughput: fixed 1,000,000 queries in incrementing QPS to issue read requests, used by the Query loop. Take the peak QPS of successful requests within 1 minute as the maximum throughput.
- Cache configuration: All graph databases in the test have read cache mechanism, which is enabled by default. Restart the service before each test to clear the cache.
- Response time: fixed 50,000 queries, with fixed QPS issuing read requests, ending when all are sent. Obtain the Avg, P99, and P999 time required by the client from sending a request to receiving a response.
- N ID returned by hop query
- Nebula
GO ${n} STEPS FROM ${mid} OVER person_knows_person Copy the code
- Dgraph
{q(func:uid(${mid})) {uid person_knows_person {#${n} hop = uid}}Copy the code
- HugeGraph
G.V (${mid}).out().id() #${n} hop = out() chain lengthCopy the code
- Nebula
- N Attributes returned by hop query
- Nebula
GO ${n} STEPS FROM ${mid} OVER person_knows_person YIELDperson_knows_person.creation_date, $$.person.first_name, $$.person.last_name, $$.person.gender, $$.person.birthday, $$.person.location_ip, $$.person.browser_used Copy the code
- Dgraph
{q(func:uid(${mid})) {uid first_name last_name gender birthday location_ip browser_used person_knows_person {#${n} Hop count Uid first_name last_name gender birthday location_ip browser_used}}Copy the code
- HugeGraph
G.V (${mid}).out() #${n} hop = out() chain lengthCopy the code
- Nebula
- Mutual friend query statement
- Nebula
GO FROM ${mid1} OVER person_knows_person INTERSECT GO FROM ${mid2} OVER person_knows_person Copy the code
- Dgraph
{ var(func: uid(${mid1})) { person_knows_person { M1 as uid } } var(func: uid(${mid2})) { person_knows_person { M2 as uid } } in_common(func: uid(M1)) @filter(uid(M2)){ uid } } Copy the code
- HugeGraph
g.V(${mid1}).out().id().aggregate('x').V(${mid2}).out().id().where(within('x')).dedup() Copy the code
- Nebula
4.3.2 Test results
- N ID returned by hop query
- N Attributes returned by hop query
The average size of an attribute for a single return node is 200 Bytes.
- Mutual friend
Maximum throughput was not tested for this item.
4.3.3 Data analysis
- In the 1-hop query to return THE ID “response time” experiment, Nebula and DGraph each needed only one out-side search. Because of the storage nature of DGraph, the same relationship is stored on a single node, and 1-hop queries do not require network communication. With Nebula’s entities distributed across multiple nodes, DGraph’s response time was slightly better than Nebula’s in the experiment.
- In the 1-hop query return ID “Maximum Throughput” experiment, the CPU load on the DGraph cluster nodes is primarily on the single node of the storage relationship, resulting in low cluster CPU utilization and a maximum throughput of only 11% of Nebula’s.
- In the 2-hop query return ID “Response time” experiment, because of the above, DGraph was nearing its cluster load limit at QPS =100, and the response time was significantly slower than Nebula’s by 3.9 times.
- In the 1-hop query return properties experiment, Nebula needed to perform only Y searches because he stored all of the attributes of the entity as a data structure on a single node. DGraph treats all attributes of an entity as outgoing edges, and is distributed on different nodes, requiring a secondary out-edge search of [number of attributes X number of outgoing edges Y], resulting in lower query performance than Nebula. The same applies to multi-hop queries.
- In the mutual friend experiment, as this experiment is basically equivalent to two 1-hop queries returning ID, the test results are close and will not be detailed.
- Because the HugeGraph storage backend is hbasel-based and has lower real-time concurrent read and write capabilities than RocksDB (Nebula) and BadgerDB (Dgraph), it lagged behind Nebula and Dgraph in performance in several experiments.
Conclusion 5.
Nebula’s batch import availability, import speed, real-time data write performance, and multi-hop data query performance were all superior to competing image databases tested, so we chose Nebula as our graphics storage engine.
6. Reference materials
- NebulaGraph Benchmark:discuss.nebula-graph.com.cn/t/topic/782
- NebulaGraph Benchmark wechat team: discuss.nebula-graph.com.cn/t/topic/101…
- DGraph Benchmark: DGraph. IO/blog/tags/b…
- HugeGraph Benchmark: HugeGraph. Making. IO/HugeGraph – d…
- TigerGraph Benchmark:www.tigergraph.com/benchmark/
- RedisGraph Benchmark:redislabs.com/blog/new-re…
This performance test is written by Meituan NLP team Gao Chen, Zhao Dengchang, if you have any questions about this article, welcome to the original post and the author exchange: discuss.nebula-graph.com.cn/t/topic/137…