Abstract: In this article, 360’s Peng Zhou details the performance improvements that came with moving the business from JanusGraph to Nebula Graph, increasing business performance by at least 20 times with less than a third of the machine resources used in the previous JanusGraph configuration.
The author of this article is Zhou Peng, development engineer of 360 Data Department
The migration background
We used the stand-alone AgensGraph for graph data before, but later migrated to the distributed database JanusGraph due to performance limitations brought by the stand-alone. For detailed migration information, please refer to my previous article entitled JanusGraph Migration Journey of Ten Billion Graph Data. However, with the increase of data volume and service query usage, a new problem appears: the time required for a single query is 10 seconds in some business scenarios, and the time required for a query with complex logic is 2 to 3s, which seriously affects the performance of the entire business process and the development of related services.
The architecture of JanusGraph determines the high single time consumption. The core reason is that its storage depends on external storage and it cannot control external storage well. Our production environment uses HBase cluster, which results in that all queries cannot be pushed down to the storage layer for processing. Data can only be queried from HBase to JanusGraph Server memory and filtered accordingly.
For example, if you look for users older than 50 in a layer of relationships, if there are 1,000 people in a layer of relationships, there are only 2 people older than 50. Since JanusGraph query requests cannot be sent to HBase with a layer of filtering associated vertex attributes, we have to query HBase with a concurrent request to get the vertex attributes of 1,000 people, and then filter them in JanusGraph Server’s memory. Finally, two users that meet the conditions are returned to the client.
The problem with this is that disk IO and network IO waste a lot, and most of the data returned by the query is not used in the subsequent query. In the production environment, 19 HBase servers are configured with SSD servers. The following figure shows the usage of network I/O and disk I/O.
We compare the same business scenario, but with only 6 SSD servers with the same configuration, the disk IO and network IO of Nebua Graph are as follows:
Nebula Graph’s performance is far better, and at a time when the machine resources are only 30 percent of the size of the previous Hbase cluster. With the Nebula Graph, the business scenario that took 2 3s to query is now back in 100ms, and the business scenario that took 10 20s is now back in 2 seconds. And the average time is basically around 500ms can be done, performance improved at least 20 times 🙂
If you’re still using JanusGraph, you should immediately forward this article to your manager and start migrating to Nebua Graph 👏
Historical Data Migration
With Nebula Graph providing Spark Writer, the Spark import tool, the data migration process is smooth because we have a lot of data, 2 billion vertices, 20 billion edges. Here is an experience to share. At that time, using Spark import tool to import in asynchronous mode caused a lot of error, slightly change the import mode to synchronous write no problem. Another experience is about Spark. If the volume of imported data is large, the corresponding partitions need to be set larger. We have set 8W patitions. If you set the number of partitions to small, the data volume of a single partition will be large, which may cause Spark task OOM Fail.
Query tuning
We are currently in production with Nebula Graph in version 1.0. On the production environment with ID production we are using the hash function, and the uUID import is slow and will no longer be officially supported.
The main tuning configurations in our production environment are as follows, primarily for Nebula – Storage
# The default reserved bytes for one batch operation --rocksdb_batch_size=4096 # The default block cache size used in BlockBasedTable. # The unit is MB. Our production server memory for 128 g - rocksdb_block_cache = 44024 # # # # # # # # # # # # # # rocksdb Options # # # # # # # # # # # # # # # - rocksdb_disable_wal = true rocksdb DBOptions in json, each name and value of option is a string, given as "option_name":"option_value" separated by comma --rocksdb_db_options={"max_subcompactions":"3","max_background_jobs":"3"} # rocksdb ColumnFamilyOptions in json, each name and value of option is string, given as "option_name":"option_value" separated by comma --rocksdb_column_family_options={"disable_auto_compactions":"false","write_buffer_size":"67108864","max_write_buffer_num ber":"4","max_bytes_for_level_base":"268435456"} # rocksdb BlockBasedTableOptions in json, each name and value of option is string, given as "option_name":"option_value" separated by comma --rocksdb_block_based_table_options={"block_size":"8192"} --max_handlers_per_req=10 --heartbeat_interval_secs=10 --raft_rpc_timeout_ms=5000 --raft_heartbeat_interval_secs=10 -- wal_TTL =14400 --max_batch_size=512 # Parameter configuration reduces memory usage --enable_partitioned_index_filter=true --max_edge_returned_per_vertex=10000Copy the code
Linux machine tuning is mainly to disable the swap service, which will affect the query performance due to disk I/O. In addition to tuning minor and Major Compact, our production environment turns minor Compact on and major Compact off. The main reason for turning off Major Compact is that it takes a lot of disk IO, And it’s hard to control with the number of threads (–rocksdb_db_options={“max_subcompactions”:”3″,”max_background_jobs”:”3″}) that Nebula Graph has official plans to refine.
Finally, to focus on the max_edge_returned_per_vertex parameter, remember that the Nebula Graph is a veteran of the Graph data industry — our previous Graph queries were plagued by super nodes, Online environments can crash JanusGraph’s HBase cluster if a query encounters a supernode associated with millions of data (which has happened several times in our production environment). Application limits to the Gremlin statement used to query JanusGraph didn’t solve the problem very well, but with the max_edge_per_vertex parameter in Nebula Graph, NebulaGraph should have been given a FIVE STAR for just that: data is filtered directly from the bottom tier, and production environments aren’t troubled by supernodes like NebulaGraph.
This article is originally published on the Nebula Graph forum. If you have any questions reading this article, please go to the Nebula Graph forum and speak with the author. The original post is at discuss.nebula-graph.com.cn/t/topic/117…