NebulaGraphCommunity first published by NebulaGraph, Follow NebulaGraphCommunity to take a look at technical practice in the dachang library.
【 Author introduction 】
- Qi Mingyu: Kuaishou security – mobile security group, mainly responsible for the construction of Kuaishou security intelligence platform
- Ni Wen: Kuaishou data platform – distributed storage group, mainly responsible for the construction of Kuaishou Map database
- Yao Jingyi: Kuaishou data platform – distributed storage group, mainly responsible for the construction of Kuaishou Map database
【 Company Profile 】
Kuaishou is a leading content community and social platform in the world, which aims to help people discover what they need and develop their strengths through short videos to continuously improve their unique happiness.
Why is a graph database needed
Traditional relational database has poor performance in dealing with complex data relation operation. With the increase of data quantity and depth, relational database cannot calculate the result in effective time.
Therefore, in order to better reflect the connection between data, enterprises need a Database technology that stores relational information as entities and flexibly expands the data model, which is Graph Database.
Compared with traditional relational database, graph database has the following two advantages:
First, graph databases are a good representation of the relationship between data
As can be seen from the graph model above, the goal of graph database is to display these relations in an intuitive way based on the graph model. Its model expression based on the relationship between things makes the graph database naturally interpretable.
Second, graph databases are good at dealing with relational relationships between data:
- High performance: Traditional relational database mainly relies on JOIN operation when dealing with relational data. However, with the increase of data volume and the increase of association depth, traditional relational database is restricted by multiple table joins and foreign key constraints, resulting in large extra overhead and serious performance problems. Graph database ADAPTS the data structure of graph model from the bottom, which makes its data query and analysis faster.
- Flexible: Graph database has a very flexible data model, users can adjust the graph data structure model at any time according to business changes, can arbitrarily add or delete vertices, edges, expand or shrink graph model, such frequent data schema changes in the graph database can be very good support.
- Agile: The graph model of the graph database is very intuitive, supports the test-driven development mode, and can be used for functional testing and performance testing at each build. It is in line with the most popular agile development requirements today, and also has certain help to improve production and delivery efficiency.
Based on the above two advantages, graph database has great demand in financial anti-fraud, public security criminal investigation, social network, knowledge graph, data blood, IT assets and operation and maintenance, threat intelligence and other fields.
Kuaishou security intelligence integrates the security data of the whole chain, such as mobile terminal, PC Web terminal, cloud, alliance and small program, and eventually forms a unified basic security capability to empower the company’s business.
Because security intelligence itself has the characteristics of data entity diversity, correlation complexity, data label richness and so on, graph database is the most suitable to do.
Why Nebula Graph
After gathering requirements and conducting preliminary research, Quickhand Security Intelligence selected Nebula Graph as the graphics database for the production environment.
2.1 Requirement Collection
For graph database selection, its main requirements are in two aspects: data writing and data query:
- Data writing mode: Offline + Online
- It is required to support daily offline data import in batches. The amount of new data written every day reaches 10 billion. It is required that the daily association data can be written in hours
- Need to support the real-time writing of data, Flink from Kafka to consume data, and after finishing the logical processing, directly docking graph database, real-time writing of data, need to support the QPS in 10W magnitude
- Data query mode: online real-time query at the millisecond level. The supported QPS are 5W
- Point and edge attribute filtering and query
- Multi degree association relationship query
- Some basic graph data analysis ability
- Graph shortest path algorithm
To sum up, the graph database suitable for big data architecture selected this time mainly needs to provide three basic capabilities: real-time and offline data writing, basic query of graph data, and simple OLAP analysis based on graph database. Its corresponding positioning is: Online, high concurrency, low latency OLTP class diagram query service and simple OLAP class diagram query capability.
2.2 the selection
Based on the above deterministic requirements, we mainly consider the following points in the selection of graph database:
- The amount of data that a graph database can support must be large enough, because enterprise-level graph data often reaches tens or even hundreds of billions
- The cluster can scale linearly because you need to be able to scale machines online while the production environment is constantly in service
- Query performance is millisecond because it needs to meet the performance requirements of online services, and as the volume of graph data increases, query performance is not affected
- It can easily connect with big data platforms such as HDFS and Spark, and build a graph computing platform based on this
2.3 Nebula Graph’s characteristics
- High performance: Provides reads and writes in milliseconds
- Scalable: It can be horizontally expanded to support large-scale map storage
- Engine architecture: Storage and computing separated
- Graph data model: Vertex, edge, and supports modeling of point or edge properties
- Query language: nGQL, SQL-like query language, easy to learn and use, to meet complex business requirements
- Provides rich and perfect data import and export tools
- Nebula Graph is an open source Graph database product with a strong presence in the open source community
- Nebula Graph is a dramatic improvement in query performance over JanusGraph and HugeGraph
Nebula Graph was chosen to serve as the graphics database for our production environment because of these characteristics and the context and requirements that Nebula Graph met.
Graph data modeling of security information
As shown in the figure below, from the point of view of intelligence, the difficulty of security layered confrontation and defense is gradually increasing from bottom to top:
Each plane, before the attack against defenders are separate, now use map database, each level can be real ID through relationship together, forming a three-dimensional level, through the three-dimensional level network can make the enterprise quickly grasp the attacker’s attack, cheating tools, ring features a whole information.
Therefore, graph structure data modeling based on security data can change the original plane identification level into three-dimensional network identification level, which can help enterprises identify attacks and risks more clearly and accurately.
3.1 Basic diagram structure
Security and intelligence figure modeling of main purpose is to judge the risk of any dimension, not just confined to the state and attributes of a dimension itself to see its risk, but the dimension from the individual extensions for the network level, through the relationship between structure of data, through the upper and lower levels (heterogeneous) and at the same level (composition) with 3 d to observe the dimension of risk.
Take device risk as an example: A device is divided into four layers: network layer, device layer, account layer, and user layer. Each layer is represented by its representative entity ID. Through the graph database, it can realize the risk cognition of a device at a three-dimensional level, which will be very helpful for risk identification.
As shown in the figure above, this is the basic graph structure modeling of security intelligence, which constitutes a knowledge graph based on security intelligence.
3.2 Dynamic graph structure
On basic figure structure, also need to consider is that the existence of every kind of relationship has timeliness, within A period of time correlation exists, B period the correlation is not necessarily exist, so we hope that you can in figure on the database security and intelligence really reflect the objective reality that the relationship in different periods.
This means that data needs to be presented with different graph structure models, which is called dynamic graph structure, depending on the query time interval.
In the design of dynamic graph structure, a question involved is: what kind of edge relation should be returned on the queried interval?
As shown in the figure above, this edge should be returned when the query time interval is B, C, D, and should not be returned when the query time interval is A, E.
3.3 Weight diagram structure
In the face of black and grey production or human evil, there are often many accounts on a device, some of which are commonly used by criminals themselves, and some of which are bought by them to do specific illegal live broadcasting. In order to cooperate with the attack of public security or legal affairs, we need to accurately distinguish from these accounts which are the common accounts of real bad people and which are just the accounts they buy for evil purposes.
Therefore, the weight of the edge associated with the account and the device will be involved: If the account is commonly used by the device, it indicates that the relationship between the account and the device is strong, and the weight of the edge will be high. If the account is only used for evil/live broadcasting, the relationship between the account and the device will be relatively weak, and the corresponding weight will be lower.
Therefore, in addition to the time dimension, we also add the weight dimension to the edge attribute.
To sum up, the graph model finally established in security intelligence is: dynamic time zone graph structure with weight.
Security information service architecture and optimization based on graph database
The overall security intelligence service architecture is shown as follows:
Overall architecture diagram of security intelligence service
The software architecture of the intelligence comprehensive query platform based on graph database is as follows:
Software architecture diagram of information Integrated Query platform
Note: AccessProxy supports the access from the office network to IDC, and KNGX supports the direct call within IDC
4.1 Offline Data Write Optimization
For the constructed relational data, there are billions of levels of data updated every day. How to ensure that these billions of levels of data can be written within hours, and data abnormalities can be sensed without data loss is also a very challenging task.
This part of the optimization is mainly: retry failure, dirty data discovery and import failure alarm policy.
During data import, batch data fails to be written due to various factors such as dirty data, server jitter, database process hang up, and writing too fast. By using synchronization client API, multi-level retry mechanism and failure exit strategy, The problem of write failure or batch incomplete success caused by server jitter and restart is solved.
4.2 HA Guarantee and Switchover mechanism for Two Clusters
In the graph database part, Kuaishou deployed two sets of graph database clusters, online and offline. The data of the two clusters adopts synchronous double-write. The online cluster undertakes online RPC services, and the offline cluster undertakes CASE analysis and WEB query services.
At the same time, the cluster status monitoring is integrated with the dynamic configuration of the delivery module. When a cluster is slow to query or fails, the dynamic configuration of the delivery module automatically switches over, so that upper-layer services are not aware of.
4.3 Cluster stability construction
The open source version of Nebula Graph was thoroughly investigated, maintained, and improved by the data Architecture team.
Nebula’s cluster uses a computation-storage separation model, divided into Meta, Graph, and Storage roles for metadata management, computing, and Storage.
Nebula
With Nebula’s storage layer, which serves as the base for the graph database engine and supports a variety of storage types, we chose classic mode with Nebula, using RocksdDB, a classic C++ implementation, as the underlying KV storage, and Raft algorithm to address consistency issues. Enable the cluster to support dynamic horizontal expansion.
Storage layer architecture diagram
We did a lot of testing, code improvement, and parameter optimization on the storage layer. This includes optimizing Raft heartbeat logic, improving leader election and log offset logic, tuning Raft parameters to improve single cluster recovery time. Combined with improvements to the client retrial mechanism, the Nebula engine has improved the user experience from initial fail-straight down to fail-millisecond recovery.
In terms of the monitoring and alarm system, we have built monitoring on multiple levels of the cluster, and the overall monitoring architecture is shown in the figure below:
Cluster monitoring architecture diagram
Including the following aspects:
- Hardware level CPU busy, disk util, memory, network, etc
- Monitor meta, storage, GRAPH service interface, and online status and distribution of partition leader for each role in the cluster
- Monitoring and evaluating the overall availability of the cluster from a user perspective
- Collect and monitor metrics for meta, storage, Rocksdb, and graph of all roles in the cluster
- Slow Query monitoring
4.4 Optimization of super Node query
Since the output degree of points in real graph network structure usually conforms to power law distribution characteristics, graph traversal encounter super point (output degree of millions/tens of millions) will lead to obvious slow query at database level. How to ensure the stability of online service query time and avoid the occurrence of extreme time-consuming is a problem we need to solve.
The solution of graph traversal superpoint problem in engineering is to reduce the query scale on the premise of acceptable service. Specific methods include:
- Limit truncation in the query
- Query edge sampling at a certain scale
Specific optimization strategies are described below:
4.4.1 Limit truncation Optimization
[Prerequisites]
The business layer can accept limit truncation for each hop, for example, the following two queries:
Go from hash(' X.X.X.X ') over GID_IP REVERSELY where (gid_ip. update_time >= XXX and gid_ip. create_time <= xxx) yield GID_IP.create_time as create_time, GID_IP.update_time as update_time, $^.IP.ip as ip, 100 # $$. The GID. GID | limit the query results in the middle have done truncation, Go from hash(' X.X.X.X ') over GID_IP REVERSELY where gid_ip.update_time >= XXX and gid_ip.create_time <= XXX yield GID_IP._dst as dst | limit 100 | go from $-.dst ..... | limit 100Copy the code
[Before optimization]
For the second query, the storage will iterate over all points before optimization, and the graph layer will truncate limit N before finally returning to the client, which is an unavoidable time-consuming operation.
Software overview: Nebula supports storage configuration for the cluster level parameter max_edge_per_vertex. However, the limit cannot be flexibly specified at the query statement level and the exact limit cannot be achieved at the statement level for multi-hop multi-point degree query.
【 Optimization ideas 】
One-hop go traversal query in two steps:
- Step1: scan srcVertex for all outputs of destVertex
- Step2: obtain all destVertex attribute values
Then the execution of each hop in go multi-hop traversal can be divided into two cases:
- Case 1: perform only step1
- Case 2: go to step1 and step2
Step2 is time-consuming (searching each destVertex attribute is a rocksdb iterator, which takes 500us in the case of not hitting the cache). It is critical to advance the “limit truncation” to step2 for points with high output. In addition, limit can be pushed down to the step1 storage sweep out stage for the super point also has great benefits.
Here we summarize the conditions under which “limit truncation optimization” can be performed and its benefits:
N indicates vertex output, N indicates limit N, scan indicates edge output, and get indicates the cost of obtaining vertex attributes
[Test effect]
For the above case1 and Case2, “Limit truncation optimization” can be implemented and the benefits are obvious. Among them, security business query belongs to Case2. The rocksdb cache does not match the rocksdb cache. The rockSDB cache does not match the RockSDB cache.
The above test results show that after our optimization, we have achieved excellent performance in the graph super point query time.
4.4.2 Edge sampling optimization
For scenarios that cannot simply do “limit truncation optimization”, we can adopt “edge sampling optimization” to solve the problem. Building on the Nebula community’s native support for maximum return and edge sampling for each Vertex at the storage process level, we have optimized to support the following features:
- Storage After the sampling function is enabled, you can configure sweep
max_iter_edge_for_sample
Number of edges instead of sweeping all edges (default) - Graph support
go
Each jump degree sampling function - Whether sampling is enabled for storage and graph
enable_reservoir_sampling
“And” Maximum return degree for each vertexmax_edge_returned_per_vertex
Both support session level parameters
With the preceding functions, services can flexibly adjust the query sampling ratio and control the traversal query scale to achieve smoothness of online services.
4.5 Query Client Modification and Optimization
The open source Nebula Graph has its own suite of clients, and there are some modifications and optimizations we’ve made to fit this suite into the Quailibao project. It mainly solves the following two problems:
- Connection pooling: The low-level interface provided by the Nebula Graph official client. Each query requires a connection to be created, initialized, executed, and closed. Creating and closing connections in a high-frequency query scenario can greatly affect system performance and stability. In practice, the connection pooling technology is used to re-encapsulate official clients and monitor each phase of the connection life cycle, realizing connection reuse and sharing and improving service stability.
- Automatic failover: through the connection is established, initialization, query, destruction of various stages of the abnormal monitoring and regular live, realize the real-time detection and automatic fault nodes in a cluster in a database, if the whole cluster is not available, will have migrated to a standby cluster level, reduce the cluster fault on the potential impact of the availability of online business.
4.6 Visualization and download of query results
For queries with fixed relationships (write nGQL dead), the front end displays the customized GUI according to the returned results, as shown in the following figure:
Here the front end uses ECharts diagram, in front of the diagram structure data loading and display here has also done some optimization.
Problem 1: Diagrams need to be able to show the details of each node, and ECharts provides diagrams that can only show simple values.
Solution: Modify the original code, add click events for each node, pop up the modal box to show more details.
Problem 2: After the click event of the graph is triggered, the graph will rotate for a long time, and it is impossible to identify which node is clicked.
Solution: Get the window position of each node during the first rendering of the graphic and fix each node position after the click event is triggered.
Problem 3: When there are many nodes in the diagram, the diagram is crowded.
Solution: Enable mouse zoom and comment roaming functions.
A flexible relational query (Flexible nGQL) visualized with the Nebula Graph Studio deployed, as shown below:
5. The practice of graph database in security information
Based on the structure and optimization of the database in the above figure, we provide two access modes of Web query and RPC query, which mainly support the following services of Kuaishou:
- Support kuaishou safety traceability, offline strike and black ash production analysis
- Support business security risk control and anti-cheating
For example, the performance of teamwork control equipment is obviously different from that of normal equipment in graph data:
Identification of teamwork control equipment:
Vi. Summary and outlook
- Stability construction: Cluster HA enables real-time synchronization and automatic access switchover between clusters in azs to ensure a 99.99 SLA
- Performance improvement: Consider the storage solution of RPC and AEP new hardware, and optimize the query execution plan
- Graph computing platform and graph query through: the construction of graph computing/graph learning/graph query integration platform
- Real-time judgment: the writing of real-time relationship and the comprehensive judgment of real-time risk
Thank seven.
Thanks to Nebula Graph for supporting Kuai Pao.
Ac graph database technology? To join the Nebula Exchange Group, fill out your Nebulae card and Nebula Assistant will bring you into the group
Recommended reading
- [Meituantu database platform construction and business practice]
- Nebula Graph’s Data Governance Practices at WeBank
- Selection of Graph database | Migration history of graph database in 360 Department of Data