preface
At present, there are many open source graph databases on the market, and various manufacturers have also carried out their own graph databases (mostly closed source), in order to meet their own customization requirements, how to choose? Hugegraph janusgraph, nebula, bytegraph easygraph were compared
To paraphrase the big guy:
-
The key of graph query lies in the efficiency of visualization and instant correlation analysis
-
The key to graph query lies in the efficiency of visualization and real-time correlation analysis. The core role of graph computing is performance acceleration in global correlation computing
-
Graph learning is most closely related to current business needs and plays the most obvious role
-
Diagrams should be used after business bottlenecks are encountered
-
Graph products should focus on business needs and usage experiences rather than graph technology itself
-
Most of the current graph databases are not qualified databases and are not defined as databases
-
Many of the “pits” in the chart are the result of a few fickle people in industry, not the chart itself
First, look at the landing application scenario
- fraud:
- Feature extraction of scam calls, such as not being in the three-step social neighborhood circle and being rejected in large numbers. Using graph database to screen out such features can be quickly identified.
- Another example is bank card/Alipay transfer. According to my personal experience, there are warning risks when alipay transfers to strangers. For example, if we are not in a k-step circle, we will intercept the transfer in time.
- Wool party identification, some app released a lot of concessions, but there are always wool party organization wool pulling, using the graph database can be personal account and mobile PHONE IP relationship screening, if a device has more than one account login, it will be judged as the wool party, timely reduce enterprise losses.
2. Represents the relationship between the company and its legal person/shareholder
A person can own many companies, a company can be held by many people, and several people jointly hold shares and other relations. When we want to check the situation of a person and his related companies, risks and shareholding, we can query based on the graph. The Domestic Tianyan search is based on this application scenario.
At present, banks are relatively backward in identifying the complex relationships among shadow groups, group customers’ multi-layer cross-shareholding and equity nesting. With the development of collectivization, family and diversification of enterprises, single enterprises form commercial empires through capital operation; All kinds of capital system internal equity is not transparent, hidden shareholders and equity proxy phenomenon frequently occurs, the main shareholder, controlling shareholder, actual controller, concerted action person, the final beneficiary is not clear; The relationship and interaction between enterprises and shareholders and between enterprises are more and more complex. The outbreak of credit risk of a single customer may lead to the risk of the whole related customer group.
Companies, executives and associated companies form a complex network, use map calculation engines for national enterprise credit information publicity system, traverse the group members and shareholders and equity structure between affiliated enterprise, determine whether there is cross shareholding, controlled by the same shareholders and senior executives for relationship, identify hidden relationship, It is helpful to find illegal and illegal clues such as the disassociation of affiliated transactions and the transfer of interests of affiliated transactions.
3. Represents the automobile supply chain relationship
Background: Thousands of parts are used in the production of automobiles, and different models of automobiles have the same or different parts, and some parts suppliers only have one supplier, and different parts inventory is different
The question is, if the car manufacturer receives an order, will all of the thousands of parts needed for the model fit the order? Even if this order is satisfied, are the production requirements for other models of vehicles using this part still met? How to find the supply of the chokehold part (without which the whole workshop cannot function)? Do some parts suppliers have risks? Do we need to add reserve suppliers?
This is the diagram application scenario from the perspective of raw materials. I can treat the car and the parts as one entity, and all the parts needed are the relations between entities. The car –> need –> parts, and establish such a edge to represent the relationship between the car and the parts. And other attributes, we can intuitively express the number of models of cars that a certain part is needed by, and how much is needed. If in a traditional database, we need to connect a large number of parts tables and model tables to collect these data.
In addition, there is also the demand for price calculation. For example, the price rise of some bulk products leads to the price rise of some parts, and automobile manufacturers also need to calculate the corresponding profit margin to reach a final price. Then how many cars are affected by the price fluctuation of this batch of parts? In the traditional statistical method, we need to calculate the dependence and part price of each car model one by one, which requires a lot of manpower. In the figure model, we only need to input the part price, and then we can calculate the price of the dependent car in line with the profit margin. Greatly improved efficiency
- Replace a large number of table joins/conversions
In the process of data warehouse development, a large number of intermediate useless tables will be generated because of data cross-table association. Using graph database can avoid a large number of useless tables and directly represent the target data according to the relational model
- Knowledge graph, construct industry knowledge graph
For example, you can look up movies that an actor and a director have worked with. Work information, similar queries about the relationship between two entities
- Loan flow monitoring/post-loan real-time monitoring and early warning
The flow of credit funds has always been the focus of supervision, among which, credit funds illegally entering the stock market, the real estate market and other areas have become the field of supervision and inspection, and the regulatory authorities require banks to monitor the real flow of credit funds. The entity relationship of bank account number, bank transfer amount, company name and other data is called “capital flow knowledge map”.
The figure starts with any lending account getting a bank loan and ends with the final account that calculates the flow of loan funds. The loan funds are transferred through the 43rd floor account
- Housing recommendation, customer maintenance, influence classification
Nodes: broker, house, client
Relationships: browse, follow, watch, etc
For example: when a user often browse, pay attention to or consult a house, the maintenance person A1 of the house will invite the maintenance person A2 of the user to take the customer to see the house. At present, shell housing search uses the map database to deal with the demand of this scene, and can also identify false house, false customer, false take to see, private single flight……
2. Graph storage structure
Generally speaking, the graph database storage in the market can be divided into the following three situations:
- One KV on one side
hugegraph/nebula
- Implement a simple
- Write magnification is small, suitable for rewrite scenarios
- KV Scan is used to implement one-degree neighbor query, and performance degrades in some scenarios, such as full read edges
- A KV pair saves all sides of a starting point
janusgraph/titan/easygraph
- Relatively simple implementation
- Writing method is large, read a addressing, all points need to be read at a time, suitable for rereading (full side reading), other reading scenes are equivalent to reading magnification. Writing/updating is also more obvious, because adding/updating an edge also requires rewriting the entire line
- Multiple KV pairs of structures such as B-tree preserve all edges of the starting point
bytegraph
- Implementation is complicated
- Flexibly balance read/write magnification according to configuration
- Can solve the super vertex problem
1.1 easygraph
- Point – edge data can be stored separately
- Edge list allows you to quickly find neighbor points and edge information
- Index to point/edge quick lookup by field value
- Snapshot Isolation is supported to ensure data consistency in real time
1.2 janusgraph
1.2.1 Point-edge Storage
The point storage of Janus is stored in the form of adjacency list. Each point stores its property first and its neighbor edge last, sorted by sortkey.
See the analysis for details
1.3 hugegraph
1.4 bytegraph
-
The storage structure of the byte map is similar to that of Janus, but the storage mode of adjacency list/B tree is used. When the node is a common dot, the storage mode of frontage table is used, so the read and write magnification is not obvious. This is special when the point is a super vertex, and there are two types of page, one that stores meta information and one that stores edge data, combined as a BTree.
- All edges of the super vertices are evenly divided by size, cut into multiple Edge pages, sorted by part.
- In MetaPage, key is vid + edgeLabel and value is partId. (LLDB vid1 + likes -> part1,)
- EdgePage is like a regular point, except the key is partId, and the value should be sorted lexicographically. Okay? It’s like removing a layer. Multilevel mapping (so it can be adjusted/split dynamically)
-
B tree structure details
- All edges of the same type of a starting point are a memory cell (KV)
- Tier-1 storage (the outgo of points is less than the threshold) :
- Start ID + point type + edge type as key
- All edges of the same starting type are aggregated as a value
- Secondary storage (point out degree exceeds threshold) :
- All edges are evenly divided
Edge page
And assign the correspondingPartKey
All,PartKey
compositionMeta
data Meta page
As a value store, (point, edge type) ->(PartKey1,PartKey2…)Edge page
PartKey->(EDGE1,edge2…)Meta page
andEdge page
The whole is a B tree, through the maintenance of page version number, to achieve lockless concurrent increase, deletion, change, search & subtree split merge
- All edges are evenly divided
- Dynamic conversion between two levels of storage
-
Distributed cluster storage
- Global data is divided into multiple Shards through a variety of optional graph partitioning algorithms
1.5 nebula graph
- Some format
- Edge format
- The property format
Type: 1 byte, used to indicate the key Type. The current Type is data, index, and system
Part ID: 3 bytes, which indicates the Partition of data. This field is used to scan the entire Partition based on the prefix during Partition redistribution
Vertex ID: n bytes representing the ID of a point
Tag ID: 4 bytes representing an associated Tag
Edge Type: 4 bytes indicating the Type of the Edge. If greater than 0 indicates the outgoing Edge, and if less than 0 indicates the incoming Edge.
Rank: 8 bytes, used to handle multiple edges of the same type. This field can be set by the user according to their own requirements. This field can be _ store transaction time _, transaction serial number, or _ some sort weight _
PlaceHolder: 1 byte PlaceHolder, reserved for TOSS (transaction on storage side). This is used to indicate whether the outgoing and incoming edges of an edge are fully inserted
(In the physical space, the dots and edges are not stored together as in version 1.0, but after the type is separated, press VertexType + VertexID prefix to scan, you can quickly obtain all tags)