The author | west lake data intelligence research institute senior r&d engineer promise

The world is a complicated place, and there are always many connections between things. With the development of modern business society, the relationship between things is more and more complicated, and the traditional relationship storage can no longer meet our business needs. As a vane for the future development of relationship exploration, “graph” can help people to recognize things more intuitively, excavate the mystery between data, and open up a new world for the embodiment of data value.

As a professional data intelligence listed company, Getuan has also carried out rich practice in graph application analysis. This paper will describe the common business scenarios of Graph, the application cases and optimization measures of Neo4j in Jitui, and on this basis, innovatively put forward the unique HA (High Availability) scheme of Neo4j community edition.

01 figure

The famous konigsberg Seven Bridge problem opened a new chapter in the diagram. In 1736, Leonhard Euler abstracted the problem mathematically and expressed it with a two-dimensional matrix, laying the foundation of graph theory.

Image from neo4j.com

What is a graph?

The definition of a graph states that G consists of two sets, denoted as G=<V,E>. Where V is a non-empty finite set of vertices, E is a finite set of edges, and edges are unordered pairs or ordered pairs of vertices. To better understand the graph, we can look at the following example.

Image from neo4j.com

The figure shows 3,200 airports and 60,000 air routes. By definition of the corresponding graph, each airport is a set of V and the airline is a set of E.

How are graphs stored?

• Adjacency matrix of graph

• Adjacency list for graph

• A cross-linked list of digraphs stores representations

• Adjacency multiple list storage representation of undirected graphs

What are the ways to traverse a graph?

• Depth-first traversal (DFS)

• Breadth first traversal (BFS)

Figure database

GraphDatabase is not a database that stores images, but a database that supports storing and querying data in graph data structures. Graph database is an online database management system that handles the create, read, update, and delete (CRUD) operations of graph data models.

Image courtesy Amazon.com

Figure storage

Some graph databases use native graph stores, which are optimized and designed specifically for storing and managing graphs. Not all graph databases use native graph storage, and some graph databases serialize graph data and store it in relational, object-oriented, or other general-purpose data stores.

Graph processing engine

Native graph processing (also known as index-less adjacency) is considered the most efficient way to process graph data because the connected nodes can physically point to each other in the database. Non-native graph processing uses other methods to handle CRUD operations.

Figure common service scenarios

Image courtesy Amazon.com

The figure above shows the application of graph in social network, recommendation system, knowledge graph, fraud detection, life science, network operation and maintenance and other business scenarios. Recommendation system is a typical application of graph, for example, when we want to search for interesting movies, ticket software will help us recommend similar movies according to the type and style we often watch.

Figure Database ranking

According to DB-Engines, the graph database popularity rankings for 2020 clearly show that Neo4j has taken over half of graph databases.

02 Neo4j Graph database

Neo4j is a high-performance NOSQL graphical database that stores structured data on the network rather than in tables. It is an embedded, disk-based Java persistence engine with full transactional features, and can also be considered a high-performance graph engine with all the features of a full-fledged database.

Neo4j lets programmers work with an object-oriented, flexible network structure rather than rigid, static tables — but they can still enjoy all the benefits of a fully transactional, enterprise-class database.

Image from neo4j.com

The platform ecology from storage to analysis of Neo4j graph database is shown in the figure.

Neo4j Community Edition (4.0.6) storage analysis

Image from neo4j.com

Node storage (15 bytes)

The node storage file is neostore.nodestore.db, a fixed-length record file, and each record represents a node.

• inUse (1 byte)

– [x x x x,_ _ _ _] 1-4 indicates the higher four bits of nextPropId

– [_ _ _ _, x x x _] 5-7 Indicates the higher three bits of nextRelId

– [_ _ _ _, _ _ _ x] 8 Indicates whether the node is available

• nextRelId (4 bytes) : specifies the low level of the ID of the first edge associated with the node. RelId has 35 bits plus the 3 bits in inUse

• nextPropId (4 bytes) : The low level of the ID of the first attribute associated with the node, plus the 4 bits in inUse, making the propId 36 bits

• Labels (5 bytes) : stores label information. The first four bytes are low bytes and the last one byte is high byte

– [1 x x x,_ _ _ _] : The first bit (1) indicates the inlining, 2-4 indicates the number of labels, and the remaining 36 bits are evenly divided to each label by division. The remaining 36 bits represent the ID of the label

– [0 _ _ _,_ _ _ _] : The first bit (0) indicates non-inline, and the lower 36 bits indicate the ID of the first dynamic label record

• Extra (1 byte) : [_ _ _ _, _ _ x] Uses only the last bit to mark whether the node is a super node

The length of an edge ID is 35 bits, that is, the maximum number of edges is about 32 billion. If the length of an attribute ID is 36 bits, the maximum number of attributes is about 64 billion.

Relationship storage (34 bytes)

Edge storage file for neostore. Relationshipstore. Db, fixed-length records file, each record represents an edge

• inUse (1 byte)

– [x x x x,_ _ _ _] 1-4 indicates the higher four bits of nextPropId

– [_ _ _ _, x x x _] 5-7 indicates the high three bits of firstNodeId

– [_ _ _ _, _ _ _ x] 8 Indicates whether the node is available

• firstNode (4 bytes) : specifies the low level of the firstNode id associated with the edge

• secondNode (4 bytes) : the low level of the secondNode id associated with the edge, plus the 3 bits in relationshipType mid, the nodeId has 35 bits

• relationshipType (4 bytes) :

– Mid (2 bytes) :

• [_ x x x, _ _ _ _, _ _ _ _, _ _ _ _] : 2-4 said secondNodeId high three

• [_ _ _ _, x x x _, _ _ _ _, _ _ _ _] : 5-7 represents the high 3 bits of First prevRelId

• [_ _ _ _, _ _ _ x, x x _ _, _ _ _ _] : 8-10 indicates the higher three bits of firstNextRelId

• [_ _ _ _, _ _ _ _, _ _ x x, x _ _ _ _] : 11-13 indicates the high 3 bits of secondPrevRelId

• [_ _ _ _, _ _ _ _, _ _ _ _, _ x x x] : 14-16 indicates the higher 3 bits of secondNextRelId

– Type (2 bytes) : indicates the ID of the edge type

• firstPrevRelId (4 bytes) : The low level of the previous id associated with the first node, plus the 3 bits in relationshipType mid, relId has 35 bits

• firstNextRelId (4 bytes) : specifies the low level of the last id associated with the first node. Add the 3 bits in relationshipType mid. RelId has 35 bits

• secondPrevRelId (4 bytes) : The low level of the previous id associated with the second node, plus the 3 bits in relationshipType mid, relId has 35 bits

• secondNextRelId (4 bytes) : The lower level of the last id associated with the second node, plus the 3 bits in relationshipType mid, relId has 35 bits

•nextPropId (4 bytes) : The low level of the ID of the first property of the edge association, plus the 4 bits in inUse, making the propId 36 bits

• firstInChainMarkers (1 byte) :

– [_ _ _ _, _ _ x _] : indicates whether the seventh bit is the first edge of the first node

– [_ _ _ _, _ _ _ x] : indicates whether the eighth bit is the first edge of the second node

From the storage structure, we can see that the edge stores the IDS of nodes at both ends and the two bidirectional linked lists, as shown in the figure:

Image from neo4j.com

Neo4j Large data read and write optimization

Read optimization

When we encounter a task with many relationships, if we want to query all data under this task, match(n) is required to scan all nodes, which will be very inefficient. In order to solve this problem, we still adopt the method of space for time and introduce the auxiliary label, which can extract all the data under the specified task efficiently and optimize the efficiency of the original 8 seconds query to less than 100 milliseconds.

Write optimization

The traditional data import method of Neo4j is to create Node first, then create Relation, and then analyze the cypher (SQL) execution plan. In this process, creating the Node and Relation steps takes little time, but the match operation takes a lot of time.

In order to solve the time-consuming problem caused by the need for the match operation in the process of creating Relation, we first create Node and Relation (the starting point and ending point of Relation have common attributes with the created Node), and then merge nodes through a specific attribute of Node. In addition, the space for time scheme is adopted to write data in real time, and finally the writing efficiency of Node and Relation is reduced from 40 minutes to 3 minutes.

Tips

Space for time and time for space are two ways to solve the performance balance problem. Choosing a reasonable balance point between time and space can greatly improve the quality of service of our software.

High Availability (HA) scheme exploration of Neo4j Community edition

After comparative analysis, we finally choose to use Neo4j graph database. The Enterprise edition of Neo4j has more features but is very expensive, so we are targeting the Community edition of Neo4j. However, Neo4j community edition does not support high availability and data cannot be backed up, which is not conducive to its use in the business environment. Therefore, it is necessary to design an HA scheme to solve the single point of failure. The following is an exploration of the high availability of Neo4j Community Edition.

1. Neo4j deplores master nodes and slave nodes. Master nodes can be read and written, while slave nodes can only be read. 2. Synchronize the data directory of the master node of Neo4j to the data directory of the slave in real time through lsyncd. 3. Mount the slave data directory to a specific directory on the master node. 4. Periodically restart the slave node (Neo4j community version needs to restart the data file for loading synchronization) to allow a certain data delay between the slave node and the master node. 5. When the master node is faulty, the driver connects to an available slave. The Read-only slave provides query.

03 summary

Graph database has the characteristics of high correlation and can handle complex relational data quickly, which can effectively help enterprises gain deeper insight and more competitive advantages. More and more companies in the industry began to research the field of layout graph database and develop their own graph database system. As a data intelligent listed company with massive data precipitation, INDIVIDUAL Push continues to innovate in the field of graph database practice, and continues to polish its own graph mining and analysis technology at the same time, but also looks forward to colliding more graph database practice with the industry.

References:

Db-engines.com/en/ranking/…

neo4j.com/docs/

www.tutorialspoint.com/neo4j/index…

www.6aiq.com/article/158…

aws.amazon.com/

www.tony-bro.com/posts/26243…