The previous articles have covered a lot of graph database basics:

Design improvement points of Titan from extended line query capability analysis graph database

Learn graph database series together: Graph database overview

Graph database series: graph database and traditional database comparative analysis

The Titan project, the open source graph database, has ceased to be updated. JanusGraph has resurrected the Titan project, and so far, JanusGraph is not far from Titan’s core mechanics. JanusGraph/Titan has the following key designs:

1. Support large-scale graph data storage. Titan graph database is built on distributed cluster, and the data storage capacity is proportional to the number of cluster nodes

2. Support elastic and linear expansion, high availability, high fault tolerance

3. Support Gremlin graph query language

4. Support graph data analysis using Hadoop computing framework

5. Support external indexes: ElasticSearch, Solr, Lucene

6. Supports multiple storage engines: Cassandra, HBase, Berkeley DB and InMemory modes;

7. Based on Apache License 2.0

This article uses HBase as Storage Backend to introduce JanusGraph’s internal data Storage structure in detail.

Basic concepts of JanusGraph

Before introducing the data store structure for relational data, let’s look at the basic concepts of JanusGraph.

Like most graph databases, JanusGraph is modeled using property graphs. Based on the property graph model, JanusGraph has the following basic concepts:

Vertex Label: The type of node used to represent real-world entity types such as “people” and “cars”. In the JanusGraph, each node has one and only one Vertex Label. When Vertex Label is not explicitly specified, the default Vertex Label is used.

Vertex: Nodes/vertices used to represent physical objects in the real world

Edge Label: Type of Edge, used to indicate the relationship type in the real world, such as “call relationship”, “transfer relationship”, “Weibo attention relationship”, etc.

Edge: an Edge used to indicate a specific connection. The edges of the JanusGraph are unidirectional. If a bidirectional edge is required, it is formed by two unidirectional edges in opposite directions. JanusGraph does not have undirected edges.

Property Key: The type of the Property, such as name, age, time, etc. The Property Key has Cardinality. Cardinality has SINGLE, LIST, and SET options. These options are used to indicate whether a Property Key is allowed to have only one value, multiple repeatable values, or multiple non-repeatable values, respectively.

Property: a key-value structure used to represent specific additional information. Key is the Property Key, and Value is the specific Value.

Example of property graph

Zhang SAN and Li Si are colleagues. They have been colleagues since 2017, which can be expressed in the attribute graph:

Figure cutting

As a distributed graph database, JanusGraph needs to shard data and store it on multiple machines.

There are two typical graph cutting methods: one is point cutting, the other is edge cutting.

Vertex-Cut

When cutting by Vertex, the cutting line passes through Vertex, not Edge. Each Edge Edge is saved only once, and each Edge appears on only one machine. Vertex with more neighbors is distributed to different machines.

Edge-cut

When cutting by Edge, the cutting line only passes through the Edge connecting vertex, and each vertex is saved only once. The cut Edge is saved to multiple machines.

The sharding method used by JanusGraph is Edge cutting, and for each Edge, it will be cut. After cutting, the edge is stored once on the starting Vertex and once on the destination Vertex. In this way, JanusGraph can quickly find the opposite Vertex, either from the starting Vertex or from the destination Vertex.

note

1. Different storage modes of Super nodes are not discussed here.

2. JanusGraph allows Edge Label to be defined as a unidirectional aware Edge. For an edge that is unidirectional aware, only the starting vertex is used to query the edge, not the destination vertex. Therefore, in this case, the destination node does not store the vertex, but only the start node.

storage

Or the above example as an example, after cutting according to the edge, it becomes the following result:

In JanusGraph, it’s point-centered and it stores data by cutting edges. The node ID is used as the HBase Rowkey, and each attribute and edge on the node are independent cells of the Rowkey. That is, each attribute and edge is an independent KCV structure (key-column-value).

As shown below:

note

For simplicity, we do not consider how to record the Vertex Label that the node belongs to.

Detailed storage format

Vertex storage format

1. The Vertex ID is stored in HBase as a Rowkey. The Vertex ID contains 64 bits.

2. Vertex ID consists of partition ID, count, and ID padding.

3. The highest five bits are partition IDS. Partition is a concept abstracted by JanusGraph. When Storage Backend is HBase, JanusGraph automatically calculates and configures split keys for each HBase Region based on the number of partitions, so that partitions are evenly mapped to multiple HBase regions. Data is evenly distributed to multiple devices in Storage Backend by evenly allocating partition ids.

4. The count part in the middle is the serial number, where the highest bit is fixed as 0.

5. The last bits are the ID padding, denoting the type of Vertex. The exact number of bits varies depending on the Vertex type. The most commonly used generic Vertex, with a value of ‘000’.

The suffixes of each Vertex type are divided as follows:

Property storage format

A Property Key can have one or more Property values associated with it, so JanusGraph uses Cardinality to describe this Property Key. Cardinality comes in SINGLE,LIST, and SET types:

  • SINGLE indicates that a Property Key corresponds to only one Value, which is the most common scenario.

  • LIST indicates that a Property Key can correspond to multiple values, and multiple values can have duplicate values.

  • SET indicates that a Property Key can correspond to multiple values. Multiple values cannot have duplicate values.

JanusGraph’s Property has a slightly different data store structure in different Cardinality. The overall principle is that, depending on Cardinality, the column name itself determines a unique attribute.

Storage structure when Cardinality is SINGLE

The HBase column name stores only the ID and direction of the Property Key. The Property Value and Property ID(the unique ID JanusGraph assigns to each Property) are stored in the Value of the Cell. In addition, if the Property has Remaining Properties, it also puts them in the Value. The Remaining Properties are not used. In some special scenarios, the Remaining properties record additional information (such as the definition of Edge Labe, which stores metadata) for the Property.

PropertyKeyID and direction The detailed structure of the entire structure is described later; The value of the attribute takes up one or more fields, in a format described later; The ID of the attribute and the Remaining attribute have the same format, which is described later.

Candinality is the storage structure of a LIST

The structure of each part is similar to that of Cardinality SINGLE, except that the ID of the attribute is placed in the column name instead of the Value.

Candinality is the SET storage structure

The structure of each part is similar to that of Cardinality SINGLE, except that the Value of the attribute is placed in the column name instead of Value.

Property Key ID and direction store structure

Each Property Key also has its own unique ID. To identify which Property Key each attribute belongs to, JanusGraph records the ID of the Property Key corresponding to each attribute. When JanusGraph is stored, it assembles the Property Key ID and DirectionID(subsystem type/hidden/general not hidden) together in multiple bytes.

Property’s ID and property value store structure

The value of a Property has a different serializer depending on the data type defined in the Property Key. For example, String has StringSerializer and Integer has IntegerSerializer. IntegerSerializer and attribute ID are formatted in the same way.

The ID of the property is an unsigned integer that JanusGraph formats into multiple bytes. The highest bit of each byte is used to indicate that it is the last byte. 0 means not the last byte, 1 means the last byte.

Side storage format

JanusGraph’s Edge Label also has a MULTIPLICITY concept. MULTIPLICITY has multiple and other values, of which MULTI is the most common.

MULTIPLICITY is the storage structure of MULTIPLICITY

The storage format of edges is similar to that of attributes. Property Key and Edge Label are abstracted into Relation Type and adopt the same data structure. SortKey is a special attribute, and JanusGraph allows you to specify one or more of these attributes as SortKey when defining an Edge Label. For the edge’s Sort Key attribute, JanusGraph stores it after the Relation Type ID and in front of all other fields when it stores it. In this way, multiple edges of the same type on a node are guaranteed to be stored sorted by the Sort Key attribute. This can help query performance when a node has a large number of edges.

 

MULTIPLICITY is a storage structure when MULTIPLICITY is not MULTI and multiple edges exist in this direction

MULTIPLICITY is a storage structure when MULTIPLICITY is not MULTI and there is only one edge in this direction

The above department of the author according to their own understanding of the source code collated, it is inevitable that there are inaccurate or incomplete places. Welcome correction and exchange.

About “NoSQL Ramblings”

NoSQL mainly refers to some distributed non-relational data storage technology, which is actually a very broad definition, can be said to cover all aspects of distributed system technology. As artificial intelligence, Internet of Things, big data, cloud computing and blockchain continue to spread, NoSQL technology will play an increasingly valuable role.

Please long click the qr code below to follow us

More NoSQL technology sharing, stay tuned!