Nebula Graph: An open source distributed Graph database. As the only in-line database capable of storing trillions of configured nodes and edges, Nebula Graph not only meets the low-latency requirements of millisecond queries in high-concurrency scenarios, but also ensures high service availability and data security.
This article introduces Nebula Graph’s data model and system architecture design. Nebula Graph: A distributed, scalable, fast graphics database now open source. GitHub:github.com/vesoft-inc/…
DirectedPropertyGraph
Nebula Graph is modeled using an easy-to-understand directed property Graph, that is, a Graph is logically composed of two Graph elements: vertices and edges.
Vertex Vertex
In a Nebula Graph, vertices are composed of a tag, representing the type of vertex, and a set of attributes that represent one or more of the attributes owned by the tag. A vertex must have at least one type, the label, and can have multiple types. Each label has a corresponding set of attributes, called schema.
As shown in the figure above, there are two types of Tag vertices: Player and team. Player schema has three attributes ID (VID), Name (STING), and Age (int); Team’s Schema has two attributes: ID (VID) and Name (string).
Like Mysql, Nebula Graph is a strongly Schema database, with property names and data types determined before data is written.
Edge to Edge,
In Nebula Graph, edges are composed of types and edge properties. In Nebula Graph, edges are directed. Directed edges indicate an association between one vertex (origin SRC) and another (destination DST). In addition, in Nebula Graph we refer to edge types as EdgeTypes. There is only one EdgeType per edge, and each edgeType defines the corresponding schema for the edge properties.
Going back to the above example, in the diagram there are two types of edges, the one is like likeness that points to player and has the attribute likeness (double). Start_year (int) and end_year (int), respectively.
It should be noted that there can be multiple edges of the same or different types between starting point 1 and ending point 2.
Figure segmentation GraphPartition
Due to the super-scale relationship, the number of nodes in the network is up to ten billion to ten billion, and the number of sides is even up to one trillion, even if only the memory points and sides are far larger than the capacity of the general server. Therefore, a method is needed to cut graph elements and store them on different logical partition. Nebula Graph uses edge splitting. The default sharding strategy is hashed and the number of partitions is set statically and cannot be changed.
DataModel DataModel
In Nebula Graph, each vertex is modeled as a key-value, which is stored on the corresponding partition according to its vertexID (or VID for short) hash.
A logical edge is modeled in Nebula Graph as two separate key-values, called out-key and in-key. The out-of-key and the start point of this edge are stored on the same partition, and the in-key and the end point of this edge are stored on the same partition.
Detailed design of the data model will be covered in a future series of articles.
System Architecture
Nebula Graph consists of four main functional modules: the storage layer, metadata services, computing layer, and Client.
Storage layer Storage
The storage layer counterpart in Nebula Graph is Nebula – Storaged, whose core is a distributed key-Valuestorage based on the Raft protocol for managing log replication. Currently, the main storage engines supported are Rocksdb and HBase. Raft protocol maintains data consistency through leader/follower mode. Nebula Storage adds the following features and optimizations:
- Parallel Raft: Allows the same partiton-IDS on multiple machines to form one
Raft group
. Concurrent operation through multiple Raft Groups. - Write Path & Batch: Synchronization between machines for Raft protocol relies on sequential log ids for throughput
throughput
Is low. Higher throughput can be achieved through batch and out-of-order commits. - Learner: Indicates Learner based on asynchronous replication. When a new machine is added to the cluster, it can be marked as Learner first and asynchronously from
leader/follower
Pull data. When the learner catches up with the leader, it is marked as follower and participates in Raft protocol. - Load-balance: Migrates the partitions served by some machines under heavy access pressure to colder machines to achieve better Load balancing.
Metadata service layer Metaservice
The Metaservice counterpart is nebula- Metad, which provides the following functions:
- User management: Nebula Graph’s user infrastructure includes
Goduser
,Admin
,User
,Guest
Four. Each user has different operation rights. - Cluster configuration management: Enables new servers to go online and offline.
- Map space management: Increase and delete map space and modify map space configuration (Raft number of copies)
- Schema Management: Nebula Graph designs for Strong Schemas.
- Record the type of each field of the Tag and Edge attributes via Metaservice. The supported types are: int, double, timestamp, list, etc.;
- Multi-version management: Supports adding, modifying, and deleting schemas and recording their version numbers
- TTL management, recycling by identifying the expiration date
time-to-live
Supports automatic deletion of data and space reclamation
The MetaService layer is a stateful service. Just like the Storage layer, the MetaService layer has KVStore for state persistence.
Query Engine & Query Language(nGQL)
The computing layer’s equivalent process is nebula- Graphd, which consists of completely peer stateless, unrelated compute nodes that do not communicate with each other. The main function of the **Query Engine ** layer is to parse nGQL text sent by the client, generate execution plans through Lexer and Parser, and deliver execution plans to the execution Engine after optimization. The execution engine retrieves the schema for points and edges through MetaService, and retrieves data for points and edges through the storage engine layer. The key optimizations for the Query Engine layer are:
- Asynchronous and concurrent execution: Asynchronous and concurrent operations are required because I/O and network operations have long latency. In addition, to prevent a single long query from affecting subsequent queries, the Query Engine sets up a separate resource pool for each query to ensure quality of service (QoS).
- Calculation sink: Condition filtering is used to prevent the storage layer from taking up precious bandwidth by sending too much data back to the computing layer
where
The operator will be delivered to the storage tier node along with the query condition. - Execution plan optimization: Although execution plan optimization in relational DATABASE SQL has been developed for a long time, there is little industry research on graph query language optimization. Nebula Graph explores execution plan optimization for Graph queries, including execution plan caching and concurrent execution of context-free statements.
Client API & Console
Nebula Graph provides a C++, Java, and Golang client that communicates with the server over RPC using the facebook-thrift protocol. Nebula Graph can also be accessed via Console on Linux. Web access is currently under development.
Nebula Graph: An open source distributed Graph database.
GitHub:github.com/vesoft-inc/…
Nebula graph. IO /cn/posts/
Weibo: weibo.com/nebulagraph