NebulaGraphCommunity first published by NebulaGraph, Follow NebulaGraphCommunity to take a look at technical practice in the dachang library.
preface
In the previous Query Engine source code parsing, we introduced the major changes and general structure of Query Engine and 1.0 in 2.0:
Query Engine parses the statement, builds it into an abstract syntax tree, validates it in the abstract syntax tree, and generates an execution plan. This article continues what goes on behind Query Engine with the new subgraph algorithm module in 2.0, and focuses on the execution plan generation process to improve your understanding of the source code.
Definition of subgraph
A subgraph is a graph whose node set and edge set are subsets of node set and edge set of a graph respectively. Intuitively, it is to start from the starting point specified by the user and expand step by step along the specified edge until the number of steps set by the user is reached, and then return all the point sets and edge sets encountered in the process of expansion.
Syntax of subgraphs
GET SUBGRAPH [<step_count> STEPS] FROM {<vid>, <vid>... } [IN <edge_type>, <edge_type>...] [OUT <edge_type>, <edge_type>...] [BOTH <edge_type>, <edge_type>...]Copy the code
- Step_count: Specify the number of hops from the start point, return from 0 to
step_count
Subgraph of the jump. Must be a non-negative integer. The default value is 1 - Vid: specifies the start point ID
- Edge_type: Specifies the edge type. You can use
IN
,OUT
和BOTH
To specify the direction of the edge type at the starting point. The default isBOTH
Implementation of subgraph
When the Query Engine receives the GET SUBGRAPH command, The Parser module (implemented by Flex and Bison) extracts the desired content from the query according to written rules (the get_subgraph_sentence rule in Parser.YY) and generates an abstract syntax tree, as shown below:
In the Validate phase, the generated abstract syntax tree is verified to verify the validity of the user’s input (see The Query Engine article). After the validation, the contents of the syntax tree are extracted and an execution plan is generated. So how is this execution plan generated? Different database vendors may generate different execution plans for the same function, but the principles are the same. It depends on what operators they have and how they interact with the query layer and the storage layer. Because each of our queries ends up fetching data from the storage tier. In Nebula Graph, the Query Engine and the storage tier interact via RPC (FBTHRIFT) (interfaces are defined in the Interface directory in the Common warehouse). There are two key interfaces, getNeighbors and getProps, to look at.
GetNeighbors Defines fbthrift in the following format:
struct GetNeighborsRequest {
1: common.GraphSpaceID space_id,
2: list<binary> column_names,
3: map<common.PartitionID, list<common.Row>>
(cpp.template = "std::unordered_map") parts,
4: TraverseSpec traverse_spec
}
Copy the code
A detailed definition of each variable in this structure can be found at github.com/vesoft-inc/… There are detailed notes in it.
The main function is that the Query Engine passes in the starting point and the edge type to be expanded according to the defined structure. Then the storage layer finds the starting point and assembes the attributes of the point and the edge attributes of the point’s outgoing edges into a table and returns it to the Query Engine. The format of the returned table is github.com/vesoft-inc/… GetNeighborsResponse, and then in The Query Engine we can extract what we want from this table. For example, in the Basketba dataset, when the starting point is Tim Duncan, Manu Ginobili expands in both directions along the like edge. Want to get $^.[player.name](http://player.name/), like._dst, $$.[player.name](http://player.name/) and like. Likeness The data returned is roughly as follows:
Table 1
Because it’s a two-way extension the + like in column 4 is the outgoing edge, and the – like in column 5 is the incoming edge.
In the Nebula Graph’s storage layer, edges are stored with the starting point, so the getNeighbor interface is used to get all the properties of the starting point and the output edge, but to get the properties of the destination point during the expansion, you need to use the getProps interface. Of course, if I just want to fetch a point or an edge, I need to call this interface. You can find out for yourself github.com/vesoft-inc/… Here is the definition of getPropRequest to deepen your understanding.
The execution plan
With the interface definition above, we are ready to execute the plan. The first operators we need are start, getNeighbor, Subgraph, loop, and datacollect.
- Start operator: it is equivalent to the leaf node in the execution plan and does nothing. The purpose is to tell the scheduler that there is no operator to rely on afterwards, or that it can be understood as a termination condition in a recursive algorithm.
- Loop operator: Equivalent to the while syntax in C, this operator has three members depend, condition and loopBody, depend in multiple statements and
PIPE
Condition is equivalent to a termination condition. LoopBody is equivalent to the body of the loop in while. - Subgraph operator: is responsible for putting the getNeighbor operator results in
_dst
The (destination point) property is extracted and then the destination points that have already been visited are filtered out (to avoid repeated fetching of data from the storage tier) and then used as input for the next expansion of the getNeighbor operator. - Datacollect operator: The datacollect operator is responsible for finally assembling the point and edge attributes acquired during expansion into vertex and edge types.
The detailed information of each operator, can refer to the source github.com/vesoft-inc/… . Figure 1 illustrates how we construct the subgraph
Figure 1
Extend the situation one step further
It is easy to get information about all points and edges in just one step, starting from point A along the like edge. All you need is getNeighbor and dataCollect. The execution plan is as follows:
Expand the situation of multiple steps
A one-step scenario is a special case of a multi-step scenario. So you can merge a one-step scenario into a multi-step scenario. When starting from point A and extending three steps along the like edge, according to the existing operator, the destination points can be extracted after getNeighbor expansion, and then these destination points can be used as the starting point to call the getNeighbor interface again. This loop is done twice (the stop condition of the loop operator is set to the current number of steps), so the execution plan is as follows:
Input and output
In general, the input of each operator is the output of the dependent operator. In this case, the input and output of each operator can be intuitively determined according to the dependency of the execution plan. However, in some cases, such as subgraph, in multi-step scenes, each input of getNeighbor operator should be the target point of the last expansion edge, that is, the output of subgraph operator, so the output of subgraph operator should be the input of getNeighbor operator. In this case, it is inconsistent with the execution plan dependence shown in the figure above. In this case, it is necessary to set the input and output of each operator. Query Engine 2.0 showed that the input and output of each operator are stored in a hash table, where value is a vector. The ResultMap is shown in the following table:
- The start point is stored in ResultMap[“StartVid”]
- The input of getNeighbor operator is ResultMap[“StartVid”], and the output is stored in ResultMap[“GN_1”].
- The input of the subgraph operator is ResultMap[“GN_1”] and the output is stored in ResultMap[“StartVid”].
- The loop operator produces no data and is used as a logical loop, so there is no need to set input and output
- The input to the dataCollect operator is ResultMap[“GN_1”], and the output is stored in ResultMap[“DATACOLLECT_2”].
The getNeighbor operator places each result at the end of the vector in ResultMap[“GN_1”], and the subgraph takes the value from the end of the vector in ResultMap[“GN_1”]. After calculation, the starting point to be expanded next time is stored in ResultMap[“StartVid”].
After the first step is expanded, the ResultMap results are as follows:
For display purposes, the result of GetNeighbor writes only the attributes of _dst, and actually takes all the attributes of the edge and all the attributes of the starting point, similar to table 1.
The subgraph operator takes input from “GN_1”, extracts the _dst attribute, and puts the result into “StartVid”. After expanding the second step, the ResultMap results are as follows:
After the third step is expanded, the ResultMap results are as follows:
Finally, the dataCollect operator extracts all the point and edge sets encountered during the expansion from the ResultMap[“GN_1”] and returns the final result to the user.
The instance
To see the structure of executing a plan in Nebula Graph for an example of a subgraph, open Nebula – Console, switch space to Basketball, Enter EXPLAIN format=”dot” GET SUBGRAPH 2 STEPS FROM ‘Tim Duncan’ IN like, serve, at which time nebula- Console will generate data IN dot format. Then go to Graphviz Online and paste the generated DOT data onto it, and you’ll see the following structure:
The Start_0 operator is a dependency of the Loop operator depend and does nothing because there are no multi-statements or PIPE statements. If you have a problem using a submap or another Nebula, feel free to talk to us in the discussion forum: discuss.nebula-graph.com.cn/
Want to share graph database technology with other big companies? NUC 2021 Is waiting for you to communicate: NUC 2021 registration portal