In this paper, based onWang lei teacherIn the 2018 Gdevops Global Agile Operations Summit – Beijing Station, the content of the on-site speech.
(click on the”here”The full PPT of Wang Lei’s speech is available.)
introduction
Wang Lei is a data field architect in science and Technology Department of China Everbright Bank. He used to work in IBM Global Consulting Service department for technical consulting. He has more than ten years of experience in data research and consulting. Currently, I am responsible for the daily architecture management, key system architecture design and internal research and development of the data field system of the whole bank, and I have strong interest in the research of distributed database, Hadoop and other infrastructure.
Hello, everyone. Today I would like to share with you the selection, analysis and practice of graph database in the banking industry. Since this topic is relatively niche, LET me start by saying why I got the opportunity to share the topic of graph databases with you.
In fact, as Teacher Shen Jian said this morning, when we study a new technology, we do it not because we think the technology is new, or because it will show our high level, but because it serves ourselves and our business.
For example, in the business of China Everbright Bank, we encountered a challenge, that is, in this business scenario, the data we faced were highly connected data. For banks, capital flow is a very important scene, and so is the relationship between people. When we are in some typical business links, such as loan issuance or audit, we will pay attention to what kind of relationship there is between people, what kind of relationship there is between enterprises, how the funds flow between accounts, and then there will be specific business.
However, when faced with these problems, we will find that the biggest problem of traditional relational database is that it can not meet our requirements in performance under the condition of high connectivity, or it can not run at all. To solve this problem, try to see if graph databases can help us, so we did some research in this area and share it with you.
Basically, we will talk about it in three parts:
-
The concept of graph;
-
Figure Database technical analysis;
-
Figure database practice in Everbright Bank.
First, the concept of graph
First look at how this part of the graph comes in:
1. Real world maps
By graph, we definitely don’t mean graph, we mean highly connected structural data. In fact, the use of the structure itself, and the development of IT technology can be seen separately.
Here is the relationship chart of people in the Name of the People:
A lot of people probably approach diagrams in a similar way, to understand the intricacies of the novel or play we’re watching.
The nice thing about this graph is that it gives you the whole relationship of the characters at once. Even if you haven’t seen the show, you can find out which ones are the most connected, maybe the main characters. Which ones are biased; Character conflict will occur in what links. It’s much more efficient and direct than if we just read the text and then construct the relationship in our mind.
In real life, we use it in many ways: social networking or transportation networks, communication, money, even when we look at the subway map. To organize and express it in the form of pictures makes it easier for us to accept and understand it.
The diagram we mentioned above can be regarded as more primitive, including the relationships between characters in the play, which can be drawn by the author without much BACKGROUND in IT or mathematics. But when dealing with this problem at the technical level, there needs to be a strict protocol. There’s a little bit of a mathematical concept here, a mathematical world map.
2. A map of the mathematical world
Because you probably know graph theory“This is a more specific field of study to study the related problems of graphs. We will not go so deep in this area, just for the convenience of understanding later, just need to have some basic concepts.
First of all, a graph should be clear that its edges are divided into directed edges, undirected edges and parallel edges. The actual organization of data is often in the form of digraph, that is, the form of both parallel edges and rings.
If the edges between nodes are directed, it is a directed graph; If it’s undirected, it’s an undirected graph. In engineering practice, most of the graphs we appear are directed multiple graphs.
When we go back to the industry, although there are basic concepts like mathematics, the industry does not use it piecemeal, but refine it into a set of models, just like we use ER to describe how to model a relational database in a relational database. ER itself is a refinement of relational database modeling.
Unfortunately, because graph database development in the emerging stage did not form a standard description language, so we here with some relatively less rigorous description language to describe, to facilitate everyone to have an impression.
3. Industrial maps
To be more specific, a more complete structure, there are actually three categoriesIn industryThe graph usedIn the form of:
RDF, property graphs, and hypergraphs, where an edge can be connected to multiple vertices, are common or more widely disseminated, but they are used less often. We’ll focus on the first two:
The first RDF has been around for quite a while, and if you search for RDF, you’ll find it. But it has a special feature, because its source is actually more academic research, and the main object of the service is the semantic Web, which is an ambitious project to connect all the knowledge on the Internet.
The latter includes the popular knowledge graph, which Google is actually developing further. But things in the RDF direction can get pretty confusing because it’s a bit of an academic development, so there are a lot of complex concepts, and we’ll just cover the simple ones here. There is a basic organizational unit is subject-verb-object, subject-verb-object is actually the same concept as subject, predicate and object in Chinese.
The second, called the property graph, addresses how to efficiently query, use, and store highly connected data. The nodes and edges of a property graph are the focus of its composition, or what we might call vertex invariance. The most important feature of a property graph is that its edges and nodes are attribute-definable, which is a big break from RDF, which itself cannot have attributes.
With properties, it becomes a more descriptive structure, and it’s actually a little bit like we define a structure. In the property graph, if we want to further clarify, we will also introduce new concepts, so with the tag, the property graph will sometimes be described as a tagged property graph, which will be more convenient in practical use.
Here is an example of three diagrams representing exactly the same semantics from left to right:
First of all, the third one is the simplest. It is the attribute map, which only represents the relationship between three people. Each person is a node. There are two things on the edge, one is what the relationship between two people is, and the other is when the relationship happened.
Let’s look at the other two graphs, and they look different:
-
The most on the left side of the figure is the RDF we said this kind of organization form, RDF has a characteristic is that it is no attributes, so it is all the information to appear in the form of a vertex and edge, it is not the most on the left of three green point said, because the corresponding city, the corresponding relationship between performance into related nodes and edges, so a bit more complex.
-
The middle form is a compromise, and RDF also adds attributes on the side, which is much simpler than the one on the left.
4. Knowledge graph and graph storage technology
Here we want to explain a confusing concept called the knowledge graph. These two years, the knowledge map is also very hot subject, but in fact, knowledge map and figure of storage as well as the figure show are two concepts, or figure of storage is one of the key technology of knowledge map, knowledge map and some more complex content, including natural language processing, machine learning, etc., it more input from unstructured data, and today we speak of storage technology, It’s more about structured forms.
Technical analysis of graph database
Let’s start with some technical topics.
1. “High growth” graph database
The following is an evaluation of all types of databases:
This chart is often used to illustrate the evolution of graph databases. The one with the highest growth rate is graph database, all of a sudden with all the other databases open the distance. This reflects a good trend in graph databases.
But when we started to think about it in 2017, we went around the market and found that there were very few applications, very few solutions. Maybe it will be a little better in a foreign country, but I personally feel more at the stage of discussion. Even I think in the last couple of years, with the return of social networking, people are talking less. But how? It still represents the future possibilities of graph databases.
2. The use value of graph database
So why do we use graph databases? Here is a more systematic talk about four more important use value:
-
Most importantly, it has excellent query performance and can avoid repeated joins in relational databases. The storage technology of graph database is completely different.
-
It has more flexible data modeling, and the underlying storage form of graph database is schema-less, which is less constraining on structure.
-
Easy to understand, this kind of graph data presentation form and organization form is very close, so it is also easier to understand, unlike the relationship, the business people to tell things and the final implementation of a relatively big gap, but the graph is not quite the same.
-
With the support of graph algorithm, we can get rid of the limitation of analyzing data by people’s intuitive feeling, and can process this data more massive and faster.
Of course, these four values are actually one of our understanding, in fact, not all graph database can reflect these values, we will introduce different types of graph database, you will feel.
3. Figure database products
When you look at graph databases, look for some of the most authoritative products. The first thing to start with is to put DB-Engines to work. There are a lot of things, but there aren’t that many choices that are really useful:
Let’s say a few important ones:
One is FlockDB, because this database is an open source project of Twitter. When I mentioned the database, my first reaction was that the form of graph database is closely related to the hot social network now, so does it mean that the company of this social network will also explore this aspect?
But it’s a shame that these header companies don’t offer any really useful open source projects, including the FlockDB project, which hasn’t had a technical update in years and is basically dead. One possible reason is that unlike other fields of big data, which have the development of header vendors, graph databases are relatively lagging behind.
The other is the top Neo4j. Neo4j is basically the most leading enterprise in this field. If you look for things related to graph database, whether you go to Google or Baidu, more than half of the articles you can see are written by experts related to Neo4j. Including the books we can find in the market, basically are about how to use Neo4j database.
4. Definition and classification of graph database
Here’s a breakdown based on a graph database from a European consulting firm:
It divides graph databases into four categories:
-
Operation
-
The text
-
analytical
-
mixed
We pay more attention to the operation category, where the lower left nested circles represent the score, and the closer you are to the center, the higher the score.
Here we have listed several important products, with our technology selection also has a specific relationship. One is Titan. Titan was acquired in 2016 or 2015. After the acquisition, Titan’s project ended and it turned to closed-source products. Later, a Fork called JanusGraph, which has a larger relationship with IBM, also offers Graph databases on its cloud services, including the IBM Graph that appears on this Graph, which is actually JanusGraph.
Let’s take a closer look at what a graph database looks like architecturally.
JanusGraph architecture
So let’s take a look at JanusGraph, and we’re using Titan. Both Titan and JanusGraph have the same storage design behind the scenes, without their own storage, storing data on several alternative external storage schemes. JanusGraph itself supports more backend storage products and provides an index scheme based on backend storage, which is essentially an open form.
This is the structure of the product:
You can see that it’s more of a middleware structure, and if you look at distributed databases that are popular today, whether it’s MySQL or any of the other databases that we’re talking about today, there’s a tendency to separate what’s stored from what’s in the computing organization.
JanusGraph does the same thing, except that the underlying layer it chooses is third-party, not homemade. The advantage of such a structure is that once the storage is stripped out, the horizontal scaling of the storage itself is not a problem at all.
But in this architecture, the front-end computing node is actually a little bit limited, because it is still a single point of computing. This single point of computing does not mean that there is only one machine in front of it. It can be expanded horizontally, but each point receives a single task and cannot be split. Another point is that it provides indexing schemes, which are not particularly efficient.
Secondary structure
Let’s take a look at the architecture of Neo4j:
Neo4jIt does everything, so it calls its route native storage. The benefits of native storage are obvious, because the storage design is integrated, so it must be optimized to the maximum. In terms of performance above the single point, there will certainly be a significant advantage.
JanusGraph is not a native storage, it accepts third parties, and Neo4j handles things on its own. Neo4j itself is open source, the open source version actually does not support clustering, of course, its single point does have strong processing power. In the commercial version, the cluster is provided with synchronization from the master node to the slave node. This way it’s obvious that its entire storage is actually copied, which means that its storage is still dependent on the master storage, which is even a disadvantage — it can’t store too many things.
The following is a breakdown of some of the most important aspects of the graph database architecture:
Query language diversity
From the outside in, the first layer of interaction is what the query language looks like:
You know, databases are SQL, so whatever you do, you have to interact with SQL. Graph databases are not quite the same, perhaps because this aspect represents a stage in its development — a hundred flowers are blooming, and there are all kinds of interfaces that people make. The oldest is probably the first SPARQL we made, but it’s a little more academic.
Gremlin – Turing-complete language
The second Gremlin is supported by many databases, and there are other extensions:
Gremlin supports both declarative and imperative styles. It claims to be a Turing-complete language, which is related to supporting imperative styles because of the manipulation of details. There is a diagram on the bottom right, which reflects the flexibility of the graph database construction language, through the blue node to find other blue nodes, there are forward search and reverse search, you can try to use SQL to achieve such a structure query, is very complex.
I won’t say much about the rest, as shown below:
One is relational database, one is native stand-alone storage, one is distributed storage, and one is native distributed storage.
Next comes a category of graph databases, such as SQL Server 2017. It supports some graph database features, but is still the back-end storage of a relational database, which determines that there is some flexibility it does not have, because the data is still stored in a table, similar to this pg-based graph database, which is relatively flexible:
The following figure is the structure of Neo4j, but the time relationship is not introduced:
Here’s the JanusGraph, which is based on what’s stored in HBase:
Another key concept is that when talking about graph databases, including the definition of graph databases themselves, there is a great deal of controversy. One view is that graph database is done in the way of adjacency index, which is proposed by Neo4j. However, other vendors do not agree with this view and believe that other schemes can also achieve good performance, as shown in the following figure:
In addition, there is another one about the impact of different storage engines, which greatly improves performance. A new model WiscKey is introduced in the storage engine part, which is different from traditional LSM-tree such as HBase and RocksDB.
Let’s talk about algorithms. Because we are unlikely to let people watch such a large amount of data, a drama may be dozens of characters at most, but we are faced with hundreds of millions or even billions of data, in this case, according to some algorithms will get better results.
Typical algorithm PageRank:
In the whole picture, it can be seen that node 4 has the most focal points, while node 6 has the second PageRank value although it does not have many connection points. What does this mean? One of the scenarios is that in a criminal network, Node 6 is more likely to be the head of the organization, while Node 4 is likely to be the strategist or executor, who contacts other members of the gang. In this way, this algorithm can be mapped to practical problem solving.
Another is the community discovery algorithm, among which the simpler one is tag propagation, which solves the problem of how to divide different subgraphs and then narrow the scope of analysis:
5. Background and trends
Commercial products have sprung up
Next, let’s talk about the situation at home and abroad. The development of graph database is not so ideal and mature, but 2017 is a magical time when many graph databases emerge:
Typical products of several big manufacturers, including Amazon, Huawei and Ali, are pushing their own graph database products, and there are some new evolutions and changes in the whole graph database architecture.
Open source database Github Star
We’ve taken a look at the number of “stars” in all the open source graph databases on Github:
You can see that Dgraph and ArangoDB, which we talked about earlier, have a good trend, and Cayley was also developed by a former Google employee, but not by Google. It was very high in the beginning, but it grew later.
Technology development Trend
In fact, from the development of the whole graph database so many products, basically can be divided into three stages, some graph database manufacturers themselves will do some definition of its development: 1.0, 2.0, 3.0. However, there is no authoritative definition, we also made a subdivision of it according to our understanding:
At first, it was stand-alone, storage and computing, and then it could do parallel computing, which is just like some systems that do more off-line computing, but this kind of parallel computing cannot meet the interactive capability. Some products have been updated to 3.0, which means that both computing and storage capabilities are scalable, and some products can do parallel computing, which means that the point at which a task is received can be divided up and multiple points can collaborate to do the computation.
3. Figure database practice in Everbright Bank
1. Application scenarios of banking industry
Here are some typical scenes we did in Everbright:
For examination and approval of the credit business process, for example, because credit is a risk control links of examination and approval, then do an enterprise loan we’ll focus on what are you holding, are you holding company, such a complex structure inside should be for you to do something such as restrictions on the credit lines, because you risk conduction.
Money laundering, family recusal, family recusal is actually internal audit,That means making sure there’s no connection between the business and the employee you’re currently auditing the loan.
Audit analysis business (simplified) logical data model
Let’s briefly draw a picture of some of the elements that will appear in it:
It was really just a gesture. However, the data volume of the graph database itself is quite high for a real scenario, because it would be large if each edge reflected the transaction. This is a different product of RDF, which is relatively small in volume.
For example, Freebase, which was acquired by Google before, makes a knowledge graph. Its maximum volume is more than 3 billion, but if we do actual business volume, this number is easily broken through. In fact, the challenges faced by databases in the service industry are different from those faced by traditional RDF storage.
2. System architecture
This is a simple architectural level representation:
That’s my main point.If you are interested, you can leave a message to communicate with me!
Q & A
Q: If our order of magnitude is up to billions of edges and billions of nodes, we may be more concerned about performance when calculating shortest paths. Do you have any suggestions on graph database selection? We have seen Dgraph before but have not actually tested it. Do you think Dgraph meets our requirements of data volume and performance?
A: As for the billions of points and edges, I personally feel the amount of data we looked at before, because we had communicated with Neo4j at that time, and Neo4j was relatively mature. When we talked with it, the official explanation of Neo4j was that it thought it could hold up at this amount. I said stand-alone version. Because a few billion is not a lot of data.
Another consideration in choosing a product is not a matter of magnitude, but an ecological one, to ensure that it is supported by both our own capabilities and the capabilities of knowledge vendors. The biggest problem with Neo4j is that it has not been implemented in China. If you have enough ability to control it, it is ok. However, if you do not have the ability, domestic companies do not have the actual support ability, which is a potential risk.