This article is reprinted from the public account: TigerGraph Vega Star.
Graph databases have attracted much attention because of their advantages over relational databases. However, with amazon, Microsoft, Oracle, IBM and other technology companies rushing into the space, it becomes more challenging to evaluate the products of different vendors when it comes to selecting graph databases for projects.
Today, we’ll share my assessment and insights on a graph database by benchmarking different graph databases.
Data loading capability
If you plan to use a graph database to deal with real-world problems, this is the first important criterion for distinguishing a graph database from a good one. I recommend trying a common dataset with more than 1B edges and 50-100m vertices. Test loading effects include: loading language /API support, loading speed (should be completed within 1 hour Max), incremental/batch loading support, etc. If you’ve done this minimal requirement load and the results are satisfactory, the next thing to try is to find a graph with more than 10 vertex types and edge types, each with different attributes. Can you load it easily? Does it support complex property types such as Map, Set, and List? How does the JSON file load? All of these problems are practical obstacles to working with real-world graph data.
Real-time updates, which means that the database is updated at the same time as the query is processed. Database updates can be inserting or deleting a new vertex or edge, updating properties of existing vertices or edges, and so on. Graph databases provide concurrency control so that different operations can be interleaved, but the end result is consistent, as if all operations were executed in order. Note that a graph computing platform is different from a graph database. Most HDFS-based graph platforms (such as GIRAPH, GraphX) do not support real-time updates due to HDFS design limitations.
Many graph databases are non-native, which means they store graph data on disk as relational models, RDF triples, or key-value pairs. In memory, they provide mid-tier apis to simulate graphical traversal. Native graph databases store graph data as graph models – vertices and edges. The most popular of these are TigerGraph and Neo4j. The advantage of the native graph storage format is that it inherits the benefits of the graph model for free, because the graph model is a natural index and each query only involves the relevant data behind the index. The difference between TigerGraph and Neo4j is that TigerGraph can be extended with an MPP architecture, whereas Neo4j uses double-linked lists to store edges and is not an MPP architecture, so it cannot be extended.
A graph language can be Turing-complete, meaning that users can use it to write any algorithm that extracts valuable information from data. Currently, there is no dominant graph query language, and there are many different expressional products on the market. GSQL from TigerGraph is a user-friendly, expressive query language that is more like a PL/SQL language that users can use declaratively to write any graph algorithm. Another declarative language is Neo4j’s Cypher, but it is difficult to write finely controlled graph analysis algorithms. The third is that Gremlin has strong expression ability, but not a low learning curve. Users must read manuals for each query and ask for different queries to test and discover each query’s expressiveness, learning curve, design philosophy, and so on. Relational database experts can click here (eight prerequisites for graph database query languages) to read a guide that describes how to discover and evaluate graph query languages.
Some graph databases can scale horizontally (horizontally) when stored, meaning that if you double the machines, you can double the amount of memory. However, the computation speed did not double. Some graph databases can also scale horizontally in computation. To do so, they must have an MPP architecture. Vertical scaling means that on a single machine, a graph computing engine can explore multiple cores for parallel computing and accelerate as more cores are added. The important thing to remember when reviewing this project is never to forget the loading capability. If doubling the machine results in doubling or even tripling the load effort, it cannot be considered a qualified scale-out database. TigerGraph’s MPP architecture scales horizontally and vertically. Amazon Neptune cannot scale horizontally; it can only increase throughput from copies of the original data. Janus Graph and ArangoDB can be stored horizontally, but cannot be evaluated in the same query.
Support OLTP, OLAP, or HTAP. Some graph platforms are purely used for offline large-scale processing, namely OLAP type, such as PageRank, Gradient Descent, weak and subgraph, etc. Some graph databases support point-traversal queries (such as outputting 3-step neighbors for a given person), which are more OLTP-style. A good gauge is whether they talk about QPS queries per second, and if so, they are likely to support OLTP. Some can support both, known as HTAP. Try simple subgraphs and page sorting algorithms and see if they can be done on a billion edge graph.
Whether multiple graphs are supported
This is an enterprise security and concurrency requirement where different departments want access to different graphs and subgraphs, whether exclusive or shared, on the same server or cluster at the same time. At the time of publication, the only vendor that supports native polygraphs is TigerGraph. Neo4j allows you to attach different diagram files to the current use, but not to use multiple diagrams simultaneously on the same server. ArangoDB has the concept of a graph view.
Whether to model first
Some graph database vendors claim that they do not require pre-modeling. Other graph database vendors require pre-modeling, such as traditional relational databases, and support online adaptation of schemas. In my opinion, up-front modeling is very important for real-time applications in the enterprise, because having models and then importing data actually requires the developer or architect to spend design time thinking ahead about what to input and how to organize the diagrams. Separating metadata from data instances is a well-known technique for achieving better performance in database systems. Unfortunately, the current graph database market overemphasizes the explosive number of element types, thus promoting the reduction of graph schemas as an advantage to solve this challenge, which actually leads to significant performance degradation during query processing.
As a database builder, I personally disagree with a one-size-fits-all solution. I follow technical know-how and believe that the best performance and user experience can be achieved by providing the best and most up-to-date professional databases. In the current market, too many vendors adapt their offerings to new market trends by adding new interfaces on top of existing cores that were originally designed and architectored for another market. This change makes it difficult to deliver the best product in the data management software department, and creates friction in the adoption of the right technology in the enterprise where it is needed.
Real time, real time, real time. Three times what’s important
In most graph databases today, the query response time slows down significantly at three steps. A simple test to determine if this is a top-level graph database is to test their 3-step path neighbor count query. For a graph with a billion edges, given an input vertex, can the graph database find all adjacent vertices k steps away from the input vertex and return the total number? This simple test is a good way to differentiate a graph database from a bad one. In our upcoming new benchmark, I will reveal the true performance of the top five graph databases, and only one of them will satisfy this simple test. All others failed. This is because from the seeding vertex, the neighbors of that vertex grow exponentially with each additional step away from the point. One standard that most graph database vendors avoid talking about is deep link analysis query (more than 3 steps) performance.
Whether built-in graph visualization is supported
A nice advantage of using graph databases is object-oriented thinking. Unlike traditional relational databases, where everything from modeling to query results is in a table format, graph databases map real-world objects cleanly to vertices and edges. If the returned results are visualized as vertices and edges, the query results can be easily understood by the human eye. We’ll share some examples later.
In short, graph database from many aspects are the latest hot and new sprout. It is recommended that buyers or professionals consider more before entering the graph database for selection.
About “NoSQL Ramblings”
NoSQL mainly refers to some distributed non-relational data storage technology, which is actually a very broad definition, can be said to cover all aspects of distributed system technology. As artificial intelligence, Internet of Things, big data, cloud computing and blockchain continue to spread, NoSQL technology will play an increasingly valuable role.