Graph theory,
Excerpt from the book Learning Neo4j chapter 2, Chapter 3, chapter 4
This is the second article, the first link here!
Figure the basis of database query efficiency
the graph database will be an excellent tool to spin around the anchor node and figure out whether there are matching patterns connected to it. Non-matching patterns will be ignored, and matching patterns that are not connected to the starting node will not even be consideredthe graph database will be an excellent tool to spin around the anchor node and figure out whether there are matching patterns connected to it. Non-matching patterns will be ignored, and matching patterns that are not connected to the starting node will not even be considered.
This, actually, is one of the key performance characteristics of a graph database: as soon as you “grab” a starting node, the database will only explore the vicinity of that starting node and will be completely oblivious to anything that is not connected to the starting node. The key performance characteristic that follows from this is that query performance is very independent of the data set size, because in most graphs everything is not connected to everything. By the same token, as we will see later, performance will be much more dependent on the size of the result set, and this will also be one of the key things to keep in mind when putting together your persistence architecture.
Path finding queries
Another type of query that is extremely well suited for graph databases are queries where you would be looking to find out how different data elements are related to each other. In other words, finding the paths between different nodes on your graph. The problem with such queries in other database management systems is that you would actually have to understand the structure of the potential paths extremely well. You would have to be able to tell the database how to “jump” from table to table, so to speak.
In a graph database, you can still do that, but typically you won’t. You just tell the database to apply a graph algorithm to a starting point and an endpoint and be done with it. It’s up to the database to figure out if and how these data elements would be connected to each other and return the result as a path expression for you to use in your system. The fact that you are able to delegate this to the database is extremely useful, and often leads to unexpected and valuable insights.
Obviously, the query categories mentioned above are just that: categories. You would have to apply it to any of the fields of research that we discussed in Chapter 1, Graphs and Graph Theory — an Introduction, to really reap the benefits. We will come back to this in later Chapters.
The use of graph databases is not recommended
The following operations are where I would personally not recommend using a graph database like Neo4j, or at least not in isolation.
Large, set-oriented queries
If you are trying to put together large lists of things, effectively sets, that do not require a lot of joining or require a lot of aggregation (summing, counting, averaging, and so on) on these sets, then the performance of the graph database compared to other database management systems will be not as favorable. It is clear that a graph database will be able to perform these operations, but the performance advantage will be smaller, or perhaps even negative. Set-oriented databases such as relational database management systems will most likely give just as or even more performance.
Graph global operations
As we discussed earlier, graph theory has done a lot of fascinating work on the analysis and understanding of graphs in their entirety. Finding clusters of nodes, discovering unknown patterns of relationships between nodes, and defining centrality and/or in-betweenness of specific graph components are extremely interesting and wonderful concepts, but they are very different concepts from the ones that graph databases excel at. These concepts are looking at the graph in its entirety, and we refer to them as graph global operations. While graph databases are extremely powerful at answering “graph local” questions, there is an entire category of graph tools (often referred to as graph processing engines or graph compute engines) that look at the graph global problems.
Many of these tools serve an extremely specific purpose, and even use specific hardware and software (usually using lots of memory and CPU horsepower) to achieve their tasks, and typically are part of a very different side of the IT architecture. Graph processing is typically done in batches, in the background, over the course of several hours/days/weeks and would typically not be well placed between a web request and a web response. It’s a very different kind of ball game.
Simple, aggregate-oriented queries
We mentioned that graphs and graph database management systems are great for complex queries — things that would make your relational system choke. As a consequence, simple queries, where write patterns and read patterns align to the aggregates that we are trying to store, are typically served quite inefficiently in a graph, and would be more efficiently handled by an aggregate-oriented Key-Value or Document store. If complexity is low, the advantage of using a graph database system will be lower too.
Start secondary
Secondary support ACID
Consistency: This means that only consistent or “valid” data will be allowed to be entered into the database. In relational terminology, this often means that the schema of the database has to be applied and maintained at all times. The main consistency requirement in Neo4j is actually that the graph relationships must have a start and an end node. Relationships cannot be dangling. Aside from this, however, the consistency rules in Neo4j will obviously be much looser, as Neo4j implements the concept of an “optional” schema.
The optional schema of Neo4j is really interesting: the idea being that it is actually incredibly useful to have a schema-free database when you are still at the beginning of your development cycles. As you are refining your knowledge about the domain and its requirements, Your data model will just grow with you — free of any requirements to pre-impose a schema on your iterations. However, As you move closer to production, schema — and therefore consistency — can be really useful. At that point, system administrators and business owners alike will want to have more checks and balances around data quality, and the C in ACID will become more important. Neo4j fully supports both approaches, which is tremendously useful in today’s agile development methodologies.
The four fundamental data constructs
- Nodes: These are typically used to store entity information. In the preceding example, these are the individual books, readers, and authors that are present in the library data model.
- Relationships: These are used to connect nodes to one another explicitly and therefore provide a means of structuring your entities. They are the equivalent of an explicitly stored, and therefore pre-calculated, join-like operation in a relational database management system. As we have seen in the previous chapters, Joins are no longer a query-time operation — they are as simple as the traversal of a relationship connecting two nodes. Relationships always have a type, a start- and an end-node, and a direction. They can be self-referencing/looping and can never be dangling (missing start- or end-node).
- Properties: Both nodes and relationships are containers for properties, which are effectively name/value pairs. In the case of the nodes, this is very intuitive. Just like a record in the relational database world has one or more fields or attributes, so can the node have one or more properties. Less intuitive is the fact that relationships can have properties too.
- Labels: This was a fundamental data model construct that was added to Neo4j with Version 2.0 at the end of 2013. Labels are A means to quickly and efficiently create subgraphs. By assigning labels to nodes, Neo4j makes the data model of most users a lot simpler. There is no longer a need to work with a type property on the nodes, or a need to connect nodes to definition nodes that provide meta-information about the graph. Neo4j now does this out of The box — and this is a huge asset, now and for the future. At the time of writing this book labels are primarily used for indexing and some limited schema constraints. However, in future, it is likely that the structural understanding that labels provide about the data stored in the graph will be used for Other purposes such as Additional Schema, security, Graph Sharding /distribution — and perhaps others.
What we know — ER Diagrams and Relational schemas
In a relational system, we have been taught to start out modeling with an EntityRelationship diagram. we usually find that from such a domain description, we can:
- Extract the entities by looking at the nouns of the description
- Extract the properties by looking at the adjectives of the description
- Extract the relationship by looking at the operating verbs in the description
These are, of course, generic guidelines that will need to be tried and tested on every domain individually to make sure that it is an appropriate fit. However, for now, let’s look at the following diagram:
As you can see from the preceding figure, ER diagrams have the advantage of at least attempting to capture the business domain in a real-world model. However, they suffer from quite a few disadvantages too. Despite being visually very similar to graph visualizations, ER diagrams immediately demonstrate the shortcomings of the relational model to capture a rich domain. Although they allow relationships to be named (something that graph databases fully embrace, but relational stores do not), ER diagrams allow only single, undirected, named but otherwise unqualified relationships between entities. In this respect, the relational model is a poor fit for real-world domains where relationships between entities are numerous, semantically rich, and diverse. The labeled property graph, as we have seen previously, allows for a much richer description of the domain, Specifically with regard to the relationships between the entities — which will be multiple, directed, and qualified through properties.
A graph model — A simple, high-fidelity model of reality
In the following figure, you will find the graph model and the relational model side by side:
On the right-hand side of the image, you will see the three tables in the relational model:
- A customers table with a number of customer records
- An Accounts table with a number of accounts of these customers
- A typical join table that links customers to accounts
What is important here is the implication of this construction: every single time we want to find the accounts of a customer, we need to perform the following:
- Look up the customer by their key in the customer table.
- Join the customer using this key to their accounts.
- Look up the customer’s accounts in the accounts table using the account keys that we found in the previous step.
Contrast this with the left-hand side of the figure, and you will see that the model is much simpler. We find the following elements:
- A node for the customer.
- Three nodes for the accounts.
- Three relationships linking the accounts to the customer.
Finding the accounts of the customer is as simple as performing the following:
- Finding the customer through an index lookup on the key that we specify in advance
- Traversing out from this customer node using the owns relationship to the accounts that we need
In the preceding example, we are performing only a single join operation over two tables. This operation will become exponentially more expensive in a relational database management system as the number of join operations increases and logarithmically more expensive as the datasets involved in the join operation become larger and larger. Calculating the Cartesian product of two sets (which is what relational databases need to do in order to perform the join) becomes more and more computationally complex as the tables grow larger.
So here’s an example of an accident model (It’s just a design, it’s not standardized.:
Supplement:
A key concept to be kept in mind when aligning relationships with use cases is the naming strategy for your relationship types. The general recommendation is to use as few generic names such as HAS_A or IS_PART_OF as possible, but to be more specific in these naming efforts.
So, the relationship is not very relevant, it should be more specific.
Replenishing…
Found the incident number as123456
People who are associated with it. A person is an abstraction (either a police officer dealing with an accident, or the person responsible for an accident, etc.). Because relationships can have attributes, all you need is a related attribute and a clear description of why they are related. This design of graph database can structurally reproduce e-R graph perfectly.
Design for query-ability
Like with any database management system, but perhaps even more so for a graph database management system such as Neo4j, your queries will drive your model. What we mean with this is that, exactly like it was with any type of database that you may have used in the past or would still be using today, you will need to make specific design decisions based on specific trade-offs. Therefore, it follows that there is no one perfect way to model in a graph database such as Neo4j. It will all depend on the questions that you want to ask of the data, and this will drive your design and your model.
Therefore, my number one and undoubtedly the most important best practice for graph database modeling is to start with your user story. What does the user of the system that you are designing want to know from the data? An example of such a story could be something like this:
“as an employee, I want to know who in the company I work for has similar skills to me so that we can exchange knowledge”
This excerpt tells a little bit about the entities that I need to include in the model and the connections that should exist between these entities. Different domain descriptions would probably add similar or different entities and similar or different connections and will then gradually complete your model.
Granulate nodes
The typical graph modeling pattern that we will discuss in this section will be called the granulate pattern. This means that in graph database modeling, we will tend to have much more fine-grained data models with a higher level of “granularity” than we would be used to having in a relational model.
The reality of this process is that we will create smaller and smaller table structures until we reach the third normal Form. This is a convention that the IT industry seems to have agreed on — a database is considered to have been normalized As soon as it achieves the third normal form. Visit en.wikipedia.org/wiki/Databa… _ normalization#Normal_forms for more details.
As we discussed before, this model can be quite expensive, as it effectively introduces the need for join tables and join operations at query time. Database administrators tend to Denormalize the data for this very reason, which introduces data-duplication — another very tricky problem to manage.
In graph database modeling, however, normalization is much cheaper for the simple reason that these infamous join operations are much easier to perform. This is why we see a clear tendency in graph models to create “thin” nodes and relationships, that is, nodes and relationships with few properties on them. These nodes and relationships are very granular and have been “granulated”.
Use in-graph indexes when appropriate
Related to this pattern(Granulate nodes) is a typical question that we get asked and ask ourselves in every modeling session: should I keep this as a property or should the property become its own node? For example, should we model the alcohol percentage of a beer as a property on a beer brand? The following diagram shows the model with the alcohol percentage as a property:
The alternative would be to split the alcohol percentage off as a different kind of node. The following diagram illustrates this:
Which one of these models is right? I would say both and neither. The real fundamental thing here is that we should be looking at our queries to determine which version is appropriate. In general, I would argue that:
- If we don’t need to evaluate the alcohol percentage during the course of a graph traversal, we are probably better off keeping it as a property of the end node of the traversal. After all, we keep our model a bit simpler when doing this, and everyone appreciates simplicity.
- If we need to evaluate the alcohol percentage of a particular (set of) beer brands during the course of our graph traversal, then splitting it off into its own node category is probably a good idea. Traversing through a node is often easier and faster than evaluating properties for each and every path.
As we will see in the next paragraph, many people actually take this approach even a step further by working with in-graph indexes
Taking the granulate pattern even further and knowing that most indexing technologies actually use graphs/trees under the hood anyway, we can apply this pattern to create natural indexes for our data models, inside the graph. This can be very useful for specific types of query patterns: range queries, time series, proximity searches, and so on.
Looking at our previous little beer model, the alcohol percentages could be a prime candidate for these in-graph indexes. The idea here is that we connect all of the alcohol percentages to one another and create a linked list of alcohol percentages that we could follow upward or downward, for example, to find beer brands with similar alcohol percentages within a certain range. The model is displayed in the following diagram:
These types of index structures are very often used to deal with time data in a graph structure. In this case, we create more of a time tree instead of a time line and connect our graph data to this tree instead of putting timestamps as properties on every node or relationship. The following diagram contains an example of such a tree structure:
All of the patterns in the preceding diagram are common modeling patterns that have been used successfully in projects. Use them wisely, and always remember that it’s all about the query patterns. Knowing what the questions you want to ask of the data are will massively impact its design model, and chances are that you will need and want to iterate over that model multiple times to get it to a stable state. Fortunately, graph database management systems such as Neo4j deal with this kind of variation quite elegantly and allow you to do exactly this when appropriate.
Graph database modeling pitfalls
As with any type of database modeling, a graph database will also have some pitfalls that we would want to try and avoid. This section will by no means attempt to give you an exhaustive list, but should give you a feel for the types of practices that can lead to poor modeling decisions.
Using “rich” properties
As it is actually a best practice to have a very granular model in in-graph database systems, the opposite of this can often be an antipattern. Using properties on a node that are ostensibly too rich (as shown in the following figure) can be much better solved by leveraging the model, splitting off the properties into separate nodes, and connecting them using a relationship:
Look at the following diagram for a potential solution to this antipattern:
Node representing multiple concepts
Another antipattern that we have seen being introduced a number of times is that different concerns or concepts are not separated appropriately in a model. In the model represented in the following figure, we are mingling together the country concept, the language concept, and the currency concept in one node structure:
This should be a red flag, as it will inevitably present us with problems or limitations at query time. It will be far wiser to split off the concepts into separate country, language, and currency node structures, each with their own characteristics and properties. This is what we are trying to convey in the following corrected figure (as follows):
Unconnected graphs
The dense node pattern
We discussed before how graph queries work in a graph database system. The basic principle was that of a graph local query: starting at a (set of) well-defined starting point(s) and then crawling out from there. The traversal speed is typically very fast and only dependent on the number of parts of the graph that the query will touch or traverse through. Typically, traversal speeds are also very constant with growing dataset sizes, simply because in a normal graph model, everything is not connected to everything. The query will only evaluate the parts of the graph that are connected to the starting points, and therefore, the traversal speeds will be flat.
A very interesting problem then occurs in datasets where some parts of the graph are all connected to the same node. This node, also referred to as a dense node or a supernode, becomes a real problem for graph traversals because the graph database management system will have to evaluate all of the connected relationships to that node in order to determine what the next step will be in the graph traversal. Supernodes can be a real issue in graph database models and should be avoided when setting up your Neo4j instances.
Different strategies exist to deal with this density problem, and Neo Technology is making some important changes to Neo4j to make this problem easier to deal with, but the fact of the matter is that on most occasions we will want to avoid hitting this technical risk by adapting our model. The following diagram highlights one potential strategy that you can use from the music artist/fan graph domain:
As you can see from the preceding diagram, the strategy essentially consists of “fanning out” the dense relationships to Lady Gaga and Madonna. Instead of having direct connections from millions of fans to these two popular artists, we create a fan-like structure that connects fans to metanodes, interconnects the metanodes, and then finally connects the top of the metanode hierarchy to the actual artist. The recommendation then becomes that every metanode should have approximately 100 DENSE_LIKES relationships to connect fans to artists and that you can very quickly traverse these relationships to figure out whether there is a variable-length path to connect fans to artists.