Graph theory overview
Excerpt from the book Learning Neo4j
Graph theory origin: Euler used mathematical models to solve the problem of seven Bridges
what are graphs? To explain this, I think it is useful to put a little historic context around the concept. Graphs are actually quite old as a concept. They were invented, or at least first described, in an academic paper by the well-known Swiss mathematician Leonhard Euler. He was trying to solve an age-old problem that we now know as the 7 bridges of Königsberg. The problem at hand was pretty simple to understand.
The application of graph theory
Graph theory is used in the social sciences
For the longest time, people have understood that the way humans interact with one another is actually very easy to describe in a network. People interact with people every day. People influence one another every day. People exchange ideas every day. As they do, these interactions cause ripple effects through the social environment that they inhabit. Modeling these interactions as a graph has been of primary importance to better understand global demographics, political movements, And — last but not least — commercial adoption of certain products by certain groups. With the advent of online social networks, this graph-based approach to social understanding has taken a whole new direction. Companies such as Google, Facebook, Twitter, LinkedIn (see the following diagram featuring a visualization of my LinkedIn network), and many others have undertaken very specific efforts to include graph-based systems in the way they target their customers and users, and in doing so, they have changed many of our daily lives quite fundamentally.
Graph theory is used in biology
We sometimes say it in marketing taglines: “Graphs Are Everywhere”. When we do so, we are actually describing reality in a very real and fascinating way. Also, in this field, researchers have known for quite some time that biological components (proteins, molecules, genes, and so on) and their interactions can accurately be modeled and described by means of a graph structure, and doing so yields many practical advantages. In metabolic pathways (see the following diagram for the human metabolic system), for example, graphs can help us to understand how the different parts of the human body interact with each other. In metaproteomics, researchers analyze how different kinds of proteins interact with one another and are used in order to better steer chemical and biological production processes.
Graph theory is used in computer science
Some of the earliest computers were built with graphs in mind. Graph Compute Engines solved scheduling problems for railroads as early as the late 19th century, and the usage of graphs in computer science has only accelerated since then. In today’s applications, the use cases vary from chip design, network management, recommendation systems, and UML modeling to algorithm generation and dependency analysis. The following is an example of such a UML diagram:
Graph theory deals with flow problems
Another really interesting field of graph theory applications is flow problems, also known as maximum flow problems. In essence, this field is part of a larger field of optimization problems, which is trying to establish the best possible path across a flow network. Flow networks are a type of graph in which the nodes/vertices of the graph are connected by relationships/edges that specify the capacity of that particular relationship. Examples can be found in fields such as telecom networks, gas networks, airline networks, package delivery networks, and many others, where graph-based models are then used in combination with complex algorithms.
These algorithms are then used to identify the calculated optimal path, find bottlenecks, plan maintenance activities, conduct long-term capacity planning, and many other operations.
Graph theory is used for routing problems
The original problem that Euler set out to solve in 18th century Konigsberg was in fact a route planning/pathfinding problem. Today, Many graph applications leverage the extraordinary capability of graphs and graph algorithms to calculate — as opposed to Finding with trial and error — The optimal route between two nodes on a network. In the following diagram, you will find a simple route planning example as a graph:
A very simple example will be from the domain of logistics. When trying to plan for the best way to get a package from one city to another, one will need the following:
-
- A list of all routes available between the cities
-
- The most optimal of the available routes, which depends on various parameters in the network, such as capacity, distance, cost, CO2 exhaust, speed, and so on
This type of operation is a very nice use case for graph algorithms. There are a couple of very well-known algorithms that we can briefly highlight:
- The Dijkstra algorithm: This is one of the best-known algorithms to calculate the shortest weighted path between two points in a graph, using the properties of the edges as weights or costs of that link.
- The A* (A-star) algorithm (details) : This is a variation of Dijkstra’s original ideas, but it uses heuristics to predict more efficiently the shortest path explorations. As A* explores potential graph paths, it holds a sorted priority queue of alternate path segments along the way, since it calculates the “past path” cost and the “future path” cost of the different options that are possible during the route exploration.
Graph theory is used for web lookup
The older tools did keyword matching on web pages, but Google revolutionized this by no longer focusing on keywords alone, but by doing link analysis on the hyperlinks between different web pages. PageRank, and many of the other algorithms that Google uses today, assumes that more important web pages, which should appear higher in your search results, will have more incoming links from other pages, and therefore, it is able to score these pages by analyzing the graph of links to the web page. History has shown us the importance of PageRank. Not only has Google, Inc. built quite an empire on top of this graph algorithm, but its principles have also been applied to other fields such as cancer research and chemical reactions.
Three stages of database
We can establish the following three major phases in the half century that database management systems have been under development:
- Navigational databases
- Relational databases
- NoSQL databases
Navigational Databases Navigation database
A navigational database is a type of database characterized by the fact that objects (or records) in it are found primarily by following references from other objects. Traditionally navigational interfaces are procedural, though one could characterize some modern systems like XPath as being simultaneously navigational and declarative.Navigational access is traditionally associated with the network model and hierarchical model of database interfaces, and some have even acquired set-oriented features. Navigational techniques use “pointers” and “paths” to navigate among data records. This is in contrast to the relational model (implemented in relational databases), which strives to use “declarative” or logic programming techniques in which you ask the system for what you want instead of how to navigate to it.
These diagrams were the starting points for database management systems that used either networks or hierarchies as the basic structure for their data. Both the network databases and the hierarchical database systems were built on the premise that data elements would be linked together by pointers.
A closed chain of records in a navigational database model (such as CODASYL) that contains the next pointer, the previous pointer, and the direct pointer provided by the keys in the various records.
Recordsets, the basic structural model for navigating (e.g., CODASYL) databases. A collection consists of one parent record (also known as the “owner”) and n child records (also known as member records)
Navigational databases eventually gave way to a new generation of databases, the Relational Database Management Systems. Many reasons have been attributed to this shift, some technical and some commercial, but the main two reasons that seem to enjoy agreement across the industry are:
- CODASYL is widely regarded as something that can only be worked or Understood by Absolute Experts – as we partly experienced in 1999, when the Y2K problem required many CODASYL experts to work overtime to migrate their systems into the new millennium.
- The lack of a declarative query mechanism for navigational database management systems systems inherently provide a very imperative approach to finding data: the user would have to tell the database what to do instead of just being able to ask a question and having the database provide the answer.
Relational Databases Relational databases
I think it is safe to say that Relational Database Management Systems have served our industry extremely well in the past 30 years, and will probably continue to do so for a very long time to come. However, they also came with a couple of issues, which are interesting to point out as they will (again) set the stage for another generation of database management systems:
- Relational Database Systems suffer at scale. As the sets or tables of the relational systems grow longer, the query response times of the relational database systems generally get worse. Much worse. For most use cases, this was and is not necessarily a problem, but, as we all know, size does matter, and this deficiency certainly does harm the relational model.
- Relational Databases are quite “anti-relational”. As the domains of our applications, The Relational Models that represent those domains — become more complex, relational systems really start to become very difficult to work with. More specifically, join operations, where users would ask queries of the database that would pull data from a number of different sets/tables, are extremely complicated and resource intensive for the database management system. There is a true limit to the number of join operations that such a system can effectively perform, before the join bombs go off and the system becomes very unresponsive.
- Relational databases impose a schema even before we put any data into the database, and even if a schema is too rigid. Many of us work in domains where it is very difficult to apply a single database schema to all the elements of the domain that we are working with. Increasingly, we are seeing the need for a flexible type of schema that would cater to a more iterative, more agile way of developing software.
NoSQL databases
we can basically categorize them into four different categories:
- Key-Value stores (redis)
- Column-Family stores (HBase)
- Document stores (MongoDB)
- Graph databases (neo4j)
A comparison of the four classifications
Excerpt from the Book The Definitive Guide to Neo4j
Key-Value stores
Column-Family stores
Document stores
Relational crossroads
On one side of the crossroads are the aggregate stores. These are the Key-Value-, Column-Family-, and Document-oriented databases, as they all share a number of characteristics:
Graph databases
- Directed graphs: The links between nodes (also known as the relationships) have a direction.
- Multirelational graphs: There can be multiple relationships between two nodes that are the same.
- Storing key-value pairs as the properties of the nodes and relationships.
Let’s investigate this model in a bit more detail. When looking closer at this, we find the following interesting aspects of this model:
- There is no fixed schema. The database, in and of itself, does not impose that you have to have a schema, although most software professionals will agree that having some kind of schema as you move closer to production is probably not a bad idea.
- Partly because of the schema-less nature of the database, it seems to be a very nice fit for dealing with semi-structured data. If one node or relationship has more or fewer properties, we do not have to alter the design for this; we can just deal with that difference in structure automatically and work with it in exactly the same way.
- Nodes and node properties seem to be quite easy to understand. In relational terms, one can easily compare nodes with records in a table. It’s as if the property graph contains lots and lots of single-row tables, that is, the nodes of the graph. Nodes will have properties just like records/rows in a table will have fields/columns.
- Relationships are a bit different. They always have a start- and an endpoint, therefore have a direction. They cannot be dangling, but can be self-referencing (same node as start- and endpoint). But the real power lies in the fact that:
- Relationships are explicit: They are not inferred by some kind of constraint or established at query time through a join operation. They are equal citizens in the database; they have the same expressive power as the nodes representing the entities in the database.
- Relationships can have properties too: They can have values associated with them that can specify the length, capacity, or any other characteristic of that relationship. This is terribly important, and very different from anything we know from the relational world.
This is the first article of neo4j. Series of links:
Neo4j learning: graph theory explanation