How to store and retrieve large-scale atlas data efficiently?

Abstract: This paper briefly introduces the storage and retrieval of knowledge graph.

This article is shared by JuTzungKuei from storage and Retrieval of Knowledge Graph in Huawei cloud community.

1, an overview of the

Background: With the development and popularization of the Internet, an interconnected world is taking shape. At the same time, data has exploded exponentially, and we are in a new era of raging digital floods.

How much data do we produce every day? According to statistics, every day:

Sending 500 million tweets to a blog;
294 billion emails sent;
Five billion searches are made online every day around the world;
A connected car generates four terabytes of data;
Facebook generates 4 petabytes of data a day, containing 350 million photos and 100 million hours of video.

As more and more knowledge is accumulated, the common knowledge graph is formed in the form of triple data.

DBpedia has nearly 80 million triples;
YAGO has over 120 million triples;
Wikidata has nearly 410 million triples;
Freebase has more than 3 billion triples;
The Chinese Encyclopedia has about 140 million triples.

So, how can we efficiently store and retrieve large-scale atlas data?

Knowledge graph is a directed graph structure that describes entities, events or concepts in the real world and their relationships. Nodes in the directed graph represent entities, events or concepts, and edges in the graph represent relationships between adjacent nodes.

The figure shows a partial schematic of the knowledge graph about Andy Lau. Red text represents concepts, rectangular boxes represent entities, blue text represents attributes, ellipses represent attribute values, and orange text represents relationships.

Concept: character, country, film, etc
Entities: Andy Lau, Carol Chu, China, A World without Thieves, etc
Attributes: height, weight, gender, capital, abbreviation, release time, Douban rating, etc
Relationships: wife, daughter, nationality, leading actor, etc

2. Storage of knowledge graph

Knowledge in knowledge graph is represented by RDF structure, and its basic constituent unit is fact.

Each fact is a triplet: < subject S, predicate P, object O>, where:

Subject S: Can be an entity, an event, a concept
Predicate P: can be relation, attribute
Object O: Can be an entity, an event, a concept, or a common value

A list of triples of knowledge representation in the knowledge graph is shown below.

<S, P, O>

“Andy Lau, birthday, September 27, 1961”

< Andy Lau, Blood Type, AB >

Andy Lau, Wife and Carol Chu

“Andy Lau, daughter, Sheung Wai Lau”

Andy Lau, Nationality, China

Beijing, Capital, China

. .

In order to query and manage knowledge graph data efficiently, it is necessary to organize the data reasonably on storage media. According to different storage methods, standard knowledge storage methods can be divided into storage based on table structure and storage based on graph structure.

2.1 table structure-based storage

Table structure-based storage uses two-dimensional data tables to store the data in the knowledge graph. According to different design principles, the knowledge graph can have different table structures, which can be divided into five categories: triplet table, attribute table, horizontal table, vertical table and full index.

2.1.1 Triplet table

Facts in the knowledge graph are triples one by one. A simple and direct storage method is to design a table to store all facts in the knowledge graph, that is, to build a table with three columns in the relational database, and the pattern of the table is: < subject, predicate, object >. Each triplet in the knowledge graph is stored as a row record in the triplet table.

This storage method is simple and straightforward and easy to understand. However, storing the entire knowledge graph in a single table will result in a large scale of a single table, and there will be very large overhead in complex queries or additions, deletions, changes, and searches.

Solution representatives: RDF database system 3Store and Virtuoso

2.1.2 property sheet

Attribute table, also known as type table, that is, a table is constructed for each type, and instances of the same type are placed in the same table. Each column of the table represents an attribute of the class entity, and each row stores an instance of the class entity.

Although this storage mode overcomes the shortage of triplet table, it also causes new problems. A large number of data fields are repeated, and some data attribute values have empty values, which will cause redundant storage.

Solution representative: RDF triplet library Jena

figure

countries

The movie

2.1.3 level table

Each row record in the level table stores all the predicates and objects of a subject in a knowledge graph. In fact, the level table is equivalent to the adjacency list of the knowledge graph. The number of columns in the level table is the number of different predicates in the knowledge graph, and the number of rows is the number of different subjects in the knowledge graph.

In the real knowledge graph, the number of different predicates may be tens of thousands, exceeding the upper limit of the database. There are lots of null values.

Solution representative: DLDB, an early RDF database system

2.1.4 vertical table

Vertical table is a kind of to triple the predicate of the division of the dimension method, divided into several zhang will RDF knowledge map according to predicate contains only the subject and object two columns of the table, the table of the total number is the number of different predicate the knowledge map, that is, build a table for each type of predicate, in the table to store knowledge map by the predicate in the connection of subject and object value.

This method uses joins between different tables instead of self-join to avoid self-join operation. But it does not support query operations where the predicate is a variable very well.

Solution representative: SW-store

gender

starring

The capital,

2.1.5 full index

Full index, also known as six-fold index, is an optimization technology proposed for the characteristics of the data and operation of the knowledge graph, which uses the characteristics of the triplet of the knowledge graph to build the index. Enumerate the various permutations of the subject, predicate, and object in the triplet, and index them one by one. There are six permutations of subject, predicate and object. These indexes correspond to all kinds of possibilities of triplet pattern with variables in knowledge graph operation, which is a typical “space for time” strategy.

This method not only alleviates the problem of single table self-join of triplet table, but also speeds up the query efficiency of atlas. But it also increases the cost of updating and maintenance.

Schema representatives: RDF-3X, Hexastore

Six tables: SPO, SOP, PSO, POS, OSP, OPS

2.2. Graph-based storage

Graph structure – based storage is to store the data in knowledge graph by graph. Considering entities as nodes and relations as labeled edges, the data of the knowledge graph naturally satisfies the structure of the graph model. The graph structure-based storage can directly and accurately reflect the internal structure of the knowledge graph. There are mainly two graph storage modes: adjacency list and adjacency matrix. The corresponding database is a graph database and the data model is an attribute graph.

2.2.1 Adjacency list

The so-called adjacency list is a list corresponding to each node (entity) in the knowledge graph, in which information related to the entity is stored. When using graph structure to manage knowledge graph data, a key problem is how to effectively prune query operations in index candidate space based on graph structure.

2.2.2 Adjacency matrix

The so-called adjacency matrix is to maintain multiple matrices of N x N dimensions in the computer, where N is the number of nodes in the knowledge graph. Each matrix corresponds to a predicate, in which each row or column corresponds to a node in the knowledge graph. If the ith row j is listed as 1 in the matrix corresponding to predicate P, it indicates that there is an edge with predicate P from the ith node to the JTH node in the knowledge graph.

Three-dimensional matrix M: | S | | | | O | x P x, says the number of subject-predicate bin respectively, if the < S, P, O > exists in knowledge map, then M [I] [j] [k] = 1, otherwise it is set to 0.

2.2.3 Figure database

Graph database is based on graph theory, which represents and stores data through nodes, edges and attributes. Specifically, graph database is based on directed graph, in which node, edge and attribute are the core concepts of graph database.

Node: Represents objects such as entities and events.
Edge: A directed line connecting nodes in a graph, used to show the relationship between different nodes.
Properties: Describes the properties of a node or edge.
Common graph databases: Neo4J, JanusGraph, OrientDB, etc.

3. Retrieval of knowledge graph

The knowledge of knowledge graph is actually stored by database systems, most of which provide users with interface to access data through formal query language.

3.1 SQL

Structured Query Language A Structured Query Language used to manage relational databases.

Four kinds of operation

Add: insert into table name (1, 2,…) Values (1, 2…)
Delete: delete from table name where condition
Select * from table_name where table_name = 1 where table_name = 1
Select * from (select 1, 2,… From table name where condition

3.2 the SPARQL

SPARQL is a query language and data acquisition protocol developed by W3C for RDF data, which is widely supported by graph databases.

Three operations:

Add: INSERT data triplet data
Delete: delete data Triplet data
Modification: none, combination of addition and deletion
Select variable 1, variable 2… Where figure pattern

select ? x, ? Y where {the world without thieves starring? X. Infernal Affairs? x . ? X’s birthday? y . }

Copy the code

3.3 Gremlim

Gremlin is a graph traversal language used in Apache Tinkerpop framework. Gremlin can be used to query graph data, modify graph, partial traversal and attribute filtering, etc.

Three kinds of operation

Add: G.a ddV (‘ characters’). The property (id, ‘007’). The property (‘ birthday ‘, ‘on June 22, 1962), g.a ddE (‘ husband’). The property (‘ XXX ‘, ‘yyy’).from(g.V(‘001’)).to(g.V(‘002’))
Delete: g.V (‘ 007 ‘). The drop ()
Check: g.V (.) hasLabel (‘ characters’), g.E (). The label (), g.V () valueMap ()

3.4 Cypher

Cypher is a descriptive graph query language that allows graphs to store expressive and efficient queries without having to write graph structure traversal code. Is a widely used declarative graph database query language.