On July 24, The original developer of Greenplum kernel ma Hongxu shared the fourth issue of the series of live broadcast “Greenplum Kernel Reveals the B-tree Index”. The video has been uploaded to Greenplum Chinese community channel B, click here to watch. This article summarizes the essence of the article content, welcome to give us a message exchange.

Indexes are an important component of databases, and b-trees are the most common index data structure and the default index type in Greenplum. Today I will give you a detailed introduction to b-tree indexes. This paper will cover the basic knowledge of B-tree, b-tree storage structure, operation algorithm, concurrent control, related system tables and other knowledge.

The basics of B trees

First of all, I would like to introduce the basic knowledge of B tree. Most of you have learned about b-trees in data structure courses in college. Now, many people will ask why database indexes often use B-trees to achieve database indexes. Is the b-tree index of Greenplum exactly the same as the one we introduced in college courses? Are the data structures and algorithms exactly the same as they were introduced in the course? You can think about these questions and read this article to find the answers.

01 index

Before we start introducing b-tree indexes, let’s give you a brief introduction to indexes. The definitions you see on the Internet are often too complex, so I gave it a simple definition: an index is a data structure that speeds up a common operation. In order to facilitate everybody understanding, here I give you a simple example, using a dictionary, when we want to look up a word, we can choose from this dictionary page by page STH from beginning to end, it is conceivable that such a query is very slow, if we use the index in the appendix of this dictionary, query speed could be greatly improved.

Indexes include b-tree indexes, which are described in detail in this article, as well as hash and inverted indexes. One hash index are common, such as a very simple program, it will perform a data query, query results are stored in the hash table, the next time you visit, first to the hash table, judge whether the data already exists, if already exists, there is no need to run this query, this is a hash index.

Inverted index is often used in the full text search, we usually use Baidu or Google search, will use the full text search, the index behind the inverted index.

02 B tree

Having introduced indexes, let’s take a look at B trees. It’s important to note that b-trees are actually a large family, so if you’re writing technical articles or blogs, you should be aware of what kind of B-trees he’s referring to. B tree can be subdivided into multiple subcategories. Compared with the data structure course in our university, a structure mentioned in the course is easy to be confused with B tree. That’s a binary tree. Binary trees and B trees are both balanced trees, but a binary tree can only store one key per node, whereas a B tree can store a large number of keys per node, so the tree is not too tall.

For details, see the typical b-tree in the figure above. Its nodes store many key values, which are arranged in order, such as 1,2,5,7,9,12,16,18,21 in the figure. Each key will point to the target data.

B+ trees are one of the most common subcategories of B trees. The following figure shows a typical B+ tree. The characteristic of a B+ tree is that the nodes in the leaf layer store all the key values, which in turn point to the target data, such as 1,2,5,9,12,18,21 in the figure. Some key values are stored repeatedly in the internal node, but there are no data Pointers. The leaf node layer has a forward traversal list.

Here we will answer the first question that got you thinking at the beginning: why are b-trees often used as an index structure for databases? It’s actually a B+ tree. B+ tree is very suitable for index structure in database, its main purpose is to reduce disk lO, each node corresponds to one page in disk, access node corresponds to one disk IO. So we want the tree to be very flat, which means the height of the tree is very small, because the height of the tree is the number of times the DISK IO is accessed.

Why use B+ trees? Since B+ trees do not store data or data Pointers on nodes, they can store more keys per node than A B tree. If they store more keys, the B tree will become very flat, very low in height, and have fewer disk lO. So we often choose B+ trees.

And one reason is that we often need to search, such as in the figure above if you want to find a 2 ~ 9 data, we found the 2 data, along the right direction can make the 5 and 9 is found, because the node layer has a pointer to the upper right of the page, so we no longer need to start from the root, but directly to the right.

B+ trees are also the default index type in Greenplum. Greenplum is based on Postgresql and has many improvements on it. Since Postgresql is a traditional database, it also uses B+ trees. For example, in the following example, it takes 20 seconds to find an ID equal to 10,000 in a large table. Then we create an index. Note that the type of index is not specified, so the default is B tree. Then we checked again, and this time it took 200 milliseconds, a 100-fold improvement. As you can see, using indexes can greatly improve query performance.

demo=# select * from big where id=10000; ... Time: 19490.566 ms demo=# create index on big(id); CREATE INDEX demo=# analyze big; ANALYZE demo=# select * from big where id=10000; ...Copy the code

03 Blink tree

In other words, “Greenplum” means “B+ tree” in Chinese. The answer is yes and no. A B+ tree in a Greenplum is called a Blink tree.

Before introducing the Blink tree in detail, I would like to add that there are two aspects of THE B tree or B+ tree that we talked about in college courses that have not been mentioned. The first aspect is that the B-tree needs to support failover. When a database goes down, you don’t want all of your data to go down and not be able to recover. If the database was suddenly restarted and the index was broken after the restart, it would be a huge headache. Changes to node pages and tree structures (such as page splitting) are actually recorded in WAL (redo logs), so we can recover by redo WAL logs. For more on recovery, check out the rest of our kernel Live series.

The second aspect is that b-trees need to provide a high level of concurrency control, because the database in a production environment needs to serve a large number of concurrent accesses.

For these two reasons, Greenplum uses the Blink tree. Blink trees come from the paper Efficient Locking for Concurrent Operations on B-trees by Lehman and Yao in 1981. The implementation in Greenplum references this paper and improves on it a bit, introducing right sibling Pointers and high keys into the structure based on B+ trees. The right sibling pointer allows inner nodes to move to the right between the same levels. In addition to the leaf node layer, there are also right sibling Pointers in the middle node layer, such as the middle layer in the figure below. The high key is the maximum key value in the current node and the number of children rooted in the current node.

Now that we have learned the basic knowledge of B-trees, in the following content, we do not subdivide the type of B-trees. B-trees mentioned in Greenplum refer specifically to B trees, also known as Blink trees. Let’s look at the storage structure of the B tree.

Storage structure

01 Physical Storage

In this case, coarse-grained index storage is introduced. The indexes in Greenplum are secondary indexes, which are physically independent files, separate from the data files in the table. The index is allocated by the data in the segment corresponding to the index content of each segment.

If you want to see where the index file is, you can use the following SQL query, first find the file name. Both files can be found on my computer by executing the find command in the Greenplum data directory with the file name 16524. There are two files here, because my test environment is 2 segments, you can execute them if you are interested.

Demo =# select relname, relFILenode, gp_segment_id from gp_dist_random('pg_class') where relname = 'student_id_idx '; relname | relfilenode | gp_segment_id ----------------+-------------+--------------- student_id_idx | 16524 | 1 student_id_idx | 16524 | 0 (2 rows) $ find . -name 16524 | xargs ls -l -rw------- 1 interma staff 1212416 Mar 17 10:44 ./gpseg0/base/16384/16524 -rw------- 1 interma staff 1179648 Mar 17 10:44 ./gpseg1/base/16384/16524Copy the code

02 Logical Structure

Next, I will introduce the logical structure of the index. You can take a look at the graph below. Looking from right to left, below is the data portion of the heap table, focusing on the index data above.

From top to bottom, the Meta page stores Meta information for the entire tree. Fast root = “root”; “root”; As data is added and deleted in a B tree, there may be only one pointer from the top node to the bottom node. In this case, each search needs to go down from the root, but each layer to the lower layer only a pointer, so there is no need to go from the first layer into the second layer and then into the third layer, directly from the fast root layer can enter, so it plays a role in accelerating. The ones in pink at each node are the high keys.

In addition, it is mentioned in the paper that the Blink tree is ok as long as it is a right pointer, but Greenplum is bi-directional, so it is also convenient for reverse traversal.

That’s the tree as a whole, and then let’s look at the individual nodes, the bottom left graph. A single node contains n-1 Key value and N pointer, that is, the pointer will be 1 more than the Key value, and a High Key, that is, a High Key.

Some readers may ask, is there any version information in the index? The answer is no. In other words, each version of the data tuple of Greenplum has a corresponding index tuple. To reduce index space overhead, Greenplum references HOT (Heap Only Turple) technology. There is no further introduction here. If you are interested, you can go to the Internet to search for information.

03 Physical structure of index nodes

Each index node actually corresponds to a physical page. The page structure is basically the same as that of a Heap table, but there are some differences. Since a data tuple, that is, a page in the heap stores a data tuple, and an index page does not store data, the contents of an index tuple are logically keys and Pointers.

It is important to note that Line pointer0 points to the first index tuple with a high key, followed by a normal index tuple. Special pages store page-level meta information, including sibling Pointers, page types, and so on.

04 Pageinspect sample

After the introduction of logical structure and physical structure, the reader may have a question, how can we see these contents more intuitively? Can be an extension of Greenplum or Postgresql to see, it’s called Pageinspecthttps://www.postgresql.org/docs/9.5/pageinspect.html, through which to view the contents of the page. In addition to viewing the contents of a data tuple, that is, the contents of a heap table, it provides many functions to view the contents of an index. The following three functions are most commonly used to view b-tree indexes.

  • bt_metap returns information about a B-tree index’s metapage

  • bt_page_stats returns summary information about single pages of B-tree indexes

  • bt_page_items returns detailed information about all of the items on a B-tree index page

The first is bt_metap, which is used to look at the meta pages of the B-tree. Bt_ PAGe_STATS is used to view the page’s special metadata, while bT_PAGe_items returns information about index tuples.

Mysql > create test table (ID); create test table (ID); create test table (ID); Because Greenplum is an MPP database, it has no data on its Master, so if you run this SQL, you need to connect directly to the segment using utility mode.

# create table test_index (a integer, b text) distributed by (a); demo=# insert into test_index(a,b) select s.id, CHR ((32+random()*94)::integer) from generate_series(1,10000) as s(id) order by random(); demo=# create index on test_index(a); demo=# create extension pageinspect; PGOPTIONS= '-c gp_session_role=utility' PSQL -p 6000 demo #Copy the code

Let’s look at the meta page first. From the data in the query, we can see:

  1. This index has a root level of 1 (leaf node level of 0), which means the height of the tree is 2.

  2. The fast root node and the root node are the same node (block number is the same, both are 3).

If you’ve looked at the Greenplum code, you’ll see that this information right here, it corresponds to the BTMetapageData structure in the code.

Next, let’s look at what the internal nodes look like. Stats corresponds to the special structure in the page: corresponds to the BTPageOpaqueData structure in the code. The meanings of each field are as follows: btpo indicates the layer number. The layer number of the root node is 1. Index tuple on the page corresponding to Items: where the first index tuple of the internal node has a null key. We just mentioned that we have n minus one key, we have n Pointers, and since we have one less key, by convention, we leave the first key blank.

And then we look at the leaf nodes. Leaf node types are all l, indicating leaf page. Btpo is equal to 0, which means layer 0. Total live_items: 1473*3+597=5016 index tuples, 4984 index tuples on another segment, which matches total 10000 entries in the table. The first three pages of Leaf have been filled, and the fill rate is about (1-3264/32768)=90%, which is also the default fill factor of leaf nodes of B-tree (the fill factor of internal nodes is 70%). If the page is completely filled, it will be very easy to split the page, so we specially leave a little space. Improve performance by making pages less prone to fragmentation. The bTPO_prev and BtPO_NEXT pages indicate the order of leaf pages (from left to right) : 1 <-> 2 <-> 4 <-> 5.

After introducing page meta information, let’s introduce page index tuple leaf items. The first entry of leaf node items is cTID =(1,628). As shown in the previous B-Tree logic example figure, the first key stored in each page is HighKey, and from the second element on, it is the key stored in the real leaf node. This is also the first normal key on the second page (this is no coincidence; the details of the B-tree construction will be covered later). The second key of the first page is 2, which is also the minimum value of the b-tree index.

Operating algorithm

After introducing the b-tree structure, let’s look at the b-tree operation algorithm. Here we will not operate the algorithm from the code level to introduce, but focus on the add, delete, change, check algorithm. And you don’t have to worry about concurrency control here, because concurrency control will be covered in the next section.

01 build

Creating a B-tree is triggered by create Index, which creates a B-tree index from the existing data in the table. The algorithm can be divided into two stages. The first stage is sorting, ordering the tuples in the table. Sorting is compared with the insertion scheme. The insertion scheme requires a descending process from root to leaf every time it is inserted, which consumes a lot. Therefore, sorting is adopted to improve efficiency. Sorting first also allows unique indexes to be processed in advance.

The second step is to traverse the ordered data tuples and build the entire B-tree from the bottom up. Generates a high key for the current node when the node page is full (there is actually some free space). Regenerates as a right sibling node and inserts key values into the parent node. Finally, the B-tree is completed from bottom to top and meta information is filled in.

02 insert

The insert algorithm is triggered by the INSERT statement: first the data tuple is inserted, then the index tuple is inserted as follows:

Start by looking down from the root node, and the goal is to locate the leaf node to insert the index tuple into. Record the path from the root node to the leaf node for subsequent reverse insertion into the parent node. Then locate the exact offset to be inserted in the leaf node and insert the index tuple into the leaf node. If the leaf node is full, the node needs to be split and a new key value inserted into the parent, recursively up the lookup path (potentially causing the parent node to continue splitting).

03 Search and Delete

Let’s talk about the find and delete algorithm. The most common way to use an index is plain index scan, which is very similar to our insert algorithm, that is, a descent path down to the leaf node, and then find an offset in the leaf node, can be found.

The index scan ultimately finds a pointer through which the data can be accessed. But a pointer means it’s a random IO, which means it’s randomly accessed. So while index lookups are fast, they perform poorly compared to sequential IO. If you have a lot of random IO, performance will be hard to improve. So the introduction of bitmap index scan, it will find the address of a sort, and then random IO as far as possible into sequential IO. Bitmap index scanning can be viewed in this article for details.

When deleting a data source group, you might wonder if the index is deleted. The answer is no. A delete is not triggered by a DELETE statement, but rather a “mark delete” of the index tuple when a dead tuple is found during an index scan. Vacuum, also a two-stage algorithm: first delete the marked index tuple, then delete the empty page and adjust the tree structure (from bottom to top).

Concurrency control

In real-world database systems, concurrency control must be considered because a database cannot be used by just one person, which is why Greenplum chose Blink trees. Before introducing concurrency control algorithms, let’s take a look at the problems with a naive algorithm to help you understand why Blink trees are used in Greenplum.

The idea of this naive concurrency control algorithm is very simple: read lock is added when read node, write lock is added when write node, and the only special treatment is node splitting. You can take a look at the code below. Only one node is locked at a time. As you go down, the lower node is locked and the upper node is unlocked.

Inserting nodes is a little more complicated. Each lock is a write lock. A new concept is introduced here. A node is called a secure node when a new index tuple is inserted into it without triggering its splitting. That is, every time a key is inserted into a node that is not full, write locks are added to all nodes along the path. So why write locks? As mentioned in the insert algorithm, we might insert the key backwards into the parent, so we have to lock it.

In most cases, the leaf node is released only after the key value is inserted into the leaf node, and then the lock of the parent node is released. In this case, the security node is a special optimization case. Some nodes are not full and may not split when they are inserted, that is, the structure of their upper tree will not change. So the lock on the upper two parent nodes can be released first, which is a simple optimization. But in extreme cases, the path is locked. This leads to two problems. First, because every path descent requires a lock operation, lock conflicts are high near the root. In addition, these locks added during path descent are most likely to be released immediately.

Let’s take a look at how Greenplum solves this problem. The concurrency control algorithm of the Blink tree in Greenplum introduces a moveright operation, using HighKey and the right sibling pointer, which is used to detect in time that the node has been split: if split, the key value searched must be on the right sibling node. That is, every time a node is accessed, the key value is greater than the high key. If so, move to the node on the right and access it again.

So why use this operation? The example below is from the Blink paper, so it’s also quite chronological. With the Moveright operation, you can handle the splitting problem shown below. Read locks are also added to Insert drops to ease the lock process.

The Blink tree concurrency control algorithm is as follows:

Blink tree concurrency control algorithm – Search

  • Starting at the root node, the read lock is added layer by layer

  • After each layer is moved down, the moveright operation is called to check whether the node is split

  • When a move down or move right operation is performed, the lock is released first and then a new lock is applied.

  • Finally, the leaf node is reached, the read lock is added, and the content is read

Blink tree concurrency control algorithm — Insert

  • The initial phase is the same as the Search operation, layer by layer read lock and with moveright, so parallelism is better

  • After the layer by layer descent reaches the leaf node, the read lock needs to be upgraded to a write lock

  • If nodes need to be split

  • Create a new right sibling page, add a write lock, and then attach it to the B-tree

  • Then recursively insert the key up: the operation of applying a write lock from bottom up for the parent node, then inserting it into the parent node, and then releasing the write lock for the lower node

The Blink tree solves two problems of the naive algorithm

  • If the lock conflict rate is high => Read locks are added to both read and write nodes

  • In addition, these locks are most likely to be released immediately when the path is descending. => Read locks are added when the path is descending, and write locks are applied from the bottom up

Let’s see if a deadlock occurs:

  • In Insert, write locks are requested from bottom up and left to right, without deadlocks

  • The Search operation may be applied in a different order than the Insert operation (the former from top down; The latter is bottom-up), but the Search operation is released and then applied each time, so there is no deadlock

Index related system tables

Finally, we will introduce the index related system table. When we use the index, often will deal with the system table, here is a brief introduction to what it is, understand the index related system table also better understand its role. The type system in Greenplum can be customized and extended. How can the same b-tree algorithm support different data types?

I believe you already have the answer, the problem is very similar to the generic algorithm in C++. In the b-tree algorithm, there are extension points or interfaces. In the case of a B-tree, that’s seven interfaces, all of which are essentially custom functions.

  • Policy operators, five

  • Support functions, 2

The ultimate goal of indexed tables is to add index support for custom types. If you expand all the system tables, you can see in the figure below that there are many system tables, which I won’t go into here. Operator class Operator family Operator class Operator family

  • Operator class, a function interface of the same type

  • Operator family, across function interfaces on similar types

conclusion

The basic realization of B tree is not complicated, but various optimization is very extensive and profound, involving all aspects of system optimization. The b-tree index implementation in Greenplum mainly refers to the Blink paper, which, although ancient, is very classic and still has a profound influence on subsequent b-tree concurrency control design. Readers if you are interested, very recommend reading Greenplum B tree implementation source code (in the SRC/backend/access/nbtree), and can learn a practical database system is how to deal with all details.