Liu Daiming, a literary young man (Nani? Youth? Yes, you read that right: The World Health Organization (2020) definition of youth: 18 to 65 years old [from Baidu Encyclopedia]). Now qihoo 360 search technology department as a Web server technical expert position. This article is the author’s original, I wish you a happy reading.

preface

Recently took the time to look at boltDB source code, the amount of code is not large (about 4000 lines), and support transactions, the structure is also very clear, because of the relatively stable, has been archived, is really the best choice to learn the database. Moreover, many well-known open source projects are using it, such as ETCD, InfluxDB, etc.

This article records what I learned from reading the source code, as a reminder.

Introduction to the

Boltdb database is a K/V database developed by GO. The design is derived from LMDB (Lightning Memory-Mapped Database), which is persistent to a single file. The file is Mapped to Memory by mmap, so that the read file can reduce one IO, but the write file is not by Mmap, instead seek+write by calling the write function. We’ll talk more about that later.

Index by B+ tree.

Write in blocks. The size of the blocks depends on the operating system, which is 4K for most operating systems.

Transaction support: Multiple read transactions can be executed concurrently, but write transactions are serial.

The working process

A simple code to use is as follows:


	path := "./bolt.db"
	db, err := bolt.Open(path, 0666.nil)
	iferr ! =nil {
		fmt.Println("open db failed:", path, err)
		return
	}
	defer db.Close()

	tx, err := db.Begin(true)
	iferr ! =nil {
		fmt.Println("begin trans failed:", err)
		return
	}
	defer tx.Rollback()

	bucket, err := tx.CreateBucketIfNotExists([]byte("nums"))
	iferr ! =nil {
		fmt.Println("create bucket failed")
		return
	}
	kv := []byte("128")
	
	bucket.Put(kv, kv)

	val := bucket.Get(kv)
	fmt.Println(val)

	tx.Commit()

Copy the code

It can be seen that the working process is:

1. Open the database file

2. Start a transaction

3. Create a bucket (a series of k/ V collections, nested)

4. Put a k/ V data in the bucket

5. Close the transaction

Each sequence of operations is a transaction, either read-only or read/write. Each operation of adding, deleting, modifying and checking data is carried out based on the bucket.

The data structure

Boltdb all modification operations are completed in memory, and the final commit transaction is performed after the disk operation.

Therefore, the byte array is stored in memory and file data structures. The byte array is converted to the corresponding structure using the unsafe function

Memory related

node

Represents a tree node in memory


type node struct {
	bucket     *Bucket	
	isLeaf     bool		// Used to distinguish tree branches from leaf nodes
	unbalanced bool
	spilled    bool
	key        []byte	// The smallest key in this node
	pgid       pgid
	parent     *node
	children   nodes
	inodes     inodes	// Where the actual data is stored in node
}

Copy the code

inode

Inodes stores K/V data. Tree branches and leaf nodes share this data structure.


type inode struct {
	flags uint32	// Use only leaf nodes. Stores data content identifiers, either bucket or normal data
	pgid  pgid		// Use only branch nodes. Page ID of the child node
	key   []byte	// Tree nodes and leaf nodes are common. key
	value []byte	// Use only leaf nodes. Stores normal data or buckets
}

Copy the code

Documents related to

page


type page struct {
	id       pgid		// pageid
	flags    uint16		// This can be metadata, a free list, a branch, or a leaf
	count    uint16		// The number of data to store
	overflow uint32		// The number of overflows (more than one page)
	ptr      uintptr	// The actual point of the stored data (only the in-memory identifier, not the disk)
}

Copy the code

Every page.ptr point to a data structure that’s going to have a page in front of it, which we call PageHeader, and then we’re going to do PH

It looks like this in the file:

The data structure to which page. PTR points

Metadata (metadata)

The metadata is stored in two copies with pageID 0 and pageID 1. Pageid is fixed and will not change in the future. What meta does:

1. Save basic database information, such as the root node, version number, and free list

2. If one of the metadata is faulty, you can use the other one to fix the problem to ensure that the database is available 3. Ensure consistency: Before each read/write transaction, the meta with the largest TXID is selected for transaction initialization. At the same time, the latest Meta is copied when MVCC is performed.


type meta struct {
	magic    uint32		// Identify the db file generated by boltdb
	version  uint32		/ / version number
	pageSize uint32		// Page size, according to the system, usually 4k
	flags    uint32		// Indicates metadata
	root     bucket		// PageID for root, starting from 3
	freelist pgid		// The free list pageID starts at 2
	pgid     pgid		// The next pageID to assign
	txid     txid		// The next transaction ID to assign
	checksum uint64		// Use when checking meta integrity
}

Copy the code

Freelist

Boltdb uses multiple versions of MVCC, adding, deleting and modifying data will be allocated new pages to store, the data in the previous page will not be deleted, when the transaction version increases, the page held by the old data can be released to re-allocate to the new write transaction to store data. The data structure Freelist controls which pages are free and which pages will be free.

We can also see that the boltDB database file does not decrease because of data deletion. Free space is reused by freelist.

Freelist data structure:


// freelist represents a list of all pages that are available for allocation.
// It also tracks pages that have been freed but are still in use by open transactions.
type freelist struct {
	ids     []pgid          // all free and available free page ids.
	pending map[txid][]pgid // mapping of soon-to-be free page ids by tx.
	cache   map[pgid]bool   // fast lookup of all free and pending page ids.
}

Copy the code

Only IDS fields are stored during disk drop. Pending and IDS are merged and stored. So the page. PTR points to the list of IDS. If the number of IDS exceeds 0xFFFF (the maximum value of page.count), the first position to which the PTR points is taken as the length of ids.

Pending is the id of a free page that has not been released, while IDS is the ID of a free page that has been available. Why can pending be merged with IDS?

Before we go into the reasons, here’s the background:

Freelist is owned by the entire DB (shared by all write transactions), only write transactions can modify it, so it does not need to be locked (why? As mentioned earlier, write transactions execute sequentially, not concurrently).

2. Each transaction has a version number: read transaction txid = db.meta. Txid, write transaction txID = db.meta

3. Each transaction copies a copy of metadata data, which holds basic information.

Here’s why:

1. Only when a write transaction commits, the data that was operated on during the transaction needs to be reassigned to a new page for storage. The page that was operated on during the transaction is pending and not released to IDS during the transaction completion. If a read transaction takes a long time to execute, the write transaction will be released to IDS and will be used by a subsequent write transaction to write new data while the read transaction is in progress, resulting in dirty reads.

2. If the write transaction is successfully committed, freelist will write a new page at this time. Ids +pending for the transaction is held on this new page. At this point, for a read transaction, there are two situations:

1. A read transaction has not completed (the read transaction may start before the write transaction or may start before the write transaction is committed)

2. After the write transaction succeeds, a new read transaction is created

3. For the first case above, pending in db.Freelist still exists at this time. Therefore, data will always exist during the process of the read transaction and will not be overwritten, ensuring transaction isolation.

4. In case 2, if the new pageID of a Freelist in metadata is successfully read, the new read transaction can directly read the new data. This is multi-version control.

5. When a new write transaction starts, it will find the minimum transaction TXID, and then put all the pageids in freelist.pending less than this TXID into freelist.ids (indicating that no read transaction is using these pending pages, so it can be released).

6. If the database is closed and reopened, don’t even care because the transaction hasn’t started yet.

The branches of the


type branchPageElement struct {
	pos   uint32	// Position of key
	ksize uint32	// Size of key
	pgid  pgid		// Point to the underlying pageID
}

Copy the code

The structure of the branches in the file is as follows:

The key can be located through pos and obtained through pos:ksize. The next layer of data (branches or leaves) can be obtained through pGID

leaves


type leafPageElement struct {
	flags uint32	// Content identifier, which can be either plain data or bucket
	pos   uint32	// Position of key
	ksize uint32	// Size of key
	vsize uint32	// The size of value
}

Copy the code

The structure of the leaves in the file is as follows:

You can locate the key through pos, obtain the key through pos:ksize, and obtain the value through pos+ksize:vsize.

The leaf contents can be buckets (sub-buckets) or ordinary data.

For bucket, the flags is bucketLeafFlag=0x01;

For common data, flags=0.

Page and node correspond to conversion

Page to node: Converts to node on demand using the node.read(p *page) function.

Convert from node to page: Convert using the node.write(p *page) function

Since the modifications are done in memory, all previous inserts are done in node and then saved to a file through a series of operations.

Node.put (oldKey, newKey, value []byte, pGID pGID, flags uint32)

As mentioned above, there are only two ways to insert data: insert bucket, insert normal data:

c.node().put(key, key, value, 0, bucketLeafFlag)

c.node().put(key, key, value, 0, 0)

The PGID (pGID =0) is not assigned when data is inserted before a disk is dropped. Only memory data will be affected, not files.

bucket

In BoltDB, a bucket is an important concept, which is a collection of columns of key-value pairs. A bucket is equivalent to a namespace, each bucket represents a complete B + tree, and other buckets can be nested. Data is added, deleted, and checked based on the bucket.

The structure is as follows:


// Bucket represents a collection of key/value pairs inside the database.
type Bucket struct {
	*bucket
	tx       *Tx                // the associated transaction
	buckets  map[string]*Bucket // subbucket cache
	page     *page              // inline page reference
	rootNode *node              // materialized node for the root page.
	nodes    map[pgid]*node     // node cache

	// Sets the threshold for filling nodes when they split. By default,
	// the bucket will fill to 50% but it can be useful to increase this
	// amount if you know that your write workloads are mostly append-only.
	//
	// This is non-persisted across transactions so it must be set in every Tx.
	FillPercent float64
}

// bucket represents the on-file representation of a bucket.
// This is stored as the "value" of a bucket key. If the bucket is small enough,
// then its root page can be stored inline in the "value", after the bucket
// header. In the case of inline buckets, the "root" will be 0.
type bucket struct {
	root     pgid   // page id of the bucket's root-level page
	sequence uint64 // monotonically incrementing, used by NextSequence()
}

Copy the code

Each field is also explained in detail in the code.

Each bucket can be persisted to a file in one of two ways:

1. Take up an entire page

2. The bucket and its data exist as a Leaf

The structure of case 1 in the file is as follows:

Root is the starting pageID of the bucket. In the subsequent query operations for elements in the bucket, binary search starts from the pageID. It takes a Pagesize.

The structure of case 2 in the file is as follows:

The bucket and its data exist as a leafPageElement. Assuming Leaf1 is inline-bucket, the content stored in its value is BucketHeader+PageHeader+PageData(the bucket’s K/V collection).

It differs from a normal bucket in that it does not assign a pageID separately. The root value is 0, and pageID is also 0.

Inline-buckets exist to reduce storage space usage.

In fact, the newly created bucket is an inline bucket by default, and then performs a series of k/ V operations in the memory to determine whether the bucket meets the inline bucket requirement. The function is bucket.inlineable().

Whole picture

Let’s look at the next complete B + tree for a bucket

Description: For the sake of drawing, the above is abbreviated: PH: PageHeader

LE: leafPageElement

BH: bucketHeader

BE: branchPageElement

K:key

V:value

For the whole picture of memory, can use node+inodes can also be described, here will not be repeated

To find the

So with that in mind, let’s look at the search.

Both contents stored in inodes and elements in pages (leafPageElement+branchPageElement) have their keys in order.

Binary search.

Data structures used in lookups:


// Cursor represents an iterator that can traverse over all key/value pairs in a bucket in sorted order.
// Cursors see nested buckets with value == nil.
// Cursors can be obtained from a transaction and are valid as long as the transaction is open.
//
// Keys and values returned from the cursor are only valid for the life of the transaction.
//
// Changing data while traversing with a cursor may cause it to be invalidated
// and return unexpected keys and/or values. You must reposition your cursor
// after mutating data.
type Cursor struct {
	bucket *Bucket
	stack  []elemRef
}

// elemRef represents a reference to an element on a given page/node.
type elemRef struct {
	page  *page
	node  *node
	index int
}

Copy the code

Page +node+ index of the key that was hit) and put it in the stack.

Because the B + leaf node stores the value, eventually, the top (last element) of the stack is the leaf node. Then do the related operations (add, delete and change)

The search process will first look in memory (node), then look from page and cache to memory.

B + tree

Persistence is performed during the read/write transaction commit phase.

As mentioned above, all changes are made in memory. Adding or deleting the b+ tree will destroy the b+ tree structure, so you need to adjust the tree structure in memory before persisting.

There are two operations involved:

Rebalance: Delete nodes with all their data. These nodes need to be checked and merged because deletion may cause page fill rates to be insufficient.

2. The spill operation causes the page filling rate to be too high, and the node needs to be split.

rebalance

In the Bucket, b. naodes and B. Nakets are cache fields that have been operated on. Therefore, perform rebalance on the nodes in these two fields.

For nodes that may be unbalanced (nodes that have been assigned to a better starting point when they are deleted), determine whether their filling rate meets the following two conditions (both of which must be met) :

1. The condition is greater than 1/4 pagesize

2. If the current node is a leaf node, the number of keys must be greater than 1. For a tree node, the number of keys must be greater than 2

Rebalance if none of these conditions is true

B. balance();

The merging process is as follows:

1. If it is a root node and it is a branch node, it contains one data, then the leaf node is lifted (delete the trunk node).

2. Delete the node that does not contain data

3. The nodes on the left of the same layer are preferentially merged, and the nodes on the right are preferentially merged (the first node has no left node, so the right node should be selected).

4. The merge may cause the parent node to not meet the condition (see the picture above, the key of the parent node is the smallest (leftmost) key of the child node, after the child merge, this key is not needed), need to recurse the parent node

spill

The b.spill() function splits the node in the Bucket.

A node can be split into multiple nodes only if the following conditions are met

1. The amount of data in inodes is greater than 4

2. The data size in inodes is larger than pagesize

There may be a big key problem: the number of keys in the inodes satisfies condition 1, but the value of one of the keys in the inodes is large, larger than the pagesize, and there is no need to split.

Pageid = pageID = pageID = pageID = pageID = pageID = pageID = pageID = pageID In addition to splitting, spill also applies a pageID for the split node.

split

All the nodes involved in buckets are judged and split: First, each sub bucket is judged to see whether it is inline-bucket, and each node in non-inline-bucket is judged and split

Split process:

1. Start at the root node in each bucket and recurse from the top to the leaf nodes

2. Determine each node and split it into multiple nodes (according to the filling rate)

3. Assign A PageID to each node after splitting. If the node has a PageID, it should be placed in Freelist Pending (Freelist is a data structure used to record the free PageID list and the current pageID list to be freed).

4. Splitting the parent node may cause too many keys. Therefore, you need to split the parent node

Distribution pageid

It looks for a free PageID in the Freelist that meets the criteria based on the space required. Because the stored data may need a page, some need multiple pages (large key), multiple pages need to be continuous.

In this case, if the size is larger than the file size, you need to adjust the mMAP mapped file size (mmap mapped disk size is an integer multiple of pagesize, so it will be larger than the file size).

In addition, you can see that the process of assigning pageID is a process of sequential traversal until the condition is met. As the data volume increases, the ID fragmentation of freelist will become serious, and it will become a bottleneck when we want to find a large page. I see that this issue has been fixed in the etCD maintenance version (by merging and categorizing).

Trading data

As mentioned above, read is used to reduce IO by MMAP, and drive is used to write to a file.

1. Update the final file size in the file description.

2. Write data at the specified offset

3. Run syscall.Fdatasync to ensure that data is dropped.

This way of sizing a file and then using Fdatasync can reduce one IO

The transaction

Boltdb is transactional: read-only and read-write transactions.

Supports multiple read transactions and one write transaction.

Let’s see how it supports ACID.

Atomicity

Modify the transaction to copy metadata at the start of the transaction,

All changes are made in memory and do not affect the database files.

B + data and freelist data are written first, and metadata is written last. If the metadata is written successfully, it is considered as a success. If any step of writing fails, the pageID allocated by the transaction in the Freelist is directly discarded and the Freelist is reloaded.

Consistency

A write transaction succeeds only if the metadata is successfully written. Even if a failure or power failure occurs, the consistency of the database will not be affected.

Isolation

Write transactions execute sequentially and read transactions execute concurrently. Use MVCC multi-version concurrency control. Each transaction is assigned a transaction version number, TXID.

Read transaction txid = db.meta. Txid, write transaction txid = db.meta

During the lifetime of a read transaction, it is not affected by a write transaction:

1. Write operations are performed in memory before persistence

2. After persistence, the modified data will be allocated with additional PageID, which will not affect the previous page. The modified page is in pending state, and txID with a low version number cannot be accessed

3. At the start of each read transaction, the latest db.meta. Txid will be obtained.

“Durability”

Boltdb uses two metadata to ensure recovery in the event of a failure or restart. There is a verification mechanism for metadata selection. If one metadata is used to restore another metadata, only a Freelist needs to be restored.

reference

boltdb/bolt

Boltdb source analysis