preface

Why watch BoltDB? The reason is that when learning ETCD, the underlying storage and MVCC transaction control is handed over to BoltDB. Before also learn Mysql, but because of its underlying source complexity, at the same time see BoltDB is pure Go write, you can put b+ tree and MVCC implementation review.

Let’s start our exploration!

import (
	"log"
	"time"

	bolt "go.etcd.io/bbolt"
)

func main(a) {
	db, err := bolt.Open("my.db".0600, &bolt.Options{Timeout: 10 * time.Second})
	iferr ! =nil {
		log.Fatal(err)
	}
	defer db.Close()
}
Copy the code

The above is the simplest code to get started with BOLtDB in ETCD.

When getting go get github.com/boltdb/bolt, you cannot get down.

So change to ETCD maintenance boltDB. But the basic API functionality remains the same.

In addition: after etCD get, a BBOLt command line will be downloaded at the same time, which is convenient for analyzing the underlying structure of BoltDB.

Storage structure

After executing the above code, a my.db file is generated in the same directory. This is a boltDB database instance: contains all information about the current database: metadata, index, data

Bbolt: bbolt: bbolt: bbolt: bbolt: bbolt: bbolt

❯ bbolt -h Bolt is a tool for inspecting Bolt databases. Usage: Bolt command [arguments] The commands are: bench run synthetic benchmark against bolt buckets print a list of buckets check verifies integrity of bolt database compact copies a bolt database, compacting it in the process dump print a hexadecimal dump of a single page get print the value of a key in a bucket info print basic info keys print a list of keys in a bucket help print this screen page print one or more pages in human  readable format pages print list of pages with their types page-item print the key and value of a page item. stats iterate over all pages and generate usage stats Use "bolt [command] -h" for more information about a command.Copy the code

Bblot Pages… :

❯ bbolt pages -h
usage: bolt pages PATH

Pages prints a table of pages with their type (meta, leaf, branch, freelist).
Leaf and branch pages will show a key count in the "items" column while the
freelist will show the number of free pages in the "items" column.

The "overflow" column shows the number of blocks that the page spills over
into. Normally there is no overflow but large keys and values can cause
a single page to take up multiple blocks.
Copy the code

First, a database instance is organized in pages when stored. Each page size is determined by the PAGE_SIZE of the current file system, which is the smallest unit of data activity on disk and in memory:

❯ getconf PAGE_SIZE. 4096
# 4096 = 4k
Copy the code

Page

Page distribution for database instance initialization:

❯ bbolt pages my.db
ID       TYPE       ITEMS  OVRFLW
======== ========== ====== ======
0        meta       0            
1        meta       0            
2        freelist   0            
3        leaf       0
Copy the code

What is the page above? The assignments and distribution for each type of page are explained below.

Page Structure

No matter what kind of page, the structure is composed of header + data body.

Github.com/boltdb/bolt…

type pgid uint64

// page header {id, flags, count, overflow}
type page struct {
	id       pgid
	flags    uint16
	count    uint16
	overflow uint32
	ptr      uintptr
}
Copy the code
  • id: page id
  • flags: Distinguish between different types of pages. Some of the abovemeta/freelist page
  • count: The amount of data stored on the page. Only when the page isbranch/leaf pageData is stored
  • overflow: A single page is not enough to store data. This value stores the number of pages that overflow
  • ptr: points to the end of the page header data, where the page data startsptr

The bbolt pages -h command explains overflow very clearly. Let me translate here:

The Overflow column shows the number of blocks that the page overflowed. Overflows usually do not occur, but large keys and values can cause multiple blocks to overflow a page.

Meta Page

Meta Page Stores database instance metadata.

Github.com/boltdb/bolt…

type meta struct {
	magic    uint32
	version  uint32
	pageSize uint32
	flags    uint32
	root     bucket
	freelist pgid
	pgid     pgid
	txid     txid
	checksum uint64
}
Copy the code
  • magicMagic number, used to identifyboltdb db file
  • version:boltdbversion
  • pageSize:db pageThe size,os.Getpagesize()You can get
  • flags: I don’t know what it is
  • root:boltdbData organization is based onb+ treeForm, while the whole treeroot nodeI’ll keep it right here
  • freelist: Data is deleted when thesepageWill be savedfreelistIn theids
  • pgid: The next one to be allocatedpgidIt can be understood as the present max page num, that is, the currentpageThe number of 】
  • txid: The next one will be allocatedtransaction id, related to transaction implementation
  • checksum: Verification code, used for verificationmetapageComplete or not

Github.com/boltdb/bolt…

// meta returns a pointer to the metadata section of the page.
func (p *page) meta(a) *meta {
	return (*meta)(unsafe.Pointer(&p.ptr))
}
Copy the code

Addressing unsafe, the PTR is a pointer to Page data

freelist page

A free linked list page is a contiguous group of pages (one or more) in a DB file that holds a list of ids of pages that were freed due to modification operations during DB usage.

Github.com/boltdb/bolt…

// freelist represents a list of all pages that are available for allocation.
// It also tracks pages that have been freed but are still in use by open transactions.
type freelist struct {
	ids     []pgid          // all free and available free page ids.
	pending map[txid][]pgid // mapping of soon-to-be free page ids by tx.
	cache   map[pgid]bool   // fast lookup of all free and pending page ids.
}
Copy the code
  • ids: Saves the current free pageid array
  • pending: Saves the corresponding of the transaction operationpage id.key: trxid; value: []pgidIs released after the transaction completes. It is the intermediate product in the transaction operation
  • cache: Mark onepgidAvailable, storedkey: pgid; Value: The flag is currently idle and available for allocation

Boltdb manages page allocation via freelist Page.

Freelist records which pages are free, which is involved in deleting and allocating new data:

  • Delete a large amount of data, corresponding topageWill release – >pgidStored in thefreelistids;
  • Write data directly from the free pageidsPlease apply.

In the example above, IDS is pre-ordered. This design is also convenient for freelists to assign new pages.

branch page

The root field in meta Page is the root node for the entire database instance data organization B + Tree.

What about the other nodes?

  • index node -> branch page
  • leaf node -> leaf page

Github.com/boltdb/bolt…

// branchPageElement retrieves the branch node by index
func (p *page) branchPageElement(index uint16) *branchPageElement {
	return& ((* [0x7FFFFFF]branchPageElement)(unsafe.Pointer(&p.ptr)))[index]
}

// branchPageElements retrieves a list of branch nodes.
func (p *page) branchPageElements(a) []branchPageElement {
	if p.count == 0 {
		return nil
	}
	return((* [0x7FFFFFF]branchPageElement)(unsafe.Pointer(&p.ptr)))[:]
}

// branchPageElement represents a node on a branch page.
type branchPageElement struct {
	pos   uint32
	ksize uint32
	pgid  pgid
}

// key returns a byte slice of the node key.
func (n *branchPageElement) key(a) []byte {
	buf := (*[maxAllocSize]byte)(unsafe.Pointer(n))
	return (*[maxAllocSize]byte)(unsafe.Pointer(&buf[n.pos]))[:n.ksize]
}
Copy the code

Part of the code related to branchPage is extracted. The following characteristics can be seen from the above:

  1. branch pageThe data part of: stores each one that starts from this branchindex nodeMetadata information
  2. index dataTake space for time, divided into:
    • branchPageElement: Each of the storage branchesbranch nodemetadata
    • key: Real data [can just find the next onenodeThe key of 】
  3. branchPageElementParameter Meaning:
    • pos:page dataPartial offset, and you can find the correspondingkey
    • ksize: Size of the key
    • pgid: the next onebranch pageThe page id
  4. fromkey()The function shows that:
    • fromelementposPosition start, reade.kszieThe offset.
    • Note: get the current firstelePointer addressBecause the entire data segment is a contiguous arraySo we can passele.pos, add the offset, and find the correspondingkey
  5. branchPageElement() Also from thepage[]branchPageElementRead the corresponding index. To showbranch pageThe storage is a continuousbranchPageElement.

However, in the ETCD version, key() is rewritten as: reflect.sliceheader

// key returns a byte slice of the node key.
func (n *branchPageElement) key(a) []byte {
	return* (* []byte)(unsafe.Pointer(&reflect.SliceHeader{
		Data: uintptr(unsafe.Pointer(n)) + uintptr(n.pos),
		Len:  int(n.ksize),
		Cap:  int(n.ksize),
	}))
}
Copy the code

This is zero-copy converted to []byte:

  1. First get the address to which the structure pointer points, + corresponding offset => get data
  2. key sliceLen and Cap are both thereeleksize

The above code is also in page to node, which reads data from disk into memory. Of course the important one is read() :

// read initializes the node from a page.
func (n *node) read(p *page){ n.pgid = p.id n.isLeaf = ((p.flags & leafPageFlag) ! =0)
	n.inodes = make(inodes, int(p.count))
  
	for i := 0; i < int(p.count); i++ {
		inode := &n.inodes[i]
		if n.isLeaf {
			elem := p.leafPageElement(uint16(i))
			inode.flags = elem.flags
			inode.key = elem.key()
			inode.value = elem.value()
		} else {
			elem := p.branchPageElement(uint16(i))
			inode.pgid = elem.pgid
			inode.key = elem.key()
		}
		_assert(len(inode.key) > 0."read: zero-length inode key")}... }Copy the code

Read () shows the layout of the Branch page as a whole:

  1. Gets the pGID from the page, indicating which page the current node is from
  2. []inodesLength ofpage.count, which is how many nodes there are currentlyThe child node
  3. isLeafpage.flagsDecide [also decide laterifAssignment of branches
  4. byp.countThe number of loops each oneelementAnd corresponding ton.inodesEach assignment of

This maps the disk’s B + Tree into memory. This section will be further explored later.

leaf page

We can also see the assignment to the Leaf Node field from read() above:

Github1s.com/boltdb/bolt…

type leafPageElement struct {
	flags uint32
	pos   uint32
	ksize uint32
	vsize uint32
}
Copy the code
  • flagsDistinguish:subbucketandOrdinary leaf node【 aboutbucketWe’ll talk about that later.
  • pos: Is the same as that on the Branch page and corresponds to the distanceleafPageElementThe offset
  • ksize/vsize:key/valueStorage size

Where value is positioned in read() :

// read initializes the node from a page.
func (n *node) read(p *page){...for i := 0; i < int(p.count); i++ {
		inode := &n.inodes[i]
		if n.isLeaf {
			elem := p.leafPageElement(uint16(i))
			inode.flags = elem.flags
			inode.key = elem.key()
			inode.value = elem.value() // It is this line...
		} else{... } _assert(len(inode.key) > 0."read: zero-length inode key")}... }// value returns a byte slice of the node value.
func (n *leafPageElement) value(a) []byte {
	buf := (*[maxAllocSize]byte)(unsafe.Pointer(n))
	return (*[maxAllocSize]byte)(unsafe.Pointer(&buf[n.pos+n.ksize]))[:n.vsize:n.vsize]
}
Copy the code

Key: &element + pos == &key

Value: &leafPageElement + pos + ksize == &val

conclusion

This paper understands the structure of BOLtDB on disk from the overall page layout. The next article will look at boltDB in memory operation and design ideas.