preface
Why watch BoltDB? The reason is that when learning ETCD, the underlying storage and MVCC transaction control is handed over to BoltDB. Before also learn Mysql, but because of its underlying source complexity, at the same time see BoltDB is pure Go write, you can put b+ tree and MVCC implementation review.
Let’s start our exploration!
import (
"log"
"time"
bolt "go.etcd.io/bbolt"
)
func main(a) {
db, err := bolt.Open("my.db".0600, &bolt.Options{Timeout: 10 * time.Second})
iferr ! =nil {
log.Fatal(err)
}
defer db.Close()
}
Copy the code
The above is the simplest code to get started with BOLtDB in ETCD.
When getting go get github.com/boltdb/bolt, you cannot get down.
So change to ETCD maintenance boltDB. But the basic API functionality remains the same.
In addition: after etCD get, a BBOLt command line will be downloaded at the same time, which is convenient for analyzing the underlying structure of BoltDB.
Storage structure
After executing the above code, a my.db file is generated in the same directory. This is a boltDB database instance: contains all information about the current database: metadata, index, data
Bbolt: bbolt: bbolt: bbolt: bbolt: bbolt: bbolt
❯ bbolt -h Bolt is a tool for inspecting Bolt databases. Usage: Bolt command [arguments] The commands are: bench run synthetic benchmark against bolt buckets print a list of buckets check verifies integrity of bolt database compact copies a bolt database, compacting it in the process dump print a hexadecimal dump of a single page get print the value of a key in a bucket info print basic info keys print a list of keys in a bucket help print this screen page print one or more pages in human readable format pages print list of pages with their types page-item print the key and value of a page item. stats iterate over all pages and generate usage stats Use "bolt [command] -h" for more information about a command.Copy the code
Bblot Pages… :
❯ bbolt pages -h
usage: bolt pages PATH
Pages prints a table of pages with their type (meta, leaf, branch, freelist).
Leaf and branch pages will show a key count in the "items" column while the
freelist will show the number of free pages in the "items" column.
The "overflow" column shows the number of blocks that the page spills over
into. Normally there is no overflow but large keys and values can cause
a single page to take up multiple blocks.
Copy the code
First, a database instance is organized in pages when stored. Each page size is determined by the PAGE_SIZE of the current file system, which is the smallest unit of data activity on disk and in memory:
❯ getconf PAGE_SIZE. 4096 # 4096 = 4k Copy the code
Page
Page distribution for database instance initialization:
❯ bbolt pages my.db
ID TYPE ITEMS OVRFLW
======== ========== ====== ======
0 meta 0
1 meta 0
2 freelist 0
3 leaf 0
Copy the code
What is the page above? The assignments and distribution for each type of page are explained below.
Page Structure
No matter what kind of page, the structure is composed of header + data body.
Github.com/boltdb/bolt…
type pgid uint64
// page header {id, flags, count, overflow}
type page struct {
id pgid
flags uint16
count uint16
overflow uint32
ptr uintptr
}
Copy the code
id
: page idflags
: Distinguish between different types of pages. Some of the abovemeta/freelist page
count
: The amount of data stored on the page. Only when the page isbranch/leaf page
Data is storedoverflow
: A single page is not enough to store data. This value stores the number of pages that overflowptr
: points to the end of the page header data, where the page data startsptr
】
The bbolt pages -h command explains overflow very clearly. Let me translate here:
The Overflow column shows the number of blocks that the page overflowed. Overflows usually do not occur, but large keys and values can cause multiple blocks to overflow a page.
Meta Page
Meta Page Stores database instance metadata.
Github.com/boltdb/bolt…
type meta struct {
magic uint32
version uint32
pageSize uint32
flags uint32
root bucket
freelist pgid
pgid pgid
txid txid
checksum uint64
}
Copy the code
magic
Magic number, used to identifyboltdb db file
version
:boltdb
versionpageSize
:db page
The size,os.Getpagesize()
You can getflags
: I don’t know what it isroot
:boltdb
Data organization is based onb+ tree
Form, while the whole treeroot node
I’ll keep it right herefreelist
: Data is deleted when thesepage
Will be savedfreelist
In theids
中pgid
: The next one to be allocatedpgid
It can be understood as the presentmax page num
, that is, the currentpage
The number of 】txid
: The next one will be allocatedtransaction id
, related to transaction implementationchecksum
: Verification code, used for verificationmetapage
Complete or not
Github.com/boltdb/bolt…
// meta returns a pointer to the metadata section of the page.
func (p *page) meta(a) *meta {
return (*meta)(unsafe.Pointer(&p.ptr))
}
Copy the code
Addressing unsafe, the PTR is a pointer to Page data
freelist page
A free linked list page is a contiguous group of pages (one or more) in a DB file that holds a list of ids of pages that were freed due to modification operations during DB usage.
Github.com/boltdb/bolt…
// freelist represents a list of all pages that are available for allocation.
// It also tracks pages that have been freed but are still in use by open transactions.
type freelist struct {
ids []pgid // all free and available free page ids.
pending map[txid][]pgid // mapping of soon-to-be free page ids by tx.
cache map[pgid]bool // fast lookup of all free and pending page ids.
}
Copy the code
ids
: Saves the current free pageid array
pending
: Saves the corresponding of the transaction operationpage id
.key: trxid; value: []pgid
Is released after the transaction completes. It is the intermediate product in the transaction operationcache
: Mark onepgid
Available, storedkey: pgid; Value: The flag is currently idle and available for allocation
Boltdb manages page allocation via freelist Page.
Freelist records which pages are free, which is involved in deleting and allocating new data:
- Delete a large amount of data, corresponding to
page
Will release – >pgid
Stored in thefreelist
的ids
; - Write data directly from the free page
ids
Please apply.
In the example above, IDS is pre-ordered. This design is also convenient for freelists to assign new pages.
branch page
The root field in meta Page is the root node for the entire database instance data organization B + Tree.
What about the other nodes?
index node
->branch page
leaf node
->leaf page
Github.com/boltdb/bolt…
// branchPageElement retrieves the branch node by index
func (p *page) branchPageElement(index uint16) *branchPageElement {
return& ((* [0x7FFFFFF]branchPageElement)(unsafe.Pointer(&p.ptr)))[index]
}
// branchPageElements retrieves a list of branch nodes.
func (p *page) branchPageElements(a) []branchPageElement {
if p.count == 0 {
return nil
}
return((* [0x7FFFFFF]branchPageElement)(unsafe.Pointer(&p.ptr)))[:]
}
// branchPageElement represents a node on a branch page.
type branchPageElement struct {
pos uint32
ksize uint32
pgid pgid
}
// key returns a byte slice of the node key.
func (n *branchPageElement) key(a) []byte {
buf := (*[maxAllocSize]byte)(unsafe.Pointer(n))
return (*[maxAllocSize]byte)(unsafe.Pointer(&buf[n.pos]))[:n.ksize]
}
Copy the code
Part of the code related to branchPage is extracted. The following characteristics can be seen from the above:
branch page
The data part of: stores each one that starts from this branchindex node
Metadata informationindex data
Take space for time, divided into:branchPageElement
: Each of the storage branchesbranch node
metadatakey
: Real data [can just find the next onenode
The key of 】
branchPageElement
Parameter Meaning:pos
:page data
Partial offset, and you can find the correspondingkey
ksize
: Size of the keypgid
: the next onebranch page
The page id
- from
key()
The function shows that:- from
element
的pos
Position start, reade.kszie
The offset. - Note: get the current first
ele
的Pointer address
,Because the entire data segment is a contiguous arraySo we can passele.pos
, add the offset, and find the correspondingkey
- from
branchPageElement()
Also from thepage
的[]branchPageElement
Read the corresponding index. To showbranch page
The storage is a continuousbranchPageElement
.
However, in the ETCD version, key() is rewritten as: reflect.sliceheader
// key returns a byte slice of the node key. func (n *branchPageElement) key(a) []byte { return* (* []byte)(unsafe.Pointer(&reflect.SliceHeader{ Data: uintptr(unsafe.Pointer(n)) + uintptr(n.pos), Len: int(n.ksize), Cap: int(n.ksize), })) } Copy the code
This is zero-copy converted to []byte:
- First get the address to which the structure pointer points, + corresponding offset => get data
key slice
Len and Cap are both thereele
的ksize
The above code is also in page to node, which reads data from disk into memory. Of course the important one is read() :
// read initializes the node from a page.
func (n *node) read(p *page){ n.pgid = p.id n.isLeaf = ((p.flags & leafPageFlag) ! =0)
n.inodes = make(inodes, int(p.count))
for i := 0; i < int(p.count); i++ {
inode := &n.inodes[i]
if n.isLeaf {
elem := p.leafPageElement(uint16(i))
inode.flags = elem.flags
inode.key = elem.key()
inode.value = elem.value()
} else {
elem := p.branchPageElement(uint16(i))
inode.pgid = elem.pgid
inode.key = elem.key()
}
_assert(len(inode.key) > 0."read: zero-length inode key")}... }Copy the code
Read () shows the layout of the Branch page as a whole:
- Gets the pGID from the page, indicating which page the current node is from
[]inodes
Length ofpage.count
, which is how many nodes there are currentlyThe child nodeisLeaf
由page.flags
Decide [also decide laterif
Assignment of branches- by
p.count
The number of loops each oneelement
And corresponding ton.inodes
Each assignment of
This maps the disk’s B + Tree into memory. This section will be further explored later.
leaf page
We can also see the assignment to the Leaf Node field from read() above:
Github1s.com/boltdb/bolt…
type leafPageElement struct {
flags uint32
pos uint32
ksize uint32
vsize uint32
}
Copy the code
flags
Distinguish:subbucket
andOrdinary leaf node
【 aboutbucket
We’ll talk about that later.pos
: Is the same as that on the Branch page and corresponds to the distanceleafPageElement
The offsetksize/vsize
:key/value
Storage size
Where value is positioned in read() :
// read initializes the node from a page.
func (n *node) read(p *page){...for i := 0; i < int(p.count); i++ {
inode := &n.inodes[i]
if n.isLeaf {
elem := p.leafPageElement(uint16(i))
inode.flags = elem.flags
inode.key = elem.key()
inode.value = elem.value() // It is this line...
} else{... } _assert(len(inode.key) > 0."read: zero-length inode key")}... }// value returns a byte slice of the node value.
func (n *leafPageElement) value(a) []byte {
buf := (*[maxAllocSize]byte)(unsafe.Pointer(n))
return (*[maxAllocSize]byte)(unsafe.Pointer(&buf[n.pos+n.ksize]))[:n.vsize:n.vsize]
}
Copy the code
Key: &element + pos == &key
Value: &leafPageElement + pos + ksize == &val
conclusion
This paper understands the structure of BOLtDB on disk from the overall page layout. The next article will look at boltDB in memory operation and design ideas.