Preface:
By chance, I came into contact with the H2 database and learned that it supports memory, file mode, and can run in embedded mode, server mode, and cluster mode. H2 database is small but complete, support transactions, MVCC and so on. I recall that when I was reading, I always wanted to understand the operation principle of a database, and how to understand the process of an SQL from interpretation to execution. I had to put it on hold until I saw this H2 and decided to start with it. In this article, I have a little introduction to MVStore, the storage engine of the H2 database (see the website for more information about MVStore). Level is not enough, you are welcome to correct.
I. Test method use cases
Oracle stores data in tablespaces, segments, extents, blocks, and data files. MySQL InnoDB storage management is similar, but MySQL adds a concept of shared and independent table Spaces.
Similar to Mysql and Oracle, H2’s storage structure is actually divided into layers: file, chunk and page. The basic unit of storage is block(generally 4k bytes). Flie contains multiple chunks. Each time a new version is written, there is one chunk. A chunk consists of multiple blocks. H2 uses a copy on write btree method. Each time the btree is modified, a new root page will be written in this chunk. Each transaction only needs to save its own root page.
Mysql InnoDB storage structure
Both top-down and bottom-up approaches are useful when reading H2 source code. When I first saw the source code, I didn’t know how to start, run, and execute SQL. After having a general understanding, debug part of the code to understand how each module is organized and operated from bottom to top. Plus, UT is a good thing. The H2 source code contains a lot of unit testing code. There are many unit tests in the test file, and each test method is relatively fine-grained, from the method name to know what function is tested. Let me start with one of the methods called testExample() and walk through the logic of MVStore. Interested in reading partners can also start from this test method to reduce detours.
Unit testing in H2 source code
Unit testing in H2 source code
TestExample () method
private void testExample(a) {
String fileName = getBaseDir() + "/" + getTestName();
FileUtils.delete(fileName);
// open the store (in-memory if fileName is null)
try (MVStore s = MVStore.open(fileName)) {
// create/get the map named "data"
MVMap<Integer, String> map = s.openMap("data");
// add and read some data
map.put(1."Hello World");
// System.out.println(map.get(1));
}
try (MVStore s = MVStore.open(fileName)) {
MVMap<Integer, String> map = s.openMap("data");
assertEquals("Hello World", map.get(1)); }}Copy the code
2. Introduction to the format of storage files
The H2 database data is stored in a single file. The file contains two headers and a series of chunks for security purposes. Each header takes up a block, and each block is 4096 bytes. Each chunk has at least one block, but usually 200 blocks or more. Data is stored in chunks in the form of log structured storage. There is one chunk for each version.
H2 stores files like this:
[ file header 1 ] [ file header 2 ] [ chunk ] [ chunk ] ... [ chunk ]
Copy the code
Each chunk contains several pages of the B+ tree. In the following example, there will be two chunks.
MVStore s = MVStore.open(fileName);
MVMap<Integer, String> map = s.openMap("data");
for (int i = 0; i < 400; i++) {
map.put(i, "Hello");
}
s.commit();
for (int i = 0; i < 100; i++) {
map.put(i, "Hi");
}
s.commit();
s.close();
Copy the code
The Chunk 1:
-Page 1: indicates the root node, pointing to two child nodes Page 2 and Page 3
-Page 2: leaf node, containing 140 elements (keys 0 to 139)
-Page 3: leaf node, containing 260 elements (keys 140~ 399)
The Chunk 2:
-Page 4: indicates the root node, pointing to two child nodes Page 5 and Page 3
-Page 5: leaf, containing 140 elements (keys 0 to 139)
This means that each chunk contains changes to each version: the new version of the page and its root node. As in the example here, Page2 gets a new Page when it is changed, and then pages from its parent node all the way to the root node get a new version of Page.
For the first time the commit
The second commit
File header
There are two headers in the storage file, both of which have the same content. When headers are updated, there is a “partial failure” in which one of the headers is destroyed. So the purpose of the second header is to update only when both headers have been successfully updated (called “in-place update”). The file header looks like this:
H:2,block:2,blockSize:1000,chunk:7,created:1441235ef73,format:1,version:7,fletcher:3044e6cc
Copy the code
Data is in the form of key-value pairs, where values are hexadecimal numbers. The form is as follows:
-
H: “H:2” refers to the H2 database
-
Block: the location where one of the latest chunks starts (it does not need to be the latest chuncks) —-
-
BlockSize: File size (in bytes). Currently, the value is 1000 in hexadecimal or 4096 in decimal. Exactly the same size as the disk sector.
-
Chunk: indicates the CHUNK ID. Usually the same as the version number. However, the chunk ID may be rolled back to 0, but the version number will not.
-
Created: file creation time (from 1970 to the current number of milliseconds)
-
Format: Indicates the file format number. The current value is 1
-
Version: indicates the version number of Chunk
-
Fletcher: Fletcher tests and
When the file is opened, two file headers are read in and checked and used for validation. If both headers match, the header with the newer version number will be used. The latest version of the chunk is known, and the rest of the metadata is known from the chunk. If the header file does not contain the chunk ID, block, or version number. The latest chunk will start with the last chunk in the file
4. Chunk format
Each version will have chunk. Each chunk contains a head, in this version the modified page, and a tail. The page contains the actual data for the map. Pages in a chunk are stored right after the header. The size of a chunk is a multiple of the size of a block. The tail is stored in the last 128 bytes of chunk.
[ header ] [ page ] [ page ] ... [ page ] [ footer ]
Copy the code
A footer can be used to verify that a chunk has been written in its entirety. Each write operation has a chunk and can find the starting position of the last chunk in the file. The head and tail of chunk contain the following data:
chunk:1,block:2,len:1,map:6,max:1c0,next:3,pages:2,root:4000004f8c,time:1fc,version:1
chunk:1,block:2,version:1,fletcher:aed9a4f6
Copy the code
- The chunk: the chunk id
- Block: the location of the first block in a chunk (multiplied by the size of the block to get the location in the file)
- Len: The size of chunk is expressed by the number of blocks
- Map: the ID of the latest map. This ID is incremented when a map is created
- Max: Sum of the sizes of all the largest pages
- Next: The predicted starting position of the next chunk
- Pages: The number of pages in the chunk
- Root: the location of the root page of metadata.
- Time: time when chunk is written. In milliseconds after the file was written
- Version: indicates the version number of the chunk
- Fletcher: Checksum
Chunks are never updated in place, and each chunk contains the pages that were modified in that version (there is one page for each version), as well as the parent pages of those pages, all the way to the root page. If elements in a map are modified, deleted, or added, the corresponding pages are copied, modified, and stored in the next chunk, and the live pages in the old chunk are reduced. This mechanism, called copy-on-write, works in a similar way to the Btrfs file system. Chunks without live Pages are marked as free, so Chunks can be reused. Since not all chunks are of the same size, there are chunks that are free in front of a chunk. In a free (stackoverflow.com/questions/1…
The latest chunk is determined when opening the H2 persistent file: the file header contains the location of the most recent chunk, which is not necessarily the latest chunk. This is done to reduce the number of header updates. When a file is opened, the header and the end of the latest chunk are read in. If the chunk that was first read in has a pointer to next, the chunk that the next pointer points to is also read in. Until the latest chunk is read in.
This paragraph is a bit confusing, you can directly read the source code implementation, later through the source code to see how to read into the latest version of chunk. One of the file headers looks like this
As mentioned earlier, the first two blocks are used to hold the headers that store files
When the file header is opened, the chunk header and tail are read first. The block and chunkId parameters are in the file head.
Read the tail of Chunk
The head and tail of chunk are checked
The procedure for opening a vstore is as follows:
MVStore s = MVStore.open(fileName);
Copy the code
- Read the header first.
- After reading the first header, go to the corresponding position in the file according to the block field in the header (block * bolck_size). Read the corresponding Chunk header
Here are examples of file headers and chunk headers
The file header
chunk
// The block argument is passed in from the block field in the file header. The expected parameter is also the chunk field in the file header
private Chunk readChunkHeaderAndFooter(long block, int expectedId) {
Chunk header = readChunkHeaderOptionally(block, expectedId);
if(header ! =null) {
Chunk footer = readChunkFooter(block + header.len);
if (footer == null|| footer.id ! = expectedId || footer.block ! = header.block) {return null; }}return header;
}
//
private Chunk readChunkHeaderOptionally(long block, int expectedId) {
Chunk chunk = readChunkHeaderOptionally(block);
return chunk == null|| chunk.id ! = expectedId ?null : chunk;
}
private Chunk readChunkHeaderOptionally(long block) {
try {
Chunk chunk = readChunkHeader(block);
returnchunk.block ! = block ?null : chunk;
} catch (Exception ignore) {
return null; }}private Chunk readChunkHeader(long block) {
long p = block * BLOCK_SIZE;
ByteBuffer buff = fileStore.readFully(p, Chunk.MAX_HEADER_LENGTH);
return Chunk.readChunkHeader(buff, p);
}
static Chunk readChunkHeader(ByteBuffer buff, long start) {
int pos = buff.position();
byte[] data = new byte[Math.min(buff.remaining(), MAX_HEADER_LENGTH)];
buff.get(data);
try {
for (int i = 0; i < data.length; i++) {
if (data[i] == '\n') {
// set the position to the start of the first page
buff.position(pos + i + 1);
String s = new String(data, 0, i, StandardCharsets.ISO_8859_1).trim();
returnfromString(s); }}}catch (Exception e) {
// there could be various reasons
throw DataUtils.newMVStoreException(
DataUtils.ERROR_FILE_CORRUPT,
"File corrupt reading chunk at position {0}", start, e);
}
throw DataUtils.newMVStoreException(
DataUtils.ERROR_FILE_CORRUPT,
"File corrupt reading chunk at position {0}", start);
}
Copy the code
-
Read the tail of chunk to check
private Chunk readChunkHeaderAndFooter(long block, int expectedId) { Chunk header = readChunkHeaderOptionally(block, expectedId); if (header ! = null) { Chunk footer = readChunkFooter(block + header.len); if (footer == null || footer.id ! = expectedId || footer.block ! = header.block) { return null; } } return header; }Copy the code
-
Read the second file header. If the version field is larger than the version of chunk read from the first file header, proceed to read chunk as in the previous step
So let’s go back
When the file is opened, the header and the end of the latest chunk will be read in. If the chunk that was first read in has a pointer to next, the chunk that the next pointer points to is also read in. Until the latest chunk is read in.
See below, in the while loop, the current chunk (the next chunk to which next refers) is read in, and if the version of the next chunk is less than or equal to the current version, it terminates. Thus, the newest version of the newest refers to chunk.
- After identifying the newest chunk, or chunk, it’s down here.
if (assumeCleanShutdown) {
// quickly check latest 20 chunks referenced in meta table
Queue<Chunk> chunksToVerify = new PriorityQueue<>(20, Collections.reverseOrder(chunkComparator));
try {
setLastChunk(newest);
// load the chunk metadata: although meta's root page resides in the lastChunk,
// traversing meta map might recursively load another chunk(s)
Cursor<String, String> cursor = layout.cursor(DataUtils.META_CHUNK);
while (cursor.hasNext() && cursor.next().startsWith(DataUtils.META_CHUNK)) {
Chunk c = Chunk.fromString(cursor.getValue());
assert c.version <= currentVersion;
// might be there already, due to meta traversal
// see readPage() ... getChunkIfFound()
chunks.putIfAbsent(c.id, c);
chunksToVerify.offer(c);
if (chunksToVerify.size() == 20) {
chunksToVerify.poll();
}
}
Chunk c;
while(assumeCleanShutdown && (c = chunksToVerify.poll()) ! =null) { Chunk test = readChunkHeaderAndFooter(c.block, c.id); assumeCleanShutdown = test ! =null;
if(assumeCleanShutdown) { validChunksByLocation.put(test.block, test); }}}catch(MVStoreException ignored) {
assumeCleanShutdown = false; }}Copy the code
Quickly check latest 20 chunks referenced in meta table
You can see it’s to read the latest 20 chunks. A priority queue is used to sort chunks. But the chunkComparator variable is sorted in reverse order. The chunkComparator is defined as follows:
Comparator<Chunk> chunkComparator = (one, two) -> {
int result = Long.compare(two.version, one.version);
if (result == 0) {
// out of two copies of the same chunk we prefer the one
// close to the beginning of file (presumably later version)
result = Long.compare(one.block, two.block);
}
return result;
};
Copy the code
The first step is to compare the versions of the two chunks, and whichever version is bigger is ranked first. If the version is the same, the size of the block is compared, and the smaller block is ranked first. Thus, the flashback is that whoever has a smaller version is ranked first, and whoever has a larger block is ranked first.
Note the method:
setLastChunk(newest);
Copy the code
The latest chunk identified in the previous step will be used to initialize the LAYOUT MVMap. The details will not be introduced here, but the structure of the Page will be easier to understand later
After going through setLastChunk, the Layout MVMap will have a structure similar to the following
Let’s start with the following step:
Cursor<String, String> cursor = layout.cursor(DataUtils.META_CHUNK);
while (cursor.hasNext() && cursor.next().startsWith(DataUtils.META_CHUNK)) {
Chunk c = Chunk.fromString(cursor.getValue());
assert c.version <= currentVersion;
// might be there already, due to meta traversal
// see readPage() ... getChunkIfFound()
chunks.putIfAbsent(c.id, c);
chunksToVerify.offer(c);
if (chunksToVerify.size() == 20) { chunksToVerify.poll(); }}Copy the code
DataUtils.META_CHUNK is the string constant chunk. The while loop is used to read chunks from the layout and put them in chunks and chunksToVerify. And keep the number of priority queues to a maximum of 19.
The cycle follows:
Chunks are read from the priority queue in turn and placed in the map validChunksByLocation
while(assumeCleanShutdown && (c = chunksToVerify.poll()) ! =null) { Chunk test = readChunkHeaderAndFooter(c.block, c.id); assumeCleanShutdown = test ! =null;
if(assumeCleanShutdown) { validChunksByLocation.put(test.block, test); }}Copy the code
6. The last step is some rational work
fileStore.clear();
// build the free space list
for (Chunk c : chunks.values()) {
if (c.isSaved()) {
long start = c.block * BLOCK_SIZE;
int length = c.len * BLOCK_SIZE;
fileStore.markUsed(start, length);
}
if (!c.isLive()) {
deadChunks.offer(c);
}
}
Copy the code