Introduction: LSM-Tree is the low-level implementation of many NoSQL database engines, such as LevelDB and Hbase. This paper is based on the design ideas of LSM-Tree database in “Data Intensive Application System Design”, combined with the code implementation of a complete description of a mini database, the core code of about 500 lines, through theory and practice to better understand the principle of the database.

The author | | ali Xiao Kai source technology public number

preface

Lsm-tree is the low-level implementation of many NoSQL database engines, such as LevelDB and Hbase. This paper is based on the design ideas of LSM-Tree database in “Data Intensive Application System Design”, combined with the code implementation of a complete description of a mini database, the core code of about 500 lines, through theory and practice to better understand the principle of the database.

An SSTable

The previous implementation of a database based on hash index, its limitation is that the hash table needs to be put into memory, and the interval query efficiency is not high.

In the logs of a hash index database, the order in which keys are stored is the order in which they are written. In addition, keys that appear after the same key take precedence over previous keys. Therefore, the order in the logs is not important. The advantage of this is that it is easy to write, but the problem with no control key duplication is that storage space is wasted and initial loading time increases.

Now simply change the log writing requirements: Write keys in order, and the same key can only appear once in a log. This type of log is called an SSTable and has the following advantages over hashed indexed logs:

1) Merging multiple log files is simpler and more efficient.

Because the log is ordered, you can use a file merge sort algorithm, which reads multiple input files concurrently, compares the first key of each file, and copies it sequentially to the output file. If there are duplicate keys, only the key values in the latest log are retained, and the old ones are discarded.

2) When querying keys, you do not need to save all keys’ indexes in memory.

As shown in the figure below, assuming that handiwork needs to be found, and the location of the key is not recorded in memory, but because the SSTable is orderly, we can know that if handiwork exists, it must be somewhere between handbag and handsome. Then scan the log from handbag to Handsome. The benefits are threefold:

  • Only sparse indexes need to be recorded in memory, reducing the size of in-memory indexes.
  • Query operations do not need to read the entire log, reducing file I/O.
  • Can support interval query.

Build and Maintain the SSTable

We know that keys appear in any order when we write to them. How can we ensure that keys in an SSTable are ordered? A simple and convenient way to do this is to save it to an in-memory red-black tree, which is ordered, and then write it to a log file.

The basic working process of the storage engine is as follows:

  • When a write is made, it is first added to a red-black tree in memory called a memory table.

  • When the memory table is larger than a certain threshold, it is written to the disk as an SSTable file. Since the tree is in order, it is written to the disk in sequence.

    • To avoid database crashes if the memory table is not written to a file, data can be written to another log (WAL) while being saved to the memory table, so that even if the database crashes, it can be recovered from WAL. This log write is similar to the hash index log and does not need to be in order because it is used to recover data.
  • When a read request is processed, it first tries to find the key in the memory table, and then queries the SSTable log from beginning to end until it finds data or is empty.

  • The background processes periodically perform log merging and compression, discarding values that have been overwritten or deleted.

The above algorithm is the implementation of THE LSM-tree (Log-structured merge-tree). The storage engine based on the principle of merging and condensing files is usually called the LSM storage engine. This is the basic principle of Hbase and LevelDB databases.

Three to achieve a database based on LSM

Previously we have known the lSM-tree algorithm, in the specific implementation of the time there are many design problems to consider, I pick some key design for analysis.

1 Memory table storage structure

What does the value of the memory table store? Store raw data directly? Or store write commands (including set and rm)? This was our first design problem. Let’s not make a judgment here, but let’s move on to the next question.

Once the memory representation reaches a certain size, it is written to a log file for persistence. This process is easy to do if you just ban writing. But what if you want to ensure that the memory table can handle read and write requests while writing to files?

One solution is: when persisting memory table A, we can switch the current memory table pointer to the new memory table instance B. At this time, we must ensure that A is read-only after the switch, only B is writable, otherwise we cannot guarantee that the process of persisting memory table A is atomic operation.

  • Get request: Queries B first, then A, and then the SSTable.
  • Set request: write A directly
  • Rm request: Suppose rm key1 only appears in A, not B. In this case, if the memory table stores raw data, then the RM request cannot be processed because A is read-only, causing rm to fail. This problem is solvable if we store commands in the memory table. If we write the rm command in B, then key1 has been deleted in B.

Therefore, if we persist the table with write disabled, then value can store raw data directly, but if we want to persist the table with write disabled, then value must store commands. For high performance, value must be stored in commands. Hbase is also designed in this way.

In addition, when the memory table has exceeded the threshold to persist, it is found that the previous persistence is not completed, so it needs to wait for the completion of the previous persistence before persisting. In other words, in-memory table persistence can only occur serially.

2 File formats of the SSTable

To achieve efficient file reading, we need to design the file format well.

Here is my SSTable log format:

  • Data area: The data area mainly stores written commands, and is segmented according to a certain number of sizes to facilitate segmented reading.
  • Sparse index area: The sparse index stores the location index of each data segment in the file. When reading the SSTable, only the sparse index is loaded into the memory. When querying, the corresponding data segment is loaded according to the sparse index.
  • File index area: The location of the area where data is stored.

The previous log format is a mini-implementation. Compared with the Hbase log format, it is simpler to understand the working principle. I also used JSON format to write the file for easy reading. The production implementation is efficiency first and compresses to save storage.

Four code implementation analysis

The code I wrote is implemented in: TinyKvStore. Let’s look at the key code. It’s a lot of code, and it’s a little bit more verbose, so you can skip this section if you’re just interested in the principles, and you can read on if you want to see how the code works.

1 SSTable

Memory table persistence

To persist a memory table to an SSTable is to write the data of the memory table to a file in the logging format mentioned earlier. For the SSTable, the data written is the data commands, including set and RM. As long as we know what the latest command for the key is, we can know the state of the key in the database.

@param index */ private void initFromIndex(TreeMap< String, Command> index) { try { JSONObject partData = new JSONObject(true); tableMetaInfo.setDataStart(tableFile.getFilePointer()); for (Command command : Index. values()) {// Handle set command if (command instanceof SetCommand) {SetCommand set = (SetCommand) command; partData.put(set.getKey(), set); If (command instanceof RmCommand) {RmCommand RM = (RmCommand) command; partData.put(rm.getKey(), rm); } / / to block number, began to write data section of the if (partData. The size () > = tableMetaInfo. GetPartSize ()) {writeDataPart (partData); If (partdata.size () > 0) {writeDataPart(partData); if (partData.size() > 0) {writeDataPart(partData); } long dataPartLen = tableFile.getFilePointer() - tableMetaInfo.getDataStart(); tableMetaInfo.setDataLen(dataPartLen); Byte [] indexBytes = jsonObject.toJsonString (sparseIndex).getBytes(standardCharsets.utf_8); byte[] indexBytes = jsonObject.tojsonString (sparseIndex).getBytes(StandardCharsets. tableMetaInfo.setIndexStart(tableFile.getFilePointer()); tableFile.write(indexBytes); tableMetaInfo.setIndexLen(indexBytes.length); LoggerUtil.debug(LOGGER, "[SsTable][initFromIndex][sparseIndex]: {}", sparseIndex); / / save the file index tableMetaInfo. WriteToFile (tableFile); LoggerUtil.info(LOGGER, "[SsTable][initFromIndex]: {},{}", filePath, tableMetaInfo); } catch (Throwable t) { throw new RuntimeException(t); }}Copy the code

The format of the write is based on reading backwards, mainly for ease of reading. For example, if tableMetaInfo writes from front to back, read from back to front. This is why version is written last, because it is read first, so that the log format can be updated. These tricks are hard to understand without trying them.

@param file */ public void writeToFile(RandomAccessFile) {try {file.writelong (partSize); file.writeLong(dataStart); file.writeLong(dataLen); file.writeLong(indexStart); file.writeLong(indexLen); file.writeLong(version); } catch (Throwable t) { throw new RuntimeException(t); }} /** * Read meta information from the file, @param * @return */ public static TableMetaInfo readFromFile(RandomAccessFile file) {try { TableMetaInfo tableMetaInfo = new TableMetaInfo(); long fileLen = file.length(); file.seek(fileLen - 8); tableMetaInfo.setVersion(file.readLong()); file.seek(fileLen - 8 * 2); tableMetaInfo.setIndexLen(file.readLong()); file.seek(fileLen - 8 * 3); tableMetaInfo.setIndexStart(file.readLong()); file.seek(fileLen - 8 * 4); tableMetaInfo.setDataLen(file.readLong()); file.seek(fileLen - 8 * 5); tableMetaInfo.setDataStart(file.readLong()); file.seek(fileLen - 8 * 6); tableMetaInfo.setPartSize(file.readLong()); return tableMetaInfo; } catch (Throwable t) { throw new RuntimeException(t); }}Copy the code

Loads the SSTable from a file

When loading an SSTable from a file, only sparse indexes need to be loaded, which saves memory. Data area and other queries when read as needed on the line.

Private void restoreFromFile() {try {// First read TableMetaInfo TableMetaInfo = TableMetaInfo.readFromFile(tableFile); LoggerUtil.debug(LOGGER, "[SsTable][restoreFromFile][tableMetaInfo]: {}", tableMetaInfo); / / read sparse index byte [] indexBytes = new byte [(int) tableMetaInfo. GetIndexLen ()]; tableFile.seek(tableMetaInfo.getIndexStart()); tableFile.read(indexBytes); String indexStr = new String(indexBytes, StandardCharsets.UTF_8); LoggerUtil.debug(LOGGER, "[SsTable][restoreFromFile][indexStr]: {}", indexStr); sparseIndex = JSONObject.parseObject(indexStr, new TypeReference< TreeMap< String, Position>>() { }); this.tableMetaInfo = tableMetaInfo; LoggerUtil.debug(LOGGER, "[SsTable][restoreFromFile][sparseIndex]: {}", sparseIndex); } catch (Throwable t) { throw new RuntimeException(t); }}Copy the code

SSTable query

To query data from SSTable, first of all, we need to find the interval where the key is in the sparse index. After finding the interval, we need to read the data in the interval according to the position of the index record, and then query. If there is data, we return it; if there is no data, we return NULL.

@param key * @return */ public Command query(String key) {try {LinkedList< Position> sparseKeyPositionList = new LinkedList<>(); Position lastSmallPosition = null; Position firstBigPosition = null; // Select the last position less than key and the first position greater than key from the sparse index for (String k: sparseIndex.keySet()) { if (k.compareTo(key) <= 0) { lastSmallPosition = sparseIndex.get(k); } else { firstBigPosition = sparseIndex.get(k); break; } } if (lastSmallPosition ! = null) { sparseKeyPositionList.add(lastSmallPosition); } if (firstBigPosition ! = null) { sparseKeyPositionList.add(firstBigPosition); } if (sparseKeyPositionList.size() == 0) { return null; } LoggerUtil.debug(LOGGER, "[SsTable][restoreFromFile][sparseKeyPositionList]: {}", sparseKeyPositionList); Position firstKeyPosition = sparseKeyPositionList.getFirst(); Position lastKeyPosition = sparseKeyPositionList.getLast(); long start = 0; long len = 0; start = firstKeyPosition.getStart(); if (firstKeyPosition.equals(lastKeyPosition)) { len = firstKeyPosition.getLen(); } else { len = lastKeyPosition.getStart() + lastKeyPosition.getLen() - start; } // If there is a key that must be in the interval, we only need to read the data in the interval, reducing IO byte[] dataPart = new byte[(int) len]; tableFile.seek(start); tableFile.read(dataPart); int pStart = 0; For (Position Position: sparseKeyPositionList) { JSONObject dataPartJson = JSONObject.parseObject(new String(dataPart, pStart, (int) position.getLen())); LoggerUtil.debug(LOGGER, "[SsTable][restoreFromFile][dataPartJson]: {}", dataPartJson); if (dataPartJson.containsKey(key)) { JSONObject value = dataPartJson.getJSONObject(key); return ConvertUtil.jsonToCommand(value); } pStart += (int) position.getLen(); } return null; } catch (Throwable t) { throw new RuntimeException(t); }}Copy the code

2 LsmKvStore

Initial loading

  • DataDir: The data directory stores log data, so the previous persistent data needs to be read from the directory at startup.
  • StoreThreshold: Persistence threshold. When the size of a memory table exceeds a certain threshold, the memory table should be persisted.
  • PartSize: data partition threshold of the SSTable.
  • IndexLock: Lock for reading and writing memory tables.
  • SsTables: An ordered list of ssTables, sorted from new to old.
  • Wal: logs are sequentially written to store memory table data for data recovery.

The startup process is simple: load the data configuration, initialize the contents, and restore the data to the memory table if necessary.

/** * initialization * @param dataDir Data directory * @param storeThreshold Persistent threshold * @param partSize Data partition size */ public LsmKvStore(String dataDir, int storeThreshold, int partSize) { try { this.dataDir = dataDir; this.storeThreshold = storeThreshold; this.partSize = partSize; this.indexLock = new ReentrantReadWriteLock(); File dir = new File(dataDir); File[] files = dir.listFiles(); ssTables = new LinkedList<>(); index = new TreeMap<>(); / / directory is empty without loading ssTable if (files = = null | | files. The length = = 0) {walFile = new File (dataDir + WAL); wal = new RandomAccessFile(walFile, RW_MODE); return; } ssTableTreeMap < Long, ssTable > ssTableTreeMap = new TreeMap<>(comparator.reverseOrder ()); for (File file : files) { String fileName = file.getName(); // Restore data from temporary WAL, WalTmp if (file.isfile () && filename.equals (WAL_TMP)) {restoreFromWal(new RandomAccessFile(file, RW_MODE)); } // Load ssTable if (file.isfile () && filename.endswith (TABLE)) {int dotIndex = filename.indexof ("."); Long time = Long.parseLong(fileName.substring(0, dotIndex)); ssTableTreeMap.put(time, SsTable.createFromFile(file.getAbsolutePath())); } else if (file.isfile () && filename.equals (WAL)) {// Load WAL walFile = file; wal = new RandomAccessFile(file, RW_MODE); restoreFromWal(wal); } } ssTables.addAll(ssTableTreeMap.values()); } catch (Throwable t) { throw new RuntimeException(t); }}Copy the code

Write operation

Write operations first add write locks, then save the data to memory tables and WAL, and determine if the threshold is exceeded and persist. I’ve done it serialized for simplicity, without using a thread pool, but without affecting the overall logic. The code for set and RM is similar, so I won’t repeat it here.

@Override public void set(String key, String value) { try { SetCommand command = new SetCommand(key, value); byte[] commandBytes = JSONObject.toJSONBytes(command); indexLock.writeLock().lock(); WriteInt (commandbytes.length); // Save the data to WAL. wal.write(commandBytes); index.put(key, command); If (index.size() > storeThreshold) {switchIndex(); storeToSsTable(); } } catch (Throwable t) { throw new RuntimeException(t); } finally { indexLock.writeLock().unlock(); }}Copy the code

Memory table persistence process

Switching a memory table and its associated WAL: Locks the memory table, creates a memory table and WAL, temporarily stores the old memory table and WAL, and releases the lock. The new memory table can then start writing and the old memory table becomes read-only.

Perform persistence: Write the old memory table to a new ssTable in an orderly manner, then delete the staging memory table and the temporary WAL.

Private void switchIndex() {try {indexLock.writelock ().lock(); private void switchIndex() {try {indexLock.writelock ().lock(); // Alter table immutableIndex = index; index = new TreeMap<>(); wal.close(); WAL File tmpWal = new File(dataDir + WAL_TMP); if (tmpWal.exists()) { if (! Tmpwal.delete ()) {throw new RuntimeException(" Failed to delete file: walTmp"); } } if (! Walfile.renameto (tmpWal)) {throw new RuntimeException(" failed to rename file: walTmp"); } walFile = new File(dataDir + WAL); wal = new RandomAccessFile(walFile, RW_MODE); } catch (Throwable t) { throw new RuntimeException(t); } finally { indexLock.writeLock().unlock(); } /** * Store data to ssTable */ private void storeToSsTable() { SsTable SsTable = SsTable. CreateFromIndex (dataDir + System.CurrentTimemillis () + TABLE, partSize, immutableIndex); ssTables.addFirst(ssTable); WAL_TMP immutableIndex = null; File tmpWal = new File(dataDir + WAL_TMP); if (tmpWal.exists()) { if (! Tmpwal.delete ()) {throw new RuntimeException(" Failed to delete file: walTmp"); } } } catch (Throwable t) { throw new RuntimeException(t); }}Copy the code

Query operation

The query operates as described in the algorithm:

  • First fetch from the memory table, or immutable memory table if not available and immutable memory table exists.

  • If the sstAbles cannot be queried in the memory table, the sstAbles can be queried from the new sstAbles to the old sstables.

    @Override public String get(String key) { try { indexLock.readLock().lock(); Command = index.get(key); If (command == null && immutableIndex! = null) { command = immutableIndex.get(key); } if (command == null) {// if (command == null) { ssTables) { command = ssTable.query(key); if (command ! = null) { break; } } } if (command instanceof SetCommand) { return ((SetCommand) command).getValue(); } if (command instanceof RmCommand) { return null; } // Return null does not exist; } catch (Throwable t) { throw new RuntimeException(t); } finally { indexLock.readLock().unlock(); }}

conclusion

True knowledge comes from the unity of knowledge and action. If we don’t implement a database, it’s hard to understand why. For example, why the log format is designed this way, why the database stores data operations instead of the data itself, and so on.

In this paper, the database function is relatively simple, there are many places can be optimized, such as data persistence asynchronization, log file compression, query using bloom filter first filtering. Interested readers can delve further.

Reference: Data Intensive Application System Design

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.