Implement a K-V storage engine from zero

The purpose of this article is to help more people understand Rosedb. I will implement a simple K-V storage engine from scratch with PUT, GET, and DELETE operations, which you can think of as a simplified version of Rosedb. Call it the Minidb (mini rosedb).

Whether you are Go language beginners, or want to advance to Go language, or is interested in K-V storage, you can try to achieve it by yourself, I believe it will help you a lot.

When it comes to storage, a core problem to solve is how to store data and how to take it out. In the world of computing, the problem is much more diverse.

There are memory and disk in the computer, memory is volatile, power failure after the storage of all data lost, so, if you want to crash and restart the system is still normal use, you have to store data in non-volatile media, the most common is disk.

Therefore, for a stand-alone k-V, we need to design how data should be stored in memory and how data should be stored in disk.

Of course, there have been many excellent predecessors to explore, and there has been a classic summary, mainly divided into two types of data storage model: B+ tree and LSM tree.

These two models are not the focus of this article, so it will be a brief introduction.

B + tree

The B+ tree is evolved from the binary search tree. By increasing the number of nodes at each layer, the tree height is reduced, which is suitable for disk pages and minimization of DISK I/O operations.

The B+ tree provides stable performance. When data is written or updated, the system searches for and locates a location on the disk and performs in-place operations. Note that random I/OS exist here, and a large number of inserts or deletes may trigger page splitting and merging.

LSM tree

LSM Tree (Log Structured Merge Tree) is not a specific Tree type data structure, but a data storage model based on the fact that sequential I/O is much faster than random I/O.

Unlike B+ trees, in LSM data inserts, updates, and deletions are recorded in a log and appended to a disk file, so that all operations are sequential IO.

LSM is suitable for scenarios where more writing is required and less reading is required.

Having looked at the previous two basic storage models, I believe you have a basic understanding of how to access data, while MinidB is based on a much simpler storage structure that is generally similar to LSM.

Instead of going straight into the concept of the model, I’ll take a look at the process of data PUT, GET, and DELETE in Minidb to give you an idea of the simple storage model.

PUT

We need to store one piece of data, namely key and value. First, to prevent data loss, we will encapsulate the key and value into a record (called Entry here) and append it to the disk file. The contents of an Entry include key, value, key size, value size, and write time.

Therefore, the disk file structure is very simple, is a collection of multiple entries.

After the disk is updated, you can update the memory, where you can choose a simple data structure, such as a hash table. The key in the hash table corresponds to the location of the Entry on the disk, which is easy to obtain during search.

Thus, in Minidb, the data storage process is completed in two steps: a disk record appending, and an in-memory index update.

GET

First, the hash table in memory can find the index information corresponding to the key, which contains the location of the value stored in the disk file, and then directly fetch the value from the disk according to the location.

DEL

Then there is the deletion operation. Instead of locating the original record for deletion, the deletion operation is encapsulated as an Entry and appended to the disk file, but the Entry type needs to be marked as deletion.

Then delete the index information of the corresponding key from the hash table in memory, and the deletion operation is complete.

As you can see, there are only two steps for insert, query, and delete: an in-memory index update and a disk file record appending. So minidb’s write performance is very stable regardless of data size.

Merge

Last but not least, disk files are being appending records, which increases the size of the file. In addition, there may be multiple entries for the same key in the file (recall that updating or deleting key contents will also append records), so there are redundant entries in the data file.

For A simple example, if you set the value of key A to 10, 20, and 30, then there are three records in the disk file:

The latest value of A is 30, so the first two records are already invalid.

In this case, you need to merge data files periodically to clear invalid Entry data. This process is commonly called merge.

Merge Takes out all the entries in the original data file, writes the valid entries into a new temporary file, and deletes the original data file. The temporary file becomes a new data file.

This is minidb’s underlying data storage model, called Bitcask, and of course Rosedb uses the same model. It is essentially an LSM-like model, and the core idea is to use sequential IO to improve write performance, but in terms of implementation, it is much simpler than LSM.

After introducing the underlying storage model, I can start the code implementation. I put the complete code implementation on my Github address:

Github.com/roseduan/mi… .

The article intercepts some of the key code.

First, to open the database, you need to load the data file first, and then retrieve the Entry data in the file to restore the index state. The key part of the code is as follows:

func Open(dirPath string) (*MiniDB, error) {
   // If the database directory does not exist, create a new one
   if _, err := os.Stat(dirPath); os.IsNotExist(err) {
      iferr := os.MkdirAll(dirPath, os.ModePerm); err ! =nil {
         return nil, err
      }
   }

   // Load the data file
   dbFile, err := NewDBFile(dirPath)
   iferr ! =nil {
      return nil, err
   }

   db := &MiniDB{
      dbFile:  dbFile,
      indexes: make(map[string]int64),
      dirPath: dirPath,
   }

   // Load the index
   db.loadIndexesFromFile(dbFile)
   return db, nil
}
Copy the code

The PUT method updates the disk, writes a record, and then updates the memory as described above:

func (db *MiniDB) Put(key []byte, value []byte) (err error) {
  
   offset := db.dbFile.Offset
   // Encapsulate it into Entry
   entry := NewEntry(key, value, PUT)
   // Append to the data file
   err = db.dbFile.Write(entry)

   // Write to memory
   db.indexes[string(key)] = offset
   return
}
Copy the code

The GET method first retrieves the index information from memory and checks whether it exists. If it does not, it returns the index information directly. If it does, it retrieves the data from disk.

func (db *MiniDB) Get(key []byte) (val []byte, err error) {
   // Retrieve index information from memory
   offset, ok := db.indexes[string(key)]
   // Key does not exist
   if! ok {return
   }

   // Read data from disk
   var e *Entry
   e, err = db.dbFile.Read(offset)
   iferr ! =nil&& err ! = io.EOF {return
   }
   ife ! =nil {
      val = e.Value
   }
   return
}
Copy the code

The DEL method is similar to the PUT method, except that Entry is identified as DEL and then encapsulated as an Entry and written to a file:

func (db *MiniDB) Del(key []byte) (err error) {
   // Retrieve index information from memory
   _, ok := db.indexes[string(key)]
   // Key does not exist, ignored
   if! ok {return
   }

   // Encapsulate it into an Entry and write it
   e := NewEntry(key, nil, DEL)
   err = db.dbFile.Write(e)
   iferr ! =nil {
      return
   }

   // Delete key from memory
   delete(db.indexes, string(key))
   return
}
Copy the code

Finally, there is the important merge data file operation. The process is the same as described above, with the key code as follows:

func (db *MiniDB) Merge(a) error {
   // Read Entry from the original data file
   for {
      e, err := db.dbFile.Read(offset)
      iferr ! =nil {
         if err == io.EOF {
            break
         }
         return err
      }
      // The index status in memory is up to date, and valid entries are directly compared and filtered out
      if off, ok := db.indexes[string(e.Key)]; ok && off == offset {
         validEntries = append(validEntries, e)
      }
      offset += e.GetSize()
   }

   if len(validEntries) > 0 {
      // Create a temporary file
      mergeDBFile, err := NewMergeDBFile(db.dirPath)
      iferr ! =nil {
         return err
      }
      defer os.Remove(mergeDBFile.File.Name())

      // Write a valid entry again
      for _, entry := range validEntries {
         writeOff := mergeDBFile.Offset
         err := mergeDBFile.Write(entry)
         iferr ! =nil {
            return err
         }

         // Update the index
         db.indexes[string(entry.Key)] = writeOff
      }

      // Delete old data files
      os.Remove(db.dbFile.File.Name())
      // Temporary files are changed to new data files
      os.Rename(mergeDBFile.File.Name(), db.dirPath+string(os.PathSeparator)+FileName)

      db.dbFile = mergeDBFile
   }
   return nil
}
Copy the code

Apart from the test files, minidb’s core code is only 300 lines, and it contains the main ideas of the bitcask storage model and the underlying foundation of RoseDB.

Once you understand Minidb, you’ll almost be able to master the bitcask storage model, and with a little more time, you’ll be able to navigate rosedb as well.

Further, if you are interested in K-V storage, you can further study more relevant knowledge. Although Bitcask is simple and easy to understand, there are many problems. Rosedb has optimized it in the process of practice, but there are still many problems.

Some people may be confused, bitcask is a simple model, is it just a toy, is it applied in the actual production environment? The answer is yes.

Bitcask was originally derived from the underlying storage model of Riak, a distributed K-V storage that also ranks high in NoSQL:

The distributed K-V storage used by Douban is also based on bitcask model and has been optimized a lot. Currently, there are not many K-Vs purely based on bitcask model, so you can look at rosedb’s code more and put forward your own suggestions to improve the project together.

Finally, attach the address of relevant project:

Minidb:github.com/roseduan/mi…

Rosedb:github.com/roseduan/ro…

References:

Riak.com/assets/bitc…

medium.com/@arpitbhaya…

Implement a K-V storage engine from zero

Related Posts

Java implements TCP communication

Spring source study of 06: AnnotatedBeanDefinitionReader

Springboot request parameter verification