Apache BookKeeper is an enterprise-class storage system originally developed by Yahoo Research. It was incubated as a sub-project of Apache ZooKeeper in 2011 and became an Apache top project in January 2015.

Originally a write-ahead log (WAL) system, BookKeeper has evolved over the years to provide more features such as high availability and multiple copies of NameNode for Hadoop Distributed File System (HDFS). Provides storage services for messaging systems over Pulsar and cross-machine replication for multiple data centers. Github.com/apache/puls…

Usage scenarios

An initial use case for BookKeeper is to save the Edit log for HDFS NameNode, as shown below:

ZKFC is a ZooKeeper client that monitors and manages NameNode status. There is one ZKFC running on each NameNode machine. It has three main functions:

• Health check •ZooKeeper session management • Election: When an Active NameNode in the cluster breaks down, ZooKeeper automatically selects a node as the new Active NameNode.

BookKeeper records the NameNode Edit Log (the Edit log stores operation logs of the file system). All changes made to the NameNode are recorded in BookKeeper. In this case, when the Active NameNode breaks down, BookKeeper will use the saved Edit log to go to the Standby NameNode for playback, and then switch to the active NameNode.

BookKeeper has the following features:

• Consistency: Because the Edit Log stores HDFS metadata, the consistency requirement is high. • Low latency: low latency is required to avoid data loss. • High throughput: High throughput is required to support more NameNode nodes

The node equivalence

The data structure saved in Bookie is shown below:

Writer Writes entries to ledgers of multiple Bookie nodes concurrently when writing data. This is similar to how a file system first opens a file when writing data, and creates file metadata if the file does not exist.

Ledger is the segment in Pulsar.

When writer writes data, it first opens a new Ledger.

OpenLedger (Number of nodes in the group, number of data backup, number of waiting nodes)Copy the code

For example, (5,3,2) indicates that there are five Bookie nodes in the group, and data needs to be written to three nodes. If two nodes return success, the data is written successfully.

In this way, the data of these three nodes is exactly the same, and the relationship is equal. There is no master/slave relationship.

Read and write data

BookKeeper data read and write:

The writer writes bookie in roundrobin mode. For example, in the preceding figure, the first data is written to Bookie1, Bookie2, and Bookie3, and the second data is written to Bookie2, Bookie3, and Bookie4. The third piece of data is written to Bookie3, Bookie4, and Bookie5, and the fourth piece of data is written to Bookie4, Bookie5, and Bookie1.

When we open a Ledger, we pass in the number of Bookies, so when we write each entry, we use the entry ID modulo to the number of Bookies to determine which bookies to write to. For example, if the third message is modulo 3 from 5, it is written to Bookie3, Bookie4, and Bookie5.

In this way, Ledger data is written to each Bookie node in a polling way. The data of each Bookie node is balanced, and the disk bandwidth and network card bandwidth of each Bookie node can be fully utilized.

Read the high availability

Reader Can read any data in multiple data copies. BookKeeper sets a read timeout, and if the read times out, sends a more speculative read request to another Bookie node.

Write a high availability

If a bookie node (such as Bookie5) fails and cannot be written, BookKeeper does the following:

• Record the entry ID that failed • Encapsulate the data from the failed node • Close the current Ledger and reopen a new Ledger which will reselect the Bookie node, 1, 2, 3, 4, 6. • If Bookie5 is restored, it will no longer provide the write service, only the read service. • If not, restore the bookie5 data from the other node backup to the new node. This process requires the Ledger ID and 5 module to determine whether the data fell to Bookie5. The data recovery process does not affect the Reader, because the other two copies of data can continue to serve.

I/O model

The I/O model for BookKeeper is shown in the following figure, which is a single Bookie data flow:

The whole process is as follows:

• The data written by the Writer reaches the Journal first. The Journal groups the data and flushes the data to the Journal disk in the same order as the data written by the Writer.

Writer Writes to Journal Disk in real time.

• Data from the Journal Disk is written to the Memory table for data collation. Data from the same topic is collated together. • Flush the data on the disk. Index Disk Specifies the Index of the entry, which corresponds to the offset of the entry in Logger Disks.

Reading and writing separation

Data is read from the Memory Cache. If the data does not exist, data is read from the Index Disk and Logger Disk. Write data is driven to the Journal Disk in real time, thus achieving read/write isolation.

Strong consistency

Data can be flushed to Journal Disk in real time, ensuring strong data consistency.

Flexible SLA

In scenarios that require high write performance, the Journal Disk performance can be enhanced, and in scenarios that require high read performance, the Ledger Disk and Index Disk performance can be enhanced.

Use in Pulsar

The architecture of Pulsar is as follows:

An ACK is returned to the Producer after the message generated by the Producer is sent to the disk in real time.

After the Consumer consumes the message, the offset saved in Cursor is also modified and recorded to BookKeeper. This ensures the consistency of the Cursor.

reading

• Why Apache BookKeeper – Part 2• Why Apache BookKeeper – Part 1