Storage mode

Most common relational databases use row storage, such as mysql and Oracle, but HBase uses column storage.

Row storage and column storage

What is row storage and what is column storage? Let’s say I have a table like this

So row storage is storing a row of data together

Column storage stores a column of data together

The underlying storage structure is as follows: row storage is to store one row, and then the next row, while column storage is to store a column of data together, since there is no concept of column data, column data and column data are not next to each other, but separate from each other.

The advantages and disadvantages

Row storage benefits

  • High random read efficiency
  • Excellent transaction support

disadvantages

  • A large number of indexes need to be maintained and storage costs are high
  • Nonlinearly scalable

Column storage benefits

  • When there is duplicate data in the column, the data is compressed and the storage cost is low
  • When multiple columns of data are queried, different columns are stored separately and can be queried in parallel to improve query efficiency

disadvantages

  • Transaction support is poor

Application scenarios

Line storage

  • Tables are associated with each other and need to be queried
  • Online transaction processing is required
  • Query requirements exceed storage requirements

The column type storage

  • Often only a few columns of data are accessed
  • When data compression and linear expansion are required
  • The storage requirement is greater than the read requirement

In summary, row storage is suitable for OLTP(Online Transaction Processing) scenarios and column storage is suitable for OLAP(Online Transaction Analysis) scenarios

instructions

The above description of row storage and column storage is from a broader perspective. It does not mean that HBase is column storage. Its internal design is exactly the same as the above. Row storage and column storage are more of two kinds of storage ideas. Specific implementation, different products have their own way of implementation.

HBase column family storage

HBase is more like a column family database than a column database

Column family

In the official HBase terminology, a column Family is called a column Family. Some people call it a column cluster

What is column family? In fact, we often use column family in our daily work. For example, we have a staff file table, as shown in the figureThe basic information, the job information isColumn family, and name, sex, and age are columns belonging to the basic information group.

Note that in daily work, the columns of a table are designed into multiple layers, such as the work file table above. Basic information may be divided into personal information and family information. However, HBase supports only two layersCopy the code

An HBase table consists of multiple column families. Each column family can have multiple columns. Column families are part of an HBase table, but columns are not.

Column families need to be defined at the time the table is created, and they need to be limited to a few dozen in theory, but probably fewer in practice. Column family names must consist of printable characters, which is a significant difference from other naming conventions for values or names.

Why use column families in Hbase

A row of several columns, the number of columns and constitute a column family (column family), which not only helps to build a data or partial semantic boundary, also helps to give them set up certain features such as compression, or indicating their storage in memory, a column family all of the columns of the storage in the same underlying storage file.

Column family setting

Hbase access control, disk usage, and memory usage statistics are performed at the column family level. Therefore, column families are usually designed by grouping related or frequently accessed columns together to facilitate data access.

HBase column

HBase columns belong to a column family. Column names are prefixed with column family names. Columns can be declared when a table is created or dynamically added after the table is created.

Unlike column families, which need to be created before they can be used, columns are much more flexible and can be specified when data is inserted

HBase table

Table = RowKey + Family + column + Timestamp + Value
Copy the code

An HBase table is constructed from multiple column families. Each column Family can have several columns. Each data in an HBase table has a RowKey that uniquely identifies the data in the table, similar to the primary key in a relational database. Each column also has a Timestamp/version number (Timestamp). HBase supports multiple versions. When you enable multi-version support, the old data in the column is kept when the data in the column changes.

In addition to storing historical data, HBase also supports query of column data

HBase storage structure

Previously, HBase is classified as column storage from the external functional expression level. If HBase is classified from the internal storage level, HBase should be classified as key value database

(Table,RowKey,Family,column,Timestamp) -> Value
Copy the code

When stored, the key for each value consists of the table name, column family name, column name, row key, and version number

Data version of the HBase column

Multiple versions of HBase column data are supported. The default version is three

Again, take the employee file table as an example

First insert insert name, age, 2 fields, bottom, they are 2 key value pairs, and each key has a Timestamp/version number (Timestamp)

The second time the salary field is added and corresponds to a new timestamp t3

The third time the age field is updated and corresponds to a new timestamp T4

There are three versions of this data, but they all correspond to the same RowKey

T1 -> table 3, t2 -> 22, T4 -> 23, t3 -> 1000 The result is Zhang SAN 23 1000

Data storage prototype

HBase maintains data using the following data structures

SortedMap<
    RowKey,List<
        SortedMap<
            Column,List<value,Timestamp>
        >
    >
>
Copy the code

The first SortedMap sorts rowkeys and the second SortedMap sorts columns

Key and value pairs are used to store data in HBase.

In fact, this is just the representation of HBase data at different levels of abstraction. It is the same as that any data on the hard disk is eventually converted into consecutive or discontinuous 0s and 1s. Therefore, HBase is either a key value data structure or a hierarchical data structure.

HBase table

HBase builds table sentences

create 'Namespace: table name',
{
NAME => 'Column family name 1',
VERSIONS => 'Number of versions',
EVICT_BLOCKS_ON_CLOSE => 'false',
NEW_VERSION_BEHAVIOR => 'false',
KEEP_DELETED_CELLS => 'FALSE',
CACHE_DATA_ON_WRITE => 'false',
DATA_BLOCK_ENCODING => 'NONE',
TTL => 'Time to live, in seconds, FOREVER means never expired.',
MIN_VERSIONS => '0',
REPLICATION_SCOPE => 'Copy range',
BLOOMFILTER => 'ROW',
CACHE_INDEX_ON_WRITE => 'false',
IN_MEMORY => 'false',
CACHE_BLOOMS_ON_WRITE => 'false',
PREFETCH_BLOCKS_ON_OPEN => 'false',
COMPRESSION => 'Data compression algorithm. SNAPPY is commonly used in hbase.',
BLOCKCACHE => 'true',
BLOCKSIZE => '65536'
},
{
NAME => 'Column family name 2',
VERSIONS => '1',
EVICT_BLOCKS_ON_CLOSE => 'false',
NEW_VERSION_BEHAVIOR => 'false',
KEEP_DELETED_CELLS => 'FALSE',
CACHE_DATA_ON_WRITE => 'false',
DATA_BLOCK_ENCODING => 'NONE',
TTL => 'FOREVER',
MIN_VERSIONS => '0',
REPLICATION_SCOPE => '0',
BLOOMFILTER => 'ROW',
CACHE_INDEX_ON_WRITE => 'false',
IN_MEMORY => 'false',
CACHE_BLOOMS_ON_WRITE => 'false',
PREFETCH_BLOCKS_ON_OPEN => 'false',
COMPRESSION => 'NONE',
BLOCKCACHE => 'true',
BLOCKSIZE => '65536'
}
Copy the code

Tables can also be created using the following terse statement

create 'Namespace: table name'.'column cluster 1'.'column cluster 2'
Copy the code

This creates a table with the following default values for the attributes of the column cluster

{
NAME => 'Field name (no default)',
VERSIONS => '1',
EVICT_BLOCKS_ON_CLOSE => 'false',
NEW_VERSION_BEHAVIOR => 'false',
KEEP_DELETED_CELLS => 'FALSE',
CACHE_DATA_ON_WRITE => 'false',
DATA_BLOCK_ENCODING => 'NONE',
TTL => 'FOREVER',
MIN_VERSIONS => '0',
REPLICATION_SCOPE => '0',
BLOOMFILTER => 'ROW',
CACHE_INDEX_ON_WRITE => 'false',
IN_MEMORY => 'false',
CACHE_BLOOMS_ON_WRITE => 'false',
PREFETCH_BLOCKS_ON_OPEN => 'false',
COMPRESSION => 'NONE',
BLOCKCACHE => 'true',
BLOCKSIZE => '65536'
}
Copy the code

Hbase Data Compression

When creating a table, you can set the COMPRESSION attribute to a column family to enable data COMPRESSION for that column family (in column families). This is a way of trading CPU time for IO speed and storage space.

The COMPRESSION attribute has the following enumeration values

  • ‘NONE’ : Data compression is not enabled
  • ‘GZ’ : indicates GZIP. The compression rate is higher than Snappy and LZO. Therefore, compression and decompression consumes more CPU resources. Recommended for infrequently accessed cold data
  • ‘LZO’ : It has a lower compression rate than GZIP, but uses less CPU resources. It is recommended for frequently accessed hot data. Before Google launched Snappy in 2011, LZO was the default recommended setting
  • ‘SNAPPY’ : SNAPPY has similar quality to LZO, but performs better. SNAPPY is recommended for frequently accessed hot data. Is currently the default recommended setting
  • ‘LZ4’ : Compared to LZO, the compression rate of LZ4 is similar to that of LZO, but the decompression/compression rate of LZ4 is faster

Often, you need to balance smaller sizes against faster compression/decompression, and in most cases, enabling Snappy by default is a good choice.

Directory structure of HBase data storage

When setting up an HBase database, you need to specify a data store path

hbase-site.xml

<property>
	<name>hbase.rootdir</name>
    <value>/hbase</value>
</property>
Copy the code

/hbase is the data storage directory. Hbase creates a series of directories under this directory

  • /.tmp: When creating or deleting a table, the table will be moved to/.tmpDirectory to perform operations. It stores data structures that need to be modified temporarily.
  • /WALs: The pre-write log directory containing WAL files managed by the Hlog instance
  • /archive: Stores archives and snapshots of tables. After the split or merge is complete, hbase willHfileThe file is moved to this directory, and then the originalHfileDelete. This is done bymasterA scheduled task on the
  • /corrupt: Stores damaged log files. Generally, the log files are empty. If the log files are not empty, HBase faults occur
  • /data: Core directory of HBase data storage, used to store data of system tables and user tables
  • /hbase.id: Stores the unique identifier of the hbase instance in the cluster after the hbase starts
  • /hbase.version: Indicates the version information of the cluster file format, that is, the version information of the HFile file
  • /oldWALs: when/WALsAfter the log file in the directory is persisted to the storage file, the log file will be moved to the directory for deletion

If we created a table called test and stored it in the default defaut namespace, the directory structure would look like this

  • data

    • .

    • test

      • Tabledesc: Stores the metadata of the table

        • Tableinfo.0000000001: specifies the file of table metadata
      • TMP: temporary data directory, which stores temporary data when table metadata changes

        • ae7qd82jdaoeciq34fq6i3rc9jf38j3b
          • .regioninfo
          • cf
            • 0b3fie93834d93j1a93814892j7a23f

HBase architecture

As shown in the figure above, HBase consists of these components

  • Region Server: stores and manages HBase data
  • Zookeeper: Stores meta tables
  • Master: manages meta tables in Region Server and ZooKeeper
  • Client: Use the Meta table in ZooKeeper to find the region server address and port number

HBase meta information table

An HBase meta information Table is called a Mete Table. It is also an HBase Table. Therefore, it also has Row keys and column families. The Mete Table stores information about all regions in the system. Region is an important concept in HBase. It is the most basic unit for storing user data. The following section describes the region in detail.

The Mete Table is stored on ZooKeeper. The client must access the Mete Table on ZooKeeper to find the region address where the data resides, and then access the region.

The meta table structure is as follows

The Meta table has only one column cluster-info, which has four columns

  • RegionInfo: Information such as startKey, endKey, and name of the current region
  • Server: indicates the server and port of the region
  • ServerStartCode: Start time of region Server
  • SeqnumDuringOpen:

Worth mentioning is the Row key, its format for TableName, StartKey, Timestamp. EncodedName

StartKey: indicates the first Rowkey stored in the region of the current table. If this place is empty, this is the first region of the table. If StartKey and EndKey are null in a region, the table has only one region. Timestamp: Timestamp of Region creation. EncodedName: TableName, StartKey, MD5 Hex value of the Timestamp string.

When a region sends changes (such as region collapse, region separation, and region merge), the HBase Master service updates the Mete Table.

Mete Table Is the top-level index of an hbase Table.

LSM storage ideas in HBase

LSM

Log-structured merge-Tree (LSM) does not belong to a specific data structure, but generally consists of more than two data structures, and each one corresponds to a different storage medium.

For example, consider an LSM with two data structures, C0 and C1, where C0 is stored in memory media and C1 is stored in disk media. When data is inserted, it is temporarily stored in C0 to obtain better storage performance. When the data in C0 reaches the preset threshold, all data is persisted to C1.

In this two-tier LSM, one data structure is stored in memory and the other data structure is stored on disk. Of course, LSM does not have to be designed as two layers, but can also be designed as three layers. For example, C0, C1, and C2 are stored in memory, hard disk, and hard disk respectively, where C2 is an aggregated archive of C1.

LSM is more of a data structure design idea that can be adjusted freely according to specific circumstances.

LSM storage implementation in HBase

HBase uses a three-tier LSM storage structure

Layer 1: Memory and logging

To improve the random write performance, when user data reaches the Region Server, it is not directly written to disks. Instead, it is written to logs and memory to prevent data loss caused by memory power failure.

Layer 2: Hard disks

When the amount of data in the memory reaches the threshold, the asynchronous thread writes the data to the disk to form a storeFile file

Layer 3: Hard disks

With continuous refresh, more and more small files are stored in the hard disk, which is not conducive to management and random query. When the time is right, start the thread to merge the small files to form a large file.

HBase Storage Module

RegionServer contains multiple regions and one HLog

  • Region: Also known as HRegin, it is the smallest unit for storing user data. It corresponds to part of the row data of a table. An HBase table may be divided into multiple segments based on row keys and stored in different regions in different RegionServers to achieve load balancing.
  • HLog: HLog files are stored on the hard disk and used to implement WAL(write – ahead log). After the user data enters the RegionServer, it is first saved in HLog to realize high availability. When the service is down, etc. You can restore the previous state by playing back the log.

A RegionServer can contain multiple regions, and a Region contains multiple Stores. Stores correspond to column families in a table. As many column families exist in a table, regions in this table have as many stores.

A store consists of a MemStore and multiple storeFiles

  • MemStore: It is a data structure stored in memory. After user data has been saved to HLog, the data is entered into the region. Then, the data is first refreshed to the MemStore of the corresponding store
  • StoreFile /HFile: storeFile is a file written after the data in MemStore reaches the threshold. A file that stores table data. The storeFile file is eventually converted to HFile format and stored in HDFS

HLog and MemStore form the first layer of LSM(Log Structure Merge Tree) in HBase, which ensures that data can be written to HBase quickly and reliably

StoreFile and HFile together form the second layer of LSM(Log Structure Merge Tree) in HBase, which ensures the persistent storage of unreliable data

HBase region

A Region is the minimum unit of HBase distributed storage and load balancing. A Regionserver service has multiple regions. Each Region corresponds to part of the row data of a table (the data of a table may be divided into multiple segments and stored in different regions in different RegionServers).

When data is continuously inserted and reaches the Region threshold, the Region is divided into two regions horizontally. When a Region Server fails or the master triggers the load balancing policy, regions may be moved from one Region Server to another

The process of splitting regions by the Region Server is as follows

  • Take regions offline and split them
  • Then add the information about the split Region to the Mete Table
  • Finally, the split Region is online

Each Region has three information to identify itself

  • TableName: indicates the name of a table
  • StartRowKey: Indicates the RowKey to start with
  • CreateTime: time when a Region is created and the earliest data is inserted. In other words, regions are created on demand

Region characteristic

Region is the minimum unit of HBase distributed storage and load balancing, but not the minimum unit of storage.

If the number of regions is too large, the number of regions decreases. If the number of regions is too small, the parallel search speed and the pressure are not scattered enough. Therefore, the number of regions should not be smaller than the number of nodes in the cluster

The splitting process of a Region is invisible because the master does not participate

HBase HFile

HBase saves data in the HDFS as HFile files

HFile and other components

The relationship between store, memStore, storeFile, and HFile is as follows

A Region may have multiple stores. Each store represents a column family. Each store contains a memStore, which is an in-memory data structure that stores data modified by users. A store may also have multiple storefiles, which are data structures at the file system level.

Stores are managed by regions and used to maintain column family data. Tables stored in a Region have a column family and the Region has several stores. HBase determines whether to split regions based on store size

The default size of memStore is 64 MB. When data is full, HBase writes data in memStore to storeFile. StoreFile is stored in HFile format.

HFile (StoreFile) files

Storefiles are stored in the HDFS in HFile format. HFile is a binary format file of Hadoop. In fact, StoreFile is a lightweight wrapper for HFile, that is, StoreFile is the underlying HFile

The HFile file is the basic storage unit of HBase. It is the actual carrier of user data. The underlying layer of the HFile file is the BINARY file of HDFS and stores data in key-value format

The structure of the HFile file is as follows. It consists of four parts, each of which is represented by different colors

  • Scanned Block Section: During sequential scanning of hfiles, this section is read and user data is stored
  • Non-scanned block section: When sequentially scanning hfiles, this part of the block will not be read, mainly including some metadata
  • Load-on-open section: metadata information about the HFile file, which is loaded into the memory when the Region Server is started
  • Trailer: Records HFile addressing information, offsets of each part, and versions. Currently, HFile 2.0 is used by HBase

Data Block

User Data is stored in the Scanned Block Section area of the HFile file and specifically in the Data Block of the Scanned Block Section.

Data Block stores Data in key-value format

As shown in the figure above, a Data Block contains multiple key-values, and each key-value consists of four parts

  • Key Length: Indicates the Length of the Key
  • Value Length: Indicates the Length of Value
  • Key: A key consists of multiple pieces of information, which will be discussed separately later
  • Value: indicates the data of a column in a column family of a row in a table

The key of key-value in a Data Block consists of multiple parts

  • Row Length: Indicates the Length of the Row key
  • Row: indicates the RowKey in each Row of the HBase table
  • ColumnFamilyLength: Indicates the length of the column family name
  • ColumnFamily: ColumnFamily name
  • ColumnQualifierLength: Column identifier length
  • ColumnQualifier: Indicates the column identifier, namely, the column name
  • Timestamp: The Timestamp of the value, which can also be understood as the version number of the value
  • KeyType: The type of key used to mark data. When scanning, the Delete flag is skippedkey – value
    • Put: Adds data
    • Delete: Deletes the entire row
    • DeleteColumn: deletes the entire column family
    • DeleteFamily: deletes the entire column

HBase Compaction

As mentioned in the previous section, HBase uses three layers of LSM. The first layer is to store data in memStore (memory) and the second layer is to write data from memStore to StoreFile. HBase Compaction takes on the third tier of LSM, combining multiple small storefiles into one large file.

When the HBase Compaction occurs, the HBase Compaction consolates a storeFile from a Region and a Store of a RegionServer during a certain period of time. This means that merging a HBase Compaction operation does not break two stores.

The HBase Compaction reads key-values from the merging file, resorts the data, and writes it to a new HFile

Why design an HBase Compaction

With the running of the system, there will be more and more files. Too many files will lead to difficult maintenance, and it is not conducive to data query, which will affect the efficiency of data query (because you may need to open multiple files to find data, while opening and closing files will consume system resources). Therefore, you need to merge them

HBase Compaction clears three types of data during consolidation

  • KeyTypeA type ofdeleteThe data of
  • ttlExpired data
  • If the version is out of date, for example, if the maximum version of a table is set to 3, when the fourth version of a piece of data is inserted, the oldest first version is not immediately deleted, but is deleted at merge time

Two kinds of HBase Compaction

There are two types of HBase Compaction: Minor Compaction and major Compaction.

Minor Compaction combines a neighboring storeFile into one large storeFile

A major Compaction compacts all of the storefiles in a store into one large storeFile. Since this Compaction occurs during a long time, it consumes a lot of system resources. This has a significant impact on operations, so in production environments, we usually shut down major Compaction automatically and manually trigger a new Compaction during a low wind.

Compaction Trigger time

HBase Compaction can be triggered by a number of factors, but three of the most common Compaction factors occur

  • MemStoreFlash : Every time a MemStore generates a Compaction when data is written to an HFile, a MemStore Compaction determines the number of files that exist in the HFile. If this Compaction reaches a threshold, it triggers a Compaction

  • A Compaction Check: This thread checks periodically in the background. It first looks at the number of files that exist, and when this Compaction occurs, it triggers a Compaction. We also check if the earliest HFile update date in the current Store is earlier than a value called MCTime, which triggers a major Compaction

MCTime is a floating value. The default floating range is 7-7 * 0.2, 7 + 7 * 0.2.

Among them 7 is to configure hbase. Hregion. Majorcompaction values

0.2 is a configuration item hbase. Hregion. Majorcompaction. The value of jitter

The default in 7 days or so will undergo a major Compaction, if you want to disable automatic major Compaction, just need to hbase. Hregion. Majorcompaction value is set to 0

  • Manual: When this operation is triggered, major Compaction is executed

HBase WAL(Pre-write log)

HBase WAL is similar to mysql’s binlog. It records all data changes. In case of a service crash, iT can be played back to restore to the state before the crash.

When writing data to Regin Server, if WAL fails, the entire operation is considered to have failed.

As shown above,HBase ClientIn the implementationput,deleteWhen you wait for a command, you must first record itlogAnd then toRegion

HBase WAL provides disaster recovery for HBase data, improving database reliability. In addition, it synchronizes log files to implement remote backup

HLog

Hlog is the Hlog implementation class of HBase WAL. A Regin Server service has only one Hlog instance. When Regin is initialized, the HLog instance is passed as a parameter to the Regin constructor, so that Regin can use HLog to log

For performance reasons, put(), delete(), and incr() have a setReadToWAL(Boolean b) function that takes a Boolean type and can be set to false to disable WAL

WAL uses the Serialized file of Hadoop to store logs as a data set in key-value format. The value part is the data of user operations such as PUT and delete, and the key part is the HLogKey, which consists of the following contents

  • Region: Region to which data belongs
  • TableName: indicates the table to which data belongs
  • SequenceNumber: indicates the serial number
  • WriteTime: indicates the log WriteTime

HLogSyncer

HLogSyncer is a synchronous log writing class. By default, HBase WAL writes logs directly to the file system. However, to improve performance, HBase WAL temporarily writes logs to the memory, which requires an asynchronous thread to flush log data from memory to the file system. HLogSyncer is responsible for this.

There are two strategies for HLogSyncer

  • Timed brush: according to the set time interval to brush
  • Memory threshold: When the memory reaches the threshold, data is flushed to the file system

HLogRoller

The HBase Log file size can be set using the configuration file. The default size is one hour. That is, a new Log file is generated every hour

The HLogRoller background thread is responsible for organizing these log files and has the following functions

  • Scroll logs at specific times to form new files and prevent a single file from becoming too large
  • Compare persisting sequence numbers according to the SequenceNumber of HLog, and delete old logs that are not needed

HBase data access process

storage

The client

The HBase Client processes for storing data (Put and Delete) are as follows

  1. The HBase Client requests Zookeeper to obtain the HBase Mete Table(meta information Table) and obtains the Region Server address from the Mete Table
  2. The Region Server service is found based on the Region Server address
  3. Obtain the meta information (such as Table Name and startRowKey) of each Region in the Region Server service.
  4. When a user submits a Put or Delete request, the HBase Client first puts the request into the local buffer (autoFlash is true by default or can be disabled. After the autoFlash is disabled, the request is submitted one at a time) until the threshold (2M by default) is met. The data is submitted to the HBase server in batches in an asynchronous manner

The service side

The HBase Server processes for storing data (such as Put and Delete) are as follows

  1. After data reaches a Region of the Region Server, a RowLock is obtained and a shared lock is created. Hbase uses RowLock to ensure atomicity of operations on a row of data
  2. Write data to HLog, note that HLog is not flushed to hard disk ** when holding a RowLock. HBase uses HLog to implement WAL
  3. After writing the log, write the data according to the column family to the MenStore in Stroe corresponding to the column family. Write to the log before writing to the cache, so that even if the service goes down, it can be recovered through the log
  4. Release the RowLock and shared lock, and flush HLog to hard disk. Release the RowLock and then flush the HLog to hard disk to reduce the time it takes to hold the RowLock and improve write performance
  5. If HLog fails to write to the hard disk, the rollback is performed and the data in each MenStore is deleted
  6. When the data in the MenStore exceeds the threshold (64M by default), an asynchronous thread is started and written to hard disk to form multiple storefiles
  7. When the number of storefiles reaches the threshold, HBase Compaction is triggered to consolidate multiple storefiles into a single storeFile, and data with expiration, version expiration, and Delete is deleted
  8. After several HBase Compaction operations, when the size of a storeFile exceeds the threshold, an HBase Split occurs, splitting the current Region into two regions, and notifting the Master service.
  9. The Master will log out of the original Region and distribute the two newly split regions to different Region servers based on load balancing.

read

The client

Data read by the HBase Client (Scan) is the same as data read by the HBase Client except that the Scan request is not stored in the local buffer

The service side

Read data (Scan) on the HBase Client

  1. When the data arrivesRegion ServerLater, according to theScanThe HBase content will be inRegion ServerRegionOn the buildingRegion Scanner
  2. The Region Scanner builds the Store Scanner according to the column family, and is responsible for data retrieval of the column family (the number of Store scanners is determined by the number of column families).
  3. Each Store Scanner constructs a Store File Scanner for each HFIle in the current Store to perform the actual File retrieval. In addition, The Mem Store Scanner is also built for the Mem Store in the Store to retrieve data in the Mem Store
  4. The key-value structure data scanned by the Mem Store Scanner and Store File Scanner is encapsulated and returned to the client

**HBase ** uses layer 3 scanners to read data

  • Layer 1: Region Scanner
  • Layer 2: Store Scanner
  • Layer 3: Mem Store Scanner, Store File Scanner

A Region Scanner consists of multiple Store scanners (column families), and a Store Scanner consists of multiple Store File scanners and one Mem Store Scanner