The HBase Doesn’t Sleep Book is an HBase technical book that makes people not fall asleep after reading. It is very good. In order to deepen my memory, I decided to organize important parts of the book into reading notes for later reference and hope to bring some help to students who are just learning HBase.

directory

  • Chapter 1 – Getting to know HBase
  • Chapter 2 – Get HBase Running
  • Chapter 3: HBase Basic Operations
  • Chapter 4 – Getting started with the Client API
  • Chapter 5 – HBase Internal Exploration
  • Chapter 6 – Advanced usage of the client API
  • Chapter 7 – Client API management capabilities
  • Chapter 8 – Faster
  • Chapter 9 – When HBase Meets MapReduce

1. Basic Concepts

1. CAP theory

Consistency Availability and Partition tolerance:

  • Consistency: Data is updated consistently and all data changes are synchronized.
  • Availability: Good response performance;
  • Partition tolerance: reliability.

Any distributed system can do two things at once, not all three. Instead of wasting energy on designing a perfectly distributed system that satisfies all three, architects should make trade-offs.

2, no

A lot of people think NoSQL is Not SQL, but it’s an abbreviation of Not Only SQL, which means more than SQL. In contrast to relational databases, NoSQL, a non-relational database, is not strictly transactional, even rather sloppy.

Some databases are guaranteed final consistency, where information is not synchronized immediately but takes a while to achieve consistency. Let’s say you post an article, and some of your friends see it right away, while others wait five minutes for it to appear.

There is a delay, but who cares about a few minutes of delay for an entertaining Web 2.0 site? If you were using a traditional relational database, the site would have gone down long ago.

Some databases can operate even when parts of the machine go down by making multiple copies of the same data in several places, echoing the old adage: Don’t put all your eggs in one basket.

3. Column and row storage

Column-based storage is compared with row-based storage of traditional relational database. Simply speaking, the difference between the two is how to organize tables.

There are two ways to put a table into a storage system, and most of us use row storage. Row storage puts rows into contiguous physical locations, much like a traditional record and file system. Column storage stores data in columns to a database, similar to row storage. Here is a graphical illustration of the two storage methods.

The database system that applies row storage is called row database, and the database system that applies column storage is called column database.

One of the main advantages of column storage is that it can significantly reduce system I/O, especially when querying massive amounts of data.

2. When to use HBase?

HBase storage is based on Hadoop, which implements a distributed file system (HDFS). HDFS is highly fault tolerant, designed to be deployed on inexpensive hardware, and it provides high throughput to access application data, suitable for applications with large data sets. Being based on Hadoop means that HBase is inherently scalable and throughput strong.

HBase uses the Key/Value storage mode, which means that query performance hardly deteriorates as the amount of data increases.

HBase is a column database. If you have a large number of table fields, you can put some of the fields on one part of the cluster and the other fields on another part of the cluster to spread the load.

However, the cost of such a complex storage structure and distributed storage approach is that it is not fast enough to store even small amounts of data. So I often tell people that HBase is not fast, but it is not significantly slow when there is a large amount of data.

Scenarios where HBase is not suitable:

  • The main requirement is data analysis, such as making reports.
  • The amount of data in a single table does not exceed 10 million.

HBase scenarios:

  • The amount of single table data exceeds tens of millions, and the concurrency is quite high.
  • Data analysis needs to be weak or not as flexible or real-time.

3. HBase deployment architecture

1. Deployment architecture

HBase has two types of servers: Master server and RegionServer Server. Generally, an HBase cluster has one Master server and multiple RegionServer servers.

The Master server maintains table structure information. Actual data is stored on the RegionServer server, which stores data. The Table data saved by the RegionServer is directly stored in the HDFS of Hadoop.

HBase has a special feature: The client obtains data from the RegionServer. When the Master fails, you can still query data but lose the table management capability.

RegionServer depends on the ZooKeeper service. Without ZooKeeper, HBase does not exist. ZooKeeper manages information about all HBase RegionServers, including the RegionServer on which data segments are stored. Each time a client connects to HBase, it communicates with ZooKeeper, checks which RegionServer needs to be connected, and then connects to the RegionServer.

2, Region

A Region is a collection of data. A table in HBase generally has one or more regions. A Region has the following features:

  • Region Cannot cross servers. One RegionServer has one or more regions.
  • When the amount of data is small, one Region is sufficient to store all data. However, HBase splits regions when there is a large amount of data.
  • When load balancing is performed on HBase, a Region may be moved from one RegionServer to another.
  • Region is based on HDFS. All data access operations of Region are implemented by invoking the HDFS client interface.

3, RegionServer

RegionServer is a container that stores regions and is a service on a server. Typically, only one RegionServer service is installed on a server. After the client obtains the Address of The RegionServer from ZooKeeper, it directly obtains data from the RegionServer.

4, Master

In HBase, the role of the Master is more like a handyman than a leader. After obtaining the Address of The RegionServer from ZooKeeper, the client directly obtains data from the RegionServer. In fact, not only the data acquisition, including insert, delete and other data operations are directly operated on the RegionServer, without going through the Master.

The Master is only responsible for coordinating various operations, such as creating tables, deleting tables, moving regions, and merging. The common feature is that these operations need to be performed by different RegionServers. Therefore, HBase puts these operations on the Master.

The advantage of this structure is that it greatly reduces the dependency of the cluster on the Master node. Generally, there are only one or two Master nodes. If the cluster is heavily dependent on the Master node, the single point of failure will occur. In HBase, even if the Master is down, the cluster still runs normally and can store and delete data.

4. HBase storage architecture

1. Storage architecture

The most basic unit of storage is a column. One or more columns form a row. Traditional databases have strict alignment of columns, such as three columns A, B, and C on this row, and certainly three columns A, B, and C on the next row. In HBase, this row has three columns A, B, and C, and the next row may have four columns A, E, F, and G.

In HBase, rows and columns can be completely different. Data in one row and data in another row can be stored on different machines, or even columns in the same row can be stored on different machines. Each row has a unique row key that identifies the row as unique. Each column has multiple versions, the values of which are stored in cells, and several columns can be grouped into a column family.

2. Rowkey

A rowkey is a unique string specified by the user. The rowkey determines where the row is stored. HBase cannot sort rows by a column. The system always sorts rows by rowkey (dictionary). Rowkey is the only evidence that determines the storage order of rows.

If an existing rowkey is used during HBase insertion, the existing row will be updated. Values that already exist are added to the cell’s history and are not discarded, but you need the version parameter to find the value. A column can store multiple versions of a cell, which is the smallest unit of data storage.

3, Column family

Columns can form column families, which are defined at the beginning of table construction. Many of the attributes of a table, such as expiration time, block cache, and compression, are defined on a column family rather than on a table or column.

Different column families in the same table can have completely different attribute configurations, but all columns in the same column family will have the same attribute because they are all in the same column family, and attributes are defined in the column family. A table without a column family makes no sense, because columns must depend on column families to exist.

The specification for column names is column family: column names, such as brother:age, brother:name. The purpose of a column family is as follows: HBase puts columns of the same column family on the same machine as possible. If you want several columns to be placed together, you define the same column family for them.

How many column families is appropriate for a table? The official advice is to keep as few as possible (one is generally enough), since HBase doesn’t want too many column families specified. Too many column families can significantly degrade database performance; In addition, too many column families are prone to bugs.

4. Cell

A column can store multiple versions of values, which are stored in multiple cells. Multiple versions are distinguished by Version numbers. The only expression to determine a result should be: Rowkey :column family:column:version number (rowkey:column family:column:version). However, the version number can be omitted. If you do not include the version number, HBase returns the data of the last version by default.

A Region is a collection of rows. The rows of a Region are sorted according to the rowkey dictionary.

Five, the summary

1. Function comparison between HBase and relational databases

2. Review of main concepts

  • Maser
  • RegionServer
  • HDFS
  • Client
  • ZooKeeper
  • Region
  • Row
  • Rowkey
  • Column Family
  • Column
  • Cell

If these words remind you of the relationship between them, congratulations on stepping into HBase.

3. Reference documents

  • HBase Not Sleeping Book
  • Understand column and row databases in one minute
  • Blog.51cto.com/flyfish225/…

Any Code, Code Any!

Scan code to pay attention to “AnyCode”, programming road, together forward.