- The previous article is a bit simple, but other concepts are very important and need to be understood and memorized. This article will learn about the internal structure of HBase, and only with this knowledge can you carry out subsequent programming and learning
Introduction of HBase
- Hadoop is also implemented and improved according to the principle of the paper published by Google. Hadoop is a distributed, and provides high availability, file backup and other features. It can run on ordinary hardware, but also provides the storage and backup mechanism of large files, and has super scalability and throughput
- HBase is based on Hadoop, so HBase has all the points mentioned above. HBase uses the Key/Value storage mode, so that the query performance does not deteriorate due to a large amount of data. In addition,HBase is stored in columns Spreading the load on different machines leads to increased time: delays in network transmission and the time it takes to organize and present data, so it doesn’t store small amounts of data quickly, but it doesn’t slow down significantly when it’s large
- HBase data analysis is a weakness. In general, HBase can be considered when the table data is large and the concurrency is high, and the analysis requirements are weak
Deployment architecture
- There are many examples of HBase cluster construction on the Internet. I have built five HBase high availability devices by myself. I will upload my configuration file for your reference
- When building a cluster, we need to know what HBase does. Otherwise, we will start building the cluster by looking for articles and not know what we are doing
- HBase is deployed as a Master server and RegionServer server. HA can also be configured for the Master node, that is, an active node and a standby node. When the active node fails, the standby node takes over (the administrator is attacked, and the deputy administrator takes over) RegionServer is used to store data, which is directly stored in the HDFS of Hadoop
- Zookeeper plays an important role in HBase. What does Zookeeper do? He is like a few teachers (Zookeeper cluster), when the class was set up to discuss to select a monitor (nodes) in the election campaign, when one day, the monitor do the wrong thing from the students like the outage (activity node, such as error), so in order to lead the class, the teachers will again choose a new leader to lead the class, so Zooke The function of eper is to select the active node among N servers. When the active node fails, another node is selected to replace the active node. Zookeeper is responsible for election to ensure the high availability of the cluster
- What is the relationship between Zookeeper and HBase? When you want to obtain data from HBase, the client is directly connected to the RegionServer. When the Master node is down, you can still return the required data, but you cannot create a new table
- Isn’t Master a hadoop-like NameNode that provides storage server addresses? No, Zookeeper manages all HBase RegionServer information. Zookeeper manages the RegionServer on which data segments are stored. Each time you connect to HBase, you communicate with Zookeeper to query the Regi OnServer needs to connect (which server is hosting the data you need) and then connect to it, so we know how important Zookeeper is to HBase
- This is the overall HBase architecture. It may be incomplete because the client cannot only interact with Zookeeper. However, this is all I know
- The client obtains the RegionServer address from Zookeeper, and then directly obtains data from the RegionServer. All data operations, including insert and delete, are performed directly A table is unique in a database, and it is cross-RegionServer. HBase stores the information on the Master. What are the advantages of this structure If HA is not configured and the Master dies, the cluster can still add, delete, change and check data normally, but cannot operate the database structure
- RegionServer: A container that stores regions. After the client obtains the Address of RegionServer from ZooKeeper, it directly obtains the number from RegionServer
-
Region: A table in HBase has one or more regions. That is, a table is divided into multiple regions. Therefore, a Region is a collection of data in a table
- Region Cannot cross servers. That is, data of a Region on a local server can be read only from the local server
- If a Region is too large,HBase splits the Region again
- Region is based on HDFS. All data access operations of Region are implemented by invoking the HDFS client interface
- RegionServer Can move a Region to another RegionServer
- This section describes HBase components and HBase storage architecture
Storage architecture
- We have briefly introduced row storage and column storage in the previous article. Since HBase is a column storage database for RDBMS, the basic storage unit of HBase is column. One or more columns form a row Row data can also be stored on different machines from another row, and even columns within a row can be stored on different machines
- Each row has a row key to uniquely mark a row. Each column has multiple versions, which are stored in the cell. Several columns can also be divided into a column family
- The above cell is a collection of multiple column versions
The line of key
-
Rows are sorted by rowkey, so rowkey determines the order in which the row is stored in the table, and is sorted in such a way that each rowkey is compared byte by byte from left to right
Row1 Row2 Row12 After sorting row1 row11 Row2Copy the code
- If the inserted rowkey is the same as the existing rowkey, the inserted data will update the original value, and the previous value will be added to the history record and will not be lost. You can find the data with the time version. The data of several versions in one row and one column constitute the cell
- The row key can be any array of bytes
Column family
- Columns can be classed into a column family
- When creating a table is do not need to specify columns in a table, but must specify the column family, column to add delete is quite flexible, and column family defined not suggest there is big change, after the many properties in the table is built on column family, so that when a column family is a property, then his column will have the same properties, within the scope of column must depend on the column family there, so Column families must be specified when building tables
- The purpose of a column family is as follows: HBase puts columns of the same column family on the same machine as possible. Therefore, if you want several columns to be placed together, you define the same column family for them.
- Because columns are based on column families, we must specify the columnfamily when we look for the specified column, such as columnFamily :column
The cell
- Said before the several versions of the data, make up the cell, so just find columnfamily: the column, is unable to determine a value, also needs is the cell of timestamp, namely columnfamily: column: timestamp, so it can determine a column, plus Row key information, complete determination of a value should be this
rowkey:columnfamily:column:timestamp
- The version number is given by the current timestamp by default, or you can specify it yourself
Relationship between Region and row
- A Region is a collection of rows. Rows in a Region are sorted according to the rowkey dictionary
- For the configuration file used to build the cluster, I configure the time is also simple, I hope to help you, you can download directly from the attachment
- Or download hbase-conf from a web disk