This is the 10th day of my participation in the August More Text Challenge

Big data storage cornerstone -HBase

HBase full resolution


@[toc]

1. HBase

1. What is HBase?

HBase official website: hbase.apache.org/ The logo is a beautiful orca

HBase on the official website is simple and straightforward:

Apache HBase™ is the Hadoop database, a distributed, scalable, big data store.
Copy the code

Apache HBase is a database on Hadoop. A distributed, scalable big data storage engine. HBase has one of the most obvious features:

1. HBase supports very large data sets, billions of rows by millions of columns. This magnitude of data is enough to overwhelm all the data storage engines we studied in the J2EE phase. 2. HBase supports random and real-time read and write operations of large amounts of data. In massive data, data reads and writes at the millisecond level can be realized. 3. HBase has been deeply integrated with Hadoop since the beginning. HBase implements file persistence based on Hadoop and inherits powerful scalability provided by Hadoop. Hadoop can build huge application clusters based on cheap PCS. HBase is also deeply integrated with Hadoop’s MapReduce computing framework and is actively integrating With Spark. This enables HBase to easily integrate into the entire big data ecosystem. 4. HBase data is highly consistent. From the perspective of CAP theory, HBase belongs to CP. This design frees programmers from having to worry about the ultimate consistency of dirty reads, phantom reads and other transactions. 5. The most important thing is that the HBase framework is efficient. HBase has an active open source community and its performance has been proven by many large commercial products. Facebook’s entire messaging infrastructure is built on HBase.

It can be said that it is the incomparable performance provided by Hadoop and HBase that supports the whole technical system of big data. This is like having an operating system that can store data through files and memory before various applications can be developed.

When it comes to Hadoop+HBase, there are three papers published by Google. For a long time, Internet companies have focused on building services, and for business data, it’s hard to reach the level of mass. For the processing of massive data, only Google such a search engine giant in the in-depth research.

In 2003, Google released the Google File System Paper (GFS), an extensible distributed File System for large, distributed applications that access large amounts of data. It runs on cheap, generic hardware and provides fault tolerance. Basically, files are split into chunks and stored redundantly on clusters of business machines.

Following this, In 2004, Google published MapReduce paper, which described the distributed computing method of big data. The main idea is to decompose tasks and process them simultaneously in multiple computing nodes with weak processing capacity, and then combine the results to complete big data processing.

Then in 2006, Google published a paper on BigTable, a multi-dimensional sparse graph management tool built on TOP of GFS and MapReduce.

It was these three papers that set off the big data craze for open source software. People developed HDFS file storage based on GFS. The MapReduce computing framework has also become the standard for massive data processing. HDFS and MapReduce are combined to form Hadoop. BigTable has inspired countless NoSQL databases. HBase inherits the traditional BigTable concept. Therefore, Hadoop+HBase simulates the three cornerstones of Google’s massive web page processing, and they become the cornerstones of open source big data processing.

2. HBase data structure

HBase can also be used as a database, but the way it stores data is very different from the way we understand traditional relational databases in order to handle large amounts of data. Although it has logical structures such as tables and columns, it stores data as a K-V key-value pair.

Vertically, each table in HBase consists of a Rowkey and several column families or column clusters. The Rowkey is the unique identifier of each row of data. Ensure that the Rowkey is unique during data management. HBase still manages data in different columns, but these columns belong to different column clusters. In HBase, ensure that the column clusters of data in the same table are the same, and the columns in the column cluster may be different. So you can scale out a lot of columns. In HBase, you are advised not to define more than three column clusters for a table. More data can be expanded as columns.

In horizontal view, HBase records are divided into regions and stored on different RegionServers. In addition, backup is generated before different RegionServers. A fault recovery mechanism is provided by Region.

Finally, although HBase still stores HDFS files as a whole, the data it stores is not simple text files, but binary files optimized and compressed by HBase. Therefore, the storage files cannot be viewed directly.

3. HBase infrastructure

The HBase infrastructure is as follows:

Among them,

  • Client The client provides interfaces for accessing HBase and maintains corresponding caches to facilitate HBase access.
  • RegionServer connects to read and write requests of users and performs real work. He stores data in the form of storeFiles in different HDFS directories.
  • HMaster maintains cluster metadata, monitors the service status of RegionServer, and provides cluster services using Zookeeper to expose cluster server information to clients.

4. HBase application scenarios

Remember the Hive framework we talked about earlier? Compare Hive to understand HBase application scenarios.

Hive provides SQL-based query statistics for massive data. However, Hive does not store data, and all data operations are performed on files in HDFS. Therefore, the optimization of data query operations is limited. Hive also cannot manage data directly and relies on MapReduce to manage data, resulting in high latency. So Hive is usually only suitable for some OLAP scenarios and is usually used in combination with other components.

The difference between HBase and Hive is obvious. HBase stores data based on HADOOP Distributed File System (HDFS), but the data it stores is optimized and indexed by itself. Therefore, HBase stores data efficiently, providing a much higher performance than HDFS directly storing files, and serving as the cornerstone of big data storage. HBase uses column storage similar to Redis to manage data. Data can be added, deleted, or modified in milliseconds. At the same time, it also provides a complete client API, so it can be used as a traditional database, suitable for most OLTP scenarios. However, its disadvantages are obvious. Column-based data is inherently not suitable for large-scale data statistics. Therefore, in many OLAP scenarios, other components such as Spark and Hive are required to provide large-scale data statistics.

In the actual practice part, we need to understand not only the HBase product, but also how it works with other big data components.

2. HBase installation

1. Experimental environment and pre-software

Experimental environment: prepared three servers, pre-installed CentOS7, assign the name of the machine hadoop01, hadoop02, hadoop03. The IP address is not important because the cluster will be configured by machine name. It is important to configure time synchronization for all three servers. In later courses, you will install HBase HMaster on Hadoop01. Therefore, you need to configure encrypted login from Hadoop01 to three machines.

To configure the encryption-free login, generate the key on hadoop01: ssh-keygen. Ssh-copy-id hadoop02 is then distributed to different machines. This eliminates the need to enter a password when using SSH hadoop02 on a Hadoop01 machine.

Pre-installed software: JDK8; Hadoop is 3.2.2; The Spark version is 3.1.1.

Optional software: Install the mysql service on Hadoop01, and install the Spark cluster on the three machines. Version 3.1.1 is used. Install Hive 3.1.2 on Hadoop02 and deploy Hive service Hiveserver2. (This software is not required, but will be used in HBase)

2. Install Zookeeper

HBase relies on Zookeeper to coordinate data in clusters. Zookeeper is built-in in HBase release packages. However, an additional Zookeeper cluster is usually set up during cluster construction to reduce direct dependency between components.

Download Zookeeper 3.5.8 from the Zookeeper official website. The Zookeeper version cannot be earlier than 3.4.x. Apache – they are – 3.5.8 – bin. Tar. Gz. After uploading the file to the server, decompress the file to /app/ ZooKeeper. Then go to the conf directory of ZooKeeper.

CFG vi zoo. CFG Modify the zookeeper data directory dataDir=/app/zookeeper/data. Add server.1=hadoop01:2888:3888 server.2=hadoop02:2888:3888 server.3=hadoop03:2888:3888Copy the code

After the modification is complete, the ZooKeeper directory is distributed to the three servers in the cluster. Then add a myID file to the ZooKeeper data directory /app/ ZooKeeper /data. The file contains the node ID of the current server in ZooKeeper. For example, the myID of Hadoop01 is 1, that of Hadoop02 is 2, and that of Hadoop03 is 3.

After the configuration, run bin/ zkserver. sh start to start the ZooKeeper cluster. After the startup is complete, you can see a QuorumPeerMain process using the JSP command, indicating that the cluster has been successfully started.

2. Install HBase

Can to hbase.apache.org/downloads.h… Download HBase. Note That the HBase version must be compatible with the Hadoop version. Here is the compatibility table for the hadoop version currently posted on the official website:

We will choose the latest version of HBase2.4.4 this time. (Note that the latest version is usually not recommended for work.)

3. Set up the HBase cluster mode

Download the HBase deployment package: hbase-2.4.4-bin.tar.gz, and upload it to the hadoop01 server.

1 Decompress the package and go to the /app/hbase directory.

Tar -zxvf hbase-2.4.4-bin.tar.gz mv hbase-2.4.4 /app/hbaseCopy the code

Go to the hbase conf configuration directory and modify the configuration file.

2 Modify hbase-env.sh. Add the following configuration at the end of the file:

Export JAVA_HOME=/app/ Java /jdk1.8.0_212 # Specifies whether to use the HBase built-in ZK. The default is true. export HBASE_MANAGES_ZK=falseCopy the code

3 Modify hbase-site. XML and add the following configuration

<configuration>
	<property> 
		<name>hbase.rootdir</name> 
		<value>hdfs://hadoop01:8020/hbase</value> 
	</property>
	<property> 
		<name>hbase.cluster.distributed</name>
		<value>true</value>
	</property>
	<property>
		<name>hbase.master.port</name>
		<value>16000</value>
	</property>
	<property>
		<name>hbase.zookeeper.quorum</name>
		<value>hadoop01:2181,hadoop02:2181,hadoop03:2181</value>
	</property>
	<property> 
		<name>hbase.zookeeper.property.dataDir</name>
		<value>/app/zookeeper/data</value>
	</property>
</configuration>
Copy the code

A root directory of hbase is configured in the HDFS. This directory does not need to be created manually, and it is best to ensure that this directory does not exist in HDFS.

4 Configure the RegionServers file to list all nodes in the cluster.

hadoop01
hadoop02
hadoop03
Copy the code

5 Synchronize the Hadoop configuration file. To ensure the consistency of file contents, soft connection is used.

Ln -s /app/hadoop/hadoop-3.2.2/etc/ hadoop-core-site. XML /app/ hbase-hbase-2.4.4 /conf/ ln -s / app/hadoop/hadoop - 3.2.2 / etc/hadoop/HDFS - site. XML/app/hbase/hbase - 2.4.4 / conf /Copy the code

6 Distribute hbase to other servers

scp -r /app/hbase/ root@hadoop02:/app
scp -r /app/hbase/ root@hadoop03:/app
Copy the code

7. Configure Hbase environment variables. Configure the HBase installation directory to environment variables and the bin directory to PATH environment variables

Vi ~/. Bash_profile Add export HBASE_HOME=/app/hbase/hbase-2.4.4 and add $HBASE_HOME:BIN in the PATH variableCopy the code

8 start hbase

bin/start-hbase.sh

Copy the code

After the startup is complete, you can use the JPS command to see that the HMaster process is started on the current machine. The HRegionServer process is started on all three machines.

You can view the cluster status on the HBase management page: http://hadoop01:16010.

Note:

1. Clock synchronization must be performed on the three services. Otherwise, an error occurs when hbase is started. Throw a ClockOutOfSyncException exception. 2. It is mentioned in official documents that HBase is based on Hadoop, so HBase has built-in Hadoop JAR packages. However, the version of the JAR package must be the same as the actual hadoop deployment. For example, in HBase2.4.4, the built-in Hadoop jar packages are 2.10.0. For actual deployment, it is best to use jar packages in the Hadoop cluster. However, in our learning process, the inconsistency of the JAR package did not cause any problems, so we did not replace it. But be careful in actual deployment. 3. After the deployment is complete, you can view the HBase metadata on the HDFS. /hbase

3. HBase basic operations

You can download an important document about HBase usage from hbase.apache.org/apache_hbas… This is the best way to learn HBase. However, this material is all in English, and the content is very, very much, need to have a certain ability to understand.

1> Basic instructions

1. HBase client:

[root@192-168-65-174 hbase-2.4.4]# bin/ HBase --help # HBase command line [root@192-168-65-174 hbase-2.4.4]# bin/ HBase Shell # query help hbase:001:0> help # Query existing tables hbase:002:0> listCopy the code

2. Basic table operations

Create table user; Basicinfo :003:0> create 'user','basicinfo' # Insert a data basicInfo :004:0> put 'user','1001',' basicInfo :name',' Roy ' Hbase :005:0> put 'user','1001','basicinfo:age',18 hbase:006:0> PUT 'user','1001','basicinfo:salary',10000 # Insert the second data hbase:007:0> put 'user','1002','basicinfo:name','sophia' hbase:008:0> put 'user','1002','basicinfo:sex','female' Hbase :005:0> put 'user','1003','basicinfo:name','yula'  hbase:006:0> put 'user','1003','basicinfo:school','phz school'Copy the code

3. Data operation

Get 'user','1001' hbase:011:0> get 'user','1001','basicinfo:name' Hbase :031:0> get 'user','1001',{COLUMN => 'basicInfo :name',VERSIONS=>3} # Querying multiple records hbase:008:0> scan 'user' hbase:009:0> scan 'user',{STARTROW => '1001',STOPROW => '1002'}Copy the code

1. HBase query data can only be queried based on the Rowkey, which is directly specified by the client. Therefore, designing the Rowkey is very important when using HBase, and important service information must be carried. 2. In the scan command, STARTROW and STOPROW must be uppercase. The results of the query are left open and right closed.

Help ‘get’ or help ‘scan’ can be used for more queries. For example, filtering data

This result is very important. Lists all attributes of a column cluster. hbase:012:0> desc 'user' Table user is ENABLED user COLUMN FAMILIES DESCRIPTION {NAME => 'basicinfo', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} Hbase :019:0> ALTER 'user',{NAME => 'basicinfo',VERSIONS => 3} hbase:020:0> desc 'user' COLUMN FAMILIES DESCRIPTION {NAME => 'basicinfo', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '3', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', Delete 'user','1002',' basicINFO :sex' # deletea column deleteall 'user','1003' # delete table data truncate 'user' # delete table delete requires disable. disable 'user' drop 'user'Copy the code

In the DESC command, you can see that Hbase tables have many important attributes, which correspond to some methods of Hbase low-level data maintenance. For example, the COMPRESSION attribute specifies the HBase data COMPRESSION format, which is NONE by default. It can also be changed to the JDK built-in GZ or Hadoop integrated LZ4, and other attributes can be configured. See Appendix D in the official documentation for details.

2> HBase data structure

Based on the above experiments, we can understand the basic data structure of HBase

RowKey:

Rowkey is the unique primary key used to retrieve records, similar to key in Redis. Table data in HBase can be queried only by Rowkey. There are only three methods for accessing HBase data:

1. Access the data corresponding to a single Rowkey through the GET command. Use the scan command to scan all tables by default. 3. Use the scan command to specify the range of the rowkey for range search.

The Rowkey can be any string with a maximum length of 64KB, which is usually less than that in practice. In HBase, the Rowkey is stored as a byte array. The rowkeys are stored in lexicographical order.

In practice, it is important to design the Rowkey, often including important columns that are read frequently. And take full account of the sort storage feature, let some related data together as much as possible. For example, if we create a user table and often query by user ID, the Rowkey must contain the user ID field. If the user ID is placed at the beginning of the Rowkey, the data will be stored in order by the user ID, and Rowkey queries will be faster.

Column Family and Column Family:

Columns in HBase belong to a column cluster. In HBase, only column clusters are defined. That is, columns can be expanded at will within a column cluster. To access a column, you must also prefix the column cluster with a colon concatenation. The data in the column is untyped and is all stored as bytecode. Do not define too many column clusters in the same table.

Physically, all columns in a column cluster are stored together. HBase indexes and stores data at the column cluster level. Therefore, it is recommended that all columns in a column cluster have the same data structure and size to improve HBase data management efficiency.

Versions:

In HBase, a storage unit is uniquely identified by {row,column,version} data, which is called a cell. In HBase, there may be many cells with the same row and column but different versions. Using the PUT directive multiple times to specify the same row and column produces multiple versions of the same data.

By default, HBase uses the timestamp as the default version when data is written. That is, the timestamp content is displayed when data is searched using the SCAN command. HBase stores data in descending order with this timestamp. Therefore, when HBase reads data, the latest data is returned by default. The client can also specify the version of the data to be written to, and this version does not require strict incrementation.

If a data has multiple versions, HBase ensures that only the latest version of cell data can be queried. For other versions, HBase provides a version reclamation mechanism and deletes them at a certain time.

For example, the following directive can specify how many versions to store

Alter 'user',NAME => 'basicinfo',VERSIONS=>5 'basicinfo',MIN_VERSIONS => 2 # Query multiple versions of data. Get 'user','1001',{COLUMN => 'basicinfo:name',VERSIONS => 3} # Scan 'user',{RAW => true, VERSIONS => 10}Copy the code

When putting and delete, you can also specify the version. The details can be viewed using the help command.

In addition, when scan is used to query batch data, Hbase returns a sorted result. Sort by column => column cluster => timestamp. You can also specify reverse return during scan.

Namespace Namespace

When creating a table, you can also specify the namespace to which the table belongs. For example,

create_namespace 'my_ns'
create 'my_ns:my_table','fam'
alter_namespace 'my_ns',{METHOD => 'set','PROPERTY_NAME' => 'PROPERTY_VALUE'}

list_namespace

drop_namespace 'my_ns'


Copy the code

In HBase, each namespace corresponds to a folder in the/HBase /data directory on the HDFS. Table storage in different namespaces is isolated.

On the HDFS, HBase creates two namespaces by default. One is HBase, which is used to store HBase internal tables. The other one is default, which is the default namespace. Tables that do not specify a namespace are created in this namespace.

4. HBase Principles

1. HBase file read and write framework

After you have a certain understanding of HBase, you can go back to the HBase architecture diagram to have a deeper understanding of the overall HBase structure.

1, StoreFile

The StoreFile is a physical file that stores data. The StoreFile is stored in the HDFS in the form of hfiles. Each Store has one or more storefiles, and data in each StoreFile is in order. In the HDFS /hbase/data/default/user directory.

2, MemStore

Write cache. The data in HFile is ordered, so the data is stored in MemStore first, sorted, and written to HFile when the time to write is reached. Each swipe creates a new HFile.

3,…

Data can be written to HFile only after sorting by MemStore, but there is a high probability of data loss in memory. To solve this problem, the data is written to a file called write-Ahead-logfile before being written to the MemStore. When the system fails, you can rebuild from this log file.

2. HBase data writing process

The HBase data writing process is as follows:

The Client sends a write request to the HReginoServer. 2. HRegionServer writes data to WAL. 3

The meta table in this figure is a table maintained by hbase by default. You can run the scan ‘hbase: Meta ‘command to view the data. He maintains all Region information in the system. The HDFS path of this table is maintained in Zookeeper. The info: Server field in this table holds the machine and port of the Region.

Data to be written on the client is written to MenStore. HBase periodically writes data in MemStore to StoreFile in subsequent processes. This process can be triggered manually on the client side with the Flush directive.

Scenarios when HBase writes MemStore data:

  • MemStore level restriction

    Habse) hregion) memstore. Flush. The size 128 MB by default. When the size of a MemStore in the Region reaches the upper limit, a MemStore operation is triggered to generate an HFile.

  • Region level restrictions:

    Hbase) hregion) memstore. Block. Multiplier 4 by default. When the total size of all memstores in Region reaches the upper limit, the MemStore refresh is triggered. The ceiling is hbase) hregion) memstore. Flush. The size * habse hregion) memstore. Flush. The size

  • RegionServer Level limitation

    A low watermark threshold and a high watermark threshold are configured for all MemStore data sizes written to the RegionServer. Flush is forced when all MemStore file sizes reach the low watermark threshold. And according to the MemStore file from large to small a brush.

    When all MenStore file sizes reach high water, all writes are blocked and flush is forced. Until the total MenStore size drops to a low water threshold.

    Two parameters are involved:

    Hbase. Regionserver. Global. Memstore. Size said regionserver high water level threshold. The default configuration is None. Allocates 40%(0.4) of the JVM’s Heap memory size. The old version of the parameter is hbase. Regionserver. Global. Memstore. UpperLimit

    Hbase. Regionserver. Global. Memstore. Size, the lower the limit said regionserver low water occupy the percentage of the high water level threshold. The default value is None, indicating 95%(0.95) of the high watermark threshold. The old version of the parameter is hbase. Regionserver. Global. Memstore. LowerLimit

  • WAL Level Restrictions

    Hbase. Regionserver. Maxlogs WAL cap on the number. When the number of WAL files in RegionServer reaches this value, the system selects one or more regions corresponding to the earliest Hlog for Flush. At this time, there will be an alarm message Too many WALs in the log. count=….

  • Refresh MemStore regularly

    Hbase. Regionserver. Optionalcacheflushinterval default is 3600000 units is milliseconds, which is 1 hour. This is the interval at which HBase periodically refreshes MemStore. In production, this parameter is set to 0 to ensure service performance, indicating that automatic periodic flush is disabled.

  • Manually invoke flush execution

In the HBase scrub mechanism, write operations are blocked only when the RegionServer reaches the high watermark threshold, directly affecting services. The other restriction levels do not block, but generally have an impact on performance. Therefore, in many production systems, MemStore file write strategies are customized based on business progress. Such as regular manual brushing during off-peak hours.

3. HBase data reading process

The HBase data reading process is as follows:

1. The Client accesses ZooKeeper, reads the region location from the Meta table, and then reads data from the Meta table. Region information of user tables is stored in meta. 2. Locate the region in the meta table based on the namespace, table name, and Rowkey. Regionserver (Regionserver) : Regionserver (Regionserver) : Regionserver (Regionserver) : Regionserver (Regionserver) : Regionserver (Regionserver) : Regionserver (Regionserver) Get the latest piece of data and return it to the client. 5. When StoreFile is read, a BlockCache is built as the read cache to improve the efficiency of reading data. The target data queried in MemStore and StoreFile is first cached in BlockCache and then returned to the client.

Generally speaking, HBase data writing operations are completed only by writing data into the memory. Data reading starts from files. Therefore, data is written faster than data is read in HBase.

4. HBase file compression process

A simple conclusion is drawn from the HBase file reading and writing process: HBase write operations are faster than read operations. But if you ask why in the middle of an interview, this simple process is not going to convince the interviewer. In this case, you need to go deeper into the HBase bottom layer to find the answer.

4.1 HBase Underlying LSM tree

Each Region in HBase stores a series of searchable key-value mappings. Keys are indexed in a Log Structured Merge Tree (LSM). LSM trees are also an extension of B-trees, and many NoSQL databases use LSM trees to store data.

The basic idea of LSM trees is to store incremental changes to data in memory, and when memory reaches a specified size limit, these changes are written to disk in batches. The performance of the writes is greatly improved, but the reads are slightly more cumbersome, requiring the merging of the membership data in disk with the most recent changes in memory. LSM trees use this mechanism to split a large tree into N small trees, which are first written to memory. As the trees grow larger, the trees in the memory are flushed to the disk. The trees in the disk are asynchronously merged periodically to merge data into a large tree to optimize read performance.

In HBase, small trees are written to the memory first. To prevent memory data loss, they need to be persisted to disks (HBase disks correspond to files in the HDFS) while writing to the memory. This corresponds to the MemStore and Hlog of HBase. MemStore corresponds to a mutable memory store that records the most recent write (put) operations. When the size of the MemStore tree reaches a certain level, flush the MemStore tree to become StoreFile in the HDFS. As previously mentioned, HBase modifies all data and stores them in separate versions. Therefore, multiple versions of a key are stored in memfiles and storefiles. These outdated data are redundant. Many young trees merge at this time to strengthen the reading temperament.

In HBase 2.0, an important mechanism, IN_MEMORY Compact, was introduced to optimize the memory tree in the LSM. Memory compression is an important feature in HBase2.0. It can reduce flush frequency and MemFile flushing by introducing LSM structure in memory to reduce redundant data. Specific can see an official blog: blogs.apache.org/hbase/entry…

In HBase2.0, you can modify hbase-site. XML to uniformly configure the memory compression mode

<property>
     <name>hbase.hregion.compacting.memstore.type</name>
     <value><none|basic|eager|adaptive></value>
</property>

Copy the code

Or you can set a column cluster separately:

alter 'user',{NAME=>'basicinfo',IN_MEMORY_COMPACTION=>'<none|basic|eager|adaptive>'}

Copy the code

There are four modes of memory compression to choose from. The default is BASIC. The eager mode also filters duplicate data in memory, which means that the eager mode has an additional performance overhead for memory filtering compared to basic mode, but the amount of data used to swipe files is relatively small. The eager mode is better suited for scenarios where a large amount of data is obsolete, such as MQ, shopping carts, and so on. Adaptive is an experimental option. Its basic function is to automatically determine whether the eager mode needs to be enabled.

4.2 HBase File Compression Process

HBase uses the LSM tree to convert random I/OS at the application level into sequential disk I/OS, greatly improving write performance. However, LSM trees also have a significant impact on the performance of read data. Overall, compared with the B+ tree structure of MySQL, HBase has higher write performance and lower read performance than MySQL. In addition, the LSM structure is an append-only tree, and files cannot be modified but can only be added, which is also suitable for the file organization of Hadoop. In addition, HBase design is faced with terabytes of data, so this mechanism is almost mandatory.

Read operations are optimized in HBase. For example, to speed up data reading from MemFile mapped memory, HBase constructs a Bloom filter to quickly filter data in the memory, reducing memory search. This corresponds to the BLOOMFILTER attribute seen in the DESC directive.

In HBase, the flush command writes data in the memory to the HDFS. In this case, only brush, no filtering data. Each brush writes a new file to HDFS.

In addition, the compact and major-compact directives are used to merge files.

hbase:016:0> help 'compact'
Compact all regions in passed table or pass a region row
to compact an individual region. You can also compact a single column
family within a region.
You can also set compact type, "NORMAL" or "MOB", and default is "NORMAL"
Examples:
Compact all regions in a table:
hbase> compact 'ns1:t1'
hbase> compact 't1'
Compact an entire region:
hbase> compact 'r1'
Compact only a column family within a region:
hbase> compact 'r1', 'c1'
Compact a column family within a table:
hbase> compact 't1', 'c1'
Compact table with type "MOB"
hbase> compact 't1', nil, 'MOB'
Compact a column family using "MOB" type within a table
hbase> compact 't1', 'c1', 'MOB'

hbase:015:0> help 'major_compact'
Run major compaction on passed table or pass a region row
to major compact an individual region. To compact a single
column family within a region specify the region name
followed by the column family name.
Examples:
Compact all regions in a table:
hbase> major_compact 't1'
hbase> major_compact 'ns1:t1'
Compact an entire region:
hbase> major_compact 'r1'
Compact a single column family within a region:
hbase> major_compact 'r1', 'c1'
Compact a single column family within a table:
hbase> major_compact 't1', 'c1'
Compact table with type "MOB"
hbase> major_compact 't1', nil, 'MOB'
Compact a column family using "MOB" type within a table
hbase> major_compact 't1', 'c1', 'MOB'

Copy the code

The Compact directive merges adjacent portions of small files into large files, and it does not delete obsolete data, so the performance cost is not significant. The Major-compact directive merges all storeFile files into one large file, at which point it deletes outdated data and consumes a lot of machine performance.

When a delete command is sent to delete a column, HBase does not delete the data directly. Instead, HBase adds a delete flag to the data so that the client cannot find the value of the current column. In the Flush phase, only the latest version of the column is deleted, but the delete mark is also not deleted to ensure that the historical version is not queried by the client. In the Compact phase, as data is still not unified, the deletion mark will not be deleted to ensure that the client cannot always find the data of the historical version. Only in the Major-Compact phase, when the data is all merged into a StoreFile, will the historical version of the data be deleted along with the delete flag.

Different versions of the data can be viewed using the following commands. Contains historical versions as well as delete flags.

hbase> scan ‘t1’, {RAW => true, VERSIONS => 10}

HBase periodically triggers Compact and major-Compact based on the size and number of HFile files. Such as hbase. Hregion. Majorcompaction this parameter is used to configure the automatic file compression interval

However, this parameter is recommended to be set to 0 and turned off in production environments. Trigger the major-compact operation periodically by hand. This is because file compression requires a lot of merging and deleting of data, which can affect online performance. Therefore, timing scripts are used to ensure that the cluster is majorcompact in the evening when services are not too busy.

If the storeFile is too large, HBase uses another mechanism to divide the storeFile into several appropriately sized files, namely regions, and allocate them to different RegionServers. In general, if HBase data operations are frequent, you can see that files are constantly changed.

5. HBase client

HBase supports various client apis. As shown in the official documents, HBase supports Rest, Thrift, C/C++, Scala, and Jython clients. Jython is an interesting pure JAVA implementation of Python and is compatible with JAVA and Python.

This time we will focus on the Rest and Java apis of HBase.

1, the Rest API

Basic Services Directive:

$HBASE_LOGS_DIR bin/hbase-daemon.sh start rest -p <port> # Stop the REST service bin/hbase-daemon.sh stop restCopy the code

Note that the default Rest service startup interface of HBase is 8080. This interface is frequently used and is prone to port conflicts. Spark uses this port by default. Therefore, you can specify another port, such as 9090.

HBase provides a variety of Rest ports for HBase management. These ports are listed in this section. You can view official documents for all Windows.

Rest address HTTP method meaning
/status/cluster GET Viewing Cluster Status
/namespaces GET View the hbase namespace. You can add a namespace to view details about a namespace
/namespaces/{namespace}/tables GET Lists all tables under the namespace.
/namespaces/{namespace} POST Create a namespace.
/namespaces/{namespace} DELETE Example Delete the namespace. The namespace must be empty before it can be deleted.
/{table}/schema GET Look at the table structure. While POST requests an update to the table structure, PUT creates or updates a table. DELETE DELETE table
/{table}/regions GET View the region where the table resides
/{table}/{row_key} PUT Writes a data entry to a table.
/{table}/{row_key}/{column:qualifier}/{version_id} DELETE Delete a historical version data from the table

2, Java API

As a NoSQL database, HBase clients are designed with simplicity. Maven dependencies need to be imported to build HBase clients:

< the dependency > < groupId > org. Apache. Hbase < / groupId > < artifactId > hbase - server < / artifactId > < version > 2.4.4 < / version > </dependency> <dependency> <groupId>org.apache.hbase</groupId> <artifactId>hbase-client</artifactId> The < version > 2.4.4 < / version > < / dependency >Copy the code

Then you can invoke apis provided by HBase to perform data operations. Keep a few key objects in mind. For example, all operations on HBase are based on Connection objects. Table structure management is operated through the HBaseAdmin object, and Table data is operated through the Table object. Don’t memorize the rest of the code, just use it where it is. Just note that hbase2.x version of the source code has a lot of outdated classes, now version can still use, but many of the source code will prompt in the future 3.X version will be deleted, so try to use the outdated classes.

1. Obtain the HBase management instance

public Connection getConfig() throws IOException { final Configuration configuration = HBaseConfiguration.create(); / / will choose parameter configuration. The set (" hbase. Zookeeper. Quorum ", "hadoop01, hadoop02, hadoop03"); configuration.set("hbase.zookeeper.property.clientPort", "2181"); / / other optional parameter configuration. The set (" hbase. Client. Scanner. A timeout. Period ", "10000"); configuration.set("hbase.hconnection.threads.max", "0"); configuration.set("hbase.hconnection.threads.core", "0"); configuration.set("hbase.rpc.timeout", "60000"); return ConnectionFactory.createConnection(configuration); }Copy the code

2. Check whether the table exists

// Check whether the table exists. The default namespace is used. Other namespaces need to be passed {namespace}:tableName
    public Boolean isTableExist(Connection connection,String tableName) throws IOException {
        final HBaseAdmin admin = (HBaseAdmin) connection.getAdmin();
        return admin.tableExists(TableName.valueOf(tableName));
    }

Copy the code

Create table

    public void createTable(Connection connection,String tableName,String... columnFamilys) throws IOException {
        final HBaseAdmin admin = (HBaseAdmin) connection.getAdmin();
        if(columnFamilys.length<=0){
            System.out.println("Please set column cluster information");
        }else if (admin.tableExists(TableName.valueOf(tableName))) {
            System.out.println("Table"+tableName+"It already exists.");
        }
        else{
            // New API for version 2.0.0. There are a number of outdated apis in 2.0.0, which are expected to be removed after 3.0. Such as HTableDescriptor
            TableDescriptorBuilder builder = TableDescriptorBuilder.newBuilder(TableName.valueOf(tableName));
            List<ColumnFamilyDescriptor> families = new ArrayList<>();
            for(String columnFamily : columnFamilys){
                final ColumnFamilyDescriptorBuilder cfdBuilder = ColumnFamilyDescriptorBuilder.newBuilder(ColumnFamilyDescriptorBuilder.of(columnFamily));
                final ColumnFamilyDescriptor cfd = cfdBuilder.setMaxVersions(10).setInMemory(true).setBlocksize(8 * 1024)
                        .setScope(HConstants.REPLICATION_SCOPE_LOCAL).build();
                families.add(cfd);
            }
            finalTableDescriptor tableDescriptor = builder.setColumnFamilies(families).build(); admin.createTable(tableDescriptor); }}Copy the code

4, drop table

public void dropTable(Connection connection,String tableName) throws IOException { final HBaseAdmin admin = (HBaseAdmin)  connection.getAdmin(); if(admin.tableExists(TableName.valueOf(tableName))){ admin.disableTable(TableName.valueOf(tableName)); admin.deleteTable(TableName.valueOf(tableName)); System.out.println(" table "+tableName+" delete successfully "); }else {system.out.println (" table "+tableName+" does not exist "); }}Copy the code

5, insert data into table

public void addRowData(Connection connection,String tableName,String rowkey, String columnFamily,String column,String value) throws IOException {
        Table table = connection.getTable(TableName.valueOf(tableName));
        // Build the Put directive, rowkey must be passed
        Put put = new Put(Bytes.toBytes(rowkey));
        // Specify the version timestamp
// put.addColumn(Bytes.toBytes(columnFamily),Bytes.toBytes(column),System.currentTimeMillis(),Bytes.toBytes(value));
        put.addColumn(Bytes.toBytes(columnFamily),Bytes.toBytes(column),Bytes.toBytes(value));
        // You can also pass a List
      
        to insert multiple columns of data
      
        table.put(put);
        Close the thread pool after the table operation
        table.close();
       

Copy the code

6. Delete data based on rowKey

public void deleteMultiRow(Connection connection,String tableName,String... rows) throws IOException { Table table = connection.getTable(TableName.valueOf(tableName)); List<Delete> deletes = new ArrayList<>(); for(String row: rows){ Delete delete = new Delete(Bytes.toBytes(row)); deletes.add(delete); } if(deletes.size()>0){ table.delete(deletes); } table.close(); System.out.println(" table data deleted successfully "); }Copy the code

7. Get Gets a row of data

public void getRowData(Connection connection,String tableName,String row) throws IOException {
        Table table = connection.getTable(TableName.valueOf(tableName));
        Get get = new Get(Bytes.toBytes(row));
        final Result result = table.get(get);
        for (Cell cell : result.rawCells()) {
            System.out.println("rowkey:"+Bytes.toString(CellUtil.cloneRow(cell)));
            System.out.println("Column cluster:"+Bytes.toString(CellUtil.cloneFamily(cell)));
            System.out.println("Column."+Bytes.toString(CellUtil.cloneQualifier(cell)));
            System.out.println("Values."+Bytes.toString(CellUtil.cloneValue(cell)));
            System.out.println("Version timestamp:"+cell.getTimestamp());
            System.out.println("= = = = = = = = = = = = = = = = = = = = = = = = = =");
        }
        table.close();
    }

Copy the code

8. Scan obtains all data

public void scanRowData(Connection connection,String tableName,String startRow,String stopRow) throws IOException{
        Table table = connection.getTable(TableName.valueOf(tableName));
        Scan scan = new Scan();
        if(null! = startRow && !"".equals(startRow)){
            scan.withStartRow(Bytes.toBytes(startRow));
        }
        if(null! = stopRow && !"".equals(stopRow)){
            // Contains the last row of data. The default is left open and right closed.
            scan.withStopRow(Bytes.toBytes(stopRow),true);
        }
        ResultScanner scanner = table.getScanner(scan);
        for (Result result : scanner) {
            for (Cell cell : result.rawCells()) {
                System.out.println("rowkey:"+Bytes.toString(CellUtil.cloneRow(cell)));
                System.out.println("Column cluster:"+Bytes.toString(CellUtil.cloneFamily(cell)));
                System.out.println("Column."+Bytes.toString(CellUtil.cloneQualifier(cell)));
                System.out.println("Values."+Bytes.toString(CellUtil.cloneValue(cell)));
                System.out.println("Version timestamp:"+cell.getTimestamp());
                System.out.println("= = = = = = = = = = = = = = = = = = = = = = = = = =");
            }
            System.out.println("-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --");
        }
        table.close();
    }

Copy the code

9. Main Test method

public static void main(String[] args) throws IOException {
        BaseDemo baseDemo = new BaseDemo();
        final Connection connection = baseDemo.getConnection();
        String tableName = "my_ns:Student";
        String basicCF="basicinfo";
        String studyCF="studyinfo";
        / / table
// baseDemo.createTable(connection,tableName,new String[]{"basicinfo","studyinfo"});
        / / delete table
// baseDemo.dropTable(connection,tableName);
        // Insert data
// baseDemo.addRowData(connection,tableName,"1001",basicCF,"age","10");
// baseDemo.addRowData(connection,tableName,"1001",basicCF,"name","roy");
// baseDemo.addRowData(connection,tableName,"1001",studyCF,"grade","100");
// baseDemo.addRowData(connection,tableName,"1002",basicCF,"age","11");
// baseDemo.addRowData(connection,tableName,"1002",basicCF,"name","yula");
// baseDemo.addRowData(connection,tableName,"1002",studyCF,"grade","99");
// baseDemo.addRowData(connection,tableName,"1003",basicCF,"age","12");
// baseDemo.addRowData(connection,tableName,"1003",basicCF,"name","sophia");
// baseDemo.addRowData(connection,tableName,"1003",studyCF,"grade","120");
        // Delete data
// baseDemo.deleteMultiRow(connection,tableName,new String[]{"1003"});
        // Read a row of data
        baseDemo.getRowData(connection,tableName,"1001");
        // Search for data
        baseDemo.scanRowData(connection,tableName,null.null);
        connection.close();
    }

Copy the code

6. HBase optimization

1. Column cluster design

When designing the HBase table structure, the official website recommends that a table contain only one column cluster. In a real project, there should be no more than three column clusters. Vertically, HBase organizes files in column clusters, which correspond to a single directory in the HDFS. If there are too many column clusters, there are too many small files, which increases the performance consumption when files are frequently merged and split. In horizontal view, HBase splits a table into multiple regions if the data volume of the table is large. If there are too many column clusters, data skew may occur. That is, there is too much data in some hotspot regions, resulting in too many read and write operations. However, other regions have very little data and few reads and writes, resulting in performance loss. Therefore, it is recommended that each table be a single-column cluster so that HBase can manage data files more efficiently. If multiple column clusters are required, try to ensure that the number of column clusters grows evenly.

2. Pre-partitioning and Rowkey design

By default, HBase randomly divides large storefiles into regions. However, this division is unreliable and may cause data skew. Therefore, in actual use, pre-partitioning is generally carried out. You can manually maintain a region division rule and configure the rowkey distribution range in each region. If the added data meets the rowkey range maintained by a region, the data is transferred to the region for maintenance. In this way, the partitions where data is stored can be maintained in advance to improve HBase performance.

1. Manually divide partitions

create 'staff','info',SPLITS=>['1000','2000','3000','4000']

Copy the code

This partitioning method partitions the Rowkey by the first four bits. For example, rowkey 152033333 will be placed in the 1000~2000 partition.

2. Partition by file

Createa montes.txt file with the following contents:

0
1
2
3
4
5
6
7
8
9

Copy the code

Then execute the building statement

create 'staff2','info',SPLITS_FILE=>'splits.txt'

Copy the code

This partitioning method splits the rowkey into 10 partitions based on the first number of the rowkey.

3. Generate hexadecimal sequence pre-partitioning

create 'staff3','info',{NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}

Copy the code

Of these methods, the second method is generally the most used, and the third method is rarely used.

4. Partitions can also be specified in Java APIS.

See previous table building Demo where there are two overloaded methods in the admin.createTable() method

   * @param desc table descriptor for table
   * @param startKey beginning of key range
   * @param endKey end of key range
   * @param numRegions the total number of regions to create
void createTable(TableDescriptor desc, byte[] startKey, byte[] endKey, int numRegions) throws IOException;

 * @param desc table descriptor for table
   * @param splitKeys array of split keys for the initial regions of the table
 default void createTable(TableDescriptor desc, byte[][] splitKeys) throws IOException {

Copy the code

After the partition is complete, you can view the table partition information on the HBase management page.

Region planning is based on the number of nodes in a cluster. Generally, the number of regions on each node should not exceed three.

3. Rowkey design

Rowkey is the only way to query HBase table data. Therefore, Rowkey must contain key query attributes in the future. For example, if a user table needs to be queried by userName and userPass in the future, the rowkey should take these two fields as the main elements and calculate a unique sequence number as the Rowkey.

On the other hand, when a piece of data has a Rowkey, the partition in which the data is stored depends on the partition in which the Rowkey resides. The main purpose of our rowkey design is to distribute data evenly across all regions. This can prevent data skew to a certain extent. However, rowkey distribution of business data is often not evenly distributed according to pre-partitioning rules. Therefore, rowkeys need to be hashed to evenly distribute regular data to different regions. For example, if we split the staff table into five partitions, and the userID at the front of the rowkey is likely to be in the range of 1000~2000, the data skew becomes very obvious. In general, there are several common hashes.

1. Hash

Hash algorithm such as MD5 and AES is performed for each original Rowkey calculated based on key service information.

2. String reversal

The entire original Rowkey is flipped in character order. For example, 13579 is reversed to become 97531.

3. String concatenation

Add an associated string to the front of the original Rowkey to control partitioning. For example, the original rowkey is computed as 0000{rowkey}.

4. Integrate Hive with HBase

Hive is a data warehouse software commonly used for big data. It can convert SQL statements into MapReduce computing tasks and directly index data files on HDFS. In this way, the query and analysis task of big data can be carried out in the way of SQL.

HBase and Hive complement each other in many aspects and are naturally suitable for integration. 1. HBase and Hive store data based on HDFS. 2. Hive is very slow because MapReduce computing needs to be enabled. Data read and write operations in HBase are fast. 3. HBase can only retrieve data based on rowkeys, which is limited in service scenarios. Hive can flexibly retrieve data using SQL.

Therefore, Hive and HBase are used together in many projects. HBase provides efficient READ/write performance to provide fast OLTP data services, and Hive provides fast and comprehensive SQL functions to provide OLAP data statistics services.

The basic integration between Hive and HBase is to create a table in Hive and map it to a table in HBase. In Hive, operations on the associated table are changed to indexes data in the HBase table.

Procedure for integrating Hive and HBase:

1. Compatibility between Hive and HBase versions

That’s an important premise. Versioning compatibility between components is a major headache when building big data projects. CDH is often used to uniformly resolve version compatibility of individual components. We are now all using the open source version of Apache components, so we need to deal with the version compatibility issues ourselves.

We are using Hive version 3.1.2. HBase is integrated with Hive in this version by default. His integrated HBase version is 2.0.0-Alpha4. The HBase version we use is 2.4.4. After testing, HBase of this version can integrate Hive successfully. If there is a conflict between versions, our following experiment will not succeed. In this case, download the Hive source code and update the HBase version.

1. For the HBase version in Hive, view the first several jar packages of HBase in $HIVE_HOME/lib.

2. When compiling Hive, download the Hive source code, modify the Maven dependent version of HBase in Hive, recompile using Maven, and update the hbase-related JAR packages in Hive to the Hive deployment environment.

2. Configure HBase environment variables

When hive was introduced earlier, an error log was generated each time the Hive client was started.

which: no hbase in (: / usr/local/sbin, / usr/local/bin: / usr/sbin, / usr/bin: / usr/local/es/node - v8.1.0 - Linux - x64 / bin: / app/hadoop/hadoop - 3.2.2 / / bi N: / app/hadoop/hadoop - 3.2.2 / / sbin/app/Java/jdk1.8.0 _212 / / bin: / root/bin)Copy the code

This is because Hive checks whether HBase is installed during startup. You can edit HBase environment variables on the Hive installation node.

Vi ~/. Bash_profile Add export HBASE_HOME=/app/hbase/hbase-2.4.4 to the file and add the $HBASE_HOME/bin directory to the Path environment variable.Copy the code

If you start Hive again, the error log will not exist. More logs are printed on the console during Hive execution.

3. Add configurations in the Hive -site. XML configuration file

<property>
  <name>hive.zookeeper.quorum</name>
  <value>hadoop01,hadoop02,hadoop03</value>
</property>
<property>
  <name>hive.zookeeper.client.port</name>
  <value>2181</value>
</property>

Copy the code

Added the Zookeeper cluster configuration used by HBase so that Hive can search for regions and read data from HDFS like HBase.

4. Create a mapping table in Hive

CREATE TABLE hive_hbase_emp_table( empno int, ename string, job string, mgr int, hiredate string, sal double, comm double, deptno int) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" =  ":key,info:ename,info:job,info:mgr,info:hiredate,info:sal,info:comm,info:deptno") TBLPROPERTIES ("hbase.table.name" = "hbase_emp_table");Copy the code

Hbase.table. name in TBLPROPERTIES indicates the table to be mapped to hbase. This table does not exist in HBase. Hive creates tables on HBase at the same time.

Hbase.columns. Mapping in SERDEPROPERTIES specifies the mapping between hive_hbase_EMP_table in Hive and hbase_emp_table in hbase. The configuration is mapped field by field. For example, in the preceding configuration, key is a fixed Rowkey in an HBase table. Key corresponds to empno, the first field in the Hive table. Info :name corresponds to the ename field in hive. Configure the mapping between each field in sequence.

5. Insert test data into Hive

Next, try to insert data into Hive’s hive_hbase_EMP_TABLE table using an INSERT statement. Check data in the HBase table.

Note, however, that you cannot load data directly into Hive. The imported data cannot be correctly mapped to HBase. Instead, we can find a Hive intermediate table and import data as insert into XXX select.

We created an EMP table when introducing Hive. In fact, this test table structure is consistent with the EMP table structure.

CREATE TABLE emp( empno int, ename string, job string, mgr int, hiredate string, sal double, comm double, Deptno int) row format delimited fields terminated by ', ';Copy the code

Then from this table, you can import the test data that we preprepared in the text file

hive> load data local inpath '/root/emp' into table emp;

Copy the code

Next, insert data into the mapping table through the intermediate table

hive> insert into table hive_hbase_emp_table select * from emp;

Copy the code

6. Check the test results

If no error is reported, imported data can be viewed in Hive and HBase. You can also perform complex SQL computations in Hive. Select sal,sum(sal) from emp group by deptno;

Then we can go to HDFS and see how the data is organized. On the HDFS, as you can see, in the Hive established a/user/Hive/warehouse/hive_hbase_emp_table directory, this is a Hive tables in the corresponding directory. But there are no data files in this directory.

Go to the HDFS directory of HBase. HDFS has also created a HBase table corresponding to the directory/HBase/data/default/hbase_emp_table. A Storefile directory named random strings will be created in this directory. Enter the directory/hbase/data/default/hbase_emp_table/c4440f96cf62f5394722fd28bef9a734 can see hbase data in the file. You will notice that the data files in this directory are still empty. This is because the HBase table data is still in the memory, but the data mapped from Hive is not written to Memfile in a timely manner. Data in the HBase memory is written to the HDFS only after the flush ‘hbase_emp_table’ command is run in HBase. For example in HDFS/hbase/data/default/hbase_emp_table c4440f96cf62f5394722fd28bef9a734 / info directory you can see a random string naming files, this is the ground hbase file.

In the integration between Hive and HBase, HBase still saves data. In this case, Hive still uses the MapReduce computing program to collect HBase data.

In fact, HBase is deeply integrated with Hadoop. Although the HBase files that are deployed to the HDFS are binary files that cannot be read directly, you can read HBase data easily using MapReduce of Hadoop. How MapReduce reads data from HBase is rarely used. There are also detailed explanations in the apache_hbase_reference_guide. PDF on the official website. In addition, HBase provides specific code examples, which will not be covered in this course.

Another important framework is Spark. The integration between Spark and HBase is also an important feature of HBase. Interested students can refer to the next.

Finally, there is another framework for HBase SQL query support. Phoenix.apache.org/ This framework is relatively niche and is currently used only to provide standard JDBC support for HBase. Then, once the scaffolding is complete, you need to use a SQuirrel SQL client tool to connect. I leave it to the students to work on it.

Finally, delete the table in Hive and delete the corresponding table in HBase.

Vii. Course Summary

As a Big data NoSQL product based on Hadoop, HBase provides strong performance in commercial projects. It supports more than 100 billion read and write operations per day, supports petabytes of data storage, and can support concurrent read and write operations at peak levels of one million. And his stability has been tested in the business practice of major enterprises. Big companies like FaceBook and Alibaba are heavy users of HBase. All these demonstrate the importance of HBase products.

In our course, we built a complete big data processing platform step by step with the latest Hadoop, Spark, Hive and HBase, demonstrated how to use these big data components, and introduced the core mechanism and principle behind these components. However, with the explosive growth of Internet data, big data has gradually become the mainstream of Internet technology, the speed of technological update and iteration is also accelerating, and new components are constantly emerging. This, coupled with the big data ecosystem, means that new features of one component can quickly spread to other components, so self-learning is especially important in big data systems to keep up with new releases and technologies.

In fact, Hadoop+Spark+HBase is a cornerstone of open source big data, which is derived from Google’s core papers. Together with Hive, these four components are sufficient to support many big data development requirements. Later, we will design practical courses based on this and take you to get familiar with them.