The HBase Doesn’t Sleep Book is an HBase technical book that makes people not fall asleep after reading. It is very good. In order to deepen my memory, I decided to organize important parts of the book into reading notes for later reference and hope to bring some help to students who are just learning HBase.
directory
- Chapter 1 – Getting to know HBase
- Chapter 2 – Get HBase Running
- Chapter 3: HBase Basic Operations
- Chapter 4 – Getting started with the Client API
- Chapter 5 – HBase Internal Exploration
- Chapter 6 – Advanced usage of the client API
- Chapter 7 – Client API management capabilities
- Chapter 8 – Faster
- Chapter 9 – When HBase Meets MapReduce
This document does not describe the HBase installation process in detail.
First, tips
1. Add hadoop user and assign sudo permission
(1) Switch to user root and create user Hadoop.
# useradd hadoop
# passwd hadoop
Copy the code
(2) Add Hadoop to sudoers list.
# chmod u+w /etc/sudoers
# vi u+w /etc/sudoers-- Add the following code -- hadoop ALL=NOPASSWD:ALLCopy the code
2. Hadoop environment variable setting
Switch to the Hadoop user and edit the ~/.bashrc file to add the following environment variables:
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_PREFIX=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
expOrt HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDES_HOME=$HADOOP_HOME
eXpOrt YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_INSTALL=$HADOOP_HOME
Copy the code
HADOOP_PREFIX = HADOOP_HOME; HADOOP_PREFIX = HADOOP_HOME; In fact, Hadoop used HADOOP_HOME in the early days to mark the location of the application folder, but later changed to HADOOP_PREFIX, so for compatibility, just set them all and keep the same value.
3, configuration,hadoop-env.sh
$HADOOP_PREFIX/etc/hadoop/hadoop-env.sh: $HADOOP_PREFIX/etc/hadoop/hadoop-env.sh:
export HADOOP_NAMENODE_OPTS="-Xms1024m -Xmx1024m -XX:+UseParallelGC"
export HADOOP_DATANODE_OPTS="-Xms1024m-Xmx1024m"
export HADOOP_LOG_DIR=/data/1ogs/hadoop
Copy the code
- The JVM will either run out of memory or use up all of the machine’s memory if the footprint is not set. Add memory parameters wherever the JVM is used, and at least know how much memory you have to control the situation.
- If the log file path is not set, most of the later will encounter log partition full, all kinds of strange faults emerge in endlessly.
4. Add hbase to the Supergroup
In both pseudo-distributed and fully distributed scenarios, HBase directly creates the/HBase folder in the root directory of the HDFS. Creating a folder in the root directory requires the permission of the super user group. The super user group permissions by HDFS – site. The DFS in XML. Permissions. Its definition, if you don’t set this parameter, the default way is its super user. Assume that everyone is not set DFS. Permissions. Its properties, now need to add hbase to Linux to its group. The CentOS system can execute the following statements:
# groupadd supergroup
# groupmems -g supergroup -a hbase
Copy the code
5. Pay special attention
HBase provides a ZooKeeper and starts its own ZooKeeper by default. If HBase uses its own ZooKeeper, the ZooKeeper name you see in the JPS is HQuorumPeer. If you are using an external ZooKeeper cluster, its name is QuorumPeer or QuorumPeerMain.
Whether to enable built-in ZooKeeper is defined by the HBASE_MANAGES_ZK variable defined in conf/hbase-env.sh. This variable defaults to true, but you can change the value to false if you don’t want to use the built-in ZK.
# Tell HBase whether it should manage it's own instance of Zookeeper or not.
export HBASE_MANAGES_ZK=false
Copy the code
6. HBase can read HDFS configurations in three ways
- the
HADOOP_CONF_DIR
Added to theHBASE_CLASSPATH
Medium (recommended); - Copy the HDFS configuration file to the HBase conf folder or create one
hdfs-site.xml
Soft link tohbase/conf
Under; - Write several HDFS configuration items directly to the hbase-site. XML file.
2. HBase basic architecture
- HBase has a Master to manage metadata, which is like namenode in Hadoop.
- RegionServer Is used to store data, which is equivalent to datanode in Hadoop.
- ZooKeeper is responsible for maintaining all HBase nodes. If ZooKeeper goes down, you can’t connect to any of them.
- In the production environment, the full deployment mode is based on HDFS, which stores data. In single-machine deployment mode, HBase directly uses common file systems to store data.
- Data can be read and written from HBase even if Master is disabled, but tables cannot be created or modified. This is because the client only interacts with ZooKeeper and RegionServer when it reads data, so ZooKeeper is even more important than Master.
Enable data block encoding
1. Data block coding
Data block encoding mainly encodes keys in keys/values to reduce the space occupied by Key storage, because many keys have the same prefix.
Consider a table where rowkeys, Column families, and columns are defined as follows: Row keys start with the prefix myrow, followed by a number to form a row key, such as myRow001, MyRow002, myROW003, etc., has a column family called myCF, myCF column family has 5 columns named col1, COL2, COL3, COL4, COL5. Their storage structure is shown below.
If we store only progressive values, we can avoid storing repeated prefixes, which is called Prefix encoding.
2. Prefix code
If prefix encoding is used as the data block encoding, only the complete string of the first Key is stored, and only the characters that are different from the first Key are stored in subsequent keys. The recoded data is shown as follows.
As you can see, the storage space of the Key is greatly reduced. After encoding, the total storage space of the Key is only 37 characters, compared to 180 characters before encoding, reducing the space usage by 79%.
3. Differential coding (Diff)
Differential encoding (Diff) goes further than prefix encoding by differentiating even the following fields together.
- Bond length (KeyLen);
- Value length (ValueLen);
- Timestamp, also known as Version;
- Type, that is, the key Type.
The KeyValue structure after differential encoding is as follows:
- 1 byte: indicates the flag bit.
- 1-5 bytes: The Key length is KeyLen.
- 1-5 bytes: Value Length (ValLen);
- 1-5 bytes: Indicates the Prefix length (Len).
- . Bytes: remaining parts;
- . Bytes: the real Key or just the different suffix part of the Key;
- 1-8 bytes: timestamp or the difference part of the timestamp.
- 1 byte: Key type (type).
- . Bytes: indicates the value.
The Prefix Len field indicates the length of the current Key compared to the same Prefix of the Key compared to it.
Flag bit
It’s a binary number. For example, 5=11,7=111. It is used to record the difference between the current KeyValue and the previous KeyValue. Here are some rules for generating flag bits:
- If the KeyLen (length of Key) in the current KeyValue is the same as that in the other KeyValue, the flag code is 1.
- If the ValLen (Value length) in the current KeyValue is the same as the other ValLen, the flag code is 10.
- If the Type in the current KeyValue is equal to the previous Type, the flag code is 100.
We can quickly know the difference between this field and the other field by doing an ampersand (&) calculation of flag and flag code, that is, the same position is marked as 1.
This encodings almost maximum compression of the data, but this encodings is not enabled by default. Why is that? Because it’s so slow, every piece of data has to be computed this way, and it’s so slow to get the data. Use it unless you are looking for extreme compression ratios, but when reading performance is not a concern, such as when you want to archive the data, consider using differential encoding.
4. Fast Diff Coding
Fast Diff coding is based on the idea of Diff coding, but also takes into account the fatal defect of slow differential coding. The KeyValue structure of fast difference encoding is exactly the same as that of difference encoding, except that the storage rules of Flag are different, and the calculation of Timestamp is optimized. The implementation of Fast Diff is faster than Diff, and it is also a relatively recommended algorithm.
If you want to compress your data with the differential algorithm, it is best to use fast differential encoding, but this “fast” is only compared to the original differential algorithm, because there is still a lot of calculation, so fast differential algorithm is still relatively slow.
5. Prefix Tree
Prefix Tree is a variant of the Prefix algorithm that was added after version 0.96. The most important function of prefix tree coding is to improve the random read capability. However, its complex algorithm slows down the write speed and consumes more CPU resources. Therefore, it is necessary to make a choice between resource consumption and random read performance.
In summary, prefix encoding and fast differential encoding (which Kylin uses by default) are two common data block encoding methods.
4. Enable the compressor
1. Compressor
The HBase data is stored in a compressed format, saving disk space. Of course, this is completely optional, but it is recommended that you install Snappy, which is currently the highest ranked HBase compressor.
You can enable the compressor by modifying the column family description:
hbase> alter 'mytable',{NAME =>'mycf',COMPRESSION=>'snappy'}
Copy the code
2. Share the built-in Hadoop compressor
Because Hadoop’s shared Library has many resources, including compressors, you can use them directly in HBase. You can check the current useful compressors of Hadoop by using the following command:
$ hbase --config $HBASE_HOME/conf org.apache.hadoop.util.NativeLibraryChecker
Copy the code
If the following error message is displayed, it indicates that NativeLibraryChecker cannot read the Hadoop Native library.
util.NativeCodeLoader: Unable to load native-hadoop library for your platform...
using builtin-java classes where applicable
Native library checking:
hadoop: false
zlib: false
snappy: false
1z4: false
bzip2: false
Copy the code
The usual solution is to add the following statement to hbase-env.sh:
exportHBASE_LIBRARY_PATH= Path of the Hadoop Native packageCopy the code
3. Snappy compressor
Snappy is a compressed software developed by Google. It has the following features:
- Fast: compression speed up to 250MB/s;
- Stable: Has been used in several Google products for years;
- Robust: Snappy’s decompressor ensures that data is not too bad when corrupted;
- Free and open source.
After the installation is complete, add the following statements in hbase-env.sh:
exportHBASE_LIBRARY_PATH= encoder so file path:$HBASE_LIBRARY_PATH
Copy the code
4, GZ compressor
GZ compressors are not recommended for archive files that do not require high speed. The features of GZ compressors are as follows:
- GZ compressors have the highest compression ratio;
- The speed is slow and occupies a lot of CPU.
- Easy to install.
Java already comes with a GZ COMPRESSION, so the GZ COMPRESSION is not the best, but it’s the easiest to use. You don’t need to set anything, just change the COMPRESSION property of the column family to GZ.
alter test1',{NAME=>'mycf',COMPRESSION=>'GZ'}
Copy the code
5, LZO compressor
Before Snappy, LZO was an official compression algorithm recommended by HBase. The main reason is that GZ compression is too slow, and LZO is focused on speed, so LZO is better than GZ, but since Snappy came along, LZO has no advantage.
6. LZ4 compressor
LZ4 features:
- Have a low loss rate;
- The speed is very fast and can reach 400M/s per core.
LZ4 is faster than Snappy. The LZ4 compressor is integrated with libhadoop.so, you only need to load the native Hadoop library from HBase.
Five, the summary
Using block encodings or compressors depends on whether the data you store takes up more space for qualifiers or values.
- If a qualifier takes up a lot of space, block encoding is recommended.
- If the value occupies a large space, you are advised to use an encoder.
When I first learned HBase, I used Java API to operate tables, and paid little attention to HBase installation. At the very least, block encoding and compressors should be considered for future table building. Snappy compressors and prefix encoding are both recommended as simple and effective tuning methods.
Any Code, Code Any!
Scan code to pay attention to “AnyCode”, programming road, together forward.