This document analyzes the HBase principles. You can learn the following:

  • Theory of CAP
  • The reason for NoSQL
  • HBase features and application scenarios
  • HBase data model and basic principles
  • Basic use of the client API
  • Easy to confuse knowledge interview summary

Warm tips: the content of this article is long, if you feel useful, suggest collection. Also remember to share, like, look, quality three!

From the BigTable

HBase is a highly reliable, high-performance, column-oriented, and scalable distributed database based on Google BigTable. HBase can be used to store unstructured and semi-structured sparse data. HBase supports large-scale data storage and can process data tables consisting of more than 1 billion rows of data and millions of column elements in horizontal expansion mode.

BigTable is a distributed storage system that uses Google’s MapReduce distributed parallel computing model to process massive data, uses Google’s distributed file system GFS as the underlying data storage, and uses Chubby to provide collaborative service management. With a wide range of applications, scalability, high availability and high performance characteristics. The following table compares BigTable with HBase:

Rely on BigTbale HBase
Data is stored GFS HDFS
The data processing MapReduce Hadoop graphs
Collaborative services Chubby Zookeeper

Theory of CAP

In 2000, a Berkerly professor Eric Brewer proposed a CAP theory, and in 2002, Seth Gilbert and Nancy Lynch at MIT published a proof of Brewer’s conjecture, The correctness of CAP theory is proved. The so-called CAP theory means that it is impossible for a distributed computing system to simultaneously satisfy the following three points:

  • Consistency

    This is equivalent to all nodes accessing the same latest copy of data. That is, any read operation can always read the result of the previous write operation, that is, in a distributed environment, different nodes access the same data.

  • Availability

    Each request gets an error-free response — but the data retrieved is not guaranteed to be up to date. That is, fast data acquisition, can return the operation results in a certain time.

  • Partition tolerance

    In practical terms, partitioning is a time-bound requirement for communication. If the system cannot achieve data consistency within the time limit, it means that A partitioning situation has occurred and that it must choose between C and A for the current operation. That is, when there is a network partition (some nodes in the system cannot communicate with other nodes), the separated system can run normally, that is, reliability.

As the figure above shows, a distributed system cannot satisfy consistency, availability, and fault tolerance of partitions at most. When dealing with CAP issues, there are several options:

  • Satisfy CA, not P. Having all transaction-related content on the same machine affects the scalability of the system. Traditional relational databases. MySQL, SQL Server, and PostgresSQL all adopt this design principle.

  • Satisfies AP, but does not satisfy C. Inconsistent (C), that is, inconsistent data is allowed to be returned. In fact, for WEB2.0 sites, it is more about service availability than consistency. For example, if you post a blog or write a tweet, some of your friends will see the post immediately, while others will wait a while for the post to appear. There are delays, but for an entertaining Web 2.0 site, these few minutes are not significant enough to affect the user experience. On the other hand, when an article or tweet is posted, users are not happy that it can’t be posted immediately (not satisfying usability). So, for WEB2.0 sites, availability and partition fault tolerance take precedence over data consistency, not completely abandoning consistency, but ultimately consistency (with latency). NoSQL databases such as Dynamo, Cassandra, And CouchDB use this principle.

  • CP, not A. Emphasize consistency (C) and partition fault tolerance (P) over availability (A). When a network partition occurs, the affected services cannot provide services during the data consistency waiting period. For example, Neo4J, HBase, MongoDB, and Redis adopt this design principle.

Why NoSQL

NoSQL stands for Not Only SQL, meaning more than SQL. The CAP theory mentioned above is the design principle of NoSQL. So why the rise of NoSQL databases? Because of the advent of WEB2.0 and big data era, relational database can not meet the demand more and more. With the development of big data, Internet of Things, mobile Internet and cloud computing, the proportion of unstructured data is more than 90%. Due to the inflexibility of model and poor expansion level, more and more defects of relational database are exposed in the face of big data. Therefore, NoSQL database arises at the historic moment, which better meets the needs of the era of big data and WEB2.0.

Facing the challenges of WEB2.0 and big data, relational databases are not performing well in the following aspects:

  • Poor performance in processing massive data

    In the era of WEB2.0, especially the development of mobile Internet, UGC(User Generated Content) and PGC(Public Generated Content) occupy our daily life. Nowadays, the development of “we media” is blossoming everywhere. Almost everyone has become a content creator, such as blog posts, comments, opinions, news, videos and so on. It can be seen that the speed of these data is fast and the amount of data is large. For example, microblog, public account, or Taobao, the data generated in a minute may be very amazing, in the face of these tens of millions of data records, hundreds of millions of relational database query efficiency is obviously unacceptable.

  • Unable to meet high concurrency requirements

    In the WEB1.0 era, most of the web was static (see what you get), making it more responsive to mass user visits. However, in the era of WEB2.0, where the emphasis is on user interaction (user-generated content), all information needs to be dynamically generated in fact, resulting in high concurrent database access, which may be tens of thousands of read and write requests per second. For many relational databases, this display is unbearable.

  • Scalability and high availability requirements cannot be met

    In today’s era of entertainment to death, hot issues (eye-catching, satisfy curiosity) would lead to traffic bandwagon, such as weibo revealed a star derailed, hot search list will quickly attracted a large number of users (commonly known as eat the melon mass), resulting in a large amount of interaction (ceng hotspot), these can create the database load has increased dramatically, to read and write Therefore, the database needs to be able to quickly improve performance in a short time to cope with unexpected demand (after all, downtime can affect the user experience). However, relational databases are often difficult to scale horizontally, simply adding more hardware and service nodes to scale performance and load capacity, as web servers and application servers do.

In summary, NoSQL database arises at the historic moment, is the inevitable development of IT.

HBase features and application scenarios

The characteristics of

  • Strong consistent read and write

HBase is not a eventually Consistent data store. This makes it ideal for high-speed counting aggregation tasks

  • Automatic Sharding

HBase tables are distributed in clusters by region. When data increases, the region automatically divides and redistributes data

  • RegionServer automatic failover

  • Hadoop/HDFS integration

HBase supports the HDFS as its distributed file system

  • Graphs integration

HBase supports large-scale concurrent processing using MapReduce. HBase can be used as both Source and Sink.

  • Java client API

HBase supports easy-to-use Java apis for programmatic access

  • Thrift/REST API

Support Thrift and REST to access HBase

  • Block Cache and Bloom Filter

HBase supports Block Cache and Bloom filter query optimization to improve query performance

  • Operations management

HBase provides built-in web pages and JMX indicators for O&M

Usage scenarios

HBase is not suitable for all scenarios

First, ** data volume **. Make sure you have enough data. If you have hundreds of millions or billions of rows, or at least tens of millions of rows in a single table, HBase is a good choice. If there are only thousands or millions of rows, a traditional RDBMS might be a better choice.

Secondly, relational database characteristics. Make sure you don’t rely on all the extra features of an RDBMS (column data types, secondary indexes, transactions, high-level query languages, and so on). An application built on an RDBMS cannot be migrated to HBase by simply changing the JDBC driver, requiring a complete redesign.

Again, hardware. Make sure you have enough hardware. For example, the default copy of HDFS is 3, so at least 5 data nodes are required to make full use of its features, plus a NameNode node.

Finally, data analysis. Data analysis is a weakness of HBase because table association is not supported by HBase or the entire NoSQL ecosystem. If the primary requirement is data analysis, such as reporting, HBase is not a good fit.

HBase data model

Basic terminology

HBase is a sparse, multi-dimensional, and persistent mapping table. Row keys, column families, and column limits are indexed by timestamp. The value of each cell is byte array []. Before you know HBase, know the following concepts:

  • Namespace

    A Namespace, or Namespace, is a logical group of tables, similar to the Database in a relational database management system. HBase has two special predefined namespaces: HBase and Default. HBase belongs to the system namespace and is used to store HBase internal tables. Default Is the default namespace. If no namespace is specified during table creation, default is used.

  • table

    Consists of rows and columns divided into column families

  • line

    A row key is an uninterpreted byte array. In HBase, row keys are stored in a table in alphabetical order. Each HBase table consists of several rows. Each row is identified by a row key. You can take advantage of this feature to store together rows that are often read together.

  • Column family

    In HBase, columns are organized by column families. All members of a column family have the same prefix. For example, the courses:history and courses:math columns are members of the courses family. The colon (:) is a column family delimiter used to distinguish a prefix from a column name. Column families must be declared when the table is created, whereas columns can be declared when they are used. In addition, all data stored in a column family usually has the same data type, which can greatly improve data compression. Physically, members of a column family are stored together on the file system.

  • column

    Data in a column family is located by column qualifiers. Columns generally do not need to be defined when the table is created, nor do they need to be consistent across rows. Columns have no explicit data type and are always treated as byte arrays byte[].

  • cell

    Cells, that is, data stored specifically by row keys, column families, and columns. Data stored in cells also has no explicit data type and is always treated as an array of bytes []. In addition, the data for each cell is multi-version, and each version corresponds to a timestamp.

  • The time stamp

    HBase table data has versions, which are identified by time stamps. When a cell is modified or deleted, HBase automatically generates and stores a timestamp for it. Different versions of a cell are stored in timestamp descending order, with the most recent data being read first.

    For details about HBase data model, see the following figure:

The conceptual model

In the HBase conceptual model, a table can be regarded as a sparse and multidimensional mapping, as shown in the following figure:

As shown in the above table:

This table contains two rows of data, com.cnn. WWW and com.example.

The three column families are contents, Anchor and people respectively.

For the first row of data (the corresponding row key is com.cnn.www), the column family anchor contains two columns: anchor:cssnsi.com and Anchor :my.look.ca; The column family contents contains one column: contents: HTML;

For the first row (the corresponding row key is com.cnn.www), there are five versions of the data

For the second row (the corresponding row key is com.example.www), it contains one version of the data

The above table can be through a four dimensional coordinates a cell data: [row key, and column, column, timestamp], like [com. CNN. WWW, contents, contents: HTML, t6]

The physical model

From the conceptual model, HBase tables are sparse. In physical storage, it is stored by column family. A column qualifier (column_family:column_qualifier) can be added to an existing column family at any time.

From the physical model, empty cells that exist in the conceptual model are not stored. For example, if you want to access contents: HTML with a timestamp of T8, you don’t return a value. It is important to note that if you do not specify a timestamp when accessing the data, the latest version of the data is accessed by default, because the data is sorted in descending order by version timestamp.

As shown in the above table: If the row com.cnn. WWW and column contents: HTML are accessed, the corresponding data of T6 will be returned in the case that no timestamp is specified. Similarly, if you visit anchor:cnnsi.com, the corresponding data of t9 will be returned.

HBase working principles and operating mechanism

The overall architecture

You must be familiar with HBase based on the above description. Now take a look at the HBase macro architecture as shown in the following figure:

This section describes the overall HBase architecture from a macro perspective. In terms of HBase deployment architecture, HBase has two types of servers: Master server and RegionServer server. Generally, an HBase cluster has one Master server and several RegionServer servers.

The Master server maintains table structure information and the RegionServer server stores actual data. In an HBase cluster, the client obtains data from the RegionServer directly. Therefore, you can still query data after the Master fails, but you cannot create new tables.

  • Master

As we all know, Hadoop adopts master-slave architecture, that is, Namenode node is the master node and Datanode node is the slave node. Namenode nodes are critical to hadoop clusters, and if namenode nodes fail, the entire cluster collapses.

However, in an HBase cluster, the Master service is not that important. Although it is a Master node, it is not a leader node. The Master service is more of a ‘handyman’, similar to a helper role. When connecting to an HBase cluster, the client directly obtains the Address of the RegionServer from Zookeeper, and then obtains the desired data from the RegionServer without passing through the Master node. In addition, the RegionServer directly interacts with the RegionServer when data is inserted or deleted from an HBase table without the participation of the Master service.

So what does the Master service do? The Master is only responsible for coordinating various operations, such as creating tables, deleting tables, moving regions, and merging. These operations have a common problem: the need to cross RegionServer. Therefore, HBase assigns these tasks to the Master service. This structure has the advantage of greatly reducing the cluster’s dependency on the Master. However, there are usually only one or two Master nodes. If the cluster is heavily dependent on the Master node, a single point of failure will occur. In HBase, even if the Master is down, the cluster still runs normally and can store and delete data.

  • RegionServer

RegionServer is a container that stores regions and is a service on a server. RegionServer is a node that stores data in the DISTRIBUTED file system HDFS. After the client obtains the Address of The RegionServer from ZooKeeper, it directly obtains data from the RegionServer. The HBase cluster is more important than the Master service.

  • Zookeeper

RegionServer relies on the ZooKeeper service. ZooKeeper acts as a butler in HBase. ZooKeeper manages information about all HBase RegionServers, including the RegionServer on which data segments are stored. Each time a client connects to HBase, it communicates with ZooKeeper, checks which RegionServer needs to be connected, and then connects to the RegionServer.

You can use the zkCli to access hbase node data. You can obtain hbase meta table information by using the following names:

[zk: localhost:2181(CONNECTED) 17] get /hbase/meta-region-server
Copy the code

Zookeeper functions in HBase clusters are as follows: On the server, it is an important dependency for cluster coordination and control. For clients, it is an essential part of querying and manipulating data.

Note that when the Master service is down, read/write operations can still be performed. However, if ZooKeeper is suspended, data cannot be read because the metadata table hbase: The location of MEATA is stored in ZooKeeper. Zookeeper is important to HBase.

  • Region

A Region is a collection of data. Tables in HBase generally have one to multiple regions. Region Cannot cross servers. One RegionServer has one or more regions. When a table is created, one Region is sufficient to store all data when the data volume is small. When the data volume increases, the table is split into multiple regions. When load balancing is performed on HBase, the Region may be moved from one RegionServer to another. Region is stored in the HDFS. All data access operations of Region are implemented by invoking the HDFS client interface. A Region is equivalent to a partition of a partitioned table in a relational database.

The micro architecture

The previous section describes the overall HBase architecture. The following figure shows the internal details. The following figure shows the internal architecture of a RegionServer.

As shown in the preceding figure, a RegionServer can store multiple regions. A Region is a data fragment. Each Region has a starting rowkey and an end Rowkey, representing the range of rows it stores. A region contains multiple stores, one of which corresponds to a column family. Each store contains a MemStore, which is responsible for sorting data. After a certain threshold is exceeded, MemStore data is flushed to an HFile file, where data is stored.

A RegionServer shares a write-Ahead Log (WAL). If WAL is enabled, data is written to WAL first, which is fault-tolerant. WAL is an insurance mechanism where data is written to WAL before it is written to Memstore. This allows data to be recovered from WAL during failover. In addition, each Store has a MemStore for sorting data. A RegionServer also has only one BlockCache, which is used to cache read data.

  • WAL pre-write logs

**Write Ahead Log (WAL)** Logs all data in HBase. WAL is used for fault tolerant recovery and is not required. On the HDFS, the default path of WAL is /hbase/WALs/. You can configure WAL in hbase.wal.dir.

WAL is turned on by default, if turned off, they can use the command Mutation. Setbugs (feature.skip_wal) below. WAL supports both asynchronous and synchronous writing, for asynchronous they want to excel by calling the following method (mut.setbugs (ASYNC_WAL)) Syncing is done by calling the following method: Mutation. Setexcel (SYNC_WAL) where syncing is the default.

For asynchronous WAL, Put, Delete, and Append operations do not trigger data synchronization immediately. But to wait for a certain time interval, the interval can be through the parameter hbase. Regionserver. Optionallogflushinterval set, the default is 1000 ms.

  • MemStore

There is one MemStore instance in each Store. After data is written to WAL, it is put into MemStore. MemStore is a memory storage object. Data is flushed into HFile only when MemStore is full.

HBase uses the LSM tree structure to store data sequentially to improve read efficiency. The data is sorted into LSM trees in Memstore before being written to HFile.

It’s easy to get confused about MemStore. Data is already stored on WAL in HDFS before being flushed to HFile, so why store it in MemStore? As we all know, HDFS cannot be modified, and HBase data is sorted by Row Key. In fact, the sorting process is carried out in MemStore. It’s worth noting that MemStore is not used to speed up writes, but to sort Row keys.

  • HFile

HFile is the actual carrier of data storage. All tables and columns created by us are stored in HFile. When the Memstore reaches a threshold or the threshold of the write interval, the HBaes are written to the HDFS by the Memstore, and are called hFiles stored on disks. At this point, our data is actually persisted to hard disk.

Region of the positioning

Before explaining the HBase data read and write process, see how Region is located. Region is an important concept in HBase. Regions are stored in RegionServer. How do clients locate required regions during read and write operations? HBase of an earlier version is different from HBase of a later version.

HBase of earlier versions (before 0.96.0)

HBase of earlier versions uses the layer 3 query architecture, as shown in the following figure:

As shown in the figure above: The first layer is the node data in Zookeeper, which records the location information of the -root – table.

Layer 2 -ROOT- table records. META. Region Indicates the location. -root – The table has only one region, which can be accessed through the -root – table. META. Data in a table

META table, which records the region location information of the user data table,.meta. A table can have multiple regions.

The procedure is as follows:

Step 1: The user knows the RegionServer location of the -root – table by searching the /hbase/ root-RegionServer node of the ZK (ZooKeeper).

Step 2: Access the -root – table to find which metadata information of the required data table exists. META. Table, this one. META. Which RegionServer the table is on.

Step 4: Access the META table to see which Region you want to query the row key in.

Step 5: Connect to the RegionServer where the data is located.

A new version of HBase

The HBase addressing of the earlier version has many disadvantages, which are improved in the new version. The hbase: Meta table is used to locate regions. Where can I obtain hbase: Meta information? The answer is ZooKeeper. Zookeeper has a /hbase/meta-region-server node. You can obtain the location information of the hbase: Meta table and use the hbase: Meta table to query the location information of the region where the required data resides.

The procedure is as follows:

Step 1: The client queries the location of the hbase:meta table on the /hbase/meta-region-server node of ZooKeeper.

Step 2: Connect the client to hbase: RegionServer where the Meta table resides. Hbase: the meta table stores rowkey ranges of all regions. You can use this table to query which Region and RegionServer the rowkey you want to access belongs to.

Step 3: With this information, the client can directly connect to and operate on the RegionServer that owns the RowKey you want to access.

Step 4: The client caches the meta information so that hbase: Meta does not need to be loaded next time.

Basic use of client API

public class Example {

  private static final String TABLE_NAME = "MY_TABLE_NAME_TOO";
  private static final String CF_DEFAULT = "DEFAULT_COLUMN_FAMILY";

  public static void createOrOverwrite(Admin admin, HTableDescriptor table) throws IOException {
    if (admin.tableExists(table.getTableName())) {
      admin.disableTable(table.getTableName());
      admin.deleteTable(table.getTableName());
    }
    admin.createTable(table);
  }

  public static void createSchemaTables(Configuration config) throws IOException {
    try (Connection connection = ConnectionFactory.createConnection(config);
         Admin admin = connection.getAdmin()) {

      HTableDescriptor table = new HTableDescriptor(TableName.valueOf(TABLE_NAME));
      table.addFamily(new HColumnDescriptor(CF_DEFAULT).setCompressionType(Algorithm.NONE));

      System.out.print("Creating table. ");
      createOrOverwrite(admin, table);
      System.out.println(" Done."); }}public static void modifySchema (Configuration config) throws IOException {
    try (Connection connection = ConnectionFactory.createConnection(config);
         Admin admin = connection.getAdmin()) {

      TableName tableName = TableName.valueOf(TABLE_NAME);
      if(! admin.tableExists(tableName)) { System.out.println("Table does not exist.");
        System.exit(-1);
      }

      HTableDescriptor table = admin.getTableDescriptor(tableName);

      / / update the table
      HColumnDescriptor newColumn = new HColumnDescriptor("NEWCF");
      newColumn.setCompactionCompressionType(Algorithm.GZ);
      newColumn.setMaxVersions(HConstants.ALL_VERSIONS);
      admin.addColumn(tableName, newColumn);

      // Update column family
      HColumnDescriptor existingColumn = new HColumnDescriptor(CF_DEFAULT);
      existingColumn.setCompactionCompressionType(Algorithm.GZ);
      existingColumn.setMaxVersions(HConstants.ALL_VERSIONS);
      table.modifyFamily(existingColumn);
      admin.modifyTable(tableName, table);

      / / disable the table
      admin.disableTable(tableName);

      // Delete column family
      admin.deleteColumn(tableName, CF_DEFAULT.getBytes("UTF-8"));

      // To delete a table, first disable the tableadmin.deleteTable(tableName); }}public static void main(String... args) throws IOException {
    Configuration config = HBaseConfiguration.create();

    config.addResource(new Path(System.getenv("HBASE_CONF_DIR"), "hbase-site.xml"));
    config.addResource(new Path(System.getenv("HADOOP_CONF_DIR"), "core-site.xml")); createSchemaTables(config); modifySchema(config); }}Copy the code

Easy to confuse knowledge summary

Q1: What does MemStore do?

In HBase, a table can have multiple column families. A column family is physically stored together, and a column family corresponds to a store. A MemStore exists in a store. As we know, HBase data is stored in HDFS, which does not support modification. To sort data by RowKey, HBase first writes data to MemStore. Data is first sorted into LSM tree in MemStore, and then written to HFile.

In short: Memstore’s purpose is not to speed up data writes or reads, but to maintain data structures.

Q2: Will data be read from MemStore first?

MemStore’s purpose is to sort by RowKey, not to speed up reads. Data is read from a dedicated cache called BlockCache. If BlockCache is enabled, data from the BlockCache is read first, and then HFile+Memstore data is read.

Q3: What does BlockCache do?

BlockCache uses memory to record data and is suitable for improving read performance. If block cache is enabled, HBase checks whether there are records in the block cache first. If there are no records, HBase retrieves hfiles stored on hard disks.

It is worth noting that a RegionServer has only one BlockCache. BlockCache is not a necessary part of a data store, but is used to optimize read performance.

The basic principle of BlockCache is as follows: After a read request is sent to HBase, the system tries to query the BlockCache first. If the required data cannot be obtained, the system obtains it from HFile and Memstore. If it does, the Block is cached in the BlockCache along with the data returned.

Q4: How does HBase delete data?

An HBase deletion does not actually delete the data, but rather marks a tombstone marker that makes this version, along with previous versions, invisible. This is for the sake of performance. In this way, HBase can periodically clear deleted records instead of deleting them every time. A periodic cleanup is an automatic consolidation (compaction) performed by aN HBase when a file is consolidated into a single file. In this way, the impact on HBase performance is minimized. It is OK to delete a large number of records even under a high concurrent load.

There are two types of merge operations: Minor Compaction and Major Compaction.

A Minor Compaction compacts several Hfiles in a Store into one HFile. During this process, the data that reached TTL will be removed, but the data that was manually deleted will not be removed. This type of merge triggers at a high frequency.

A Major Compaction compacts all hfiles in the Store into one HFile. Data that is manually deleted during this process is actually removed. Also deleted is version data that exceeds MaxVersions in the cell. This merge triggers at a low frequency, which is once every 7 days by default. When a MajorCompaction occurs, manually control the timing of a MajorCompaction.

One caveats: While a Major Compaction deletes data that has a tombstone attached to it, a Minor Compaction ignores stale files when it compacts, so stale files will be deleted when it compacts.

Q5: Why does HBase provide high-performance read/write capabilities?

HBase uses an LSM storage structure. In the LSM tree, data is sorted before data is stored. LSM tree is the basic storage algorithm of Google BigTable and HBase. It is an improvement of B+ tree of traditional relational database. The core of the algorithm is to ensure that the data is stored on disk sequentially as much as possible, and the data will be organized frequently to ensure its sequence.

The LSM tree is a bunch of small trees, and the small trees in memory are the memstore, and every time the memstore is flushed, the memstore becomes a new storefile on disk. This batch read/write operation improves HBase performance.

Q6: What is the relationship between Store and column cluster?

Region is the core module of HBase, and Store is the core module of Region. Each Store corresponds to a column family Store in a table. Each Store contains a MemStore and several hfiles.

Q7: Is WAL shared at RegionServer or Region level?

In HBase, each RegionServer maintains only one WAL. All Region objects share one WAL rather than each Region maintains one WAL. In this way, you only need to continuously append logs generated in update operations of multiple regions to a single log file. You do not need to open and write multiple log files at the same time. In this way, disk addressing times are reduced and write performance is improved.

However, this method has a disadvantage. If RegionServer is faulty, you need to split the WAL on RegionServer according to the Region object to which it belongs and distribute the WAL to other RegionServers for recovery.

Q8: Can I still query data after Master hangs?

You can. The Master service manages tables and regions. The main functions are:

  • Manage user operations to add, delete, and modify tables
  • Load balancing between different RegionServers is implemented
  • Region splitting and merging
  • Migrate the Region of the faulty RegionServer

When a client accesses HBase, the Master only needs to connect to ZooKeeper to obtain the HBase: Meta address. Then, the Master directly connects to the RegionServer for data read and write operations. The Master only maintains metadata information of tables and regions, and the load is small. However, Master nodes cannot be down for long periods of time.

conclusion

This article starts from Google BigTable, then introduces the theory of CAP, and analyzes the reasons for the emergence of NoSQL. Then the HBase data model is analyzed, and the HBase principle and operation mechanism are described in detail. Finally, the basic use of client API is given, and the common and confusing knowledge points are explained.

The public account “Big Data Technology and Data Warehouse”, reply to “information” to receive the big data data package