Phoenix is an SQL layer built on HBase. It allows you to add, delete, modify and query HBase data using standard JDBC APIs to build secondary indexes. Of course, open source products, naturally need to pay attention to the “pit”, Aru will tell you how to use and best practices.

1. What is Phoenix

Written entirely in Java, Phoenix converts SQL queries into one or more HBase scans and orchestrates execution to produce standard JDBC result sets. Phoenix can do the following:

  • Compile SQL queries into HBase scan
  • Determine the start and stop positions of scan
  • Scan is executed in parallel
  • Push predicates in the WHERE clause to the server side for filtering
  • Aggregate queries are executed through server-side hooks called co-processors

In addition to these, Phoenix has some interesting enhancements to further optimize performance:

  • Secondary index to improve performance of non-row key queries (this is the main reason we introduced Phoenix)
  • Skip scan filters to optimize IN, LIKE, and OR queries
  • Optionally add salt to row keys to achieve load balancing and avoid hot spots

2. Phoniex architecture

Phoenix is structurally divided into client and server:

  • The client includes application program development. SQL is parsed and optimized to generate QueryPlan, which is then converted into HBase Scans. HBase API is called to send query calculation requests and receive returned results.
  • The server uses the HBase coprocessor to process secondary indexing, aggregation, and JOIN calculation.

The previous Phoiex architecture adopted the client mode, executing a series of parser and Query plan processes on the client, as shown in the figure below.

This architecture has usage drawbacks:

  • The application is bound with Phoenix core, which requires the introduction of Phoenix kernel dependency. A single Phoenix heavy client integration package has reached more than 120 M.
  • The operation and maintenance is inconvenient. Phoenix is still being optimized and developed. Once Phoenix is updated, the application needs to be updated and re-released.
  • Only the Java API is supported. Developers in other languages cannot use Phoenix.

As a result, the community reinvented itself and introduced a new “light client” model.


The light client architecture divides Phoenix into two parts:

  • The client is the JDBC driver that the user least depends on. It is decoupled from the Phoenix dependency and supports Java, Python, and Go clients.
  • QueryServer is deployed as an independent HTTP service, receiving requests from light clients, parsing, optimizing SQL, and producing execution plans;

3. Basic use

Portal: https://phoenix.apache.org/language/index.html (more at ordinary times refer to official syntax)

Table 3.1 built

By default, a table built on the Phoenix shell has only one region. When creating a table, pay attention to pre-partitioning

// Step1: [Phoenix shell] Create schema if the namespace has not been created before"MDW_NS"; Or create schema MDW_ns; // Step2: [hbase Shell] Create a table on native hbase. Why is this step required (pre-partition can be used, phoenix table construction clauses are not supported on partitions) create'MDW_NS:TABLE_DEMO'.'F1', {NUMREGIONS => 16, SPLITALGO => 'HexStringSplit'} // step3: create a craete table with phoenix shellif not exists mdw_ns.table_demo (
    id varchar not null primary key,
    f1.a varchar,
    f1.b varchar,
    f1.c varchar,
    f1.d varchar,
    f1.e varchar,
    f1.f varchar,
    f1.g varchar,
    f1.h varchar
)
TTL=86400,
UPDATE_CACHE_FREQUENCY=900000;Copy the code

3.2 indexed

You need to focus on hashing index tables

// Create index table_demo_A on mdw_ns.table_demo(f1.a) split on ('10000000'.'20000000'.'30000000'.'40000000'.'50000000'.'60000000'.'70000000'.'80000000'.'90000000'.'a0000000'.'b0000000'.'c0000000'.'d0000000'.'e0000000'.'f0000000'); // Create index table_demo_A on mdw_ns.table_demo(f1.a) salt_buckets = 16; [x] create index table_demo_A on mdw_ns.table_demo(f1.a DESC) [√] craete table [x] create index table_demo_a on mdw_ns.table_demo(f1.a DESCif not exists mdw_ns.table_demo (
    f1.a varchar,
    f1.b varchar,
    f1.c varchar,
    f1.d varchar,
    f1.e varchar,
    f1.f varchar,
    f1.g varchar,
    f1.h varchar,
    constraint pk primary key(a, b desc)
    );Copy the code


3.3 select queries

[√] select * from mdw_ns.table_demo limit10; [√] select * from mdw_ns.table_demowhere id = '0000000000'; [x] select * from mdw_ns.table_demowhere a = '0000000000'; [√] select id, b, C, d, e from mdw_ns.table_demowhere a = '0000000000'Copy the code


Best practices

4.1 Case

Note: Phoenix is case sensitive for (table names, column names, namespace, cf) and is uppercase by default. If you want to mask the conversion, double quotes (“) are required around the corresponding character. If the data type is a string, enclose it in single quotes (‘).

// case1: Query table_demo (table) in mdw_ns (namespace), where F1 (column cluster) + A (column name) is A string.'0000000000'Select id, a, b, c from"mdw_ns"."table_demo" where "f1".a = '0000000000'


// case2: Querying MODULE_REVISION (table) records in the NAMESPACE of PS_NS (Namespace) select * from PS_ns.module_revision;Copy the code


4.2 Precautions for adding salt

Salt is used to solve the scenario where data hotspot and range query exist at the same time. For details about the operation principle, see the Phoenix community documentation.

Generally, we only use salt when the following requirements are met:

  • Write hot or unbalanced
  • Need range query

There are hot spots will be scattered, but scattered it is difficult to do scope query. So to meet these conflicting demands at the same time, there must be a compromise: a degree of fragmentation and a degree of order. The solution is to add salt buckets. Data is sorted within buckets, randomly between buckets. Data is moulded according to the number of buckets. Data is randomly stored in a bucket to ensure balanced write requests among buckets. All buckets are read during query to ensure the order and completeness of the result set.

Side effects:

Write bottleneck: even though buckets keep split, only buckets are used to write data into the table. When the service volume increases, the number of buckets cannot be adjusted and more regions cannot share write requests. As a result, write requests cannot increase linearly with cluster expansion. This leads to write bottlenecks that limit business development.

Read diffusion: SELECT splits and concurrence by buckets, and each concurrence takes one thread to execute. Excessive concurrency in the SELECT itself can cause the thread pool to quickly exhaust or cause the QueryServer to FGC due to excessive concurrency. At the same time, a simple query that would have been done by a single RPC is now split into multiple queries, greatly increasing the query RT.

These two side effects can restrict business development, especially for large volume and fast growing businesses. Because the number of buckets cannot be modified, the write bottleneck affects service expansion. The increase in RT caused by read diffusion also greatly reduces the efficiency of resource use.


Never use the salting feature for pre-partitioning, but design tables with the read/write patterns of the business.

The number of Buckets is related to model configuration and data volume. You can calculate Buckets by referring to the following method, where N is the number of Core/RS nodes:

Memory of a node 8 GB: 2 x N

Memory of a node 16 GB: 3 x N

Memory of a node 32 GB: 4 x N

Memory of a node 64 GB: 5 x N

Memory of a node 128 GB: 6 x N

Note: The index table inherits the salt value of the main table by default; The number of buckets cannot exceed 256; The data structure of an empty Region in the memory is about 2MB. You can evaluate the total number of regions carried by a single RegionServer. Some users created a large number of salt tables on nodes with low configuration and consumed the cluster memory.


4.3 Scan the entire table, OR, Join, and subquery with caution

Although Phoenix supports various Join operations, Phoenix is still mainly located as an online database. Complex joins, such as a large amount of data returned by a sub-query or a large table joining a large table, consume system resources in the actual calculation process, seriously affect online services, and even cause OutOfMemory exceptions. For users with high requirements for online stability and real-time performance, you are advised to use only the simple query method of Phoenix, and all the queries match the primary key of the primary table or index table. In addition, it is recommended that users always run explain before running the SQL to confirm whether the index or primary key is hit.

4.4 Phoenix Does not support complex Queries

The nature of Phoenix’s secondary index is prefix matching. Users can build multiple secondary indexes to increase the query mode of data. The consistency of secondary indexes is realized through the coprocessor, and the index data can be seen in real time. Phoenix does not support complex queries, such as arbitrary and/ OR combinations, fuzzy search, and word segmentation. You are advised to use search engines (such as Solr and ES). Of course, the search engine using background asynchronous index, will inevitably affect the real-time, need to carefully weigh.

4.5 Phoenix Does not support complex analysis

Phoenix is positioned as Operational analytics, which is not suitable for complex analysis, such as the aforementioned complex Join. This recommendation is implemented by Spark, a specialized big data computing engine.

4.6 Phoenix You can map an existing HBase table

Refer to community documentation. You can use Phoenix to create a view or map a TABLE to an existing HBase TABLE. If you use a TABLE to map an HBase TABLE, running the DROP TABLE statement in Phoenix also deletes an HBase TABLE. In addition, because column family and column name are case sensitive, a one-to-one mapping is required for the mapping to succeed. In addition, most of the field encoding modes of Phoenix are different from those of the Bytes tool of HBase. It is recommended that you map only vARCHAR fields.


5. Usage specifications

  • Case convention: Phoenix is case-sensitive and will be converted to uppercase by default, so we use uppercase as the norm when building a table to avoid unnecessary trouble
  • Index name: An index table is created in the hbase cluster for index identification. You are advised to use the standard index name {table name}_{index ID} (for example, table name TABLE_DEMO and index name TABLE_DEMO_C).
  • Original table partitioning rules: It is recommended to use the pre-partitioning method of native HBase
  • Partitioning rules for index tables: Salt_buckets is recommended
  • Rule for using select: Do not use select *
  • Note the following when building a table: Do not use the default UPDATE_CACHE_FREQUENCY policy (ALWAYS). Change it to UPDATE_CACHE_FREQUENCY = 60000

See the end, the original is not easy, point a concern, point a like it ~

Reorganize the knowledge fragments to construct the Java knowledge graph:
Github.com/saigu/JavaK…(Easy access to historical articles)