Abstract:The advantages of HBase and Phoenix are well known, but there are a lot of problems with implementing them. Do replication’s random sending and Connection management give you a headache? In this sharing, Didi will take you to explore HBase and Phoenix in depth with typical application scenarios and share kernel improvement measures. Didi also shares a number of internal tips on how to achieve stability and capacity planning. It’s not to be missed!




This live video highlights, poke here!


The following content is compiled according to the video sharing and PPT of the speakers.


This paper will focus on the following aspects:
1. HBase feature application and kernel improvement
2. Phoenix improvement and practice
3. GeoMesa Application Introduction and prospect
4. Stability & capacity planning

one HBase feature application and kernel improvement

1. Typical application scenarios of HBase in Didi

Some simple HBase operations, such as Scan and Get, are performed in didi. Each operation can be applied to different scenarios, for example, Scan can derive timing and reports. Timing can be applied to trajectory design, establishing timing sequence with business ID, timestamp and trajectory position as a whole. In addition, in asset management, asset status is divided into different stages, and asset ID, time stamp, asset status and other information are established time sequence. Scan is also widely used in reports. In fact, there are many methods. The main method is phoenix, and the standard SQL operation Hbase is used for online transaction processing. In this method, attention should be paid to primary key and secondary index design. In the report, users’ historical behavior, historical events and historical orders will be designed in detail.




The Get operation can be applied to small files such as voice files and didi invoices stored in HBase. The most basic application method is to get entity attributes by ID. For example, in real-time computing, multiple data flows need to be merged. In this case, the ID is the Rowkey in HBase. For example, there are multiple data sources upstream of the service, and the data from these data sources needs to be aggregated into one table.



In addition, some other operations are derived from HBase. Demand of Internet companies changes rapidly and there are many business parties involved. Dynamic columns can help realize such demand. There are also composite applications, such as graphs and CoProcessors. Graphs include user-defined graphs that allow you to customize data sources and data allocation. JanusGraph is also connected to the HBase cluster. Coprocessor is primarily used in Phoenix and GeoMesa.





2. Replication application and optimization

Assume that the original cluster has three hosts: ReplicationSource01, ReplicationSource02, and ReplicationSource03. The target cluster has four hosts: RS01, RS02, RS03, and RS04. If the original cluster sends a replication request, traditional logic sends the request randomly. If the tables in the target cluster are stored on two hosts in GROUP A, sending them randomly may result in the hosts not receiving replication requests and instead sending them to groups unrelated to the business. Therefore, the execution policy is optimized to match requests that may be sent to other clusters in the original cluster and obtain the allocation of the target cluster GROUP so that the requests can be sent to the corresponding GROUP host to prevent other services from being affected.




In addition, we hope to add table-level statistics to Replication in the future, counting connection errors on requests for optimization from a user perspective.


3. Connection management and use

Based on the HBase version of Didi, users may encounter Connection problems. For example, after multiple connections are established, a large number of ZKS may exist. Therefore, you need to manage the Connection. The ClusterConnection is created in RS to minimize Connection creation. This applies in a Phoenix secondary index scenario, described in the Phoenix section.

4. Optimize ACL permission authentication

Acls match user names and passwords with IP addresses. Userinfo is created to store user password information, rowKey is the user name, and Column Family (CF) is the password. HBase:ACL This table has the/ACL node in ZK. Similarly, the userinfo node is created.







The following figure shows the userInfo access process. At the top is the serialized userInfo information encapsulated by the user client and sent to the server, which performs operations such as preGetOp, preDelete, and prePut through the AccessController. Then TableAuthManager does the permissions judgment. The TableAuthManager class stores the permission information cache. When a new UserInfo is received, the cache is first updated for the client to access. Traditional cache contains only /hbase/ ACL information. After didi optimization, the /hbase/userinfo information is added. To update the cache, Join() is required with /acl and /userinfo information.



However, there are some problems when using native/ACL, and the reason is to reduce the dependence on ZooKeeper.
In addition, Didi also optimized other aspects, including RPC Audit log, RSGroup, Quota, safe Drop table, etc. RPC Audit log analyzes user request volume and error information. Quato can limit user traffic, such as the number of put operations a user can perform on a table per second. In addition, when you drop a table, the traditional method is to clear the table contents directly. Now, you need to store a snapshot first and then delete it to prevent accidental deletion.


two Phoenix improvement and practice

1. Principle and architecture of Phoenix

Phoenix provides a FRAMEWORK for SQL operations based on HBase to manage source data. RegionServer3 stores the System. CATALOG table. On each RegionServer, Coprocessor performs operations such as query, aggregation, and secondary index establishment.




The Phoenix client manages Connection, source data, physical execution plans for SQL syntax, sends and queries Scan data in parallel, and encapsulates scanner data. RegionServer uses coProcessor frequently.





2. Phoenix important business support — full historical orders

Next, it mainly introduces the optimization and improvement made by Didi based on a Phoenix important server-side task, historical order query. Before the improvement, Didi’s APP could query historical order data within three months, but now it has reached full historical order data, which means it can query all order data in theory. System stability can reach SLA99.95%. In addition to monitoring and recovery measures guaranteed by the HBase cluster itself, improvements are also made in handling secondary index write failures and adding secondary indexes to services. Moreover, Didi also has a high requirement on query delay. Currently, the delay of JDBC client P99 has reached 30-40ms. In function, the default value can be declared for each column.


2.1 Index Coprocessor improvement
Connection management is critical in improving stability. In HBase, the ZK is fragile. A large number of connections causes great pressure on the ZK. Phoenix traditionally writes secondary indexes by the Index Coprocessor. When a large number of indexes are declared in the primary table, many regions are generated. The number of ZK connections is the number of regions multiplied by the number of indexes, resulting in a table that may contain several thousand ZKS. Therefore, to solve this problem, a ClusterConnection is established inside RegionServer, which is multiplexed by all regions.





2.2 Optimization of TupleProjector performance
TupleProjector improvements have been made to optimize client P99 latency. The initial time for Phoenix to execute Query compilePlan is about 150ms. This is mainly because there are more than 130 columns in the order business table. It takes 1ms to get column expression for each column. This total time is therefore an intolerable drain on the system. For the above types of wide tables, the most effective optimization measure is to configure parallel processing. After optimization, P99 can reach about 35ms.





2.3 Secondary index design
The Salting function of Phoenix is very effective, but has a large impact on latency. Therefore, if the latency requirement is high, Salting is not suitable. Therefore, we do not use Salting function in the primary table and index table, but use reverse to hash the primary key column. Use Function Index and Function Index in indexes to reduce query latency. The sample code looks like this:





2.4 Pit with Default value
Common services do not use the Default value, and the traditional Default design has many bugs. For example, a write error in a secondary index where the Default column is declared, that is, an attempt to write a new value is still being written to the Default value. The following is an example:



There are two solutions. One is to do some special processing on the client side before the Index is built. Method two is to modify the value entered incorrectly when generating the source data of the index table.
Another bug, such as an inconsistency between rowKey generation and Indexer build rowkey when creating an asynchronous secondary index, results in a double when querying data in the index. The reason is that ptableppl. newKey has a logical bug on columns with Default values. The following is an example:





2.5 Performance pressure measurement
After the optimization, the java-JDBC and Java-QueryServer queries are used to perform performance pressure tests. The results are as follows:



Other non-Java languages will use the Go-Query Server to query, and the P99 results will be slightly slower than in The Java language.


2.6 Secondary Index Policy
In traditional logic, if the secondary index write fails, disable the secondary index table and rebuild it, but this is not available online. Therefore, Didi turned off automatic rebuild and Disable to improve. How do I detect write failures after automatic rebuild and Disable are disabled? In this case, you need to periodically query the ES data collected by the RegionServer Log and invoke the asynchronous Partial rebuild index to repair the data. When the primary table has a large amount of data, a large number of MapTasks may be split during the asynchronous full secondary index, which is beyond the capacity of the master. Therefore, you need to improve the split logic to merge some tasks. When these improvements are complete, more stable support can be provided.





2.7 Impact of Cluster Upgrade on Secondary Index Writes
When a cluster upgrade occurs, secondary index writes are bound to fail. How can you minimize the impact? If the primary and index tables are deployed together, all machines will fail to write when one machine is upgraded. As mentioned in the previous section, Didi has added the GROUP function to HBase. In this way, the primary table and index table can be separated and deployed in different groups to reduce the impact of cluster upgrade. As shown below:





3. GeoMesa application introduction and prospect

1. GeoMesa architecture principles

Didi has recently started research on GeoMesa, but has not yet gained extensive online experience. GeoMesa is a large-scale spatio-temporal data query and analysis engine based on distributed storage, computing systems. In other words, you can select a storage area and enter data in a variety of formats, such as Spark Kafka. The architecture is shown below:




GeoMesa has the advantage of indexing geospatial and temporal information. In the case of HBase, GeoMesa can map two-dimensional or three-dimensional data to one-dimensional storage, with Geohash being a common encoding method. This index method belongs to point index and is widely used nowadays. Specific examples are shown in the figure below:



The Geohash method is to map high-dimensional geographic spatio-temporal data such as latitude and longitude with 01 encoding to one dimension. Firstly, the dimension is placed in odd digits and the longitude is even digits, and the string is generated through Base32 to reduce it to one dimension. The specific process is shown in the following figure:



The theoretical basis of Geohash is the Z-Order curve. However, z-Order curve has mutability, that is, when the geographical distance is very far, the encoded logical position may produce continuous phenomenon, as shown in the figure below. As a result, the result of a Geohash query will be different in number than it really needs, and there will be many invalid results.






This results in the figure above, some points have the same coded prefix, but the actual geographical location is far apart, and the code prefix with the actual geographical location is not the same.


2. GeoMesa application

GeoMesa is often used for thermal mapping. Didi uses it to record driving tracks, that is, to query passing vehicles in a certain area at a certain time. And the track point processing, congestion analysis through vector path. The xZ-Order line and polygon index are expected to be applied to Didi’s business scenarios in the future.



Didi has also visualized GeoMesa, as detailed in the video.





3. GeoMesa Future Outlook

Based on the mutability of z-Order curve mentioned above, it is hoped that indexes can be generated based on Hilbert space filling curve encoding in the future. Because the Hilbert space filling curve contains geographical space locality, the closer the points on the one-dimensional curve are, the closer they are in the 2D space. This feature is applicable to Scan. In addition, the plan in GeoMesa takes a long time, averaging 90ms per session, which hopefully can be optimized in the future.


Four. Stability & Capacity planning

In order to achieve stability and capacity planning, Didi mainly carried out the following work.

The first is machine planning. Three dimensions are considered: read/write volume per second, average traffic volume, and storage space. The read/write volume per second affects the read/write capability of the service. The average traffic affects the read/write capability of the service and the DISK I/O. The storage space corresponds to the disk space. If the read/write volume per second is too large, the GC and network traffic of the service will be affected. If the required storage space is larger than the total storage space of the disk, the service will also be affected. Therefore, the machine capacity planning can be calculated by comprehensively comparing these three dimensions through pressure test and DHS statistics. You can also plan cluster groups based on the preceding information.
In addition, some HBase indicators are monitored to determine whether user services are in a healthy state. If a fault occurs, you can rectify the fault based on the monitoring. One of the monitoring effects is as follows:



The current working process of Didi is roughly as follows: When users are connected, the number of requests is estimated, pressure test is conducted, and then the user goes online. After the service is launched, it often changes, for example, the amount of data with good development increases. At this time, some operations are carried out in a cycle, for example, the online service is automatically optimized according to the monitoring data, or the online service is optimized by manual periodic check or user feedback. The follow-up work hopes to achieve automatic mode, that is, using monitoring data to optimize online services and automatically discover idle nodes and groups that can be optimized.



At the same time, Didi has also established the Didi HBase Service (DHS) platform, which is connected to user requirements and can help users understand Service status by visualizing information and automatically verifying it. Efforts are now under way to tally more detailed information about these services, with fewer manual calculations, to help administrators understand how users are using them.


The original link

This article is the original content of the cloud habitat community, shall not be reproduced without permission.