Author: Zhou Yueyue, Sudan
In recent years, a series of big data-related business scenarios have emerged with the increasing scale of data and the resulting surge in demands for real-time data. The two obvious characteristics of scenario complexity and business multidimensional are high, so many real-time data warehouse architectures have emerged to meet business needs.
This paper does not involve the introduction of big data scenarios. It is suitable for those who are or are considering exploring big data architecture. It will introduce different role requirements, cost and technology selection, and provide an alternative HTAP technology product — TiDB. Meet the requirements of real-time and fast response in big data scenarios.
Why is it needed
Different roles have different concerns, so different business scenarios can be abstracted out by extension:
- For business decision makers, they need to focus on the current state of the company and determine its direction for the next few years. For example, it is necessary to analyze the current development of the company by querying real-time statements and profit and loss statements within five years. Meanwhile, it is necessary to predict and analyze the revenue in the next few years according to the existing data modeling to determine the future development direction of the company. During this analysis, the query time granularity can be hour, day, week, month, or year.
- For the person in charge of a business line, he/she needs to pay attention to the basic situation of the current business, such as revenue, health, and hot issues. Therefore, he/she needs to query real-time big screen, real-time report, revenue forecast, and commercialization data at any time within three years.
- The head of the security team must always pay attention to information security issues. In general, the data information of the risk control platform at any time in the latest week to month should be queried.
Based on the requirements of the above three common roles, it can be found that real-time query is needed in analysis and judgment. Meanwhile, since most of the time range is large, the data scale also needs to be considered. With large data scale, it is a challenging task to satisfy real-time query.
How to choose big data solution
Different roles require different dimension information quickly, in the aspect of data real-time and has certain requirements, the user (no matter external or internal) already can’t satisfy for offline data analysis, data they need to be more fresh, even for the ongoing transaction data analysis directly, so you need a professional architecture to support. At the same time, it can only be completed by a professional team. In general, it is necessary to set up a professional data team to support the needs of each person in charge. Besides, it is necessary to consider the cost, including labor cost, time cost and the cost of introducing various technology stacks. In other words, in the existing TP business scenarios, ap-biased data warehouse architecture needs to be considered or constructed to meet the requirements of real-time query.
Existing solutions
Traditional data analysis/data warehouse schemes based on Hadoop analytical database have the obstacle of not supporting real-time analysis well. NoSQL solutions such as HBase support good scalability and real-time performance, but cannot provide required analysis capabilities. Traditional stand-alone databases do not provide the scalability required for data analysis. Traditional transactional databases provide complete support for database transaction features, but lack of scalability, and use of row storage, which is not ideal for analytical scenarios. In addition, the traditional big data technology platform itself also has some problems, as shown in the figure below:
As you can see from the figure, the existing solutions have the following problems:
- The logic of data passing through ETL is complicated, and the cost of storage and time is too high
- Data link length
- Complex technology stack
The cost of spending
The establishment of a complete data team includes data development group and data product operation group. The data development group can be further divided into:
- Data development engineer: including ETL, back-end development and front-end development, mainly responsible for building, operating and maintaining big data platform and data warehouse; The number of such talents on the current market is still relatively abundant.
- Data analysis engineer: responsible for summarizing the company’s business data and completing data analysis. The number of such talents on the current market is also relatively abundant.
- Data algorithm engineers: responsible for building mathematical models and algorithmic models to meet business needs. These talents are scarce and require high salaries.
The data product Operations group is subdivided into:
- Data product manager: Main responsibilities include: evaluating the feasibility of using data-driven to solve business pain points, accurately identifying and evaluating various needs, abstracting common data-driven needs, forming standardized products and reducing workload; Talent is scarce in the current market.
- Data operations: Responsible for helping business teams solve data product usage problems; The number of such talents on the current market is still relatively abundant.
According to the above information, only the rough calculation of the human cost, the team’s book cost will be a small expense, and the core talent is difficult to acquire. In addition to the human cost, the time cost of team building is 0.5-1 years due to the difficulty in acquiring core talents. It takes 1-2 years from product design iteration to output; At the same time, the introduction of a variety of technology stack, complex maintenance. Therefore, high labor cost, high time cost, complex maintenance.
Why TiDB
TiDB features
TiDB is a database characterized by HTAP. It is positioned as a Hybrid Transactional/Analytical Processing (HTAP) converged database product, which achieves one-button horizontal scaling. Strong consistency of multi-copy data security, distributed transactions, real-time HTAP and other important features, while highly compatible with MySQL protocol and ecology, easy migration, low operation and maintenance costs.
- HTAP architecture based on row and column storage
- Provides complete indexes and high concurrent access for precise location of detailed data to meet high QPS point-searching.
- The high-performance MPP framework and the updatable column storage engine can be synchronized to the column storage engine in real time after the data is updated, so that the system can access the latest data with the reading performance of the analytical database to meet the real-time query needs of users.
- A set of entries meets both AP and TP requirements, and the optimizer will automatically decide whether to perform TP class access, index selection, column storage or MPP calculation mode depending on the type of request, simplifying the architecture.
- ** Flexible expansion and contraction capacity: ** Flexible and convenient expansion and contraction capacity of TiDB, PD and TiKV components in the online environment will not affect the production environment, transparent operation. When the write and TP query capabilities of TiDB components become bottlenecks, this capability can be linearly extended by adding nodes. Meanwhile, storage nodes (TiKV) can also be continuously expanded to add storage resources based on storage requirements. On the contrary, when nodes are scaled down, there is almost no perception of the impact on online services.
- SQL standard and compatible with MySQL protocol: Supports standard SQL syntax, including aggregation, JOIN, sort, window function, DML, online DDL and other functions. Users can flexibly analyze data through standard SQL. In addition, it is compatible with MySQL protocol syntax, which enables compatible MySQL ecological tool chain and analysis tools to be used directly. In the case that MySQL has been widely used, after business migration to TiDB, it can achieve seamless interconnection without a lot of code changes in the business layer.
- ** Simple management: ** Using TiUP tool can quickly complete the construction and deployment of cluster environment; Normal operation does not need to rely on other systems, easy operation and maintenance; Provides a built-in monitoring system to facilitate performance bottleneck analysis and troubleshooting.
The surrounding ecological
- ** Rich peripheral tools: **TiDB provides rich data inflow and outflow tools and backup and recovery tools. In terms of data inflow, DM can realize full and incremental data synchronization from MySQL to TiDB. During the synchronization process, data read and write requests from the business side and TiDB side are not affected. The Lightning tool can import external offline data into TiDB in batches. The TiDB data can be sent out through TiCDC and Binlog to the downstream TiDB and Kafka environment for secondary data processing. The BR tool can back up and restore data based on users’ full and incremental requirements, especially for large data backup. It is effective and easy to deploy and operate.
- ** Good community ecology: ** For students who want to learn TiDB products, TiDB provides a variety of channels to learn and master. The official documents provided by the official website contain all the corresponding information of TiDB versions. You can learn about the use information of a certain version in detail by consulting the documents. The blog section presents in-depth articles on the corresponding knowledge points of each tag; The User Case section describes the current use of TiDB in core and typical scenarios across industries. If you want to use TiDB or want to know how others use TiDB, TiDB provides a platform for everyone to exchange at any time, namely TUG (TiDB user group), will update user tutorials, use experience and interpretation of principles and other articles from time to time, at the same time encounter any use problems, Ask for help by Posting on the AskTUG forum, and there will be community support. In terms of talent cultivation, TiDB has a special PU (PingCAP University) course to introduce products and help students who need to learn and master the latest products. After about 3 months of study, you can be familiar with and master the use of TiDB.
In summary, after using TiDB, it can basically replace the big data architecture composed of various technology stacks to solve many problems of traditional big data platforms, such as complex ETL logic, long data links, diverse technology stacks, and separation of data analysis and TP scenarios. At the same time, after learning and training, I can quickly and effectively master the use of TiDB, with low operation and maintenance cost, satisfying the requirements of performance and high cost performance.
Effect of TiDB application
The HTAP architecture based on TiDB has been used by many users in AP scenarios.
360 x TiDB
Why TiDB
- The pressure of single instance writing is greatly alleviated by the use of separate database and separate table. When using single-instance MySQL to meet business requirements, the test found that the single-instance MySQL was under great pressure. In order to disperse the write pressure, MySQL libraries and tables must be divided.
- Data relocation is complex and maintaining routing rules is costly. In the case of a large amount of data, the split rules often need to be changed. Each rule change may involve data relocation, and the business side also needs to invest a lot of manpower to maintain routing rules.
- Multiple database products, high maintenance costs. To meet the reporting requirements of advertisers, other databases need to be introduced. The extraction of MySQL by offline ETL every morning causes a full network card, which also affects other business operations in the early morning.
Usage scenarios
** Advertisers real-time & offline reporting business: ** Real-time/offline reporting business and advertising material delivery are the most important and core business for advertisers in the advertising process.
architecture
earnings
- Good scalability and performance: solve the problem of database and table, and meet the performance requirements, write 150 million data in two hours.
- Real-time analysis and strong consistency guarantee: with TiFlash component, it can carry out multi-dimension global and detailed real-time analysis of single table after merging of sub-database and sub-table, and realize online statistics of offline reimbursement, and provide strong consistency guarantee at the same time.
- The architecture is simplified, data links are shortened, and maintenance costs are reduced.
For more details, please refer to360 x TiDB | Performance increased by 10 times, how can 360 easily withstand double 11 traffic
Zto Express x TiDB
Why TiDB
- With rapid service development and rapid data volume, the data cycles stored in the Exadata appliance become shorter and shorter, increasing service requirements for data cycles.
- The design of separate database and table can not meet the requirements of business analysis and timeliness, statistical analysis depends on stored procedures, and the scalability and maintainability of the system is not high.
- Service peak single machine performance bottleneck, high single point of failure risk, data synchronization T+1, analysis time is not enough.
- The real-time data warehouse constructed by HBase and Kudu was not compatible with the existing technology stack and could not well support multi-dimensional query on the business end.
Usage scenarios
** Express logistics business: ** A large amount of data will be generated in each link. We will conduct relevant analysis on each data in each link, including time-effectiveness monitoring.
architecture
earnings
- The data storage period ranges from 15 days to 45 days.
- Support online horizontal expansion, online storage and computing nodes at any time, application awareness.
- It meets the business requirements of high-performance OLTP with slightly lower performance than Oracle, which is inevitable because TiDB is a distributed database.
- TP and AP are separated, and the single point of database pressure disappears.
- Support more business dimension analysis.
- The overall architecture is clear, maintainability is enhanced, system scalability is enhanced, and hardware cost is reduced.
For more details, see HTAP practice _ from _Exadata to TiDB
conclusion
As an HTAP database, TiDB has unique advantages for real-time analysis and real-time data warehouse scenarios.
It can not only support real-time data storage, but also provide integrated analysis capability. The column and column hybrid engine design also enables its analytical capabilities beyond highly concurrent detailed data location and analysis, and can also provide large-scale interactive BI queries. Users can use TiDB alone to build real-time analysis business, or together with big data ecology to build offline + real-time data warehouse system. In addition, TiDB is also exploring the Flink + TiDB architecture to adapt to more application scenarios. At present, several users are using Flink + TiDB to meet their own business needs.
To learn more about TiDB’s exploration and solutions in the real-time analysis scenario, please click the link at the end of this article to fill out the form and contact our community technical experts for more proprietary dry goods content.
TiDB practice – community communication in real-time query service scenario forms.pingcap.com