Welcome to follow our wechat official account: Shishan100

My new course ** “C2C e-commerce System Micro-service Architecture 120-day Practical Training Camp” is online in the public account ruxihu Technology Nest **, interested students, you can click the link below for details:

120-Day Training Camp of C2C E-commerce System Micro-Service Architecture

** “** This article is about one of the many projects that I led a team on a few years ago to talk about the evolution of a 100-million-level traffic system architecture.

First, background introduction

First of all, a brief introduction to the project background. The company provides a paid-grade product for cooperative merchants. Behind this commercial product, hundreds of r&d teams are involved in collaborative development, including various business systems to provide many powerful business functions, and a crucial core data product is included in the whole platform. This data product is positioned to fully support the user’s business operation and rapid decision making.

This article will talk about a set of large business data platform corresponding to this data product, and see the evolution of the architecture of this platform under the technical challenges of distributed, high concurrency, high availability, high performance, mass data and so on.

Because the whole system is too large in scale, involves many researchers and lasts for a long time, it is difficult to describe various detailed technical details and schemes, so this paper mainly expounds from the perspective of overall architecture evolution.

As to choose the business data platform architecture evolution process project to chat, because the platform with basic business coupling is low, not like we are in charge of a class C end of electric business platform and other business platform for business class have so heavy in it, this paper illustrates the evolution of the technical architecture can focus, do not need to be involved too much detail about business.

In addition, this platform project is relatively simple among many projects that the team led by the author has been responsible for, but it involves the evolution of various architectures, so it is suitable to show it in the form of words.

Second, the business process of merchant data platform

The following points are the core business processes of this data product:

  • Real-time collection of various business data from a large number of business systems used by users every day
  • It is then stored in its own data center
  • Then the real-time operation of a large number of hundreds to thousands of lines of SQL to generate a variety of data reports
  • Finally, these data reports can be provided to the user for analysis.

Basically, in the process of using the business system, as long as there is any change in the data, it will be immediately fed back to various data reports. Users can immediately see the changes in the data reports, and then quickly guide their own decision-making and management.

The whole process, if you look at the picture below.

The lowest version that goes online in the process from 0 to 1

The picture above seems pretty simple, doesn’t it?

Seems data platform as long as you think of a way to put the business system of data acquisition, and then on the MySQL table, directly click the run more than 100 hundreds of lines of SQL, SQL run results then wrote other MySQL table as report data, then the user directly click on the statements in the query report page MySQL data, That’s it!

In fact, the process of any system from 0 to 1 is relatively low. At the beginning, in order to quickly develop this data platform, it is really using this architecture to develop. See the following figure.

In fact, at the beginning of the small volume of business, small requests, small amount of data, the above architecture is not a problem, it is quite simple.

Directly based on the database binlog acquisition middleware developed by myself (this is another set of complex system, not in the scope of this paper, we can talk about it later), the data changes in the database of each business system can be sensed and synchronized to the MySQL database of the data platform at the millisecond level.

Then do some scheduling tasks in the data platform, run hundreds of complex large SQL every few seconds, calculate the data of various reports and store the results in MySQL library;

Finally, users can refresh the report and immediately check the latest report data from the MySQL database.

Basically, this simple architecture works well without any technical challenges. However, things are often not as simple as we think, because we all know that one of the biggest advantages and resources of domestic Internet giants is the abundance and mass of C-end users and B-end cooperative businesses.

In terms of C-end users, any Internet giant launching a new C-end product is likely to quickly reach hundreds of millions of users.

As for b-end businesses, any Internet giant that plays b-end market is likely to quickly gather hundreds of thousands or even millions of paying B-end users with its huge influence and cooperative resources.

So unfortunately, over the next year or two, the system will face significant technical challenges and pressures from the rapid growth of the business.

Technical challenges of massive data storage and computing

The first big problem, like the first technical challenge of many large systems, was storing huge amounts of data.

When you launch a system, it may be used by dozens of merchants, and then as your product sales continue to vigorously promote, it may gather 100,000 users in a few months.

These users will use a lot of products you provide every day, and then every day will produce a lot of data, we can imagine, in the hundreds of thousands of scale business user scenarios, every day you will add data is probably tens of millions of data, ** remember, this is new data every day! ** This will put a lot of pressure on the very low architecture you saw above.

If you are in charge of the above system, and you slowly find that tens of millions of data are flooding into MySQL every day, this phenomenon can be very frustrating, because the amount of single table data in your MySQL can quickly grow to hundreds of millions of data, or even billions of data. And then you run hundreds or even thousands of rows of SQL against those monster tables? Which contains N levels of nested queries and N various multi-table joins, right?

I bet if you try, your data platform system will freeze because a large SQL can take hours to run. The MySQL CPU load is 100%, and the MySQL database server will crash.

So this is the first technical challenge, the data volume is getting bigger and bigger, SQL is getting slower and slower, MySQL server is getting more and more stressed.

At that time, we had seen the rapid growth of the business, so we absolutely had to rebuild the system architecture first. We couldn’t let the above situation happen. The first architecture reconstruction was imperative!

Fifth, the separation of offline computing and real-time computing

In fact, when we did this project a few years ago, big data technology had already been well applied in China. Especially in some large Internet companies, we basically used big data technology to support many projects in production environment, and we have accumulated enough experience in big data technology.

According to the requirements of this data product, we can do it completely. We can store all the data of yesterday and before yesterday in big data storage for offline storage and offline calculation, and then only the data of today can be collected in real time.

Thus, at the heart of the first architectural refactoring of this technical challenge was the separation of offline and real-time computing.

As you can see from the diagram above, the new architecture is divided into offline and real-time computing links.

** One is offline computing link: ** Every morning, we import the data in the MySQL database of the service system before yesterday into Hadoop HDFS as offline data for offline storage. Then we perform offline computing on the data in the offline storage based on Hive/Spark.

If you don’t know much about big data, you can take part in an article I wrote earlier: Brother, I will tell you in plain English the principles of Hadoop architecture that even small people can understand. As the best and most widely used big data technologies in the world, Hadoop and Spark are naturally suitable for distributed storage and computing of massive pB-level data.

After the offline computing link is fully supported by big data related technologies, it perfectly solves the massive data storage, even if you come in hundreds of millions of data a day, distributed storage can be expanded at any time, and distributed computing technology is naturally suitable for massive data offline computing.

Even if it takes a few hours every morning to complete the calculation of the data before yesterday, this is ok, because nobody usually looks at the data in the early morning, so it is mainly done before people go to work at 8am.

** The other is real-time computing link: ** After midnight every day, the latest data changes of the day are all in the same way as before, which synchronizes the data of the business library to the storage of the data platform in seconds, and then the data platform system runs a large amount of SQL for calculation. Meanwhile, at midnight each day, the data platform will be cleared from the storage of yesterday’s data, only the data of the current day.

The biggest change to the real-time computing link is that only one day’s data is kept in the local storage of the data platform, which greatly reduces the amount of data to be stored in MySQL.

For example: for example, tens of millions of data are placed in MySQL a day, so the data volume of a single table is maintained at the level of tens of millions. At this time, if the SQL corresponding index and optimization to the extreme, it can barely complete the calculation of all reports in dozens of seconds.

6. Increasing data volume and computational pressure

But just doing the above will only temporarily relieve the pressure on the system architecture as the business accelerates and continues to grow.

You always expect tens of millions of data per day. How can that be? The business won’t give you the chance. It’s not long before we can see hundreds of millions, even billions, of data in a single day.

If once the amount of data in a single day reaches the level of billions, the amount of data in a single table is hundreds of millions, how can you optimize SQL performance, there is no guarantee that more than 100 hundreds of lines of complex SQL can be run quickly.

When the time comes, it will return to the original problem. Slow SQL calculation will cause the core system of the data platform to freeze, and even put too much pressure on the MySQL server, which will break down after the CPU is 100% loaded.

And in addition there is another problem, it is a single MySQL database server storage capacity is limited, if the data once daily or even more than a single MySQL database server storage limit, so this time can also lead to a single MySQL database can’t accommodate all of the data, it is also a big problem!

The second architecture reconfiguration is imperative!

Shortcomings of real-time computing technology in the field of big data

In the background of this project a few years ago, the real-time computing technology in the field of big data that was available at that time was mainly Storm, which was a relatively mature technology, and Spark Streaming in Spark ecology. There were no Flink, Druid technologies that are hot today.

After careful investigation, there is no real-time computing technology in the field of big data that can support this demand.

Because Storm does not support SQL, and even if you forced him to support it, his SQL support would be so weak that it would be impossible to run hundreds or even thousands of lines of complex SQL on such a streaming engine.

Spark Streaming, too, was weak at that time. Although it could support simple SQL execution, it could not support precise computation of such complex SQL at all.

So unfortunately, given the technology of the day, there was no open source solution to the pain point of real-time data manipulation. We must customize and develop our own data platform system architecture from 0 according to the specific scenarios of our business.

Divide database and table to solve data expansion problem

First of all, we need to solve the first pain point, which is if a single database server cannot store the data of the day, what should we do?

The first preferred solution, of course, is to separate libraries and tables. We need to split a library into multiple libraries, put the unused libraries on different database servers, and put multiple tables in each library.

After using this sub-database sub-table architecture, each database server can put a part of the data, and with the increasing amount of data, more and more database servers can be continuously added to accommodate more data, so as to expand according to demand.

At the same time, each database single table is divided into multiple tables, so that you can ensure that the amount of data in a single table is not too large, control the amount of data in a single table in the order of millions, basically the performance optimization to the extreme SQL statement running efficiency is good, second results can be done.

Similarly, here is a picture to give you an intuitive feeling:

Read/write separation reduces the load on the database server

At this time sub-database sub-table, and faced with another problem, is now if each database server is written and read, will lead to the DATABASE server CPU load and IO load is very high!

Why do you say that? The CPU of the database server is extremely busy because thousands of concurrent database writes per second are being run frequently to query data.

Therefore, we have deployed MySQL to separate read and write. Each primary database server has multiple secondary database servers. Write can only be written to the primary database, and check can be checked from the secondary database.

Take a look at the image below:

10. Self-developed sliding window dynamic computing engine

But just isn’t enough to do this, because actually found in production environment, even if the level of single table data will be limited to a few million, you run hundreds of hundreds of complex SQL, also want to dozens of seconds or even minutes of time, the timeliness of pay level of products have a little can’t accept it, puts forward the perfection of performance requirement is that the second grade!

Therefore, for the above system architecture, we optimized the architecture again and embedded our own purely self-developed sliding window computing engine into the data platform. The core ideas are as follows:

  1. In the process of database binlog acquisition middleware collection, data changes should be cut into a sliding time window, each sliding time window is several seconds, and the data in each window should be labeled with the label of that window
  2. At the same time, it is necessary to maintain a sliding time window index data, including the window in which the data of each fragment is located, and some specific index information and status of the data of each window
  3. Then the core engine, data platform is no longer is running a large number of SQL every seconds all the day all data to calculate it again, but to the sliding time window one by one, according to the window tag to extract the data in the window to calculate, computing is just a string of recent data within the sliding time window
  4. Then the data of the sliding time window, and may be up to one thousand or so, run all complex SQL statements to calculated the sliding time window data, then this window data calculated results, and the other window before calculated results merging, finally in the report put in MySQL
  5. In addition, there are a number of production-level mechanisms that need to be considered, including sliding time Windows and what if calculations fail? What if a sliding time window computes too slowly? If the system breaks down during sliding window computing, how can I automatically recover computing after the system restarts? , etc.

Through the calculation engine of this sliding window, we directly improved the system’s computing performance by dozens of times. Basically, the data of each sliding window can complete the calculation of all reports in only a few seconds, which is equivalent to improving the timeliness of the real-time data finally presented to users to a few seconds rather than dozens of seconds.

Again, take a look at the picture below.

Eleventh, offline computing link performance optimization

The performance problem of real-time computing link is solved by self-developed sliding window computing engine, but the performance problem of offline computing link appears again at this time…

It is because the full historical data is imported offline from the business library in the early morning every day, and then it is necessary to run a lot of complex thousands of lines of complex SQL in the early morning for the full data of tens of billions. When the data amount reaches tens of billions, this process takes a long time, sometimes from the early morning to the morning.

The key problem is that the offline computing link, every day is to import the full amount of data to calculate, which is very pit.

The reason for doing this is that when synchronizing data from the business library, data is updated every day, and data in Hadoop cannot be updated like that in the business library. Therefore, in the beginning, full historical data is imported every day and used as a latest snapshot for full calculation.

Here, we optimize the offline computing link. ** is mainly full computing to incremental computing: ** After the daily data is imported into Hadoop, the daily changed incremental data will be analyzed and extracted according to the business timestamp of the data, and these incremental data will be put into an independent incremental data table.

In addition, the basic blood relationship of data calculation needs to be automatically analyzed based on specific service requirements. Incremental data may need to be mixed with partial full data to complete the calculation. In this case, partial full historical data may be extracted and merged to complete the calculation. After the calculation is completed, the calculated results are combined with the historical results.

After completing the process of full computing to incremental computing, the offline computing link has tens of billions of levels of data in the early morning. As long as it takes one or two hours to complete the calculation of yesterday’s incremental data, it can complete all tasks of offline computing, and the performance is improved at least ten times more than that of full computing.

12. Periodic summary

So far, this is the system in the initial period of time to make a set of architecture, not too complex, there are many defects, not perfect, but in the context of the business at that time the effect is quite good.

Early in this architecture the corresponding business background, the new data every day is about magnitude, but after depots table, single table data volume in millions of level, the peak period of a single database server in pressure in the 2000 s/s, query the pressure in the 100 s/s, database cluster of total write pressure peak in the 10000 s/s, query the pressure in the 500 s/s, More database servers can be added at any time to accommodate more data, higher write and query concurrency.

Also, because of read/write separation, the CPU load and IO load of each database server will not be full during peak times, preventing the database server from being overloaded.

The self-developed computing engine based on the sliding time window can ensure that the real-time data updated on the same day can complete a micro-batch calculation in a few seconds and feed back to the data reports seen by users.

At the same time, this engine manages the state and log of calculation by itself. In case of calculation failure of a window, system downtime, calculation timeout and other abnormal situations, this engine can automatically retry and recover.

In addition, until yesterday, massive data was stored and calculated offline by Hadoop and Spark ecosystem. After performance optimization, it takes an hour or two every morning to calculate all the data before yesterday.

Finally, real-time and offline calculation results are fused in the same MySQL database. At this time, if users perform operations on the service system, the real-time data report will be refreshed in a few seconds. If you want to view the data before yesterday, you can choose the time range to view it at any time.

In the early months, hundreds of millions of data were added daily, and the overall data magnitude in the offline and real-time links reached tens of billions of levels. Whether it was storage expansion or efficient computing, this architecture basically supported.

13. Prospects for the next stage

Evolution of this large system architecture practice is a series of articles that will contain a lot of articles, because the process of the evolution of a large system architecture will continue for a long time, made many times the architecture of the upgrading and reconstruction, solve the technical challenges of growing, constantly perfect withstand huge amounts of data, high concurrency, high performance, high availability, etc.

The next article will talk about the next steps in reconstructing the data platform system into a highly available and fault-tolerant distributed system architecture to solve problems related to single points of failure, high CPU load on a single system, automatic failover, automatic data fault tolerance, etc. There will be more articles on our own more complex platform architecture that supports high concurrency, high availability, high performance, and high volume data.

Q&a from the previous article

In my last article, I wrote about high concurrency optimization for distributed locks. For details, see: High concurrency optimization for Distributed locks in a Thousand orders per second scenario! . I got a lot of questions from you, but it all came down to one question:

In view of the article in the way of using distributed lock segment lock, to solve the problem of oversold inventory, then if the inventory of a segment does not meet the quantity to buy, what to do?

First, I mentioned one sentence in my article, maybe I didn’t write too much detail. If the inventory of one section is insufficient, we need to lock the other sections and make combined deduction. If you lock the sections, that is the case and it is very troublesome.

If you look at Java 8 LongAdder source code, his segmentation lock optimization, is also such trouble, to do segment migration.

Secondly, I have repeatedly emphasized in the article that we should not make a comparison. Because there are many other technical means to solve the problem of overselling inventory of e-commerce, we only use other solutions, not this one, and we will talk to you about how to solve the problem in the future.

The article was just using that example as a business case to illustrate the concurrency problem of distributed locks and high concurrency optimizations to get the idea across.

Third, finally emphasize that we pay attention to the idea of sectionalized lock, remember not to take the seat, do not pay attention to too much in the inventory oversold business.

END

Stay tuned:

How to Design highly Fault-tolerant Distributed Computing Systems with Multi-billion Traffic System Architecture

How to Design a High-performance Architecture for Ten Billion Traffic

How to Design a High Concurrency Architecture with Hundreds of Thousands of queries per second

How to Design full Link 99.99% High Availability Architecture for 100 Million Level Traffic System Architecture

If there is any harvest, please help to forward, your encouragement is the biggest power of the author, thank you!

A large wave of micro services, distributed, high concurrency, high availability **** original series

The article is on its way,Please scan the qr code belowContinue to pay attention to:

Architecture Notes for Hugesia (ID: Shishan100)

More than ten years of EXPERIENCE in BAT architecture

** Recommended reading:

1. Please! Please don’t ask me about the underlying principles of Spring Cloud

2. [Behind the Double 11 carnival] How does the micro-service registry carry tens of millions of visits of large-scale systems?

3. [Performance optimization] Spring Cloud parameter optimization practice with tens of thousands of concurrent applications per second

4. How does the microservice architecture guarantee 99.99% high availability under the Double 11 Carnival

5. Dude, let me tell you in plain English what Hadoop architecture is all about

6. How can Hadoop NameNode support thousands of concurrent accesses per second in large-scale clusters

7. [Secret of Performance Optimization] How does Hadoop optimize the upload performance of large TERabyte files by 100 times

8, please, interview please do not ask me TCC distributed transaction implementation principle pit dad!

9, 【 pit dad ah! How do final consistent distributed transactions ensure 99.99% high availability in real production?

10, please, interview please don’t ask me Redis distributed lock implementation principle! 台湾国

11, * * * *[Eyes light up!] See how Hadoop’s underlying algorithms elegantly improve large-scale cluster performance by more than 10 times?

12,Distributed lock high concurrency optimization practice in thousands of orders per second scenarios!