Welcome to follow our wechat official account: Shishan100

My new course ** “C2C e-commerce System Micro-service Architecture 120-day Practical Training Camp” is online in the public account ruxihu Technology Nest **, interested students, you can click the link below for details:

120-Day Training Camp of C2C E-commerce System Micro-Service Architecture

First, write first

A series of “Hundred-million-level traffic System Architecture” has been updated before, which mainly describes the following aspects of a large-scale merchant data platform:

How to carry ten billion data storage
How to design a highly fault-tolerant distributed architecture
How to design a high-performance architecture that can carry billions of dollars of traffic
How to design a high-concurrency architecture with hundreds of thousands of concurrent queries per second
How to design the full link 99.99% high availability architecture.

Next, we will continue to discuss the extensible architecture, data consistency guarantee and other aspects of this system through several articles.

Those of you who haven’t read this series can go back and look at some of the previous posts:

100 million level traffic system architecture

How to support the storage and computation of billions of data
How to design highly fault-tolerant distributed Computing Systems
How to design a high-performance architecture for ten billion traffic
How to design a high concurrency architecture with 100,000 queries per second
How to design a full link 99.99% high availability Architecture

Second, background review

If you have seen the previous series of articles, you will recall that by the end of the last one, the overall system architecture had evolved roughly to the state shown below.

If you haven’t seen the previous series of articles, take a look at the picture below and you will definitely be confused and see a lot of “colorful”. You can’t help it. Complex systems are extremely complex.

Third, the coupling between real-time computing platform and data query platform

All right, let’s get started! In this article we will talk about an extensible architectural treatment of the communication process between different subsystems in this system.

It contains the real scenes and pain points of interaction between complex online systems, which I believe will inspire you.

Let’s focus on the left part of the architecture diagram above. The real-time computing platform in the middle writes the calculation results to the data query platform on the far left after completing the calculation of each data fragment.

For a variety of reasons, because the amount of data in the calculation is actually an order of magnitude smaller than the amount of data in the original data.

Therefore, we choose the real-time computing platform to directly write data into the MySQL database cluster of the data query platform, and then the data query platform provides query requests externally based on the MySQL database cluster.

In addition, in order to ensure that the real-time calculation results of the day can be queried by users in a high degree of concurrency, the real-time computing platform was used to double write the cache cluster and database cluster.

In this way, the data query platform can preferentially go to the cache cluster, and if the cache is not found, it will look back from the database cluster.

So the above is a typical system coupling architecture between real-time computing platform and data query platform in a certain period.

The two different systems are coupled through the same set of data stores (database cluster + cache cluster).

Take a look at the figure below to get a clear sense of the coupling between the systems.

System coupling pain point 1: passive high concurrent write pressure

Everybody if you have seen before series, you should know about, in the early days is mainly focus on the architecture of real-time computing platform to do a lot of evolution, in order to let he can support high concurrent writes, huge amounts of data of high performance computing, finally can withstand tens or even hundreds of thousands of data per second influx of storage and computation.

However, due to the early adoption of the simplest, most efficient and practical coupling interaction mode shown in the figure above, the real-time computing platform directly writes the calculated results of each data fragment into the shared storage, which leads to a big problem.

Real-time computing platforms are no problem withstanding ultra-high concurrency writes, but also fast high-performance computing.

However, it also writes computations to a database cluster with increasing concurrency as the volume of data grows. The database cluster is actually maintained by the data query platform team when the team is divided.

In other words, the real-time computing team does not care about the state of the database cluster, but simply writes data to the cluster.

However, for the data query platform team, they are passive to take on the data written by the real-time computing platform under increasing concurrent pressure.

At this time, students in the data query platform team are likely to be in such a kind of anxiety: originally, there are many architectural improvements to be made in this system, such as the self-development of the cold data query engine mentioned above.

Instead, they are constantly being swamped by alarms from online database servers.

As services grow, the single-node write pressure of a database server rapidly changes to 5000-6000 per second. In peak hours every day, the CPU, disk, IO, and network of online servers are under great pressure, resulting in frequent alarms.

The data query platform team rhythm will be disrupted, the evolvement of the architecture because must be according to the real-time computing platform into passive pressure to adjust, must immediately stop work, then to consider how the database cluster do depots table plan, how to increase the table, how to increase the library.

At the same time, combined with the scheme of sub-database and sub-table, the query mechanism of the data query platform itself must be changed together, and a lot of transformation work, research work, data migration work, online deployment work, code transformation work.

In fact, the above mentioned situation is absolutely unreasonable.

Because the whole data platform is a core system of the core business department of a big Internet company, it is jointly developed by dozens of Java engineers and big data engineers, and it is divided into multiple teams.

For example, one team is responsible for data access system, one team is responsible for real-time computing platform, one team is responsible for data query platform, one team is responsible for offline data warehouse, and so on.

Therefore, as long as the division of labor and cooperation, one team should not passively bear the sudden increase of writing pressure of another team, which will break the rhythm of each team’s own work.

The root cause of this problem is that there is no decoupling between the two systems.

As a result, the data query platform team is unable to do any effective control and management of the data flooding from the real-time computing platform, which also leads to the problem of “passively bearing high concurrent write pressure”.

The passive high concurrent write pressure caused by system coupling is not just as simple as the above. In the above scenario, all kinds of strange things actually happen in the online production environment:

Once a large amount of hot data was suddenly generated online, and the calculation results of hot data poured into the data query platform. Because no control was done, the number of concurrent writes to a certain database server reached more than 10,000 in an instant, and the DBA worried that the database would break down, and all the people were also engaged in a mental breakdown.

System coupling pain point 2: Online system performance jitter caused by database operation and maintenance

In this system coupling scenario, on the other hand, students in the real-time computing platform team will cry out in their hearts: Our hearts are bitter too!

If you think about it the other way around, table structure changes in an online database are almost perfectly normal, especially in a rapidly iterative business.

At the requirements review meeting, if you accidentally meet a product manager, change the requirements today, change the requirements tomorrow. The engineer’s gonna be pissed off and want to chop someone up. But there is no way, in the end, we have to compromise, the need to change or change. The table structure should be changed and the index should be added.

But consider one point, if the above strong coupling system architecture, a single table is basically in the tens of millions of levels of data, and there is a single database server write pressure of several thousand per second.

In this scenario, walk a MySQL DDL statement online to try? I advise you not to try, because the young students in the data query team have done this.

The actual result is that DDL clicks to modify the on-line table structure, which directly causes real-time computing platform write database performance to drop more than 10 times…

Then it will lead to a lot of delay of data slice calculation task of real-time computing platform. Then, because the data after real-time calculation can not be fed back to the storage as soon as possible, can not be queried by users, resulting in a large number of online complaints.

And, the EXECUTION of THE DDL statement is particularly slow, it takes tens of minutes to complete the execution, which leads to tens of minutes, the whole system has a large-scale calculation delay, data delay.

The real-time computing platform slowly resumed normal computing through its own automatic delay scheduling recovery mechanism until the DDL statement was executed ten minutes later.

orz…… Therefore, since then, the siege lion of the data query platform must be careful to carry out relevant database operation and maintenance operations between 2 to 3 o ‘clock in the morning every day, to avoid affecting the performance stability of the online system.

But didn’t the young engineer have a girlfriend? Don’t older engineers have wives and children? No one would want to take a taxi home at 3am to look out the window.

In fact, the above problems, to put it bluntly, or because the two systems are coupled together directly through storage, resulting in any system as long as there is a little change, will directly affect the other system. Coupling! Coupling! Coupling again!

System coupling pain point N…

In fact, the above only selected two of the system coupling pain points to explain, the paper is limited, it is difficult to explain all the pain points in the above months-long coupling state one by one, the pain points in the actual online production environment also include but not limited to:

Data loss caused by bugs in the writing mechanism of the real-time computing platform, which was checked by the students of the data query platform;
When the real-time computing platform double-writes cache cluster and database cluster, the double-write consistency guarantee mechanism actually needs to be implemented by itself, which directly leads to a large amount of business logic that does not belong to its own code.
Sometimes, after the data query platform has done the operation and maintenance of separate database and table, such as the expansion of database and table, students of the real-time computing platform have to modify the code configuration together, test and deploy online together
The students of the two teams, the data query platform and the real-time computing platform, often work overtime together until midnight every day in the large number of coupling scenarios mentioned above, and their girlfriends all think they are going to get together, but the actual situation is that a bunch of old men are made in a terrible way every day, suffering unsufferable, and do not want to look at each other
Due to various problems caused by system coupling, both teams have to devote time and energy to solve them, which affects the architecture evolution progress of their own system and makes it impossible to concentrate their manpower and time to do things of real value and significance

Four, the next episode

In the next article, we will talk about how to flexibly use MQ message middleware technology to decouple complex systems from these pain points, and how to control traffic data and solve the problems of system coupling after decoupling.

Stay tuned:

How to Design scalable Architecture in Ten thousand Concurrent Scenarios (In)?
How to Design scalable Architecture in Ten thousand Concurrent Scenarios (Part 2)?

END

If there is any harvest, please help to forward, your encouragement is the biggest power of the author, thank you!

A large wave of micro services, distributed, high concurrency, high availability of original series of articles is on the way

Please scan the qr code belowContinue to pay attention to:

Architecture Notes for Hugesia (ID: Shishan100)

More than ten years of EXPERIENCE in BAT architecture

Author: Architectural Notes of Huoia Link: juejin.cn/post/684490… The copyright belongs to the author, please contact the author for authorization

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

How to Design scalable Architecture for Ten Thousand Concurrent Scenarios (PART 1)? [Architectural Notes of Huperia]

First, write first

Second, background review

Third, the coupling between real-time computing platform and data query platform

Four, the next episode

How to Design scalable Architecture for Ten Thousand Concurrent Scenarios (PART 1)? [Architectural Notes of Huperia]

First, write first

Second, background review

Third, the coupling between real-time computing platform and data query platform

Four, the next episode

Related Posts

CRUD leaders: Barely survive two years of Internet winter, 20 years without learning will really be frozen to death

How to write recursive code correctly

ThreadLocal: Thread-specific variable