Welcome to follow our wechat official account: Shishan100
My new course ** “C2C e-commerce System Micro-service Architecture 120-day Practical Training Camp” is online in the public account ruxihu Technology Nest **, interested students, you can click the link below for details:
120-Day Training Camp of C2C E-commerce System Micro-Service Architecture
directory
1. Previous tips
What is data consistency?
Three, a data computing link combing
Fourth, the data calculation link bug
5. The inconsistency of inventory data of e-commerce
6. How difficult it is to troubleshoot data inconsistency in large systems
Seven, the next notice
1. Previous tips
In this article, let’s continue to talk about the evolution of the previous billion-level traffic architecture. The previous articles in this series have been updated to the design of extensible architecture. If you are not clear about it, please refer to the previous articles first:
1, * * * *How to support the storage and computation of billions of data
How to design highly fault-tolerant distributed Computing System
How to design a high-performance architecture for ten billion traffic
How to design a high concurrency architecture with 100,000 queries per second
5,How to design a full link 99.99% high availability Architecture
How to design scalable Architecture in Ten thousand Concurrent Scenarios (PART 1)?
How to Design scalable Architecture for Tens of thousands of concurrent Scenarios (Middle)?
How to design scalable Architecture in Ten thousand concurrent Scenarios (Part 2)?
As usual! Let’s start by looking at what the overall architecture diagram looks like as this complex system architecture evolves into its current phase.
Once again, if you don’t understand the following complex architecture diagram, please go back to the previous articles, because the context of the series must be clearly understood and understood.
Next, we will talk about how to ensure data consistency in a complex system under the background that a core system carries tens of billions of traffic every day.
What is data consistency?
Simply put, in a complex system there must be some very complex processing of data, and it may be multiple subsystems, or even multiple services.
The execution of complex business logic on a piece of data in a certain order may eventually produce a valuable piece of system core data, which is stored in a repository, for example.
To give you a hand-painted color picture, feel the atmosphere of the scene:
In the figure above, we can see how multiple systems process a single piece of data in turn to get a core piece of data and store it in storage.
Then in this process, it may produce the so-called data inconsistency problem.
What does that mean? To give you the simplest example, we would expect data to change as follows: data 1 -> data 2 -> data 3 -> data 4.
So the last thing that lands in the database is number 4, right?
The result? Somehow, through the coordination of the subsystems, or services, in the complex distributed system above, we ended up with a number 87.
After a long time of working on something that had nothing to do with data 4, it finally landed in the database.
And then, ah, the end user of the system might see a puzzling number 87 on the front desk.
This is embarrassing, the user will obviously feel that there is a mistake in the data, will report to the company’s customer service, at this time the bug will be reported to the engineer team, we began to talk about the problem.
This scenario is actually a data inconsistency problem, and it is a problem we will discuss in the next few articles.
In fact, similar problems exist in any large-scale distributed system. No matter it is e-commerce, O2O or the data platform system exemplified in this paper, it is the same.
Three, a data computing link combing
So now that we have identified the problem, let’s take a look at the data platform system, what exactly is the problem that may cause an anomaly in the final storage of data?
To understand this problem, let’s first look back at the data platform project mentioned above, what is the calculation link of a final landing data?
Take a look at the picture below:
In the simplest sense, as shown in the figure above, the link of the data calculation would look something like this.
- Firstly, data is obtained through MySQL binlog acquisition middleware and forwarded to the data access layer.
- The data access layer then drops the raw data into kv storage
- Next, the real-time computing platform will extract data from KV storage for calculation
- Finally, the results are written to the database + cache cluster. The data query platform will extract data from the database + cache cluster and provide users with queries
Seems simple enough, right?
But even in this system, the data computing link, is definitely not that simple.
If you’ve seen the previous articles in the series, you know that this system introduces a number of complex mechanisms to support scenarios with high concurrency, high availability, and high performance.
So essentially a raw piece of data enters the system, and all the way down to storage, the computational link contains the following:
- Traffic limiting at the access layer
- Real-time computing layer failure retries
- Degradation mechanism for local memory storage in the real-time computing layer
- The aggregation and calculation of data shards, where a single piece of data may enter a data shard
- Multi-level caching mechanism for data query layer
These are just a few examples. However, even a few of these can make a data computing link many times more complex.
Fourth, the data calculation link bug
Now that you understand, in a complex system, a piece of core data can be processed through a very complex computational link, with hundreds of thousands of loops in between, and anything can happen.
Then you can understand how the problem of inconsistent data can arise in a large distributed system.
In fact, the reason is very simple, to put it bluntly, is the data calculation link bug.
That is to say, in the process of data calculation, there is a bug in a subsystem, which is not processed in accordance with our expected behavior, resulting in the final output of the data becomes wrong.
So why does this bug occur in the data calculation link?
The reason is simple: if you have ever been involved in a large distributed system with hundreds of people working together, or have led the architectural design of a large distributed system with hundreds of people working together, you should be familiar with the exceptions and errors in core data, and you will feel uncomfortable.
In a large-scale distributed system, hundreds of people collaborate on development. It is possible that the owner of a subsystem or service misunderstood the logic of the data and wrote a hidden bug in the code.
And this bug, easily not triggered, and in the QA testing environment has not been detected, the results with a time bomb, the system online.
Finally, some special situation on the line triggered this bug, resulting in the final data problems.
5. The inconsistency of inventory data of e-commerce
Students who have been in contact with e-commerce may quickly think of a similar classic scene in their mind at this time: inventory in e-commerce.
In large-scale e-commerce systems, inventory data is absolutely the core of the core. But in reality, in a distributed system, many systems may have some logic to update the inventory.
This can lead to problems similar to the scenario described, where multiple systems update the inventory, but one system updates the inventory with a bug.
This could be because the person in charge of that system didn’t understand exactly how to update the inventory, or the logic he used to update it didn’t take into account some special cases.
As a result, the inventory in the system does not match the actual inventory in the warehouse. But it’s not clear what exactly went wrong, causing the inventory data to go wrong.
This, in fact, is a typical data inconsistency problem.
6. How difficult it is to troubleshoot data inconsistency in large systems
When dealing with a large distributed system, if you haven’t thought about data inconsistencies before, I bet you will be completely confused when your system is reported online by customer service to have some core data inconsistencies.
Because the processing of a core data, less involves the collaborative processing of several systems, more than ten systems involved in the collaborative processing.
If you don’t keep any logs, or if you only have partial logs, then you’re basically left with everyone staring at their code.
You’re going to get an incorrect result based on a number, say 87. More than 10 people in their own code, repeatedly thinking, thinking hard.
And then everyone is frantically simulating their own code in their head, but they just can’t figure out why the number 87 comes out instead of the number 4.
So the real problem is this, the problem of inconsistent data, probably has the following pain points:
- Basically, you can’t take the initiative to perceive data problems in advance. You have to passively wait for users to discover data problems and give feedback to customer service, which is likely to lead to a large number of complaints about your products, and your boss will be very angry, with serious consequences.
- Even if the customer service tells you the data is wrong, you can’t reconstruct the scene, there’s no evidence left, it’s basically a bunch of engineers imagining and guessing the code.
- Even if you fix a data inconsistency once, there may be a next time, and if you do that, several talented guys on your team will end up spending their time on the same thing.
Seven, the next notice
Therefore, in view of the data inconsistency problem of large-scale distributed system described in this paper, the next article will give: how to construct a whole set of core data guarantee scheme for a complex system under the scenario of ten billion traffic flow.
Stay tuned:
- How to ensure data consistency in multi-billion Traffic System Architecture (Middle)?
- How to ensure data consistency in multi-billion traffic System Architecture (part 2)?
end
If there is any harvest, please help to forward, your encouragement is the biggest power of the author, thank you!
A large wave of micro services, distributed, high concurrency, high availability of original series of articles is on the way
Please scan the qr code belowContinue to pay attention to:
Architecture Notes for Hugesia (ID: Shishan100)
More than ten years of EXPERIENCE in BAT architecture
Author: Architectural Notes of Huoia Link: juejin.cn/post/684490… Nuggets copyright belongs to the author all, please contact the author to obtain authorization!