Something’s wrong with the data!

An occasional problem with the service.

Troubleshoot logs, monitoring, data, and code logic.

Services do not add transactional guarantees, and data inconsistencies are inevitable: there is data in the cache, no data in the database or vice versa; There is phase A data, there is no phase B data; There is no data where there should be, but there is no data where there should be.

The master-slave mechanism encountered strong consistency requirements, with occasional cache inconsistencies.

Faulty logic is embedded in the service, accumulating a total amount of faulty data over time.

. .

In either case, the data gets dirty.

How to do?

1. Data for dealing with current issues

Right, not all of the problem data, but the current problem data, which we often call urgent problems, and we call them situational.

Users’ problems and demands always need to be responded to in the first time, which is related to the use experience of big companies’ products and corporate image, and cannot be postponed.

Of course, before this, you still need to keep the site, these sites are your investigation process, every step to confirm the problem extracted information.

Experience must be accumulated and lessons retained.

Fix the cause of the problem

The cause of the problem has been identified and isolated cases have been dealt with. The next thing you need to do is fix the service.

1. Transactional security issues

First determine whether the problem point is a key core module and whether it can be tolerated.

For example, the association of trading system, order, inventory, account balance and so on is intolerable, so it is necessary to ensure the consistency and correctness of data from all aspects.

Interface idempotence of service logic level, transactional of business operation, compensation and retry guarantee reliability; Message middleware level or related to message persistence, mirror fault tolerance, etc. Data storage layer distributed cache, database master/slave, etc.

For example, the number of views, the number of likes and so on, few people will pay attention to the specific one more or one less (except serious), a big V wrote an article, the number of views instantly checked tens of thousands, thousands of forwarding, big V will not care about the number of views should be 30001 only 30000.

Such data, either for final consistency or occasional missing data, can also be ignored. There is no need to add additional transactional safeguards that consume performance.

2, master and slave and strong consistency

Master-slave and strong consistency are like two magnets at the same pole, which usually cannot coexist.

Before I met a colleague in reference to ** * “just write in and find out” ** always laugh, that kind of think ridiculous laugh.

And often at this time, I did not smile, only doubt and thinking. Doubt is laughing colleagues, why laugh; The question is, is this the only way to do it, is there a better alternative?

Usually, the real business is a big business within a small business, and a small business within a smaller business, linked together, one ring after another. For example, when a man carries a dozen boards across a river, he puts one down, steps on it, and then puts the next down.

Therefore, it is normal to solidify the previous change and then make the next change.

However, with the explosive development of the information age, the load and load of information services is an order of magnitude, which has evolved many complex information services today. And in this, is the revolution “from the” revolution came into being, has been.

Master and slave is a kind of divide-and-rule idea. Reading and writing scenarios have different levels of magnitude, different processing logic and different guarantees. Therefore, different ways of providing data can be adopted.

Cache or database, one or more slave nodes are separated from the master node for data query, and the slave node synchronizes the data changes of the master node in real time. The data will have a certain delay, but it basically conforms to the realistic application scenarios. What we pursue is not so strong final consistency.

The master node is responsible for the processing of data changes and the necessary data queries for scenarios requiring strong consistency.

In the actual application development, special attention should be paid to the screening of zero tolerance of data delay business logic, combined with the actual application of the master node for data query or other means to ensure the strong consistency of data requirements.

3. Avoid unnecessary low-level logic errors

Here, we emphasize errors of logic due to negligence, not personal ability.

People are emotional creatures who struggle with their own weaknesses all their lives. Humans are arrogant, irritable and impatient. And this often leads to some unnecessary mistakes.

Attribute assignment error, unassigned; Null value not determined; Special state values are not filtered, and so on.

It is impossible for people not to make mistakes, what we need to do is how to avoid unnecessary mistakes.

The importance of Code Review needs to be mentioned here. It is difficult for a person to find his own mistakes, which have subjective emotional factors and objective cognitive problems. The bystander sees the most of the Code Review. The process of Code Review is a process of discovery, correction, optimization and improvement. The focus of each person is different, or the overall architecture design, or the details of variable naming. One thing needs to be pointed out is that many developers who are first exposed to Code Review often have resistance, or think it is unnecessary, and then feel ashamed of bugs and others. In this regard, it is necessary for teams to build common cognition, awareness and regulations.

Processing dirty data

What if the data is dirty? Just wash it!

Is dirty data easy to handle? Easy to handle. The question is where is the dirty data?

The data of individual user problems can be processed specifically. The hidden dirty data needs to be located and cleaned.

1. Limit the scope of influence

What data was affected? Since when? What is the impact?

Limit the scope of business logic and the data surface of influence. For example, in the data group associated with fans + friends + group, A follows B, A has an extra friend, B has an extra fan, and A puts B in A specific group.

Limited time range: when it goes live or when the business logic in question is triggered, for example, if the change modification adjustment to the X object started on 1, then all active data related to the X object modification from 1 up to now needs to be found.

Limiting impact: Determine what the impact is on the data to further determine what to do next. For example, if an attribute assignment is missing, can it be corrected for other state attributes? The deleted data is not deleted. Can you delete the data directly to solve the problem?

2. Screening data

With the foundation of step 1, all you need to do in this step is find the problem data. One by one, compare and filter with calibrated values one by one. Typically, we add a temporary processing interface for real-time validation.

Here, it should be pointed out that the processing and screening methods of data of different levels will be different.

If your data volume is ten thousand levels, then a single sequential filter is ok, no additional processing is required;

If your data volume is 100,000, you may need to add batch processing to the processing interface.

If your data volume is in the millions, multithreading of the script will be required in addition to interface batch processing.

If your data volume is in the tens of millions, temporary extension of some data processing nodes can also greatly improve processing efficiency.

3. Handle dirty data

Dirty data is always not mass grade, and necessary verification and validation are necessary before processing.

Select test data or select a small amount of dirty data for processing results validation. After full certification, then full processing.