1 Why split?

Let’s start with a dialogue.

From the dialogue above, we can see the reasons for the split:

1) Serious coupling between applications. Different applications in the system are impassable, and the same function has been implemented in various applications. The consequence is to change a function, which needs to change all applications in the system at the same time. This situation mostly exists in the system with a long history. Due to various reasons, each application in the system has formed its own small closed loop.

2) Poor business expansibility. The data model only supports a certain type of business at the beginning of the design. When a new type of business comes, the code has to be written again. As a result, the project is delayed and the access speed of business is greatly affected.

3) The code is old and difficult to maintain. A variety of random if else, write dead logic scattered in every corner of the application, everywhere is pit, development and maintenance of fear;

4) Poor system expansibility. System support for the existing business has been shaky, both applications and DB have been unable to bear the pressure brought by the rapid development of business;

5) A vicious circle of new holes being dug. If you don’t change it, you end up killing the system.

2 What to prepare before disassembly?

2.1 Multi-dimensional grasp of service complexity

A perennial question, the relationship between systems and business?

The ideal situation we most want is the first relationship (vehicle-human), where the business finds it unsuitable and can immediately replace it with a new one. But the reality is that it’s more like the relationship between a pacemaker and a person, not that you can replace it. The more services a system is connected to, the tighter the coupling. If you jump in without really grasping the complexity of the business, you end up taking your heart out with you.

How to master business complexity? It needs multi-dimensional thinking and practice.

One is technical. Through discussions with PD and developers, I am familiar with the domain models of existing applications, as well as advantages and disadvantages. Such discussions can only give a general idea, and more details, such as code and architecture, need to be mastered through the practice of requirements, transformation and optimization.

Familiar with the various applications, need from the system level to idea, we want to create the platform type products, so the most important function is the most difficult thing is centralized control, break the application of small business closed loop, unified collapsed, this determination is more, business development, product, the consensus between each team, See the Microservice Thing to “organize resources around business or customer needs.”

In addition, we also need to maintain functional and planning communication with the business side to ensure that the application meets the usage requirements and expansion requirements after splitting, and obtain their support.

2.2 Defining boundaries, principles: high cohesion, low coupling, single responsibility!

After grasping the complexity of services, define service boundaries for each application. What is a good border? Apps like calabash brothers are good!

For example, the abilities of the calabash brothers are independent of each other and follow a single duty principle, such as water babies can only spray water, fire babies can only breathe fire, and invisible babies can’t spray and breathe fire but can be invisible. More importantly, calabash baby brothers can be combined into Diamond calabash baby eventually. In other words, although these applications are independent of each other, they get through each other and finally become our platform together.

A lot of people here will be confused, how to control the granularity of the split? It’s hard to come to a definitive conclusion, just a compromise that combines business scenarios, goals, and schedules. But the general rule is to start with a large service boundary, not too thin, because as the architecture and business evolve, the application will be split again, and it makes sense to let the right things happen.

2.3 Determine the application objectives after splitting

Once the system of macro application split out of the map, it is necessary to implement a specific application split.

The first thing to determine is the goal of a split application. Split optimization is bottomless, may be more and more deep, more and more fruitless, and then affect their own and team morale. For example, the goal of this phase is to separate DB and application, and the redesign of data model can be in the second phase.

2.4 Determine the current architecture state, code state, dependency state of the application to be split, and deduce possible exceptions.

The cost of thinking before doing something is far less than the cost of solving a problem after doing something. Application split the most afraid of is halfway said “he *, this can not move, the original was designed for a reason, have to think of another way!” At this time the pressure can be imagined, the whole rhythm is not in line with expectations, it is likely to encounter the same problem one after another, then not only the morale of colleagues decline, they will also lose confidence, which may lead to the failure of the split.

2.5 Keep a little something up your sleeve.

The tip is just four words: “Be prepared.” You can paste it on your desktop or phone. In the future concrete implementation process, think more about “are there many options for the scheme? Can complex problems be disassembled? Is there a plan in place?” , the application of separation in the process of specific practice is to compete carefully two words, more than a plan, more than a plan, not only can improve the probability of success, more confidence to their own.

2.6 Relax and relieve pressure

Clean up your mood and start working!

3 the practice

3.1 DB split practice

DB split in the whole application split link in the most complex, divided into vertical split and horizontal split two scenarios, we have encountered. Vertical split is the splitting of the tables in the database into the appropriate database. For example, if a library has both message tables and people organizational structure tables, it is more appropriate to split the two tables into separate databases.

Horizontal split: take the message table as an example, a single table broke through tens of millions of line records, query efficiency is low, at this time it is necessary to divide the database into tables.

3.1.1 Primary Key ID Access the global ID generator

The first thing the DB split does is use the global ID generator to generate the primary key ids for each table. Why is that?

For example, if we have a table with two fields id and token, id is generated by autoincrement primary key, and we need to divide the table by token dimension, then continuing to use autoincrement primary key will cause problems.

Positive transfer capacity, through the increase of primary keys, the new depots in the table must be unique, but we want to consider the scenario of the migration failed, as shown in the figure below, the new list assumptions have been insert a new record, the primary key id is 2, this time assuming that began to roll back, need to merge the two tables of data into a table (reverse flow), Primary key collisions will occur!

Therefore, the id generated by the globally unique ID generator should be used to replace the primary key auto-increment ID before migration. There are several ways to generate globally unique ids.

1) Snowflake: github.com/twitter/sno… ; (Non-globally increasing)

Mysql > alter table auto_increment increment mysql > alter table auto_increment increment mysql > alter table auto_increment increment

3) Some people say that only one table how to ensure high availability? The two tables are good (in two different db’s), one table produces odd, the other produces even. Or n tables, each of which is responsible for a different range of steps (not globally increasing)

4)…

SQL > alter table tDDL-sequence (mysql+ memory) mysql > alter table tDDL-sequence (mysql+ memory)

1) Alter SQL by primary key id in advance. Because id is not guaranteed to increment, may appear out-of-order scenario, in this case can be changed to gmT_create sort;

2) Primary key conflict is reported. Insert SQL (id, id, id); insert SQL (id, id, id, id, id);

3.1.2 Create a new table & Migrate Data & Binlog synchronization

1) The new table character set is proposed to be UTF8MB4, which supports emoticons. After the new table is built, do not omit the index, otherwise it may cause slow SQL! Experience shows that index omistions happen from time to time. It is recommended to write down these points when planning ahead and then check them one by one.

2) Use the full synchronization tool or write your own job for full migration; Full data migration must be performed during peak service periods and the number of concurrent operations must be adjusted according to the system conditions.

3) Incremental synchronization. After the full migration is completed, the binlog incremental synchronization tool can be used to track the data. For example, Alibaba uses Jingwei internally, other enterprises may have their own incremental system, or cannal/ Otter: github.com/alibaba/can…

Github.com/alibaba/ott…

The binlog loci obtained at the start of the incremental synchronization must be before the full migration. Otherwise, data will be lost. For example, if the full synchronization starts at 12:00 and the full migration ends at 13:00, the binlog loci of the incremental synchronization must be before 12:00.

Does being first lead to duplicate records? Don’t! If a delete statement deletes 100 records, the binlog will not record a logical SQL delete statement, but will have 100 binlogs. The INSERT statement inserts a record that cannot be inserted if the primary key conflicts.

3.1.3 SQL Transformation for joint Table Query

Now that the primary key has been connected to the global unique ID, the new database table and index have been established, and the data is being equalized in real time, can we start the database cutting? No!

Consider the following very simple join table query SQL. What happens to this SQL if table B is split into another database? After all, cross-library associative table queries are not supported!

Therefore, before cutting database, we need to complete the SQL transformation of hundreds of joint table queries in the system.

How to transform it?

1) Business avoidance

Technology can be loose-coupled only when the business is loose-coupled, thus avoiding syntable SQL. But in the short term is not realistic, need time to settle;

2) Global table

Each application library redundant a table, disadvantages: equal to no split, and many scenarios are unrealistic, table structure change trouble;

3) Redundant fields

Just like the order table, there are redundant commodity ID fields, but we need too many redundant fields, and we need to consider the data update problem after field changes;

4) Memory splicing

4.1) Fetch the data of another table through RPC call, and then memory concatenation. 1) SQL suitable for Job class, or SQL with less RPC query volume after transformation; 2) Not suitable for large data volume of real-time query SQL. Suppose there are 10000 ids, paged RPC query, 100 ids each time, requires 5ms, a total of 500ms, RT is too high.

4.2) Cache data from another table locally

Suitable for SQL that requires little data change, a large amount of data query, and high interface performance and stability.

3.1.4 Design and implementation of Store-cutting scheme (two schemes)

After the above steps are completed, the real cutting process begins. Here we provide two solutions, which are used in different scenarios.

A) DB stop write scheme

Advantages: fast, low cost;

Disadvantages:

1) If you want to roll back, you have to contact the DBA to perform online write stop operation, which is high risk, because it may be rolled back during business peak;

2) If only one place is checked, the probability of failure is high and the probability of rollback is high

For example, if you are dealing with a complex business migration, there is a good chance that rollback will occur:

SQL joint table query transformation is incomplete;

SQL joint table query error correction & Performance Problems;

Missing indexes cause performance problems;

Character set problem

In addition, binlog backflow is likely to have character set problems (UTF8MB4 to GBK), causing backflow to fail. These binlog synchronization tools ensure strong final consistency. If a record fails to return, it will be stuck out of sync, resulting in the data of the old and new tables being out of sync and unable to roll back!

B) Double-write scheme

Step 2: “Turn on the double-write switch, write the old table A first and then write the new table B”. At this time, make sure to try catch when writing the old table B, and mark exceptions with A clear mark to facilitate troubleshooting. Step 2 After the double-write lasts for a short time (for example, half a minute), you can disable the binlog synchronization task.

Advantages:

1) Break down complex tasks into a series of measurable tasks to win step by step;

2) Online non-stop service, easy rollback;

3) The character set problem has little impact

Disadvantages:

1) The process has many steps and a long cycle;

2) Double write causes RT to increase

3.1.5 Switches should be written well

No matter what library cutting scheme, the switch is indispensable, here the initial value of the switch must be set to NULL!

If you set an arbitrary default value, such as “read old table A”, assume that you are already in the process of reading new table B. At this time, the application is restarted. At the moment of application startup, the latest “read new table B” switch push may not be pushed, this time may use the default value, resulting in dirty data!

3.2 How to ensure consistency after splitting?

In the past many tables are in a database, using transactions is very convenient, now split out, how to ensure consistency?

1) Distributed transactions

Poor performance, almost not considered.

2) Message mechanism compensation (how can a message system avoid distributed transactions?)

3) Timed task compensation

Use more, achieve the final consistency, divided into data compensation, data compensation deletion two.

3.3 How to ensure stability after application splitting?

In a word: suspect the third party, guard against the use of party, do their own!



1) Suspect third parties

A) Defensive programming, making various degradation strategies;

  • For example, active/standby cache, push/pull combination, local cache…

B) Follow the principle of quick failure, be sure to set the timeout time and catch exceptions;

C) The strong dependency becomes weak dependency, and the side-branch logic becomes asynchronous

  • After asynchronization of the side logic of a certain core application, the response time is almost shortened by 1/3, and jitter occurs in the middleware and other applications behind, while the core link is normal.

D) Properly protect the third party and carefully choose the retry mechanism

2) Guard against user

A) Design a good interface to avoid misuse

  • Follow the principle of least exposed interface; Many students will expose a lot of interfaces after building a new application, and these interfaces are easy to dig holes for the future because no one uses them and they lack maintenance. I have heard more than one conversation, “how do you use my interface ah, then write casually, the performance is very poor”;
  • Don’t let the user do something the interface can do; For example, if you only expose a getMsgById interface, other people might call the for loop directly if they want to call it in batches. If you provide getMsgListByIdList interface, this will not be the case.
  • Avoid long-running interfaces; Especially in older systems, an interface may correspond to the scenario of select DB for loop.

B) Capacity constraints

  • Flow control according to application priority; For example, the quota of core applications must be higher than that of non-core applications.
  • Service capacity control. Sometimes restrictions are not only required at the system level, but also at the business level. For example, for some saas systems, “you as a tenant can be used by 10,000 people at most”.

3) Be yourself

A) Single responsibility

B) Clean up the historical pits in time

  • For example, we found the pit left a year ago during the transformation, and after removing it, the CPU usage of the whole cluster decreased by 1/3

C) SOP operation and maintenance

  • To be honest, if there is a problem online, if there is no plan, no matter how to deal with it will run out of time. I once encountered a DB failure that led to the problem of dirty data. I had to write code to clean up the dirty data, but it took a long time and I could only watch the failure escalate. After this experience, we immediately imagined various scenarios of dirty data, and then launched three dirty data cleaning jobs to prevent other unexpected scenarios of dirty data failure. In the future, whenever a dirty data failure occurs, these three dirty data cleaning jobs are directly triggered, and recovery is performed before troubleshooting.

D) Predictable resource use

  • The CPU, memory, network, and disk of your application
    • Regular matching consumes CPU
    • Performance-intensive Job optimization, demotion, offline (loop calls to RPC or SQL)
    • Slow SQL optimization, degradation, and traffic limiting
    • Tair/REDis and DB should be predictable
    • Example: Tair, DB

For example: a certain interface is similar to the second kill function, the QPS is very high (as shown in the figure below), the request to tAIR first, if can not find the source back to DB, when the request surge, will even trigger the tair/ Redis cache flow limiting, and since the cache has no data at first, the request will penetrate DB, thus crushing DB.

The core issue here is that the use of the tair/ Redis layer of resources is unpredictable. How do you make requests predictable because of the QPS that depend on the interface?

If we add another layer of local cache (guava, for example, with a timeout of 1 second) and ensure that there is only one request back to each key, then the tair/ Redis resource usage can be predicted. Assuming that there are 500 clients, a maximum of 500 requests for a key can penetrate Tair/ Redis in one moment, and so on to DB.

Here’s another example:

For example, if there are 500 clients, a maximum of 500 requests for a key will be sent to the DB at a moment. If there are 10 keys, a maximum of 5000 requests will be sent to the DB.

Data can be continuously flushed from DB to cache through a timed program. Here we change the db access of 5000 QPS that are not controllable into db access of single-digit QPS that are controllable.

4 summarizes

1) Prepare for pressure!

2) Complex problems to be disassembled for multiple steps, each step can be tested and rolled back!

This is the most valuable practical lesson in application splitting!

3) Murphy’s Law: Whatever you worry about will happen, and it will happen quickly, so have your SOP (standardized solution) ready!

When I had dinner with my colleagues on A Friday, I discussed the risk of a certain function and agreed to solve it next week. As a result, the function broke down when I went to work on Monday. In the past, it was said that a small probability is impossible to happen, but no matter how small the probability is, it has value. For example, P =0.00001%. In the Internet environment, if the number of requests is large enough, a small probability event will actually happen.

4) Borrow the fake and fix the real

The term is a bit of a mystery, as the name suggests, to borrow something to promote another ability, the former is called false, the latter is called true. There are very few opportunities in any organization for major breakdowns of core systems, so once you take responsibility, don’t hesitate to go for it! Don’t be intimidated by the twists and turns of the process, the sharpening of the mind is the true.