NetEase Qiyu service governance practice

As we all know, business architecture evolves gradually. Architecture continues to change as businesses and organizations evolve, and this change is often reflected in the division of business domains. Dynamic adjustment is a process, it is to take apart management again first commonly. This article focuses on the service and module boundary issues that arise during the evolution of business architecture and practices to address these issues.

1. Service architecture evolution

Due to the high complexity of Qiyu itself, the microservice architecture was adopted at the beginning of the system design. With the development of the organization and business, this architecture has undergone significant changes.

Fundamentally speaking, these changes are the process of system separation from “separation by function” to “separation by business domain”.

Split by function

In the early days, since everyone was a team, what everyone was responsible for was defined by function. This approach is simple, intuitive and consistent with the “single job” principle.

At the same time, due to the data-oriented and procedural programming approach (that is, what data needs to be assembled by itself, and the common logic is extracted and reused by sharing Jar packages), the coupling degree between services is not high.

Single responsibility, high cohesion and low coupling support qishui to carry out version and function iteration at a very high speed in the early stage of business development

Split by service

With the continuous development of business, Seven Fish has gradually formed several large independent sales business lines, as well as some relatively small but strongly independent supporting businesses.

At this stage, it is obvious that the early services divided according to functions can no longer adapt to the needs of organizational development. The most typical scenario is that each business group changes the services of the base service domain as they develop functionality.

At this point, although the principle of “single responsibility” is preserved, the principle of “high cohesion, low coupling” is destroyed. This leads to a lot of code coupling, unreasonable service dependencies, and publishing dependencies. These problems can affect the stability and maintainability of online systems, and slow down r&d performance.

In order to solve the above problems, we started the Seven fish service governance project.

2. Service classification

Before we can identify the rationality of coupling and dependency, we need to level services. Without grading, there would be no entry point for technical optimization such as service priority and dependency inversion, and no basis for resource and schedule arrangement.

The idea that core modules are not core modules has a long history. The importance of services can be graded according to this idea.

In the seven fishes, we define the hierarchy of services as follows :(note that middleware and database infrastructure are not included here)

P0: System-level basic services, where outages result in large, often small, perceived service exceptions (management and querying of underlying shared data, etc.)
P1: Basic services and core functions of core business. In case of downtime, the main process of a core business is unavailable
P2: Non-core business applications and core business non-core functions (data reports, system notifications, etc.)
P3: Internal support business (operation background, operation and maintenance background, etc.)

After defining the service hierarchy, we have the following basic principles:

Lower-level services cannot directly invoke upper-level services
The availability of lower-level services is not limited to upper-level services
Maintain as much logical separation between services at the same level as possible
The underlying services provide only basic capabilities and keep the model stable

Settlement of boundary issues

On the basis of service classification and module division, we identified the following problems in daily development:

Code coupling: As we went from functional decoupling to business decoupling. In the intermediate phase, some services host services from multiple business domains. Although the services are split into a business group, the coupling of the code remains and the Owner needs to modify the code to meet the needs of the other groups. Code coupling causes the following problems:

The Owner is not in full control of his own code and schedule, and responding to the needs of other groups may disrupt his own schedule.
When the Owner is unable to schedule, he uses non-owner to develop in order to catch up with the time, which causes problems due to insufficient familiarity.
There are dependencies to go live, which introduces issues of publish permissions and order.

Unreasonable dependence: again divided into reverse dependence, circular dependence, strong and weak dependence on three aspects

Reverse dependency: Low-level services depend on upper-level services. The call is inverted, causing the stability of the lower-layer service to be affected by the upper-layer service.
Circular dependencies: A depends on B, B depends on A, of course it may be A->B->C->A with intermediate services. Can cause the release order to be out of control.
Unreasonable strong and weak dependencies: weakly dependent on the business but strongly dependent on the service invocation. That is, the failure of a service should not affect the core functions, but the actual result is that the core functions are unavailable after the service failure.

In the basic service boundary governance process of Seven Fish, the following technical means are adopted to optimize the service. In order to optimize a scenario, several approaches may be combined to achieve a common goal.

Let’s take a look at some of these techniques from the context below.

4. Border governance practices

Break up

Split into the following cases:

If there is no common code, it is generally the independent function of each business party, which can be removed directly.
The shared code is divided into two cases: the shared part belongs to the basic capability, and the shared part belongs to the business logic:
- Common infrastructure capabilities can be extracted individually as Jar packages or as separate services
- If the business logic is coupled
If the underlying model cannot be broken down, this is a sign of a problem with business domain segmentation;
If the need for display, it can be used as an aggregation service, itself does not affect the business domain division;

Early on, all the page interfaces were hosted in the same service, resulting in the application having to be classified as P0. Most of these interfaces are internal business Settings and data viewing, so they are not shared code and can be removed directly.

In addition, all pages rely on a series of basic data, such as enterprise information, customer service information, permission information, etc. This is a case of global reliance on some base capability for presentation requirements, so we separate the page base data query functionality into a separate service. Since all pages still rely on this data, the service remains P0.

This split takes the patchwork P0 and turns it into a single, simple, stable P0, with a series of P1 and P2 services.

Load on demand + weak dependency degradation

For scenarios that rely on multiple business parties, these dependencies are usually strong or weak.

Weak dependencies: Scenario A relies on service B, but A is not strongly related to SERVICE B. That is, if B is unavailable, A’s main process can still run.
Strong dependence: A depends on B, and scenario A is strongly correlated with SCENARIO B. That is, if B is unavailable, the main flow of A cannot go through.

For strong dependencies, availability must be guaranteed while loading on demand minimizes unnecessary risk. Weak dependencies are allowed to be unusable, but in order to prevent unfriendly prompts after weak dependencies become unusable, you need to provide a de-escalation scheme.

All the underlying data that the page relied on in the previous example was loaded together before being split. A failure to load one item of data can result in a failure to return all data. Although not all the basic data of the business is available, it is in fact strongly dependent on all the basic data.

But because so much data is shared, it is not cost-effective to write a separate data encapsulation interface for each page. So we introduced GraphQL in the new data loading service to solve this problem.

GraphQL requires that the data be broken down into basic units and queried to the Server by assembling query statements that contain both atomic data items and the final desired data format.

Rather than writing a separate data interface for each page to accommodate on-demand loading, this has many advantages:

Query reuse of atomic data
According to the need to load
Data format is flexible and adjustable
Extend easy
Provides a wealth of data operations and assembly capabilities
Cross the front-end technology stack

I’m not going to expand GraphQL too much here, but if you’re interested, see Graphql.org/.

Downgrades are usually done by Hystrix or Sentinel. I’m not going to expand too much here.

Boundary changes

There are many ways to divide the business domain, but sometimes the one that best fits the business domain may not be the most realistic one.

Sometimes boundaries have to be redrawn for code maintainability and online stability. Here are a few guidelines:

Reduce the number of P0 applications
Business logic with stable models, high callbacks, and global impact can be put together
After tuning, model boundaries need to have clear business meanings that are easy to understand and maintain, not hard to forge.

Seven Fish s enterprise Information Management and Order and Service Package have been separated into two services. But found in daily work: enterprise management volume and stable model; The order logic is complex and there are many changes, but most of the calls are small, only the “service pack query” calls are large and the model is stable.

We have migrated the function of “service pack query” to “enterprise information Management” and changed the “enterprise information Management” module to “enterprise runtime management” conceptually. Boundary changes allow us to split two P0 services into one P0 and one P1, while ensuring that complex and volatile services do not affect stable underlying services.

Domain model optimization

If there is coupling to data from other domains in the domain model, then the code must also be strongly coupled. However, as long as the business domain partition can be determined, decoupling can be achieved through domain model optimization.

The usual practice is to use the KV table to store the associated data of other domains and update the KV table asynchronously with an event-driven process, so that the current domain model can ignore the business implications of the data.

Storing data without concern for business implications, the underlying model can be generalized and stable, eliminating reverse dependencies and code coupling altogether.

In addition to basic User information, the User table of Qiyu also stores data of business parties such as “last and last contact time”. Obviously, this level of coupling leads to contamination of the User model. However, this information is required for business functions to display User information.

Given that the User now needs to display “last contact time”, it is possible to display “last work order time”, “last text message time”, and so on. If you constantly change your code to fit your needs, you create code coupling.

At the model level, the UserInfoExt table is added to provide extended information storage in the form of key-value pairs. The business system updates data by actively updating k-V values. This ensures the stability of the User model layer, optimization of call relationships, and complete decoupling of the code layer.

Ability to push

Domain model optimization.

The optimization costs of domain models are high, and the resources may not be available in practice to accomplish this refactoring. In particular, underlying model changes involving P0 applications are often risky.

In scenarios where low-level data and upper-level data need to be presented in association. The logic of association display can not be carried on the underlying model, but push the assembly process to the upper business system, so as to release the data coupling of the underlying model.

Following the example above, since User is one of the most core services in the whole world, the risk of transformation is very high. Finally, we did not adopt the domain model optimization scheme. I’m going to do a power push here, and I’m going to push a P0 power to a P1 power.

The User core model removes “last Contacted time” and pushes the process of retrieving information up to the User-Gateway service. Although user-gateway belongs to the basic service domain, it is only responsible for providing User data required by the page. The breakdown does not affect data flows such as low-level sessions and work orders, so it is a P1 level service.

event-driven

When business processes are coupled to processes in other business domains. There are two possibilities:

Strongly dependent on the results of upper-level business processes;
Independent of the results of the upper-level business.

If you do not depend on the results of the upper-layer business, you can broadcast the process and core nodes in the form of life cycle events so that the upper-layer business can complete the subsequent process independently.

To ensure that life cycle events can be consumed smoothly and trigger business logic. Messages need to be reachable and idempotent at the middleware layer, and mechanisms need to be provided for messages to compensate for execution, assuming execution failures occur in extreme cases.

Initially, the seven fish business registration process is a simple serial process, the middle of any setting is not done, enterprises can not complete the initialization, resulting in enterprise registration failure.

This kind of failure is actually not cost-effective, even if one business is not well initialized, you can try other businesses without completely losing a potential customer.

By modeling the life cycle of enterprise registration, we broadcast the event of “enterprise creation” to complete the whole registration process with event driven. The benefits of this are:

The initialization of new services will not affect the existing official website code, thus decoupling the code;
The initialization failure of some services does not affect the overall registration process and removes the strong dependence on a single service.
As P1 level services, the official website directly calls only P0 enterprise and customer service management services, so there is no reverse dependence.

Calls asynchronously

Connect the event driver.

When the underlying processes in a business process depend on the results of the upper business, there are two ways to resolve this dependency:

Adapt the domain model to remove strong dependencies. Although more thorough, but often costly.
Call directly and depend on the result; This creates a reverse dependency on the call.

Asynchronous invocation is designed to solve the problem of code coupling and reverse dependency that direct invocation creates. Get the results of the upper-level business in an event-driven manner, rather than calling and getting the results directly. Unlike normal message drivers, asynchronous calls depend on the result returned; Unlike direct calls, there is no dependency on the called party’s interface.

In Qiyu, deleting customer service requires checking: whether there are unfinished calls, conversations, work orders, etc. Since deleting customer service is basic customer service management, it is in P0 service. To validate the business information, level P1 services must be invoked.

This reverse dependency can be removed if the domain model is reconfigured so that the business layer informs the customer service management of “deletion or not” and updates it in real time with the business process. However, this approach requires intruding into the core process of each business party and modifying the current business logic, which is too costly.

In seven fish, we designed a component that implemented an asynchronous call. Assuming that the deletion is the result of customer care A, B, and C, ABC registers the deletion event with the registry. Each delete process gets the watch list, then broadcasts a delete message, which the business receives and returns the results to the registry. Deletion depends on the notification mechanism of the registry to get the results and decide whether to complete the deletion.

Because the procedure is asynchronously driven, there is a Timeout wait. There are two modes, one is the strongly dependent mode, Timeout means the operation fails; The other is the weakly dependent mode, where Timeout can still operate successfully.

The benefits of this are:

The coupling of the code layer can be removed. Assuming that the newly added business party D needs to be verified, you can register the concern deletion event on the D service.
This transformation has no impact on existing business logic and core processes, and the scope of change is limited.
It is not called directly, so there is no reverse dependency.

Anticorrosive coating

Event driven and asynchronous invocation.

Ideally, the business side can respond to the driving events of the core system to complete the business process.

But in fact, the third party is not controlled by the team, so the development schedule is not controllable and the development motivation is not strong. In order to continue the optimization of the team, the code that depends on the business side needs to be packaged together and removed from the P0 service to prevent contamination of the core model and core process.

When we decouple the registry, we need the business side to respond to the lifecycle events of the enterprise registry. Decoupling customer service removal requires the business side to integrate the asynchronous invocation component.

This led to our development being dependent on other business teams. To do this, we added a separate anticorrosion layer application that migrated logic that responded to life cycle events and integrated asynchronous drop call components.

By doing so, our development was completed on time and smoothly, while the coupling to the business system was limited to a single application, limiting the scope for corruption and making late business party migration much easier.

5. To summarize

Split, load on demand, weak dependency degradation, boundary changes, and event-driven approaches are the starting point for governance. As governance progresses, many problems cannot be solved simply by splitting and changing. Only through domain model transformation can we find a way to fully unravel the services.

However, the cost of model transformation is often very high. In reality, we have to use anti-corrosion layer, capability promotion, asynchronous call and other means to ensure that the transformation can actually proceed, rather than falling into endless scheduling and testing.

After the boundary relationship is sorted out, the upper level services may affect the stability of the lower level services. The main scenario is system stress generated by uncontrolled calls. This belongs to the category of fusible current limit degradation, which will not be discussed in detail here.

More technical dry goods, welcome to pay attention to [NetEase Smart enterprise technology +] wechat public number.