Introduction: There are many discussions about service governance in the industry, but there is not a unified interpretation. As a practical matter, many practitioners bind Service governance to Service Mesh. The micro service governance of Baidu search engine and recommendation engine has its own characteristics and definitions. In the second issue of Geek Talk, we have invited Chuanyu, a teacher from MEG’s Recommendation Technology Architecture department, to talk about how our own micro-service governance has been implemented and how it has worked.
The full text is 3864 words, and the expected reading time is 7 minutes.
Guest profile: Pass jade
Recommend technical experts in technical architecture department. Since 2012, focus on search engine and recommendation engine direction; Since 2016, I have been responsible for the research and development of my own resource scheduling and container scheduling system. In 2019, I began to take charge of the research and development of some general basic components, and comprehensively promoted the transformation of cloud native architecture in MEG user products.
Q1: How do you define service governance?
In order to promote service governance internally, we gave a clear definition: we expected service governance to keep all our microservices in a reasonable state of operation. The so-called reasonable operating state includes:
-
The capacity of all managed services is reasonable, with neither too much nor too little redundancy;
-
All managed services are healthy, capable of tolerating localized short-term exceptions without long-term failures and providing high-quality services throughout their life cycle;
-
The whole system is observable, and its traffic topology and main technical indicators are transparent, observable and monitored.
Q2: When did we start with service governance?
The recommendation system was started at the end of 2016 and the beginning of 2017. It started like any other business. We hooked up with a fast food system and jumped right in. The initial focus is to support the rapid development of the business. As long as the huge services, resource utilization, and observability do not affect the business development, they may not pay that much attention. For example, some services were launched and tried, but the effect was not ideal, and then offline, but the resources used in these attempts may not be recycled in time and wasted. This situation lasted until 2018.
Later, when the business grows to a certain scale, it cannot maintain such super-fast growth and enters the stage of fine optimization. At this time, we need to pay attention to how to make the resource utilization of these services more reasonable, and it is time to pay back the “technical debt”.
But we don’t want to just by artificial to movement type to clean up these debts, but hope that through a mode of sustainable development, through the system of automation to solve this problem, and then the “technical debt” clean up from the operation of the “sports type” become a regular operations, decided to start fully implementing on service management work.
Q3: Service governance, what is governance?
With the increasing number of services, back-end systems are faced with two key problems: how to rationally allocate service resources; and how to ensure the efficient and healthy operation of services. At present, our micro service management system includes three levels: capacity management, flow management and stability engineering.
Capacity management aims to ensure the rational utilization of resources by services on the line using the automatic capacity expansion and reduction mechanism.
It is necessary to monitor the load status of online services and automatically adjust the resources of services to ensure that redundancy is not too high or too low. A more refined approach is to do real-time pressure testing of online services, with redundancy allocated according to the tolerance limit. We do not require all services to meet high standards, but we will set a baseline to avoid waste.
Flow management, there are two objectives: one is that the flow should be observed, monitored and controlled, and the other is that management should be automated.
For example, when we moved the computer room before, less than 5% of the flow was unknown, so we did not dare to act rashly, because we did not know whether it would cause irreparable losses. Therefore, we need to find OP and RD to chase this part of traffic sources, which may take a long time. The problem with this situation is that the traffic status of our entire line is actually opaque and needs to be observed, monitored and controlled across services.
In addition, the service after transforming the online connection relationship is complicated, the module number multiplied, the connection between the relationship cannot by manpower maintenance, the biggest problem is that once the service appear problem, will do some reversion and shielding temporary, we don’t know how much this scope, how many important service as it will depend on the chosen. Therefore, this aspect of management to achieve automation.
Stability engineering is to establish a stability monitoring and early warning mechanism.
Microservice transformation has made our iteration efficiency faster and resource efficiency higher, but it has also made the overall system architecture more complex and our control over system reliability weaker. You are less and less sure what will go wrong, what will happen if something goes wrong, and whether the new plan will work this time…
We had this problem half a year ago. After an online failure, we added an intervention interface and made a pre-plan. When the module went wrong, we called this interface to stop loss. However, when the same fault occurs again half a year later, the plan is executed to call this interface, and it is found that this interface has become invalid after a certain iteration.
Therefore, we need a more systematic mechanism to detect and verify the stability risks of the system and the effectiveness of our plans.
Q4: Services grow with the size of the system. What capacity governance framework can meet our needs?
In terms of capacity Management, we have introduced a fully automated Application layer Management system ALM (Application Lifecycle Management), which has a relatively complete software Lifecycle Management module. We automatically build capacity models for each app through pressure measurement to determine reasonable redundancy in resource utilization.
In 2020, we will recycle more than 10,000 machines in this way. This year, we teamed up with INF to make load feedback more real-time, thus improving the real-time capacity tuning closer to Serverless.
Q5: In terms of traffic governance, do we also use Service Mesh to solve observable problems?
Yeah, not really.
On the one hand, like the industry, we try to solve the problem of traffic management and visibility through the Service Mesh.
We introduced an open source product, LSTIo + Envoy, and then did secondary development to fit the factory ecosystem, as well as a lot of performance optimizations to meet the latency requirements of online services. In addition, we have implemented uniform Load Balance strategies and traffic intervention capabilities such as traffic fuses, traffic black holes, and traffic replication based on LSTIO + Envoy.
In this way, the implementation of many stable plans does not need to be independently developed by each module, but implemented through the standardized interface of mesh, which can avoid problems such as plan degradation to the greatest extent.
On the other hand, we are constructing a Service Graph of online services, or global traffic topology.
The reason is that the search and recommendation back-end services are very sensitive to latency, which makes it difficult to achieve 100% coverage of the Service Mesh within our internal network. Therefore, we separate out the global Service connection relationship and obtain and summarize the Service connection relationship and some traffic indicators from the Service Mesh and BRPC framework to achieve the overall observability of the architecture.
In both the BRPC framework and the data agent layer, we store a large number of common “gold indicators” in a unified standard format, including delay and load conditions, which are used to observe the traffic health status of the whole system link. In addition, we are also trying to apply the Proxyless mode of Service Mesh to embed basic capabilities such as traffic fusing, traffic black hole and traffic replication into the BRPC framework and control them through the control command delivery channel of Service Mesh.
In this way, the baseline of intervention capability is consistent regardless of whether the service hosts traffic into the mesh, which is a standardized way to deal with online exceptions and avoid local degradation due to service iteration.
One of the immediate benefits of the Service Graph is that it improves the efficiency of machine room construction.
Equipment room construction is usually complex. Each service needs to manage its own downstream connections without a global view. Therefore, in the process of building a new machine room, it is difficult to find a single configuration error from hundreds of services, and debugging takes a long time. But with the Service Graph, we now have a global view of traffic and connection relationships, making it easy to spot problems.
Q6: Stability is mainly to nip in the bud. How can we find the “hidden dangers” of the system in advance?
Our construction in stability engineering is mainly automation management and hybrid engineering.
Stability work usually requires experienced engineers, but in the past it was a hidden work, lack of measurement mechanism, no positive feedback when the work is done well, and once there is a problem, it is a well-known accident, and there is no way to avoid it in advance. We introduce chaos engineering mainly to solve two problems, one is to find the hidden danger of stability in the system by injecting random fault on line, the other is to try to quantify the stability work by the method of mixed engineering.
In the aspect of fault injection, we have the ability of fault injection from container to whole machine, switch, IDC, local to global, and cooperate with OP to develop external network faults and customized faults for some general complex systems. Online, we systematically inject failures at random, and by observing how the system behaves, we can spot some potential problems.
As for the measurement mechanism, we set up a “resilience index” to measure the online system’s tolerance to stability problems.
For example, the tolerance of a single machine failure, you can hang a machine online through chaos engineering experiment, if your system hangs, it means that you can not tolerate a single machine failure, this experiment is 0 points.
Dozens or dozens of experiments in chaos engineering are conducted in one round, and scores are scored after completion. The higher the score is, the more stable the system is inferred. In the score setting, the more common faults are, the higher the score is, and the more difficult and complex faults are, the lower the score is. The idea is to encourage people to focus on common problems rather than going off the deep end, to focus on uncommon but difficult problems.
Q7: What are the challenges we face in terms of automation?
In my opinion, there are three aspects: first, how to obtain data for decision-making; second, how to design standardized operation interfaces; and third, how to make operational decisions based on data.
In terms of data, the gold indicators and some general load-related indicators developed by us have accumulated and are constantly advancing coverage. However, for some indicators like service level and service quality, different businesses may have different standards. If services need to be provided in a standardized way, it will be more complicated. How to formulate uniform norms is a relatively big challenge.
There are two standardized interfaces: one is the control interface (capacity interface) of instances on the cloud, which can connect to the underlying PaaS system through ALM. The traffic interface is addressed through the Service Mesh control panel, which is still being covered.
Once you have standardized data and standardized operational interfaces, intermediate policies are not a particularly difficult problem to implement. Most scenarios are usually covered by simple rules. However, the challenge here may be more in the balance between the sensitivity and accuracy of the decision of the strategy itself. In some scenarios, the accuracy may decline if the sensitivity of automatic operation is too high, leading to system jitter. However, if the sensitivity is too low, the problem can not be solved in time. Therefore, different strategies need to be set according to different scenarios.
Q8: Talk about the reshaping of your thinking that impressed you the most
In my opinion, we should not only rely on the process and norms to solve problems, but also attach importance to systematization and deposit the process, norms and experience into the code. By strengthening the observability of the system and controlling and managing the system through automatic mechanisms, we can solve some problems that originally need to be solved manually. On the one hand, there are only process norms, the experience of personnel will be lost with the loss of individuals, we will repeatedly make some mistakes made in history; On the other hand, if you want to develop a new product, you have to train a group of very professional maintainers from scratch, which is too expensive.
There used to be an old saying, is to put OP into the architecture, in fact, do this thing.
Recruitment information
If you are interested in micro services, please contact me and we will talk about N possibilities in the future face to face. In addition, if you want to join us, follow baidu Geek, a public account of the same name, and input internal push, looking forward to your joining!
Recommended reading
|Baidu search and recommendation engine cloud native transformation | Geek higher-ups said the first phase
———- END ———-
Baidu said Geek
Baidu official technology public number online!
Technical dry goods, industry information, online salon, industry conference
Recruitment information · Internal push information · technical books · Baidu surrounding
Welcome to your attention