Image credit: unsplash.com

Author: Di Qing

The background,

With the large-scale expansion of servitization, service stability has been paid more and more attention. Cloud music began to build stability capacity in 2018, and finally became a universal capacity. Cloud music has probably experienced the following main stages, from the stage of streaked without stability guarantee ability in 2018 to the establishment of stability capacity. From the platform integration construction to improve the ease of use and efficiency, to the platform construction of plan management recently, every evolution can contribute to the overall stability of cloud music. Today, we mainly introduce some practices of the platform construction of plan.

1. What is the plan?

Baidu Encyclopedia explains it to me like this: pre-plan refers to the emergency response plan formulated in advance according to the category and impact degree of potential or possible emergencies based on evaluation and analysis or experience.

From a major national event to a very small one, there are plans involved:

  • For example, we often see some scenes in movies, PlanA, PlanB, PlanC… In fact, this is our plan. The implementation of PlanA, PlanB or PlanC depends on different scenarios.
  • Take this time will be coronavirus outbreak, in fact, the government already has its own series of plans to deal with different levels of emergency (emergency management department of the People’s Republic of China is mainly responsible for overall plan and organize the formulation of national emergency planning, guide regional departments deal with emergencies, emergency plan system construction and plan practice);
  • Cloud music often do large activities, do a lot of pressure testing and drills before the activities, during which the plan will be sorted out in the case of some events, such as a service can not carry, need to do some downgrading, if still can not carry need to continue to do downgrading or limiting flow, etc.;
2. Current situation of cloud music plan

In the process of ensuring large-scale cloud music activities, routine drills and pressure tests will be conducted for capacity assessment and mapping, and possible risks will be assessed and preventive preparations made. In fact, these preparations are plans, such as Plan A and Plan B. Before that these plans are maintained on the wiki (for example, the business side typhoon project Plan), or on our own notepad, when A specific scenario happened we will go to execute the Plan A or Plan B, all this is scattered and not unified, ling collaboration when it comes to A large number of personnel cost very high preparedness enforceability will be poor; Another point is the lack of rehearsal, can not achieve the effectiveness of the plan;

3. What does the plan look like?

What exactly is the plan and what capabilities does it provide? Plan generally have some common ability, let’s go to the search engines search plan can get all kinds of answers, but can be summarized as the following several parts, such as the purpose of this plan, the applicable scope, plan, plan, the trigger condition, a situation () after the plan execution, post processing (such as how to restore) some of the items,

  • Purpose and scope of application of the plan: The plan is mainly used to complete what kind of matters, which scenarios can be used, such as pressure testing scenarios, what level of activities are guaranteed, etc.
  • Trigger conditions: open what is the condition of the plan, normally we will according to the state of our system needs to be enabled to see which plan, such as by monitoring, alarm and so on to tell us the current system of the state, if the plan execution conditions, such as when and how the server timeout threshold, will open the current limit or active relegation plans;
  • How to recover: In general, the plan is set up for a certain emergency scene, so after the completion of the plan, some follow-up work needs to be done to restore the plan to the normal state;
  • Achievement status: After the implementation of the plan, it is necessary to fill in some achievement status in time to verify whether the plan meets expectations. If not, it is necessary to record events that do not meet expectations to provide data support for subsequent optimization and review of the plan;
  • Plan level: In general, we will set some levels for the plan. When we see this level, we will know the importance of the current plan and whether it will damage the information, so that it is easier to do some upper-level sorting. The following is the hierarchical description of cloud music plan:
    • L0 plan: pre-planned plan, which has minimal impact on business and no perception to users (no perception of operation experience);
    • L1 plan: damage to services (affecting business data, etc.) has been reported to the service line during plan preparation.
    • L2 plan: Generally, the impact is relatively large, so it can only be implemented after on-site decision;

In the current plan platform, the purpose of the plan, the scope of application, the implementation conditions, how to recover, and the realization of the plan are relatively simple. They are all filled in the description of the plan, but not expanded.

The plan will be fully prepared in advance, discussed and sorted out. Then, when a condition is triggered, the plan will be implemented, and whether the plan meets the expectation will be recorded. After the plan, an optimization/review will be made, as shown in the figure below:

Ii. Plan platform construction

1. Background of the plan platform:
  • The initial plan came from the requirements of batch operations, such as batch operations with multiple application of limiting thresholds and the ability to adjust degrade switches in batches.
  • In most cases, service plans are adjusted based on the configuration center configuration value, traffic limiting threshold, and degradation rule.
  • Non-standard plan: Mainly comes from plan form is varied, some of them are notified us to do something, such as check a configuration, and check whether the monitor is in line with expectations, some of them are adjusting the center of the configuration threshold, adjust current limiting threshold, adjust the degradation ability, in addition, from the cognitive plan level, for example, L0, L1, L2, everybody is not consistent on levels of cognition;
  • There is no corresponding platform to provide the ability to formulate, implement and control plans: for a long time, these documented plans have been maintained on wiki, all of which are scattered and inconsistent. When a large number of people are involved, the collaboration cost is very high, and the execution of plans is poor.
2. Concept alignment of plan platform:

There are three basic concepts in cloud music preplan platform:

  • Resource: the resource can be the key/value of the configuration center, a traffic limiting rule, or a fuse degrade rule.
  • Plan: belong to a plan group, is triggered in certain circumstances, such as before the pressure test, can close the current limit, touch high pressure test;
  • Plan group: manages the same group of resources. Multiple plans can be created in a plan group. For example, in a drill scenario, several traffic limiting resources need to be adjusted. Traffic limiting thresholds may be different for different plans in the plan group.

The overall plan group can be understood from the figure above. A plan group contains multiple plans and a resource management function. The resources of each plan in the plan group are inherited from the plans imported from resource management, and the resources in each plan can be adjusted independently.

After the platform, the plan format can be fixed, such as plan module, person in charge, plan level, type and other information;

3. Platform ability

  • The platform-based (productized) plan platform is not only about the capabilities of the above plans, but also designed from the perspective of platform construction, such as approval ability, authority management, configuration security (release preview, configuration comparison, etc.) and basic governance capabilities.
  • Current support for traffic limiting, downgrade, configuration center plan management capabilities;
  • Execution plan, why there is such a concept, we will introduce in the plan layout;
4. Platform view

The following is the plan dimension management view of the plan platform;

3. Plan arrangement

One of the questions that a lot of students will ask is, do we need to plan? The arrangement here means that in an activity, there are multiple plans. I execute a series of plans in order or in a planned way. In fact, there is such a scenario for the plan, depending on the business needs; If the execution of the plan is not predictable or evaluable, it will only be enabled in certain circumstances (only for preventive use), in which case the plan choreography is not required; This ability may come into play if there is a high probability that some scenarios will be executed clearly.

Therefore, we support another set of capabilities: execution plan. Here, we need to clarify the goal of execution plan and its related concepts & capabilities.

  • Objectives of the implementation plan:

    • Provide timeline dimension process management, notification capabilities;
    • Provide a unified execution view for display.
  • Concept alignment:

    • Execution plan: the unit that manages the execution of processes, including one or more execution processes;
    • Reference execution plan: By referring to other execution plans, you can display the execution process of the referenced execution plan in a soft chain.
    • Execution process: The smallest unit of an execution plan, used to store one item that needs to be executed;
    • Descriptive plan: as a record of the execution content outside the non-execution plan;
    • Execution plan: the current plan directly connected to the plan module.
  • Ability to execute plans:

    • Rich notification mechanisms: Currently, popO, Stone, email, and SMS notification policies are supported. The notification timing can be flexibly configured.
    • Easy-to-use rights control mechanism: the current rights management ability of execution plan dimension and process dimension;
    • Execution plan association ability: it is convenient to associate execution plans. For example, multiple people design their own execution plans for their own scenes, and on the whole, they need to be combined into one execution plan for unified control on the whole.
    • Friendly execution plan visualization; The diagram below:

  • Connecting with the plan can support the scheduling ability of the plan. An execution plan contains multiple execution processes, which can be the plan in the plan platform.

It should be emphasized that the plan platform and the execution plan are two different concepts. The execution plan allows us to have the ability of arrangement and notification, while the plan can be arranged, and each plan can be an execution process in the execution plan. The execution plan is mainly to assist business development students in making process arrangement and process notification ability. They can not only make plan arrangement, but also manage checklist, notice and kanban.

Some best practices

Cloud music has been widely connected to the plan platform, which is mainly for the configuration of stability ability. At the same time, problems need to be paid attention to in the plan configuration.

1. Traffic limiting, downgrade, and configure the central plan

It can realize the single plan ability of limiting traffic, degrading traffic, and configuring the center. Meanwhile, it can combine the value management of multiple applications and products of different limiting traffic, degrading traffic, and configuring the center into a plan group, so as to achieve the plan management ability under the combination scenario, and has the ability of rights management.

2. Issues needing attention in plan configuration

When designing a plan, we make assumptions based on some scenarios or the results of drills, which may temporarily be some problems in our own products, technical scheme design or code. If we make some optimization, but the plan is not adjusted in time, the effectiveness of the plan will be greatly reduced. At this time, we may consider how to improve the effectiveness of our plan in time and prevent the plan from being corrupted.

  • Plans for corruption

    • The construction of the plan is closely related to the implementation of the code and product, which are always in rapid iteration. The plan, like the system architecture, is constantly “corrupted” after one requirement iteration after another. How do you plan to stay active? At present, a good solution is to carry out the preparation drill to the end, which can be done together with the fault drill. Of course, the plan is constantly improved and supplemented in our practice and online practice, which is a win-win process.
    • In order to avoid the corruption of the plan and maintain the executability of the plan, the content of the plan should be reduced rather than expanded.
  • Regular drill of plan:

    • Plans for construction is a very important part of the stability of construction, after the completion of the plan design, isn’t there, such as the corresponding trigger condition, after a direct open, imagine a scenario, a plan for half a year in the design and found some problems in line with the trigger condition, we want to open, at this moment but have plans at this point do not conform to the prior conditions, Students concerned may not have much confidence, which is the result of the lack of daily drills. Therefore, regular drills should be carried out for the plans, so as to better manage and optimize the plans under normal circumstances and avoid the corruption of the plans.

6. Future planning

In the future, we will mainly do some things related to scene opening, and make some timely adjustments according to the trial situation of the business, so as to make more scalable capabilities. At the same time, we will continue to build the platform.

  1. Scenario-based integration: For the existing fault drill platform, NPT pressure measuring platform and activity support platform, all need pre-plan capability;
  • To get through the fault drill platform, it is necessary to design a contingency plan for the fault drill platform. When a fault is mock, a contingency plan needs to be opened so that the service can be quickly pulled up or the service can not die without damage or partial damage.
  • NPT pressure test platform, there are a lot of students at the time of pressure test need to do some touch high, but also to guarantee the stability of the service, not by continuous flow down, so it might be a current limiting threshold adjustment ability, the ability to adjust is precipitation can be obtained and NPT and pressure test plan can be normalized correlation, ensure sustained and effective plan, Not to decay rapidly;
  • Activity support platform: an activity, especially a large-scale activity, needs to do a lot of pressure measurement and stability assurance, so in this process will involve a lot of switch, configuration, current limiting and other capabilities, need to design N sets of plans to deal with;
  1. The platform construction capacity of the plan is enhanced
    • Planning resource layer monitoring ability;
    • Visual verification of plan effect;
    • Risk inspection ability;
    • Richer governance capacity;
    • Plan automatic execution ability foundation;
    • Ability to execute plan dependencies;
  2. Advance the plan on a daily basis to avoid the corruption of the plan;

Seven,

  • The plan platform is built on the configuration of batch processing demand, and these batch capacity is the simplest one of the demand scenarios, plan construction is a very important part of the business stability guarantee, and the platform construction of the plan can bring twice the result with half the effort to the business stability guarantee;
  • Attempts to carry out contingency plans from more dimensions (through scenarioization) will help improve risk awareness and stability;
  • How to make the plan exercise routine and prevent the plan corruption is still the direction that we need to continue to explore.

This article is published by NetEase Cloud Music Technology team. Any unauthorized reprinting of this article is prohibited. We recruit technical positions all year round. If you are ready to change your job and you like cloud music, join us!