1 the

This article describes why a good public or private cloud must have an orchestration system to support on-cloud automation, as well as the difficulties and efforts to implement this orchestration system. At the same time, it provides a set of prototype implementation of the choreography system, which includes theoretical analysis and main plug-in framework, and also gives some detailed control suggestions. I hope to help you to have a deeper understanding of the concept of “resource choreography & application choreography”, and I hope to work with you with an open mind to make the cloud as natural and universal as hydropower.

2. Why on-cloud automation

Automation in IT goes without saying, and every programmer knows IT’s a must. Automated scripts, automated tests, automated deployments, and so on are all designed to make the program and the programmers around it more enjoyable. Do we still need automation in the cloud? In short, first use is a no-brainer; Deep users need on-cloud automation. Specifically reflected in:

2.1 Repetitive execution actions

There are a lot of iterations in verifying an application on the cloud. Such as destruction and restoration of the environment; Or in the capacity expansion scenario, configure multiple new instances repeatedly. Once these operations become more frequent, such as once a day or several times a day, you will get bored and start trying to automate the process so that each execution is repeatable. Maybe you write a Shell or Python script, or you call the CLOUD provider’s API on your own initiative, or even use a tool like Chef or Puppet to do this.

Repetition is the first thing that drives automation.

2.2 Saving time

When using services in the cloud, some operations are very time consuming, such as creating databases and VMS, which require minutes of waiting time. Once multiple time-consuming tasks need to be created sequentially, the user is required to wait for a period of time. If the process can be automated, the human waiting process can be freed up and the programmer can complete other, more valuable tasks.

When processes are automated on the cloud, the overall time to perform an action is not reduced, but the wait time can be shifted, for example, in the middle of the night. It is for this reason that repetition is the future of time savings after automation. If it is a one-time operation, the “time saved by automation” versus “time to complete automation” is generally not cost-effective.

2.3 Replication of the basic environment

The Infrastructure environment here refers to Infrastructure, which is the collection of all the cloud services that an application needs to run on the cloud. For example, a typical Web site has 3 layers, front end + back end + database. After building a complete system in a certain region of the cloud (for example, north China), there will be a demand for system replication when the same environment needs to be re-built in South China or even on the cloud of another cloud provider. Do programmers manually install components one by one? Automated one-click repeat deployment? In the case of the latter ability, of course the latter is preferred.

Many cloud vendors are now pushing a concept called Infrastructure As Code, which uses machine-understandable configuration files instead of manual, interactive configuration actions. And this configuration file can be versioned as code through the version management system. In this way, the benefits for enterprises are mainly reflected in three aspects: cost reduction, efficiency improvement and risk reduction.

Cost reduction is easy to understand. As mentioned above, automation can shift people to other tasks and improve programmer productivity. The efficiency gains can be realized by shortening the implementation process of environment installation through automated configuration, especially if multiple components or teams interact. At the same time automation can eliminate human error, repeatable execution features also improve the reliability of the implementation process.

2.4 Self-service

Cloud-based services, if done well, should be self-service, like tap water and electricity, pay-as-you-go. Only in this way can arbitrary automated on-demand supply and on-demand expansion be supported, which is what the cloud itself has.

So this is a requirement for cloud providers. Your cloud platform should be able to support self-service use of various cloud services on demand, and provide corresponding usage measurement information (billing) and usage reports. Only when the back-end implementation process of the platform is fully automated can the user experience be completely self-service. This is the same as taobao merchants’ “available”, otherwise they have to communicate with the store before placing an order, unable to do self-service use as needed.

2.5 Summary: Cost of automation

If you need to automate anything, you need to do it repeatedly, and only when the benefits of automation outweigh the costs of repetition will there be a need for automation. If tasks are one-off, there is no need for automation. Instead, we believe that in terms of revenue, it’s more cost-effective to do it manually than to automate the process.

For example, sometimes you don’t want to install a set of tap water when you are thirsty on the road, it is better to buy a bottle of mineral water directly, and at home, you need to install a set of tap water system, because you need to use water every day.

Automation on the cloud provides a kind of reliability, which makes every creation of cloud resources and cloud services consistent, and the execution of any user and any organization repeatable; At the same time, it also eliminates the problems caused by possible errors caused by human operation, which is a necessity for deep users on the cloud.

3 Automation evolution on the cloud

3.1 Difficulties faced by automation

(1) The variety of cloud services makes it difficult to achieve full automation. A typical cloud platform provides ECS (virtual machines), EVS (hard disks), VPCS (networks), RDS (databases), ELBS (load balancing) and an endless list of services. There was a new term called AWS Fatique, which means AWS releases new services and features every year, making users feel “AWS fatigue” and tired of using them.

(2) Complex dependencies exist in the creation of cloud services. A typical example is that VMS need to be bound to an EIP, and subnets need to be created to create a CM. A Subnet must be created before a VPC is available. Layers of dependencies, and cross-dependencies, create roadblocks for developers to automate, making it much more expensive to do so. Automation is abandoned when the costs mentioned above outweigh the benefits of repetition.

(3) The use of resources on the cloud is different from the traditional way. The user changes from the full owner of the resource to the user of the resource, and the reduction of background permissions makes you unable to control everything, which makes it difficult to locate the cause of resource initialization failure (perhaps due to the Bug of the cloud platform itself). Sometimes you have to contact the cloud provider for help to understand the cause of the failure. There is also a slight change in the usage process. In the past, your software package was copied to the verification environment, whereas in the cloud, you may need to transfer the springboard to achieve the purpose. All these make the implementation of automation more difficult.

3.2 Attempts at automation

Here is a direct diagram to summarize the experience of cloud automation, which can be more intuitive to understand the development of this field. However, the boundary between resource provisioning automation and resource scheduling is not so obvious. The main difference is flexible syntax. In the existing automatic template gradually add some flexible syntax, basic can achieve the purpose of flexible arrangement.

The ultimate automation system – choreography

Automation means that there is no need for human intervention in a task flow, while choreography means that multiple task flows can be planned in advance, and tasks can be executed in parallel or serial. As can be seen from the most straightforward definition, choreography is an upgraded version of automation only when arbitrary automatic process control is achieved. It follows that a cloud vendor’s orchestration system is not a good orchestration system if it fails to meet even some basic automation processes.

4.1 The choreography benchmark on the cloud

When IT comes to the orchestration system on the cloud, we have to mention the Cloudformation of AWS, which is basically a standard of AWS cloud ecosystem, supporting the application market and service catalog to complete the initialization process of any IT software and IT infrastructure.

The main principle is that the user provides the properties to create the object, and CFN assists in the creation of the object. For example, to initialize EC2, you create VMS. The user then has to provide attributes: host name, what mirror to use, hard disk size, what network to use, host specifications, security groups, etc. With these properties, CFN can determine how to call EC2’s API to create the VM.

The reason why it is very powerful is that it not only provides the ability to control the execution order, but also provides a lot of built-in functions at the syntactic level, through which users can reference variables, concatenate variable values, and control the execution details. Super rich choreography objects, making it possible to automate the creation of virtually any AWS resource using CFN.

4.2 System Comparison on the Cloud

Here are three typical cloud vendor capability analysis tables that provide orchestration capabilities. Please contact us to correct any errors. (Amazon CFN, Ali ROS, Huawei AOS)

√ means “strong/well done”, O means “fair/to be enhanced”, and X means “not with this feature”.

Note: OpenStack’s Heat choreography capabilities are similar to AWS, but without a graphical designer, it is not listed here.

4.3 Shortcomings of the arrangement system

Current choreography systems require a description file that describes the desired execution flow of the user. This description file is commonly referred to as a “template.” Using templates to control execution logic is not a problem, because every industry choreography system you can see has its own “template” syntax rules. The problem is that writing a new template from scratch can be difficult and requires a certain threshold for users to learn, and the initial feeling is always like learning a new programming language.

This is due to the complexity of the target object being choreographed: creating an RDS database requires more control parameters than creating a single VM. A new template syntax, then, is equivalent to a new programming language. If you’ve ever written code, you know that you need the right IDE support for fast coding. Therefore, some powerful layout systems will launch the corresponding graphical designer, its positioning is the supporting template writing IDE.

For example, AWS, Alibaba and Huawei all provide online IDE for template editing. A good designer is judged by whether it can support a convenient writing template.

5 How to implement the On-cloud Choreography system

At the heart of a choreography system is a workflow engine that analyzes the dependencies between steps and controls the order in which these processes are executed according to a DAG (directed acyclic graph) model. Choreography on the cloud, on the other hand, is more specific, creating cloud services in order of dependency.

Algorithmically, we can call each cloud service an element. The process of creating cloud services is the process of creating elements in sequence.

5.1 Directed acyclic graph DAG

Directed Acyclic Graph (DAG) is a type of Directed Graph, which literally means there are no rings in a Graph. It is often used to represent dependencies between events and to manage scheduling between tasks.

Figure: An example of a directed acyclic graph

Topological ordering of all nodes is often used in directed acyclic graphs, and our system prototype is also implemented on this theoretical basis. Is to determine the order of all elements in accordance with the DAG dependence, the specific algorithm we can search on the Internet or information, here will not be introduced in detail. Once sorted, the next implementation completes the low-level elements first, then the upper-level elements, until all elements are initialized. The above is the theoretical reference of our arrangement system model.

5.2 Layout of system prototype

Here we assume that there is a system initialization process as follows:

In order for all elements to be created in the desired order, we follow two points :(1) execute in parallel by default. (2) No dependency is executed first. In the implementation of the algorithm, we first decompose the element starting sequence into a directed graph, and calculate the dependency number of each node by traversing. As follows:

Note: Dependencies only need to calculate the neighboring nodes.

Following the previous two rules: the dependency number of elements B and D is 0, so they can be initialized first. B and D are independent of each other and can be executed in parallel.

After the execution of any element, the number of all dependencies on these nodes is reduced by one to obtain the number of dependencies of all nodes:

The only elements that can be executed this time are C and F, because their dependencies are 0. After these two elements are executed, subtract the number of dependencies of the elements that depend on them by one to get all node dependencies again:

Follow the above logic recursively until all elements have been executed and the workflow is complete. It ensures that the entire process takes the shortest time sequentially. From the principle of workflow implementation, the ability of orchestration does not emphasize flow control, but the richness of orchestration elements and syntax. A good arrangement system can quickly complete the development of new elements, so as to provide the arrangement ability of new services.

5.3 Information transfer between elements

If each element is initialized, it has to record information about the other elements, so there is coupling between the elements in the implementation. To keep each element independent at execution time (i.e. the current element is initialized without knowing about other elements). The body framework needs to hold global information and then, when it initializes an element, tell it what it needs. It has no idea what the other elements are, but it has all the information it needs.

As an example, the scheduling framework maintains a global record of what parameters each element needs to initialize. The green ones are provided by the user and the red ones are automatically obtained after the dependent object is created. For example, if the VPC ID is required for VM creation, the VPC ID is known after the VPC is created.

So after D is initialized, C is ready to initialize. At this point, all arguments to create C should be validation values. There is no lack of information when calling the C service’s initialization API. In this way, the creation and destruction apis of C are implemented in a very independent way, dealing only with the C service itself.

As shown in the figure above, when developing a new service, you only need to know the new service itself, and all the desired information (which can be directly requested from users or obtained through dependencies) is managed and delivered through the framework.

This is our plug-in framework, which makes it very easy to add a service. Because the driver development of the service is completely independent.

5.4 Plug-in Design

5.4.1 Life cycle of an element

Each cloud service object is an element from the perspective of the choreography system. When an element is added to the orchestration, it is required to provide basic execution capabilities such as add, delete, modify, and review. The orchestration system’s plug-in management framework invokes the element’s API based on user actions, such as create or destroy.

Now that you have the element execution flow framework from the previous section, you add a choreographer object that just completes the various behavior drivers for that element. For example, as long as there are methods to create and destroy VMS (apis), it is possible to add an EC2 service to the choreography element, which can be added to the template. The scheduling framework just treats it as a normal element.

5.4.2 User-defined Plug-ins

Based on the advantages of plug-in framework each element driver independent, and considering the Resource object in Kubernetes also has a custom Resource definition, we can design an element plug-in to support the ability of users to define their own K8S layout objects. The “information” provided by the user is passed on to the underlying API intact. The underlying system interprets the user’s “information”. The orchestration system degenerates into a process control + information transfer channel.

5.4.3 Operation wait & Progress

As mentioned above, the operation of some cloud services is very time-consuming. If you cannot provide intuitive feedback on the overall progress, the user experience will be very poor, and the whole execution process will be suspended. So in element-driven writing, progress and waiting feedback must be considered so that the choreography framework is aware of execution progress. This allows the user to know which element is currently executing and how far it is progressing. This ensures that the overall choreography process can give the most direct and user-friendly response to the user.

TOSCA 5.5 model

With the scheduling framework & plug-in framework, all that is left is the syntax of configuration files. The main reference syntax is AWS Cloudformation and TOSCA syntax. AWS-CFN is centered on resource initialization. TOSCA is a specification that aims to standardize how we describe software applications and everything that is. Taichichuan contains Taichichuan Required for them to run in the “cloud”, TOSCA is more app-oriented. Given the popularity of container technology, more and more applications are emerging as stand-alone containers, de-emphasizing the need for traditional VMS. We feel that using TOSCA for template syntax is a good choice.

In fact, as you can see in the automation process, template syntax is not the key point. As long as it can be automated, the template can be written without much difference, so the key is to see the ability to automate. It’s like choosing a programming language, Java versus Go, and writing a binary tree traversal doesn’t care if it’s for or while. The main difference between programming languages is in built-in functions/libraries, so providing rich automation convenience in template syntax is the goal. Learn from AWS, which has a lot of built-in functions.

6 summarizes

In the cloud, automation is just a necessity. Only when the base of automation is completed can a complete cloud ecosystem be built. Orchestration, as an advanced automation capability, is responsible for bringing the cloud ecosystem to its fullest. Is to test the strength of a cloud vendors hard currency.

Huawei PaaS team has years of exploration and accumulation in cloud, especially in the field of automation & choreography on PaaS cloud. Here I hope to share and promote the development of cloud choreography with the industry, so as to bring better user experience in the use of cloud, so that cloud automation can truly be as ubiquitous as the trend of cloud.

If you have good ideas & suggestions, feel free to share with us.

Author: Huawei PaaS architect Tang


Click to follow, the first time to learn about Huawei cloud fresh technology ~