Introduction: In traditional software architecture, computing nodes, storage resources, and network resources need to be deployed regardless of the service layer code, and then the operating system needs to be installed and configured. The essence of cloud services is to realize the softwareization of IT architecture and the intellectualization of IT platform. These hardware resources are defined in the form of software, and their operation interfaces are fully abstracted and encapsulated. Any resource can be directly created, deleted, modified, and queried by calling relevant apis.

The author | | yuan Yin source ali technology to the public

A background

On July 9, 2018, I joined Aliyun through school recruitment and started my career. I was fortunate to participate in all the design, development and testing of resource choreography services from 1.0 to 2.0, which enlighten me to understand cloud services. Of course, this article comes from my thinking and perception in the process of design and development.

In traditional software architecture, computing nodes, storage resources, and network resources need to be deployed regardless of the code at the business layer, and then the operating system needs to be installed and configured. The essence of cloud services is to realize the softwareization of IT architecture and the intellectualization of IT platform. These hardware resources are defined in the form of software, and their operation interfaces are fully abstracted and encapsulated. Any resource can be directly created, deleted, modified, and queried by calling relevant apis.

Relying on ali Cloud’s full abstraction of resources and highly unified OpenAPI, IT is possible to build a complete IT architecture based on Ali Cloud and manage the life cycle of each resource. The customer provides resource templates as required, and the orchestration service automatically creates and configits all resources based on the orchestration logic.

Ii Architecture Design

With the increase of business scenarios and exponential growth of business scale, the original architecture has gradually exposed problems such as large granularity of tenant isolation, small concurrency, and serious service dependence. It is extremely urgent to reconstruct the service architecture, among which topology design, concurrency model design and workflow design are the most important three aspects.

1 Topology Design

The core problem of topology design is to clarify the product form and user demand and solve the problem of data path. From the product perspective, the following points are considered: 1. Resource owner (service resource [charging unit], user resource); 2. Resource access rights (quarantine, authorization). The following points need to be considered from the perspective of users: 1. Service type (WebService: requires public network access, data computing: allows Intranet access) 2. Data access (source data, destination data).

Resource owners are divided into service accounts and user accounts. The mode in which resources belong to service accounts is also called the large account mode. This mode has the following advantages: 1. Stronger control ability; 2. Billing is easier. However, the bottleneck points include: 1. Resource quota; 2. 2. Service-dependent interface flow control. Obviously, full resource hosting is unrealistic. For example, resources such as VPC, VSwitch, SLB, and SecurityGroup need to be connected to other systems. These resources are usually provided by users, but ECS instances are suitable for creating with large accounts.

Multi-tenant isolation is an important issue in large account mode. Ensure that the resources of a user can be accessed by each other, and ensure that there is no trespass between multiple customers. A common example is that the ECS of all users are in the same service VPC. Instances in the same VPC can access each other by default. Therefore, security risks may arise.

Our design for the above problems is that ECS instances are created in large account mode in a resource VPC under service accounts, and access to different user instances is isolated through enterprise-level security groups. To access user data (such as NAS and RDS), you need to provide the VPCS and VSwitches where these access points are located. You can create ENI on the instance and bind it to the user VPCS to access user data. The specific data path is shown in the figure.

Common service architectures

Design of concurrent model

The core of the model design is to address the High Concurrency, High Performance, and High Availability issues.

The main indicator of high concurrency of resource orchestration is QPS(Queries-per-second). For resource orchestration logic that is often arranged in minutes, the synchronization model cannot support high concurrency requests. The main performance indicator of resource orchestration is transit-per-second (TPS). During resource orchestration based on user resource templates, resources are dependent on each other to a certain extent. Linear resource creation results in busy state for a large amount of time, severely limiting service throughput. The main indicator of high availability of resource orchestration is Service Level Agreement (SLA). If THE CRUD’s dependence on internal services can be decoupled based on HA, the impact on SLA can be reduced when Service upgrades or exceptions occur.

For the above problems, our design is that the user template is written into the persistence layer immediately after simple parameter checking in the front end of the service, and the resource ID is returned immediately after successful writing, and the persistent resource template will be regarded as an unfinished task waiting for scheduling processing. Then, we periodically scan the table to detect tasks, create resources and synchronize their state in an orderly manner, and return immediately if the resource state does not meet the conditions for advancing downward. After several rounds of processing, the desired state is finally achieved. A simplified distributed model is shown in the figure.

Distributed concurrency model

In order to avoid the task more cases for locking, we design a set of tasks found + the lease renewal mechanism, once the cluster is a node from the database pool after the fight for will be added to the node scheduling pool and set the lease, the lease management system for expiring leases for renewal (lock). This ensures that a cluster is processed by only one node until the next service is pulled up. If the service is restarted, the task is automatically unlocked due to timeout and captured by other nodes.

3 Workflow Design

The core of process design is to solve the dependency problem. There are two types of dependency problems: the status of the preceding resource is not as expected and the status of the resource itself is not as expected. We assume that each resource has only available and unavailable states, and assume that available resources do not jump to unavailable states. The simplest case is a linear task, as shown in the figure. Considering that the assignment of partial resources can be done in parallel, the assignment can be regarded as a Direct Acyclic Graph (DAG) task.

Resource linear orchestration structure

The world is not only black and white, but also the state of resources. Directed and acyclic has become a good wish. Only directed and acyclic can conform to the operation rules of the real world. In this case, it is difficult for simple workflow to cover complex processes, so we have to further abstract the workflow and design FSM (Finite State Machine) that meets the requirements. Finite state machines are too abstract to talk about, but the state transition of ECS instances is familiar, and the following is the state transition model of ECS instances.

ECS instance state transition model

Combined with the actual business requirements, I designed the cluster state transition model as shown in the figure below. This model simplifies the logic of state transition. There is only a steady state of Running, and the other three states (Rolling, Deleting and Error) are intermediate states. A resource in the intermediate state migrates to the steady state according to the current resource status. Each migrates to the steady state according to a certain Workflow.

Cluster state transition model

From then on, the overall architecture and design ideas of the service were basically established.

Three core competitiveness

The shortage of resources (ECS) is becoming increasingly serious, and the coarse-grained expansion and contraction capacity, lifting and lifting functions can no longer meet the needs of customers. Resource Pooling, Auto Scaling and Rolling Update have been put on the agenda and become a powerful tool to improve product competitiveness.

1 Resource pooling

Resource pooling simply means reserving certain resources in advance for a rainy day. Obviously, the prerequisite of resource pooling must be the large-account mode. Thread pooling is not a new term for developers, but resource pooling is relatively new. In fact, resource pooling solves the problems of expensive resource creation and deletion time and uncontrollable inventory. Of course, another assumption of pooled resources is that the pooled resources will be used frequently and can be recycled (with relatively single specifications and configurations).

Since the creation cycle of computing resources is long and often troubled by resource inventory and other problems, and the product is expected to expand in business, we design the resource pooling model as shown in the figure and abstract various computing resources to provide a set of processing logic that can deal with heterogeneous resources.

Resource pooling model

Resource pooling can greatly shorten the waiting time for resource creation and solve the problem of insufficient inventory. In addition, it can decouple the complex state transfer logic for upper-layer services using resources, and simplify the Available and Unknown resource states. But the questions that have to be considered include:

  • Whether the creation of an ECS instance is limited by the user’s resources (e.g. the ECS available space is limited by the user providing vSwitches).
  • How to solve the problem of idle resources (cost problem).

For the first question, the VSwitch provided by the customer is currently limited, so there is no good solution. We can only try to require the VSwitch provided by the customer to cover more available areas. If the VSwitch is a service account, we can better plan which AZ to build the resource pool in. For the second question, resource pool itself is also a kind of resource, and cost control can be answered by automatic scaling mentioned next.

2 Automatic expansion

The biggest attraction of cloud computing is cost reduction, and the biggest benefit for resources is pay-as-you-go. In fact, almost all online services have peaks and valleys, and automatic scaling solves the problem of cost control. It adds ECS instances when the customer’s business is growing to ensure computing power, and reduces ECS instances when the customer’s business is declining to save costs, as shown in the figure.

Automatic expansion diagram

My design idea for automatic scaling is to trigger scheduled tasks for time segments and then configure scaling policies for time segments. A scaling policy contains two parts: the maximum and minimum ECS scales, which specify the floating range of the cluster size within the specified period; and the monitoring indicators, tolerance, and stepping rules, which provide scaling criteria and standards. Monitoring indicators are interesting points here. In addition to collecting CPU and Memory utilization rates monitored by the cloud, the ratio of working nodes can be calculated by marking the idle and busy state of ECS. Once the tolerance range is exceeded, an expansion or reduction event can be triggered according to the step size.

3 Rolling Upgrade

The modification of customer service architecture often involves complex reconstruction logic, which will inevitably affect the quality of service in the reconstruction process. How to gracefully and smoothly make ups and downs has become a rigid demand of many customers. Rolling upgrade is to solve the problem of non-stop service, adjustable lifting and matching.

Rolling Upgrade Diagram

A simplified rolling upgrade process is shown in the figure above. The core of a rolling upgrade is to grayscale the upgrade, with a certain proportion of Standby resources being released until they are ready for service, and then the corresponding number of resources being taken offline. After many times of scrolling, make all its resources updated to the latest expectations, through redundancy to achieve upgrade non-stop service.

Observability

Observability of services will become one of the core competitiveness of cloud services in the future, which includes observability for users and observability for developers. Up to now, I still remember the fear of being dominated by customers’ phone calls in the middle of the night, the bewilderment of investigating problems in the massive logs, and the bewilderment of being clueless after a customer’s complaints.

1 User-oriented

Yes, I hope that when users report problems to us, the information they provide will be valid, even direct to the focus of the problem. For users, being able to access the phases of resource choreography and the state of each phase directly through the API can greatly improve the user experience. Aiming at this problem, I analyzed the system processing flow and designed the running state collector for “stage-event-state”.

Specifically, it includes splitting the business process into multiple processing stages, sorting out the events (resources and their state) dependent on each stage, and structuring the possible state of each event (especially the abnormal state). A typical example is shown in the code sample.

[{"Condition":"Launched", "Status":"True", "LastTransitionTime":"2021-06-17T18:08:30.559586077+08:00", "Condition":"Authenticated", "Status":"True", "LastTransitionTime" : "the 2021-06-17 T18:08:30) 941994575 + 08:00", "LastProbeTime" : "the 2021-06-18 T14:35:30. 592222594 + 08:00"}, {" Condition ":" Timed ", "Status" : "True", "LastTransitionTime" : "the 2021-06-17 T18:08:30) 944626198 + 08:00", "LastProbeTime":" 2021-06-18T14:35:599628262 +08:00"}, {"Condition":"Tracked", "Status":"True", "LastTransitionTime" : "the 2021-06-17 T18:08:30) 947530873 + 08:00", "LastProbeTime" : "the 2021-06-18 T14:35:30. 608807786 + 08:00"}, {" Condition ":" Allocated ", "Status" : "True", "LastTransitionTime" : "the 2021-06-17 T18:08:30) 952310811 + 08:00", Condition :"Managed", "Condition":"Managed", "Status":"True", "LastTransitionTime" : "the 2021-06-18 T10:09:00. 611588546 + 08:00", "LastProbeTime" : "the 2021-06-18 T14:35:30. 627946404 + 08:00"}, {" Condition ", "three", "Status" : "False", "LastTransitionTime" : "the 2021-06-18 T10:09:00. 7172905 + 08:00", "LastProbeTime":"2021-06-18T14:35:30.74967891+08:00", "Errors":[{"Action":"ScaleCluster", "Code":"SystemError", "Message":"cls-13LJYthRjnrdOYMBug0I54kpXum : destroy worker failed", "Repeat":534 } ] } ]Copy the code

Example code: Collect the cluster dimension status

2 For Developers

For developers, observability consists of monitoring and logging. Monitoring can help developers check the running status of the system, and logging can help troubleshoot and diagnose problems. The product monitors and aggregates data from four dimensions: infrastructure, container service, service itself, and customer business. The specific components used are shown in the figure.

Monitoring and alarm systems at all levels

The infrastructure mainly relies on Cloud Monitor to track CPU and Memory usage. The container service relies on Prometheus to monitor the K8S cluster where the service is deployed. For the service itself, we have added Trace for fault location at each running stage. For the most difficult customer services, we collect customer usage through SLS, perform data aggregation through UserId and ProjectId, and organize the DashBoard of Prometheus to quickly analyze the usage of a certain user.

In addition to monitoring, cloud monitoring alarms, Prometheus alarms, and SLS alarms have been added. Different alarm priorities have been set for the system and services, and various emergency response schemes have been prepared.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.