Summary: This article shares best practices for efficient use and management of large-scale cloud servers.

On October 22, 2021, Jia Shaotian, senior technical expert of Ali Cloud, delivered a speech titled “Best Practices for efficient use and Management of Large-scale Cloud Servers” at the sub-forum of “Best Practices for Operation and Maintenance on Cloud” at the Cloud Computing Conference. This article is compiled according to his speech. The following three parts introduce the best practices for efficient use and management of large-scale cloud servers.

  1. How to get to the cloud quickly
  2. How to build large-scale resource scenarios at low cost
  3. How to manage resources efficiently

01 How can I Quickly Go To the Cloud

We divide the upper cloud into four stages: the overall evaluation before the upper cloud, the process of the upper cloud migration, the verification of the upper cloud migration, and the online business switch. Today we bring you the server migration center product is to help you optimize the migration process and migration verification, so that this part of the faster and more efficient.

There are three modes of migration:

◾ The first option is redeployment and migration. It is to re-operate the original offline environment step by step on the cloud, which is not recommended in terms of ease of use, speed and degree of restoration.

◾ The second method is to export the image. Is in your own local environment in accordance with ali cloud image specifications export a mirror, and then upload ali cloud to use, system restore degree can be guaranteed, but easy and speed is not the optimal way.

◾ Third, use aliyun’s server migration center. All you need to do is download a client and run it locally, create a migration task, and the server Migration Center product will automate the migration for you.

What are the advantages of Ali cloud server migration center?

◾ First, it is a highly mature product that supports a wide variety of images in the industry.

◾ Second, highly automated. One line of command, the whole process unattended. We provide apis and consoles that allow you to observe the process and the results.

◾ Third, highly intelligent. From the beginning of the migration to the implementation of any problems will be automatically related to the repair work, making the whole process more efficient and smooth.

Users can also migrate to polymorphism based on their own scenarios. We also support incremental and full migration to achieve complete integration of online and offline; Users can also choose a variety of replication modes according to their own situation.

Server migration center is a highly automated product, supporting batch multi-instance migration, no matter what scale of resource migration can be efficiently supported, if you encounter migration problems in the process of using Ali Cloud, we strongly recommend you to use this product.

02 How to build large-scale resource scenarios at low cost

How to build large-scale servers at low cost? There are two key words: low cost, large scale. Let’s see how to use the ECS of Ali Cloud with the least money.

If ECS is used on a large scale, the first question is how efficiently? For example, there is a business peak today and 1000 machines are needed. Can we quickly deliver these 1000 machines in the shortest time? Second, can you use 1,000 machines at a lower cost? Third, can the machine be automated in a way that reduces human involvement and makes the management and maintenance process cheaper?

We recommend using ECS to start the template function. I don’t know if any of you have used ECS startup Templates, a persistent tool for ECS configuration data. Any ECS instance created on Ali Cloud can save all configurations of the ECS instance through it. This configuration can be used to quickly create an instance at any time later, with no reconfiguration required. And each change can be managed by versioning. Even if you haven’t used it before, it’s easy to use. You can quickly generate a startup template from any existing instance, and the corresponding configuration is the configuration of that instance.

With launch templates, we can use them in other ways than quickly creating instances. For example, you currently need to create a highly resilient Web application, such as an online Web service scenario, with daily peaks. The peak uses more resources, the low peak uses less resources. In this case, you can quickly create an elastic scaling group using an existing startup template.

For example, it has a timing mode, when the peak business is at 8 a.m., it will be timed to expand at 8 a.m. Business peak period is 6 o ‘clock in the evening, at 6 o ‘clock in the evening will reduce the machine; Second, it can be in dynamic mode, adding machines when the CPU exceeds 50% and reducing machines when the CPU is below 40%. Third, in manual mode, users themselves trigger scaling activities through a locally built system.

In addition, if you want to have a more comprehensive control on the whole process, we also provide life cycle link capacity, such as scaling group in help you let resources, you find that instance need to backup the log file, you can through the life cycle of hook refused to current shrinkage behavior, telescopic group can help to keep resources; There is notification ability, any expansion or reduction of capacity can be notified to you through nails, short messages, emails. Scaling groups can also help you connect instances to SLBS and RDS, helping users quickly build resilient Web capabilities in this way.

If you don’t need a solution with continuous resiliency, you just need to use large-scale computing resources in bulk, say 1000 machines. We recommend using flexible supply groups. Elastic provisioning groups are designed to meet scenarios in which large quantities of computing power are delivered. For example, if you need 10,000 cpus, you can set the number of cpus based on the capacity mode using the elastic supply group. The system automatically determines how many instances need to be created based on 10,000 cpus. At the same time, you can choose whether to match your business needs with volume or Spot instances, depending on your cost considerations.

In addition, we have multiple delivery types. There is cost optimization mode, the system will be created with the lowest price instance every time, so that your cost to the lowest; Balanced mode can help you create multiple availability zones, improve the high availability of the system, etc. In order to meet more scenarios, the flexible provisioning group provides three delivery modes to meet different needs. The maintain mode of continuous delivery, which helps you keep the amount of resources you need all the time, and the Request and Instant modes of one-time delivery. The Instant mode can be understood as an upgraded version of RunInstances interface capability. On the basis of the original runInstance only supports a single instance specification and a single availability area, more comprehensive capabilities are added.

Flexible response teams make the delivery process smoother and more successful.

If you use these resiliency capabilities to create resources, you can easily secure a 99.9% resiliency success rate and deliver 1,000 ECS a minute. From this base, you can quickly build your own elastic scenario, and any fast and demanding extreme elastic scenario can be quickly built in this way.

I was talking about lowering costs, using these resources at low cost. A quick introduction to Spot is a postpaid instance. It has two characteristics. One is low price. Its price is between 10% discount by volume and original price. The other is easy to release, you can make a bid based on your acceptable price, if the current bid is below the market price, this instance has the possibility of being released by the system. The key feature is cheap but possible release.

If the current business scenario is built on a pay-as-you-go model, or partially pay-as-you-go. You can slowly try to replace existing volume instances with partial Spot instances. As the Spot ratio gets higher and higher, the cost will approach the minimum indefinitely, achieving a 10% discount effect. At this point you must ask, if I use so many Spot instances, what if a price change causes the instance to be released, and my business will suffer? So on top of that we provide more capabilities to circumvent this problem.

First, the Spot instance specification is fully loaded with its own business scenario, and if the Spot instance is overpriced, all business is released. So we’ve developed an optimization for Spot scenarios that allows you to create multiple lost-cost instance specifications when using Spot instances, such as 3, as shown on the left, by splitting multiple instance specifications to avoid the problem of releasing a single instance.

We also added a second capability, Spot auto-compensation. If Spot compensation is not enabled, all spots are released with a 2 minute cliff-like exception, and all services are damaged. If compensation is enabled, our system automatically determines that some replacement instances are created five minutes earlier. Before these instances are released, they are created and automatically replaced. So there are no more cliff anomalies in the middle. With both approaches, you can more easily use SPOT instances to host business scenarios while reducing overall resource costs.

In addition to the above basic capabilities, there are also some automation capabilities. Here are just a few examples. First, we provide the ability to scale rules for elastic scaling groups, of which there are several types.

◾ Common scaling rules. It is defined as adding four ECS when the CPU is greater than 20%. This mode applies to scenarios where services do not change frequently, and can be likened to manual air conditioners.

◾ Step scaling rules. It is an enhanced mode based on the normal scaling rules, and can be set to multiple intervals, which are handled in different ways. In this way, according to our own experience accumulation, we can judge the different load conditions, how much capacity needs to be expanded, in order to bear the business pressure, more flexible, can be likened to semi-automatic air conditioning.

◾ Target tracking scaling mode. A fully automatic scaling capability. With this strategy you only need to know at what level the current load is maintained. For example, if the CPU stays at 50%, the system will automatically decide how many machines to add or shrink. That way, the whole process is smoother without human intervention.

We added a further scaling rule to these, the predictive scaling rule. For any scaling group with predictive scaling enabled, we use machine learning models to learn the overall resource usage and load changes over the past 1 to 14 days. Based on the forecast result, the system automatically generates scheduled tasks for scaling groups in hours to prepare resources in advance. This scenario is well suited to cyclical business scenarios. For example, if your website is visited at a fixed time and scale every day, you can use this mode, after turning on no manual intervention.

If there is some sudden traffic in the process, how to predict? Predictive mode can be turned on by overlaying existing target tracking modes with various other modes. Predictability ensures daily periodicity and goal-tracking patterns respond to unexpected situations. Through a variety of mode superposition, finally achieve effective and stable effect.

Next, I’d like to share with you the rolling upgrade feature. The rolling upgrade mainly solves the release problems that are frequently encountered in daily work. We offer rolling upgrades and then automatically help you do it. All you need to do is set up a few machines for today. Before the update, the machine is in standby state and no service is provided. After the update, exit the standby state and provide external services. Then, to the next batch. You can also determine if you want to retry, roll back, or continue. Through the overall process, the final release effect is achieved. In this way, the overall publishing cost can be reduced, and it is easier for people to complete the daily application publishing work, without having to build a set of publishing system.

So with efficiency, low cost, and automation, let’s look at two customer examples. The first is Mobvista, which puts its online advertising business on elastic shrinkage products. Because its final advertising revenue is advertising revenue minus the cost of resources, its resource cost is very important. At the same time, it also uses high volume resources, so it uses flexible products. Then, by setting the combination of volume and Spot, and opening the Spot automatic compensation mechanism, the overall cost is controlled at 3-4 discount.

The second objective example is Deep Potential, a company that makes artificial intelligence and molecular simulation algorithms. It is characterized by all interactive tasks. Each run requires a lot of resources and strict cost control. So in this scenario, the all-SPOT approach was chosen. Keep costs to a minimum while also setting its Spot maximum each time to ensure that it does not exceed the overall cost boundary and ultimately meets its overall business scenario.

03 How to Efficiently Manage Resources

When you have more resources on Aliyun, what’s the next step for efficient management?

Because there are many scenarios for managing resources, only three are listed here: cost, efficiency, and security.

◾ costs. How do you know which resources cost how much when there are many teams involved and there are many resources? How do I know the cost of each team resource?

◾ efficiency. How to quickly connect resources and efficiently carry out some daily operation and maintenance work?

◾ security. When there are more and more sub-accounts, how do I control the call permission between controls to ensure security?

Here are aliyun’s recommended best practices for grouping resources using tags.

Such as you in ali cloud has bought all kinds of resources, at the same time these resources are belongs to different environment, different teams, such as one of the team is the information department of Beijing area, the team team use a batch of production environment, if the single from the perspectives of resources, is unable to clearly distinguish between what is intelligence unit production environment of Beijing, But if you define regions, departments, and environments as labels, and label instances, then you can switch to a clear view of the labels that automatically group your resources, even across products. You can group resources by one tag or multiple tags. You can customize the resources according to your scene. You can add up to 20 custom tags to a resource.

Once you label resources, many things become easier, and billing, transportation, and security control are easy with the ability to label resources.

After grouping, you can easily achieve the effect of accounting, operation and maintenance. After finishing the relevant labels, you can enter the cost center console and query the cost of all resources under the corresponding labels through the labels. It can show the details of expenses by month, by day, by hour, so as to achieve the effect of fast ledger. If you need to look at the resources condition of multiple sets of labels overlap, can through the way of new financial unit to turn on cost analysis, financial unit filter function to support multiple tags binding cost, need to focus on here is that the label is T + 1 enter an item of expenditure in the accounts, if you add a label for resources, is to see the billing data after T + 1.

After tagging, go to the Operations Orchestration console for quick operations on resources. We can find related operations such as sending commands, executing scripts, batch restart and batch renewal on the operation and maintenance orchestration console.

Similarly, after typing the tag, you can go to the background of access control. The current operation is performed by applying some strategy to overlay the information associated with the Tag. An API call must contain a Tag. If not, the entire request is rejected. In this way, you can isolate permissions between accounts.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.