The author

Pengfei He, Tencent Cloud expert product manager, used to be the product manager and architect of container private cloud and TKEStack, and participated in the design of Tencent cloud internal business and external customer container transformation scheme. Currently, he is responsible for the design of cloud native hybrid cloud product scheme.

Hu Xiaoliang, Tencent cloud expert engineer, focus on cloud native field. Currently responsible for the design and development of open source community TKEStack and hybrid cloud projects.

preface

Hybrid cloud is a deployment mode. On the one hand, enterprises can choose hybrid cloud from the perspective of asset aging, cost control, risk reduction and lock-in. On the other hand, enterprises can also obtain the comparative advantages of different cloud service providers through mixed service deployment and make different cloud service providers complement each other. Container and hybrid cloud are a match made in heaven. Container standardized packaging greatly reduces the coupling between application operating environment and heterogeneous infrastructure of hybrid cloud, making it easier for enterprises to realize agile development and continuous delivery of multi-cloud/hybrid cloud, and making it possible to manage applications with multi-regional standards. The TKE container team provides a range of product capabilities for hybrid cloud scenarios, and this article introduces a product feature for burst traffic scenarios — third-party cluster Pop EKS.

Low-cost expansion

IDC resources are limited. When unexpected service traffic needs to be handled, the computing resources in the IDC may be insufficient. It is a good choice to use public cloud resources to deal with temporary traffic. The common deployment architecture is as follows: Create a cluster in the public cloud, deploy part of the workload to the cloud, and route traffic to different clusters based on DNS rules or load balancing policies:

In this mode, the deployment architecture of the business has changed, so it needs to be fully evaluated before use:

  1. Which business workloads need to be deployed on the cloud, in whole or in part;
  2. Whether the services deployed on the cloud depend on the environment, such as IDC Intranet DNS, DB, and public services.
  3. How to display on-cloud and off-cloud service logs and monitoring data in a unified manner?
  4. On-cloud and off-cloud service traffic scheduling rules;
  5. How does the CD tool adapt to multi-cluster service deployment?

Such transformation investment is worthwhile for service scenarios requiring long-term multi-region access, but high cost for service scenarios with burst traffic. Therefore, we have developed the ability to use public cloud resources in a single cluster to deal with sudden business traffic conveniently. EKS is Tencent Cloud elastic container service, which can create and destroy a large number of POD resources in seconds. Users only need to demand POD resources without maintaining cluster node availability, which is very suitable for elastic scenarios. You only need to install relevant plug-in packages in the cluster to quickly gain the ability to expand to EKS.

Compared with the direct use of virtual machine nodes on the cloud, this method is faster to expand and shrink the capacity, and we also provide two scheduling mechanisms to meet customers’ scheduling priority requirements:

Global switch: At the cluster level, when the cluster resources are insufficient, any workload that needs to create new PODS can create replicas on Cloud EKS;

Local switch: at the workload level, users can specify a single workload to keep N copies in the cluster, and then create other copies in Cloud EKS.

To ensure that all workloads have enough copies in the local IDC, when burst traffic passes and the capacity reduction is triggered, the EKS copy on Tencent cloud can be reduced first (TKE distribution cluster is required, please look forward to the next series of articles about TKE distribution).

This mode, business deployment architecture has not changed, can be flexible in ChanJiQun using cloud resources, to avoid the introduction of reconstruction of business architecture, CD production line, more cluster management, monitoring the log series and a series of derivative problems, and the cloud resource usage is according to the need to use, on-demand billing, greatly reducing the costs of users. However, to ensure the security and stability of workloads, we require users’ IDCs to communicate with private VPC lines of Tencent cloud public cloud, and users also need to evaluate the applicability from storage dependence and delay tolerance.

EKS POD can communicate with local clusters pod and Node in Underlay network mode (local POD CIDR routes need to be added in Tencent cloud VPC, refer to route configuration). Elaseks, a third-party cluster, has been opened source in TKEStack. For details about how to use it and examples, see the usage document

Practical demonstration

steps

Obtain THE TKE-Resilience Helm Chart

git clone https://github.com/tkestack/charts.git
Copy the code

Configure VPC information:

Edit charts/ Incubator/TKE-Resilience /values.yaml and fill in the following information:

Cloud: appID: "{Tencent cloud account appID}" ownerUIN: "{Tencent cloud account ID}" secretID: "{Tencent cloud account secretID}" secretKey: "{Tencent cloud account secretKey}" vpcID: "{EKS POD ID}" regionShort: {EKS POD region name} regionLong: {EKS POD region name} subnets: -id: "{EKS POD placed subnet ID}" zone: "{EKS POD placed available zone}" eklet: podUsedApiserver: {API Server address of the current cluster}Copy the code

Install THE TKE-Resilience Helm Chart

helm install tke-resilience --namespace kube-system ./charts/incubator/tke-resilience/
Copy the code

Make sure chart Pod works properly

Create demo application nginx: ngx1

Effect demonstration:

Global scheduling

Since this feature is enabled by default, we first set AUTO_SCALE_EKS in kube-system to false. By default, the number of ngX1 copies is 1Adjust the number of ngX1 copies to 50

After setting the AUTO_SCALE_EKS in Kube-system to true, wait for a short time and observe the POD state, the POD originally in pend is scheduled to the EKS virtual node: KZFLM eklet – subnet configures – 167.

Specify the

Again, we adjust the number of copies of NGX1 to 1Edit NGx1 YAMl to enable local switches

LOCAL_REPLICAS: "2"" spec: template: metadata: Annotations: # SchedulerName: tKE -schedulerCopy the code

Change the number of NGX1 copies to 3, and although there is no resource shortage in the local cluster, you can see that after more than 2 local copies, the third copy is scheduled to EKS

Uninstall THE TKE-Resilience plugin

helm uninstall tke-resilience -n=kube-system
Copy the code

In addition, TKEStack has been integrated with TKE-Resilience, enabling users to install TKE-Resilience interface in the Application market of TKEStack

Application scenarios

The cloud burst

E-commerce promotion, live broadcasting and other scenarios need to expand a large number of temporary workloads in a short time. In this scenario, the time of resource demand is very short. In order to cope with such short-term demand, a large amount of resources are reserved daily, which is bound to cause a relatively large waste of resources. With this function, you do not need to pay attention to resource preparation, just rely on the automatic scaling function of K8S, you can quickly create a large number of workloads for business escort, after the peak traffic, POD on the cloud will be prioritized destruction, to ensure that there is no waste of resources.

Off-line calculation

In big data and AI service scenarios, computing tasks also have high flexibility requirements on computing power. In order to ensure the rapid completion of the task, it is necessary to have a large amount of computing power in a short period of time. However, after the calculation is completed, the computing power is also in a state of low load, and the utilization rate of computing resources fluctuates greatly, resulting in resource waste. And because the GPU resources scarcity, the user’s own hoarding a large number of GPU devices not only cost is very high, also face resource utilization to ascend, new card adapter, old old Cali, heterogeneous computing and so on the many kinds of resource management problem, and the cloud rich type of GPU card can provide users with more variety of choices, use namely also features also ensures that the zero waste of resources, Every penny is spent on real business needs.

The future evolution

  1. Multi-region support: Applications can be deployed to multiple cloud regions, and applications can be deployed based on regions
  2. Cloud-edge and TKE-Edge provide application deployment and scheduling policies for weak network scenarios, eliminating the dependence on dedicated lines