Introduction:

With the continuous expansion of the department in the field of business security, Tencent Shuidui, as a real-time risk control system supporting business security confrontation around the service scenarios such as verification code and financial advertising, has higher and higher real-time requirements for online tasks, and the number of business requests to be supported also increases. In order to meet the requirements of fast online business and rapid expansion and reduction of resources, and the company’s self-development cloud project pushes forward to the direction of comprehensive container-development cloud, shuidi risk control platform starts to carry out self-development cloud transformation. This paper mainly summarizes the practice in the cloud process of Tencent Water Drop platform, hoping to provide some reference value for other businesses to migrate to the cloud.

Water Drop Background Architecture

Shuidi platform is mainly a real-time risk control strategy platform with high availability, high performance and low latency for business security countermeasures. It provides a series of basic components for the strategy personnel to construct the strategy model, which can help the strategy personnel to quickly complete the construction and test verification of the strategy model. The droplet system architecture is shown as follows:

The system of water drop real-time risk control platform is mainly composed of configuration processing module and data processing module.

The configuration processing module is mainly composed of front-end Web page, CGI, Mc_srv and Zookeeper. Policy developers edit the policy model, create, go online and update the policy task on the front-end page of Water Drop, and the completed policy model information is stored in The Zookeeper data center in JSON format through CGI and Mc_srv interface. The data processing module uses the Agent to pull policy information corresponding to different services on Zookeeper.

Data processing module is mainly composed of Access, Engine and external systems, and the core processing logic is engine module. Different services start independent engine instances to ensure the isolation between services. Business data requests are sent to the specified Polaris service address or IP :port address, and then the Access layer receives the request data and forwards it to the engine instance of the corresponding task according to the task number. Engine queries the request data to external systems in the presence of components accessed by external systems.

Research on the cloud practice

In the process of cloud transformation of Shuitou platform, the characteristics of TKE(Tencent Kubernetes Engine) platform were first familiarized and tested, and the key problems affecting cloud on service were sorted out:

  1. If the Monitor monitoring system is not connected with TKE, a lot of manual intervention will be needed if the Monitor indicator monitoring system is continued to be adopted.
  2. It is too troublesome and not timely to expand or shrink capacity manually in response to sudden flow;
  3. TKE supports Polestar rules, but the original CL5(Cloud Load Balance 99.999%) has some problems;

In view of the above key problems of self-developed cloud, we carried out transformation and optimization from indicators transformation, container transformation, traffic distribution optimization and other aspects to ensure the smooth operation of cloud in business services.

Index monitoring transformation

Tencent Shuidou platform uses Monitor monitoring system to view system indicators and manage alarms. However, in the process of migrating to the upper cloud, many problems affecting the upper cloud were found in Monitor monitoring indicator system. In order to solve the problems existing in the original Monitor indicator monitoring system, We transform the index monitoring system from Monitor monitoring system to intelligent research monitoring system.

Monitor Monitors system problems

  1. If the TKE fails to communicate with the Monitor monitoring system, you need to manually add the IP address of the container instance to the Monitor system when the IP address of the instance on the cloud changes. Moreover, in cloud scenarios, the instance IP address changes frequently and it is difficult to maintain
  2. Monitor indicators Container instance indicators in NAT network mode cannot be distinguished at the instance level. In NAT network mode, indicators with the same attributes cannot be queried at the instance level
  3. The flexibility of the Monitor monitoring system is poor. When a new attribute is added, you need to apply for the attribute ID and adjust and update code implementation

Wisdom research index transformation process

The index report of Monitor monitoring indicator system mainly consists of attribute ID and attribute index value. For different indicators, attribute IDS need to be applied in advance, and the Monitor SDK is integrated to call the buried point of different attribute ids during the implementation of platform system. For example, different task request volume indicators need to apply for attribute ids in advance

In the process of using zhiyan indicator transformation, we integrated Zhiyan Golang SDK into the platform system implementation and reported the original Monitor indicator for zhiyan call transformation. The most important thing in the process of zhiyan indicator transformation is that the single-attribute indicator idea of Monitor system needs to be transformed into multi-dimensional indicators. It is necessary to have a certain understanding of the dimension and index concept of intelligence research and index design.

For example, to set the task dimension, the task value can be reported by calling

Intelligent research indicators and dimension design: in the process of realizing the transformation of intelligent research indicators, the most important thing is to understand the meaning of indicators and dimensions.

Metric: a measurement field used for aggregation or related calculations.

Dimension: is an attribute of indicator data. Usually use cases filter different attributes of indicator data. Dimension attributes can be uniformly abstracted from indicator data, such as instance IP, task NUMBER, component ID, and indicator status. Attributes that cannot be abstracted into dimensions are used as indicator attributes. In the early stage of the reform of intelligent research indicators, reasonable dimension design was not carried out, resulting in too chaotic selection of indicators and dimensions, which was not convenient for subsequent addition and maintenance.

Zhiyan alarm notification optimization

Think tank research index transformation is completed, we in the platform side and business side to distinguish warning, indicators that are relevant to the business side warning alarm callback way forward directly to the business side, promptly notify the business side of the processing of abnormal situation, improve the reception in the business of timeliness and reduces the platform side handle business side the interference of the alarm.

Optimize the Dashboard display of the Smart research indicator view, and integrate the commonly used smart research indicator view into the smart Research Dashboard page, so that operation personnel can quickly learn about key indicators.

Optimization of route Distribution

Route Distribution Problem

1. When CL5 queries a node with a SID for the first time, it is likely to encounter the following problem: -9998

2.CL5 SDK cannot access nearby in NAT network mode, and data requests are prone to timeout in the case of remote services

Migration Polaris (Tencent service discovery governance platform) transformation

In the case of CL5 request routing, when the container instance uses NAT mode, the CL5 interface fails to obtain the IP address of the physical machine. As a result, the nearest access to the request data fails. After the LOAD balancing API interface is changed from CL5 to Polaris service interface, the Polaris interface can be used to obtain the IP address information of the physical machine where the container instance resides in the NAT network model, so that the nearest access can be realized.

Polaris – GO Polaris – GO (Golang version SDK) Polaris – Go Polaris – Go

/ / * * * * * * * * * * * * * * * * * * * * * * * to get service instance * * * * * * * * * * * * * * * * * * * * * * * * * * * / / structure of a single service request object getInstancesReq = & API. GetOneInstanceRequest {}  getInstancesReq.FlowID = atomic.AddUint64(&flowId, 1) getInstancesReq. Namespace = papi. The Namespace getInstancesReq. The Service discovery Service = name / /, GetInstResp, err := Consumer.getoneInstance (getInstancesReq) if nil! = err { return nil, Err} targetInstance: = getInstResp Instances [0] / / * * * * * * * * * * * * * * * * * * * * * * * * service call report * * * * * * * * * * * * * * * * * * * * * * * * * / / structure request, Service call results reporting svcCallResult: = & API. ServiceCallResult {} / / set the modulated instance information svcCallResult. SetCalledInstance (targetInstance) / / Set the service call result, enumeration, Success or failure if the result > = 0 {svcCallResult. SetRetStatus (API) RetSuccess)} else {svcCallResult. SetRetStatus (API) RetFail)} / / Svccallresult. SetRetCode(result) // Set the delay of service invocation svccallresult. SetDelay(time.Duration(usetime)) // Report the call result consumer.UpdateServiceCallResult(svcCallResult)Copy the code

Containerization modification

According to the architecture diagram of water drop platform, after the business side creates different tasks on the water drop platform, different engine instances will be started on the water drop platform for computing operations of corresponding tasks. The relationship between the water drop platform tasks and the water drop task engine instances is 1: N. The more tasks, the more engine instances need to be deployed online. In order to quickly put different engine instances of water drop task on line, we need to ensure that the engine instances corresponding to the task can be quickly deployed online. Therefore, containerization of engine instance module and self-development of cloud can improve operation efficiency.

As the amount of requests changes, the data processing module of Shuitiu platform needs to expand and shrink the access instance and Engine instance, so the access and Engine instance will be expanded and shrunk frequently. Droplet data processing module architecture diagram:

Physical server deployment

  1. Task creation: New task, to apply for a new task name service address corresponding to Polaris, the task of process deployed on a different physical machine to start the engine, and manually to engine instance with the name of the north star service binding, need to manually to carry on the process of starting and managing, add, and modify the corresponding load balancing service, managing complex, High operation and maintenance costs

  2. Upgrade a task: Upgrade all engine processes corresponding to a task and restart all engine processes

  3. Task capacity expansion: To expand the capacity of a task, deploy the engine process on a physical server, start the engine process, and add the new process instance to the Polaris name service. When a task is scaled down, remove the scaled down process from the Polaris name service and then suspend the engine process. The service capacity expansion process is similar to the service upgrade process.

TKE platform deployment

  1. Task creation: When adding a task, apply for the Polaris name service address corresponding to the new task, and then create an engine application instance on the TKE platform

  2. Task upgrade: Upgrade the engine instance version

  3. Task capacity Expansion: You can set the Horizontal Pod Autoscaler (HPA) on the TKE platform page to automatically adjust the capacity expansion of an application instance

Experience in improving cloud native maturity

1. Divide services into different service types. Create Pod services with large cores for CPU intensive services, and create Pod services with small cores for I/O intensive services (currently, they mainly deal with instantaneous flow services and network buffers are easy to become bottlenecks). The single Pod of instantaneous burst flow service is 0.25 core or 0.5 core.

2. Container services use the HPA mechanism. During service access, the estimated CPU and memory resources are used to set the Request value of the Pod service.

Research on the cloud effect

In the process of shuidi platform migrating to the upper cloud, the self-developed platform has brought a lot of efficiency improvement after migrating to TKE cloud. The efficiency of the cloud cloud is improved in the following aspects:

  1. The application process for resources in the upper cloud system is simpler and faster. Before the upper cloud system, the application period for migrating VMS, applying for virtual IP addresses, and transferring VMS takes about one week. After the upper cloud system, the application period for resources in the upper cloud system is shortened to hours
  2. The utilization of machine resources is increased by 67%, and the CPU utilization is about 36% before and 59.9% after cloud upscaling.
  3. To cope with sudden traffic, you do not need to manually expand or shrink the capacity using the HPA mechanism, shortening the period from 15 minutes to one to two minutes.
  4. The service policy deployment online period can be shortened from 2 hours to 10 minutes.