background

QQ browser information flow (QB) recommendation architecture supports real-time recommendation capabilities of QQ browser, Express main feed scenario, floating layer and other information flow cards. The architecture not only supports multi-business and multi-product, such as QB, Express, external cooperation, etc., but also needs the ability to quickly support various types of scenarios. Such as the main TL, floating layer, and can quickly expand support for vertical channels and apps. Then the information flow recommendation architecture needs to be flexible and modular and easy to expand horizontally.

Level in order to do huge amounts of real-time accurate recommendations, recommend information architecture is divided to four layers: control layer, layer (pure row/thick row), recall layer and index layer, grade level and provide real-time user portrait model module and characteristics of the system, the real-time cumulative users/item characteristics, real-time feedback to recommend link must level index article/video screening and precise recommendation. The detailed architecture module diagram is as follows:

It can be seen that the recommended architecture contains a large number of modules, supports a large number of business forms, and needs to support a user scale of 100 million and a characteristic scale of 10 billion. Therefore, a better technical architecture system is needed to develop, expand, manage and maintain such a large architectural service and storage.

challenge

Challenge 1: Real-time

(1) Real-time update of features

(2) Online real-time learning of the model

Challenge two: Super scale

(1) User level: 100 million users

(2) Large scale of features: billions of features, billions of parameters (infinite)

(3) Large sample size: nearly 100 TB of refined sample size per day

Challenge 3: High performance

(1) The entire link time reaches the industry-leading level.

The solution

Although the recommendation system is a relatively large system with many modules and functions, we split the recommendation system in the way of micro-service. A huge system is divided into tens of thousands of micro-service modules, each of which is only responsible for relatively independent functions and provides services in the way of remote RPC interface. Modules can call each other, thus forming a complex call network. On the whole, a huge recommendation system is formed. After micro-service, the related technology architecture system of cloud native is combined to ensure the stability of cloud native application, and improve the utilization rate of resources and research and development efficiency.

The container is changed

In the era of cloud computing 1.0, the industry separated applications from the hardware level by means of virtualization, while the emergence of containers marks the arrival of cloud computing 2.0, where applications are separated from the operating system level by means of containers.

In the containerization era, the emergence of Docker makes the isolated development and test environment and continuous integration environment become a widespread practice, while K8s makes container applications enter the period of large-scale industrial production. Clustering also maximizes the container’s capabilities for good isolation, resource allocation, and orchestration management. The biggest value brought by containerization to the enterprise is the saving of manpower and machine cost. The following figure compares the differences between virtualization and containerization in terms of stability, scalability, and resource utilization.

Cloud primordial Era

The concept of Cloud Native was first introduced by Matt Stine of Pivotal. With the passage of time, the concept of cloud native is also constantly updated and iterated, cloud native architecture has become a hot technology in the Internet industry, domestic and foreign major factories have begun to promote the evolution of the company’s business toward the direction of cloud native, and to a large extent to promote the reduction of IT costs and enterprise development.

Introduction to the cloud native Architecture

Cloud Native is a technology architecture architecture and methodology that embraces concepts such as containerization, continuous delivery, DevOps, and microservices. Containerization is the best vehicle for microservices, continuous delivery reduces quality risk with frequent and rapid delivery, DevOps automates release and deployment, and microservices can, at their core, be independently deployed, updated, scaled, and restarted.

The CNCF Cloud Native Computing Foundation has a complete list of cloud-native products and services in the CNCF Panorama. These 1,381 projects make up the vast cloud native world. The whole panorama is divided into 29 modules according to their functions, which belong to 9 major categories

  • Application Definition and Development
  • Orchestration and Management
  • Runtime (Runtime)
  • Configuration (Provsioning)
  • Platform
  • Observability and Analysis
  • No service (Serverless)

(Original link:Docimg7.docs.qq.com/image/0jAMP…

QB Cloud native architecture

The following figure shows the overall cloud native architecture of the QB information flow recommendation architecture. DevOps carries the capabilities of code/release/build management, pipeline, automated deployment, monitoring, CI/CD, integration testing, etc., while the test platform is responsible for stress testing/interface testing/integration testing/fault analysis drills. Microservice framework provides convenient, effective and stable service link capability through service discovery/routing, load balancing, dynamic configuration and other functions. Redis, mysql, MQ, Flink and other data storage managed by cloud platform maintain unified and secure management and backup of service data. The container service provides stateless automatic adjustment and parallel scaling capabilities based on cluster, container, image, and storage management to optimize the utilization of machine resources. The daily link troubleshooting, deployment and maintenance can be performed in series through configuration, log management, and alarm monitoring and timely response.

In terms of technology selection, Polestar and TRPC frameworks are selected in the arrangement and management of QB information flow recommendation architecture, Rainbow platform is selected in the basic configuration management, 123 platform is selected in the service deployment management, and Tianjige and 007 are selected in the dyeing monitoring.

DevOps

QB information Flow recommends that architecture services migrate to the TRPC framework and 123 platform. 123 platform is a universal open DevOps platform for development and operation, which supports plug-in extension and customization of business features. The 123 platform enables automatic service deployment, operation and maintenance (O&M), and reasonably allocates service container resources to improve resource pool utilization. In addition, new versions can be deployed and released in a stable and safe manner by isolating different service deployment environments (such as formal and testing environments).

In CI/CD, the blue shield pipeline is selected to strictly standardize the code style/specification/quality/specification of commit log of Git submission, etc., and the usability and stability of MR branch and default mainline branch code are strictly guaranteed through compilation and construction and the pass rate/coverage rate of single test cases. It also ensures high service and code quality during frequent continuous delivery by standardizing build and mirroring releases through the build pipeline.

Observability and analysis

Due to the large and complex business service modules, long links, and the cooperation between businesses, it will become complicated and time-consuming in the absence of observable means to control the stability of service links or trace problems under the huge business and technical architecture. As for the observable component, we chose the machine cabinet. Tianjige is designed to solve the problems mentioned above: fault location, link sorting, and performance analysis. It provides multi-directional monitoring and concatenation through Log, Trace and Metric. Through tianjige, we can quickly dye and troubleshoot link problems, as well as conduct traffic analysis and performance analysis of services.

Configuration management

The service configuration management platform needs to be able to do configuration synchronization, version management, permission management and other elements. Based on the above reasons, and TRPC framework also supports the rainbow stone plug-in, so we choose rainbow stone as a management platform for service configuration.

Cloud native Applications

Overview of traditional vs. cloud native applications

What’s the difference between traditional and cloud-native apps? Mainly reflected in several aspects: research and development mode, architecture design, deployment mode, operation and maintenance mode. The deployment, operation and maintenance of cloud native applications are automated.

The traditional application

Traditional applications tend to have a long life cycle and are often built as tightly coupled monolithic applications. They conform to the relevant specifications established at the time of definition, but these specifications are often developed long before the delivery of the application. Many of the fundamental applications needed to run a business were not designed to provide a digital experience.

Traditional application development schemes are mostly waterfall and incremental, with long time horizons and only recently semi-agile. Applications go through phases of development, testing, safety compliance, deployment, and management, and these phases are divided into functional areas, each of which is handled by a different team with different roles and responsibilities, all communicated through a linear process.

For most traditional applications, the infrastructure is pre-positioned for the peak capacity required by the application; It also improves the performance of the server’s hardware through vertical scaling.

Cloud native Applications

Because cloud-native applications are so focused on r&d and delivery efficiency, there is an urgent need for the information flow business. Therefore, a more agile and service – and API-based solution and continuous delivery strategy needs to be implemented at development time. Development has shifted from a server-based perspective to a container-centric pattern. The development solution has shifted from a single, tight application to a loosely coupled form of service, with more emphasis on using indirect communication. Service delivery cycles are often shorter, requiring continuous iterative delivery. The success of these solutions depends on DevOps collaboration between development and delivery teams; A more modular architecture; A flexible infrastructure that scales out on demand, supports multiple environments, and enables application portability.

Cloud native application development paradigm

Given the many benefits of cloud native apps, what should we focus on when developing cloud native apps? The following describes microservices, name services, data management, configuration management, and log management.

Micro service

Microservices architecture is the development of a single application system as a set of small services, each running as a separate process, each using a lightweight communication mechanism with other services. These services are built around business functions and can be deployed independently through automatic deployment mechanisms. At the same time, the services use the minimum centralized management capability, and can be managed in different programming languages and databases to achieve decentralized service management. Keywords for microservices: Accelerated delivery readiness, highly scalable, excellent flexibility, easy to deploy, easy to access, more open.

Why switch to microservice architecture design? The traditional recommendation architecture of information flow is based on server. In the traditional architecture, fine scheduling, rough scheduling, development control and other modules are deployed on the same physical machine, so that the physical machine may be down, resulting in the entire related system unavailable. In the process of operation and maintenance, the capacity of a single machine can only be manually adjusted by the operation and maintenance students in view of the insufficient performance of the single machine. Microservices emphasize low coupling + high cohesion. Through cut the flow of business into a complete and independent services, each service can be used as an independent component upgrade, gray or reuse, etc., also had little effect on the application of the whole big each service can be done separately by specialized organization, relying party as long as the good input and output can be fully developed, even in the organizational structure of the whole team will be more concise. Therefore, communication cost is low and efficiency is high.

Problems and thinking changes encountered in the process of microservice migration:

  • Version management and one base code, multiple deployments

Information flow service history stored codes and resources are stored on the centralized version control SVN in a single path. If the central server does not back up data, data may be lost. In response to the subsequent migration of microservices and collaborative development of multiple people. Git version control tool is introduced to improve the production efficiency of parallel development. Each service shares one copy of the benchmark code, but multiple copies can be deployed. It is common to deploy different versions of the service in different publishing environments provided by the 123 platform to test the effect of verification and grayscale verification.

  • Compile and publish the system

The distributed compilation system is adopted to accelerate the compilation of source code. The 123 service operation platform allows each team to customize its own compilation and running images to meet various customization requirements. At the same time, it can record the history of compilation and release, so as to reach the effect of tracing each historical version at any time.

  • Explicitly declare dependencies and manage them

Divide the individual applications in the business into microservices. Maintaining compiled dependencies between services and keeping service dependencies clear is a top priority. The business first divides the application into more granular micro-services according to functions and modules, and divides them into different Git repositories for management. On the basis of the loose coupling of services, the module dependency management tools are introduced, such as Bazel, Maven, Go Modules, etc. With advanced productivity tools, clearer dependencies are built, and problems caused by inconsistent versions of historical applications of various protocols are solved.

  • Processes and concurrency

Development, operation and maintenance students to think into a process model to design the application architecture. Assign tasks to different process types, such as Web and Worker processes in services. Consider designing processes to be stateless, shareable. The application can then expand by copying its processes. Building stateless applications also allows processes to be portable between different computing infrastructures. Design thinking is becoming more distributed, so that when the entire system needs to scale horizontally, more processes can be added to solve the pressing problems.

Name service

The concept of service discovery is one that has evolved as computer architectures have evolved. At the beginning of the network era, different computers needed to locate each other, and this was done through a global text file, hosts.txt. Because new hosts are not added very often, the address list for the file is maintained manually. With the development of the Internet, the number of hosts grew faster and faster, requiring a more automated, scalable system, which led to the invention and widespread adoption of DNS.

Service discovery is generally consists of three modules, master, the client and the server, the server will be the current node information registration to the master, when a client needs to call the service side, from master control access to information server nodes or from already cached data to get the information from the server, and then call, their relationship as shown in the figure below:

Now, microservices architecture is driving the evolution of service discovery. With the increasing popularity of containerized or cloud platforms, platform-based microservice architecture deployments, service lifecycles are measured in seconds and minutes. At the same time, the flexibility of microservices is once again driving the development of service discovery technology, as they have dynamically changing address lists due to automatic scaling, breakdowns and release upgrades. Today’s microservice applications based on container or cloud platforms need to address the dynamic change of service addresses. Tencent also offers a number of excellent platforms and components to choose from internally, such as CL5 / L5 / Polaris. Here are the similarities and differences between traditional application and cloud native references in name services:

The business actually encountered in the process of switching the name service, the problem and the change of thinking:

  • Port binding

In a non-cloud environment, Web applications are typically written to run in an application container. In contrast, cloud native applications do not rely on external application containers. Instead, they package the Web server libraries as part of the application itself. Internet applications can provide services through port binding and listen to all requests sent to that port at any time. Services connected to the 123 platform will automatically connect to the Polaris name service. For each service in the 123 platform, Polaris can resolve the specific instance (representing an IP port and related configuration information) by using the service name. When the service is offline or changed, using the name service can often reduce a lot of labor costs.

Data management

Internet services also include information flow services, because the magnitude of data and access speed requirements are different, inevitably using a variety of storage systems. With the promotion of cloud native, the traditional data management mode is gradually shifting to cloud storage. For example, redis, a common NoSql component, and mysql, a relational database, have also experienced the changes of local storage, cluster storage, and cloud storage.

Compared with traditional data management, cloud native data management has the following features:

  • distributed

Users’ storage is distributed across multiple machines, freeing them from capacity and resource constraints.

  • High availability

Each instance provides primary/secondary hot backup, automatic downtime monitoring, and automatic disaster recovery.

  • High reliability

Data is stored persistently and can be used for cold backup and self-restore.

  • The elastic expansion

The clustered storage system supports the capacity expansion of a single instance without an upper limit. During capacity expansion, services are not interrupted and users are not aware of the capacity expansion.

  • Improved monitoring

It provides a complete operating system, including real-time traffic, feature monitoring, and alarm. Monitors component indicators in real time and configures alarms based on thresholds.

The following is a distributed storage monitoring diagram

Configuration management

The recommendation system of the information flow business is divided into thousands of micro-service modules, each of which is deployed to multiple environments on the 123 operating platform. The traditional file configuration management method has some limitations, so the cloud native configuration management system arises at the historic moment.

Problems encountered in the process of switching the configuration management system and the change of thinking:

  • Store the configuration in the environment

The information flow service uses rainbow system as the core configuration management system. In the 123 platform, seven color stone exists in the form of plug-ins. The configuration of services in different environments (such as pre-release and formal environments) still has differences. Qicaishi and 123 platforms can do different isolation according to the environment to meet the requirements of the service.

  • The environment is equivalent

Different environments may have different configurations. To ensure continuous deployment and system stability, you need to minimize the differences between configuration files. From the perspective of the information flow service itself, it is worthwhile to introduce DevOps to shorten the deployment time and make sure that the configuration files are modified by the development students as much as possible.

Log management

In traditional log management, logs are stored as local files on the Web Server. The capacity of a single log Server is limited. When business development and operation and maintenance students need to analyze data, they can only log in to a specific machine to see a specific log. The logs of the link triggered by the event and other related modules cannot be summarized.

Under the cloud native log management process, the service log should be a summary of event flows. The log collection system collects the output streams of all running processes and back-end services in chronological order. Logs are appented to log files in the form of pipelines.development and operation and maintenance can track the status of the standard output stream on the terminal in real time. With the help of the log collection and tracking component, the dyeing logs generated by the related modules of the event occurrence process can be traced completely. The information flow business currently uses the Hawk-eye log system to analyze various problems that may be encountered in the life cycle of the business.

In the process of switching the log system, the problems encountered and the change of thinking:

  • Centralized remote log center
  • Centralized collection and unified management
  • Convenient log analysis capability

Resource Utilization Optimization

Docker, K8s technology used by business, in theory can effectively improve resource utilization, but under the complex business architecture, the service quantity expansion, most business application still exists the problem of low utilization rate of resources, the main reason is the imperfection of the tool platform, business unique resources using in a different way, How to improve the utilization of resources is an important proposition of cloud native architecture.

Resource Waste Scenario

Considering the waste of resources, let’s first analyze the common usage of resources in the business. From these scenarios, we can find some optimization directions.

Claim more than use

For example, the actual operation of the service requires one core CPU, and the service leader does not estimate the memory used by the application properly. In order to ensure the stable operation of the application, the service leader will apply for too many resources, such as quad-core CPU. Avoid later service problems in the process of running. This is bound to cause a greater waste of resources. As shown in the figure below, CPU usage is only around 1%-1.5%.

Peaks and troughs

Most businesses have peaks and troughs. For example, in our business, there is a peak after lunch and before going to bed in the evening, when users have more free time to consume information flow. And early in the morning when the user is resting, there is a trough. As shown in the figure below, a crest appears around 10-11 PM.

Differences in application resource usage

Different service applications use different resources in different periods of time, especially in information flow services. Online services have high load and high latency in the daytime, while offline computing services have high load and low latency in the early morning.

Optimization scheme

In view of the above mentioned resource waste, we can adopt the way of manual and automatic tuning of resource utilization, such as to get greater than using scene can adopt the way of manual tuning, the multiple application of resources to be carried out in accordance with the real use of capacity, but the operating mode is largely dependent on the development of ability and willingness to It is difficult to land in the actual operation, so it needs to be optimized in an automated way, with as little reliance on the developer as possible. The automatic methods are as follows:

  • Elastic expansion based on parameters such as HPA and CA
  • Scheduling, K8s provides a resource allocation mechanism to automatically find suitable nodes
  • Both online and offline services are deployed

These automated capabilities are supported on TKE, which can be detailed in TKE documentation.

monitoring

Although the automatic capacity expansion and reduction capabilities can effectively schedule resources and improve resource utilization, the resource usage of the entire service needs to be monitored. The following is the monitoring module of the service part. The monitoring data can be used to quickly identify the optimization direction.

  • Overall Resource Profile

  • Service application expansion capacity threshold

  • Module resource utilization list

Afterword.

Cloud native architecture system is a huge technology system, with relevant research or mature components in various technical directions, and the company also has Oteam to build in related fields. As a business, it needs to integrate with the platform and Oteam’s ability, and only by standing on the shoulders of giants can it fly higher. Finally, I would like to thank all the teams and individuals who have contributed to cloud native technology.

Reference documentation

The road to cloud native apps: www.sohu.com/a/211846555…

12- factors:12factor.net/zh_cn/