Abstract: Duan Liang from Huawei Cloud Live introduced in detail the problems, challenges and solutions encountered by Huawei Cloud Video in the transformation practice of Cloud Native.

With the development of Cloud infrastructure services and edge computing technologies, Cloud Native, architecture concepts and r&d are becoming more and more popular. The transition from traditional software architecture to cloud-native software architecture will take some time to mature. From the perspective of experience and lessons, Mr. Duan Liang, a teacher of Huawei Cloud Live, introduced in detail the problems, challenges and solutions of Huawei Cloud Video in the transformation practice of Cloud Native.

With the development of Cloud infrastructure services and edge computing technologies, Cloud Native, architecture concepts and r&d are becoming more and more popular. The transition from traditional software architecture to cloud-native software architecture will take some time to mature. For today’s sharing, we have invited Mr. Duan Liang, a teacher from Huawei Cloud Live, to introduce in detail the problems, challenges and solutions of Huawei Cloud Video in the transformation practice of Cloud Native from the perspective of experience and lessons.

The theme is mainly divided into three parts: The first two parts review the characteristics and architecture of Cloud Native; The third part is about how Huawei’s Cloud video business explored and practiced Cloud Native, and what problems it encountered. I hope this sharing can help those who want to use Cloud Native.

I. Past life of Cloud Native

1. Industry description of Cloud Native

To review, in 2010, Paul (Paul Fremantle) proposed the concept of cloud native. At first, it only proposed the basic characteristics of elasticity, distribution, multi-tenancy and so on. With practice and development, Adrian (Adrian Cockcroft, 2013) and Matt (Matt Stine, 2015) improved the key features one after another, and gradually put forward new understandings such as anti-vulnerability, DevOps and continuous delivery. This was the initial development of Cloud Native.

From the perspective of the team, CNCF proposed the features and objectives of Cloud Native, and Gartner also made important specifications.

2. Definition and requirements for cloud native

As early as 2016 and 2017, Huawei started to define and standardize the native cloud language through internal documents, so as to facilitate the alignment of languages among different parts and services, including (micro) servitization, elastic scaling, and distribution. The definition and scope of these key features are also specified in detail.

3. Definition and key features of Cloud Native

The whole cloud native, we think, should be divided into three parts. Based on Conway’s Law, organizations determine what businesses they achieve. For the practice of cloud native, organizational changes can make us feel the most, especially for each person’s ability requirements have undergone a very obvious change. For example, r&d personnel are no longer required to fulfill the requirements under the previous traditional mode; Now, from front-end demand discussion, to demand analysis, to development, participation in testing, participation in the gray process, and finally to online and online operation, including monitoring, alarm, etc., all need end-to-end attention. As a result, the demands on the skills of the r&d team have increased.

Today, I’m going to focus on the architectural and engineering aspects of cloud native. Each company according to different business, will involve in these two aspects, so more reference value. The core of the architecture is the microservice architecture, followed by elastic scaling and distributed features. Engineering ability: DevOps, continuous delivery, gray on-line, etc. The ultimate goal is rapid and efficient deployment and scaling of cloud applications, as well as high availability of the entire service. This is an overview of cloud native, and I’ll share it with you from both architectural and engineering perspectives.

Ii. Basic features and architecture of Cloud Native

1. Definition and advantages of microservices architecture

The core part of the cloud native architecture is the microservice architecture. The primary characteristics of microservices are high cohesion and single function. In the best state, a microservice only does one thing, and each microservice can be tested, deployed and upgraded independently through process isolation and independent code base. In this way, individual microservices are small, flexible and easy to manage. After the implementation of microservitization, the delivery of the whole cloud native and small team communication can be carried out quickly, because the microservices use API (application programming interface) communication, without the limitation of development language, small team communication will be simpler and smoother. If a function point or microservice fails, the fault affects only the local server. In this way, the overall high availability of microservices can be improved, and each microservice can evolve independently.

2. Comparison between traditional software architecture and microservice architecture

Since microservices are so important, let’s take a closer look at microservices versus traditional architectures. Software in traditional single architecture is divided by modules. A complex system may be divided into dozens or even more modules, each of which performs certain functions. There may be internal code-level interface calls or local API calls between modules. It can be seen that in the case of a simple architecture or a single system function, individual software may be more efficient in the initial stage, because the whole system uses a set of code, which is easier to manage during deployment and static inspection, and memory is shared, can be called, lower latency, which is its benefits. If all function points under cloud native are called through microservitization API, there will be delay between calls.

So what are the benefits of a microservices architecture? Process isolation, complete decoupling between code, that is, each microservice can evolve and evolve independently. Compared with traditional single architecture, it is definitely designed to decouple modules to make them function independently. However, in the process of architecture evolution, developers are inevitably unable to control them effectively, resulting in increasingly unclear coupling. At the same time, the change of each module will involve the upgrade and change of other modules, and even affect the development of technology stack. The refactoring of one module will have an impact on another, making the evolution of the entire system extremely difficult. However, in the service-oriented architecture, the above problems can be easily solved. Each service is achieved through API collaboration, making it more flexible and efficient. As the scale of the system increases, efficiency does not decrease, and aspects such as availability and development efficiency are guaranteed.

3. Take advantage of cloud infrastructure and platform services

There has also been a big change in the way our architects and designers think about leveraging cloud infrastructure and platform services. Cloud native software is built on the basis of the entire cloud, which includes computing resources, network resources, storage resources, and message queues. Existing resources on cloud services are preferentially used, and these resources are called in a choreographed way to facilitate the availability of the entire system. In other words, each service, each application, only need to focus on the parts of its implementation, not to implement every function. In a monolithic architecture, it is more common to find an open source code, software, or module to use when you need a feature, which is not desirable. In the cloud native, it is more focused and efficient to implement by invoking other services.

4. Elastic expansion

Elastic expansion is also the essence of cloud native, which mainly needs to solve two problems: one is “stretch”, the other is “shrink”. In the case of video, the scale is often uncontrollable. If the anchor produces a burst point during live broadcast, a large number of users will flood in, leading to a sharp increase in business volume. At this point, our service needs to be able to expand resources automatically. If the business personnel receive the high load alert and then take the initiative to upgrade and expand resources, it may be too late. This is the “stretch” part of elastic expansion, and the necessity of “shrink” is mainly based on cost. We know that there is a peak of video every day. For example, educational video services will peak in the morning during school hours, while live video for games will peak between 8 and 10 PM. In the peak period, compared with the low peak period, the business scale is ten times or even more than one hundred times different. If the resources are not automatically opened, it will cause resource waste in the idle period. Shrinking frees up resources for use by other services.

5. The distributed

Distribution is a core concept of cloud native and is mainly used to improve availability. Distributed applications are divided into three parts: First, application distributed in multiple AZs (availability zones) and multiple regions (regions). If a failure occurs, it does not affect other aspects. For example, the failure of a city’s electrical system, or the cutting of an optical fiber, will not affect the availability of overall services. Second, data distribution. Important data must be deployed and stored synchronously across regions and azs. Third, deployment across availability zones and overall scheduling. Taking today’s sharing as an example, the entire media processing is distributed in different regions. If there is a problem with the optical fiber in a certain city, the whole process and quality of live broadcasting will not be affected. This is the benefit of distribution.

6. High availability

In cloud native, usability is fundamentally different from traditional models, and the design concept is completely different. Cloud native is to design anti-vulnerability systems based on unreliable and disposable resources. Example: how to build a strong building in the desert? We cannot build unstable buildings on unstable sand just because it is unstable. Cloud native system design does not assume that the resources under the system are stable, in fact all resources can fail. Then, how should the anti-vulnerability of the system be designed? That’s part of the essence of our cloud native design.

7. Recognize failed designs

In the traditional way, we always make great efforts in security, availability and other aspects, hoping to remove all bugs and failures in the system, leaving no hidden dangers. This kind of thinking is not wrong. However, as the systems and services on the cloud get bigger and bigger, it becomes almost impossible to get rid of all bugs. In design, we should accept that failures happen from time to time. At the same time, you need to consider how to keep services running when the system or a function fails. For example, how to quickly detect and self-isolate the failure of a microservice, so as to eliminate its impact on the whole system; And even how to degrade a service when the entire availability area and core services fail, rather than allowing the entire service to fail. For example, there are 10 services in the system. If a fault occurs, 3 services are unavailable. Can the other 7 services continue to be available? This is a core concept for designing systems in the cloud native.

8. Automatic operation and maintenance — comprehensive fault monitoring based on data analysis

At present, the number of micro-services in cloud native environment is large. In most cases, if the system scale is medium, the number of micro-services is dozens or hundreds or even more, and it is almost impossible to adopt the way of human labor operation and maintenance. Because the running state of each microservice is very complex, and there are various complex relationships among the services. If you judge the health of the system only from the top customer gold metrics, when things eventually go wrong, they may already be serious. Therefore, from the perspective of development, deployment, upgrade, problem location and other aspects, automatic operation and maintenance is a very important part of cloud native.

9. Grayscale release

Grayscale publishing is also part of the overall cloud native core. Our system has been under development, if there is no grayscale release, how to change the system has been under development? For example, an aircraft flying at high altitude cannot wait for the aircraft to land before changing the engine. Similarly, if I were a customer and wanted to stop the system and change it, it would be impossible. So, how to keep all the business available while changing and keep the system moving forward is a big challenge. Grayscale release is a widely used way at present, there are other rolling release, blue-green release and other ways, among which blue-green release causes resource waste. Gray release is a commonly used way at present, through gray upgrade, gradually expand the gray range, so as to ensure the availability of the whole service, if there is a problem in the middle of the fast rollback.

Iii. Cloud Native practice of Huawei Cloud video

Below, I would like to share with you some of our experiences and lessons in implementing cloud native in our cloud video business.

1. Cloud Native architecture ability

Huawei cloud not only unified the definition and language of cloud native, but also summarized several cloud native projects, including the internal architecture design guide, that is, how to design the cloud native architecture, and some excellent cases. For different scenarios and patterns, we build an architecture pattern library, which can be directly referred to in the design process to facilitate efficient and high-quality architecture design. For the whole cloud native, we have also carried out a comprehensive and standardized system construction. There are unified specifications for Console and style among services, including how to authenticate, how to connect AKI gateway and interface style. The last point is very important. According to the maturity of cloud native R&D of each business, evaluation criteria for cloud native architecture should be established in the tool and numerical scores should be provided. That way, every business, including cloud video, can measure the gap between its current state and its ideal state and know where it needs improvement.

Cloud native is a series of lessons learned on the cloud. In the implementation process, it is not necessary to practice all of the lessons, just refer to the practice required by the business. What matters is the value of experience.

2. Architecture – Microservices Architecture

We also summarized the microservice architecture, from service discovery, service registration, service partitioning and deployment, and so on, each pattern has uniform requirements. These include usability, automated operations, and so on, which are not covered in detail here.

Let’s compare the left and right pictures. The left picture shows the microservice architecture at the beginning of business construction, when the understanding of cloud native was not deep and the business logic was relatively simple. After about a year, we found that the old business structure was broken. First, the development efficiency is getting lower and lower. As the development of one service often involves other services and needs frequent changes, it takes a long time for a demand to be developed and put online, and the customer response time is significantly lower. Second, testing became more difficult, the architecture corrupted, and the code smelled bad. In many ways, we decided that we needed to reframe the architecture. Now, if you look at the diagram on the right, you can split the VodManager microservice into four or five services. The functional logic of each service is relatively independent, and the development of a requirement falls into only one or two services, making testing easier and development more efficient. In my opinion, microservice partitioning is dynamic, and there is no ideal architecture or partitioning method, only a few guiding principles, which we have covered before and won’t go into here. You need to take a unified view based on the task scenario, service complexity, number of users, and specific requirements of the system. If there is any problem, reconstruct it. If there is no problem, simplify it as much as possible. These are some of our practices in refactoring the microservices architecture.

Again, the microservices architecture here is not a one-off change, but a gradual evolution. One microservice could be spun off one week, another next week; You can’t just split it up in advance and go live all at once. I think the reason is mainly based on quality. If the whole architecture is rearchitecting suddenly, it may involve dozens or even more code going online at the same time, and the quality is hard to guarantee. Every time a microservice is changed, there is a grayscale process to protect the availability of the customer service. At present, the number of our services in the whole cloud video business is about 200, with one or a little more per person. All of them have external API interfaces, and the number of interfaces is about 2,000. Not to say the more the better, the above content only for everyone to do a reference.

3. RTC user access to micro service container practice

We believe that containerization is necessary for cloud services because microservices are independent of each other and should be designed to be stateless and can be destroyed at any time. The picture above is a micro-service of our real-time video, which has been commercially available now. One of the cases can be shared with you. All microservices are fully containerized, because any single instance can be destroyed, and if there is a change or an increase in traffic, one microservice can be pulled at will and the others can still work. To recap a point, we suggest to schedule external micro-services by domain name rather than IP, because there may be multi-region situation. It is well known that if a Region fails, the domain name can be resolved to another Region. For an IP address to be parsed externally, it must be an EIP in active/standby mode, not a single IP address. Because the EIP can correspond to multiple IP addresses, the failure of any IP address or host does not affect the entire service. This is a consideration for external service microservices.

In addition, all microservices cannot be called directly to each other, but should be called through the service grid. At present, the CSE is more mature on Huawei cloud, which has been tested on a large scale. A relatively new way is Istio, which has been gradually used in the past two years and the number of service lines has increased rapidly. The number of dockers (containers) in our entire cloud video plate reached several thousand at its peak.

4. Use containerized revenue

Here is a special explanation of the benefits of containerization through some of our own situations. We didn’t use containerization at first, but later we found that containerization significantly reduced resource utilization, because it was easier to do elastic scaling. The other thing is quick start and stop. We can basically restart in seconds with containers, eject in seconds, deploy in seconds if we need to, and migrate between services, dependencies, decoupling, service packaging, and so on very quickly. Our current code upgrades, changes, pipelining, and so on are all tied to containerization. So that’s our practice with containerization.

5. Elastic scaling practice of real-time transcoding service

As mentioned earlier, elastic scaling is an important part of cloud native because it really affects availability and cost. What are the main concerns in this video about elastic scaling? First, “pop”, I think there is no problem, the above configuration diagram hides the data. It is very simple to use, according to the event-driven, such as memory, network, business volume, etc., when a certain value is reached, the corresponding pop-up instances, the pop-up speed is very fast, can basically reach the second level pop-up. So, the “play” aspect is no problem. However, for video business, “shrink” is also very important. So, “shrink” specific will encounter what problems? Video includes services such as live broadcast, conference, and RTC, and has high requirements on real-time performance. For example, on an instance, there are 1000 users watching at the same time, of which 800 are offline and only 200 occupy an instance. We have two ways to do this. One way is to prioritize the scheduling of new businesses based on some designated containers. Especially in the declining period of business scale, a time ranging from half an hour to an hour may be set aside for shrinking according to the business situation. For a really small business, how to move the live broadcast to another container, and has no impact on the user’s viewing experience. This is the practice of elastic scaling, cloud video business to achieve “shrink” is very important.

The graph in the lower left shows resource utilization during peak and low peak periods over a 24-hour period. Originally, the peak between peak and low peak is 10 times or even 100 times different, but CPU resource utilization is still stable, there is no big ups and downs. This is one of the capabilities that elastic scaling requires.

6. Cloud video unified OPS platform

No OPS in the cloud, no eyes. Operating every piece of business, visualizing every service, quickly locating when something goes wrong, OPS is a core competency. Business monitoring, configuration, call chains, log sizing, and so on are all well represented in OPS. This section is closely related to services, so only cloud video capabilities are displayed, including fault location, demarcation, operation, management, and configuration.

7. Cloud video surveillance operation and maintenance system architecture

The OPS platform relies on a large amount of data, especially log data, for all analysis, such as problem location and demarcation, failures, and alarms. In cloud services, the overall operation and maintenance architecture is very important. For the cloud video plate, I will show you our process for you to learn from.

For data collection and reporting, local redundancy needs to be considered. It cannot be reported directly and will be discarded after a failure. In addition, considering the cost, large data channels are not reserved for use, so local caching capability is required, and the ability to report failure repeatedly. For data access, Kafka is used for secondary aggregation of data, because the original data is too large to meet the scale of analysis, storage, query and so on. Currently, there are hundreds of terabytes of logs per day and they cannot be stored for long. I think the operations architecture can evolve. When our operations architecture was originally designed, it was much simpler than it is today, and the architecture above is what we have studied so far. When building a log system or o&M system, you need to consider the following two points: 1. Real-time log reporting. Logs are like neurological manifestations. If you quickly obtain logs, you can quickly understand the problems in the system and solve them. Real time is a challenge, and of course cost is involved. 2. Log real-time data. Including data aggregation, analysis and presentation.

8. Engineering capability – Service autonomy

As mentioned earlier, in microservices architecture, there are many microservices with separate code bases, which are more independent than individual software, but then there is the problem of management. Each microservice requires manual deployment and is so frequent that it is almost impossible to implement. Therefore, the instrumentalization and automation of engineering capabilities and the ability to achieve service autonomy are very important parts of cloud native and need to be built from the beginning. What steps do developers take from the time they write code to the time they go live? As you can see from the first figure, this includes developing code, static checking, compliance scanning, Alpha testing, Gamma testing, automated deployment, grayscale publishing, online testing, and more. While it may take only 30-50 lines of code to develop a feature, a process is almost impossible to complete manually. However, after instrumentalization, you only need to open the code locally and submit it with one key after testing. The whole process only needs manual confirmation in the deployment process, and other steps can be completed automatically through instrumentalization, thus realizing the operation and maintenance of multiple micro-services by one person. Currently, our team has approximately 500 + changes per month, thousands of changes per year. According to the business development, and the video field is more active, including RTC and other business scenarios, the number of changes will be more next year, so service autonomy is very important.

9. Functional ability – Grayscale release

Grayscale publishing is also a core part of cloud services and the basic guarantee of service on-line and development process quality. At present, grayscale publishing is mainly used. Grayscale mode can be based on traffic, content, domain name, features, etc., related to the development of features, easy to change in the script to achieve rapid release. Each release has validation, which can be automated test cases, manual test cases, or co-testing with customers.

10. Architecture – Distributed & highly available

In addition to improving efficiency and reducing costs, the core of the above discussion is the availability of cloud services. Cloud service availability is a service commitment to customers, problems, not only compensation, more serious is the impact of brand reputation. So how do you do usability? The summary is as follows: Design based on the concept of recognizing failure, able to downgrade failed businesses; At the same time, ensure multiple AZs for underlying resources. In this way, when an equipment room in a city fails, the overall service is not affected. In this way, multi-region disaster recovery (Dr) is implemented to achieve overall availability. However, the usability can only be as high as possible, not 100 percent, and the whole process needs to be explored together. The figure above shows a serious event: almost all resources at the bottom of a Region become unavailable. The red line in the figure shows our business, which is barely affected. It should be our common aspiration to improve service availability without affecting business.

This article is shared from The Huawei Cloud community “Cloud Native Cloud Video Practice”, the original author: audio and video manager.

Click to follow, the first time to learn about Huawei cloud fresh technology ~