In this paper, byNetease cloudrelease

Author: Liu Chao, chief solution architect of netease Cloud

I’ve been wondering lately why Kubernetes is winning the battle between supporting container platforms and microservices, when in many ways the three container platforms end up being virtually identical in terms of functionality. (Please refer to “Ten models of Container Platform selection: Docker, DC/OS, K8S who is leading? )

After some time of reflection and interviews with netease cloud architects who have been practicing Kubernetes since its early days, I’ve summarized my reflections into today’s post.

First, three perspectives on container platform from three architectures of enterprise cloud

As shown in the figure, the three major architectures of cloud in the enterprise are IT architecture, application architecture and data architecture. Different companies, different people and different roles pay different attention to them.

For most enterprises, the appeal of the upper cloud is initiated from the IT department, usually initiated by the operation and maintenance department. They focus on computing, network, storage, and try to alleviate CAPEX and OPEX through cloud computing services.

Companies with ToC businesses have accumulated large amounts of user data that they use for big data analysis and digital operations, and therefore need to focus on data architecture.

Enterprises engaged in Internet applications often focus on the application architecture first, whether it can meet the needs of end customers and bring them good user experience. Business volume tends to explode in a short period of time, so we pay attention to high-concurrency application architecture, and hope that this architecture can be rapidly iterated, so as to seize the wind.

Before the advent of containers, these three architectures were often addressed by means of virtual machine cloud platforms. After the emergence of container, container’s various good characteristics make people’s eyes shine, its lightweight, packaging, standard, easy migration, easy delivery characteristics, so that the container technology has been widely used quickly.

However, there are a thousand Hamlets in a thousand people’s minds. Due to their original work, the three roles see the convenience brought by the advantages of containers from their own perspectives.

For IT o&M engineers who used to be in charge of computing, network, and storage in the equipment room, containers are more like a lightweight O&M mode. For them, the biggest difference between containers and virtual machines is that they are lightweight and can start quickly. Therefore, they are more likely to introduce virtual machine mode containers.

For data architectures, which perform a variety of data computation tasks on a daily basis, the container is a more isolated and resource-efficient mode of task execution than the original JVM.

From an application architecture perspective, containers are the delivery form of microservices. Containers are not just for deployment, but for delivery, D in CI/CD.

So people with these three perspectives will have different approaches to using containers and choosing container platforms.

Kubernetes is the bridge between microservices and DevOps

Swarm: IT operation and maintenance engineers



From the point of view of IT operation and maintenance engineers, containers are mainly lightweight, fast start-up, automatic restart, automatic correlation, elastic and telescopic technologies, making IT operation and maintenance engineers seem to no longer need to work overtime.

Swarm’s design is clearly more in line with the management model of traditional IT engineers.

They want to be able to clearly see the distribution and state of containers on different machines, so they can easily SSH into a container as needed to see what’s going on.

It is better to restart the container in place rather than randomly scheduling a new container, so that everything installed in the container is still there.

Instead of starting with a Dockerfile, it’s easy to make an image of a running container, so that later startup can reuse the 100 things that were done manually inside the container.

Container platform integration is better, the use of the platform was originally to simplify the operation and maintenance, if the container platform itself is very complex, like Kubernetes itself so many processes, also need to consider its high availability and operation and maintenance costs, it is not cost-effective, no more than the original work, and the cost has increased.

Preferably a thin layer, like a cloud management platform, but more convenient for cross-cloud management, after all, container images are easily migrated across the cloud.

Swarm is used in a way that may sound familiar to IT engineers, but IT does everything OpenStack does at a faster rate.

The problem of Swarm

However, as a lightweight VIRTUAL machine, containers are exposed to customers, whether external customers or development within the company, instead of IT personnel themselves. When they think they are the same as virtual machines, but find different parts, there will be a lot of complaints.

Self-healing function, for example, after the restart, the original SSH in manual installation software is gone, even in the hard disk file could not be found, and the application without your Entrypoint automatically start, since the repair process did not run, still need to manually enter to start the process, the customer will complain you use the self-healing function have?

For example, some users will photoshop and find a process they don’t know, so they kill it directly. As a result, the Entrypoint process will hang directly. The client complains that your container is too unstable and always hangs.

When the container is automatically scheduled, the IP address is not maintained, so the original IP address will not be restored after the restart. Many users will ask whether this IP address can be maintained. The ORIGINAL IP address configured in the configuration file will change after the restart.

The system disk of the container, that is, the disk of the operating system, is usually fixed in size. Although it can be configured in the early stage, it is difficult to change in the later stage, and there is no way for each user to choose the size of the system disk. Some users will complain, we originally put a lot of things directly on the system disk, this can not be adjusted, called what cloud computing flexibility ah.

If the customer said that the container mount data disk, the container is started, some customers want to cloud host, and then mount a disk, the container is more difficult to do, will also be scolded by the customer.

If container users don’t know they’re using the container, they’ll find it hard to use it when virtual machines come to use it, and the platform is not good at all.

Swarm is relatively easy to use, but when problems arise, as the operation and maintenance container platform, people will find it difficult to solve the problems.

Swarm has a lot of built-in functions, which are coupled together. Once an error occurs, it is not easy to debug. If the current functionality does not meet the requirements, it is difficult to customize. Many functions are coupled to the Manager, and the operation and restart of the Manager have too much influence.

Mesos: Data operation and maintenance engineer

From the perspective of big data platform operation and maintenance, how to dispatch big data processing tasks faster and run more tasks faster in limited time and space is a very important element.

So when we evaluate the merits of a big data platform, it is often measured by the number of tasks run per unit of time and the amount of data that can be processed.

Mesos is a great scheduler from a data operation and maintenance perspective. Since you can run tasks, you can also run containers, and Spark’s natural integration with Mesos allows for a more fine-grained way of performing tasks.

In the absence of fine-grained task scheduling, the execution of tasks looked like this. Task execution requires the Master node to manage the entire task execution process, and the Worker node to perform a sub-task. At the beginning of the overall task, the resources occupied by the Master and all the works should be allocated, and the environment should be configured so that sub-tasks can be executed there. When no sub-tasks are executed, the resources of the environment are reserved there. Obviously, not all the works are fully run, and there is a lot of resource waste.

In fine-grained mode, at the beginning of the total task, only resources are allocated to the Master, and no resources are allocated to the Worker. When a sub-task needs to be executed, the Master temporarily applies for resources from Mesos, and the environment is not ready. Fortunately, there is Docker, start a Docker, the environment is all there, running sub-tasks in it. When there are no tasks, resources on all nodes can be used by other tasks, greatly improving resource utilization efficiency.

This is the biggest advantage of Mesos. In Mesos papers, the most important thing is the improvement of resource utilization, and Mesos’s two-layer scheduling algorithm is the core.

Former big data operation and maintenance engineers will easily choose Mesos as a container management platform. But it used to be short tasks, and marathon allowed you to run longer tasks. But Spark later deprecated the fine-grained schema because it was still inefficient.

The problem of Mesos

Scheduling is the core of the core in big data, important in container platforms, but not the whole story. So the container also needs to be choreographed, you need various peripheral components that make the container run long tasks and access each other. Marathon is only the first step on a long journey.

Therefore, most of the early manufacturers using Marathon + Mesos used Marathon and Mesos naked. Due to incomplete peripheral, they had to do various packaging. If you are interested, you can go to the community and see the vendors using Marathon and Mesos naked, each with their own load balancing solutions, each with their own service discovery solutions.

Therefore, DCOS came into being, that is, a large number of peripheral components were added in addition to Marathon and Mesos to supplement the functions of a container platform. Unfortunately, many manufacturers have customized it themselves, and most of them still use Marathon and Mesos naked.

Mesos is a great scheduler, but it only deals with part of the scheduler, and you have to write your own framework and schedulers. Sometimes you have to develop executors, which can be very complicated to develop and expensive to learn.

Although later DCOS features are more comprehensive, but it does not feel like Kubernetes like the use of a unified language, but to adopt a hodgepodge way. In the entire DCOS ecosystem, Marathon is written in Scala, Mesos is written in C++, Admin Router is Nginx+lua, mesos-dns is Go, marathon-lb is Python, Minuteman is Erlang, so it’s too complicated to fix bugs.

Kubernetes

And Kubernetes is different, at the beginning of Kubernetes people feel that he is a strange place, the container has not been created, the concept to a lot of first, the document to read a lot of documents, the arrangement of files is also complex, components are also many, so that many people recoiled. I just want to create a container, how so many preconditions. If you put Kubernetes concept on the interface and ask the customer to create the container, you will be scolded by the customer.

From a developer’s point of view, using Kubernetes is definitely not the same as using virtual machines. In addition to writing code, building and testing, you also need to know that your application is running on the container, not just sitting on the shelf. Developers need to know that containers exist differently from the original deployment. You need to distinguish between stateful and stateless containers. Developers need to write Dockerfiles, need to care about environment delivery, and need to know a lot of things they don’t know before. To be honest, it’s not convenient.

In the operation and maintenance personnel point of view, using Kubernetes is absolutely not like the operation and maintenance of virtual machines, I delivered the environment, how to call each other between applications, I don’t care, I tube network connectivity. In the eyes of operations, it does too many things that it should not care about, such as service discovery, configuration center, fuse downgrading, which should be the concern of the code level, should be the concern of SpringCloud and Dubbo, why to the container platform layer to care about this.

Kubernetes + Docker is a bridge between Dev and Ops.

Docker is a delivery tool for microservices. After microservices, there are too many services, which cannot be managed by operation and maintenance alone, and it is easy to make mistakes. Therefore, research and development should start to care about environmental delivery. For example, only development can know what configuration has been changed, which directories have been created and how to configure permissions. It is difficult to synchronize these information to the operation and maintenance department in a timely and accurate way through documents. Even if the synchronization has been done, the maintenance amount of the operation and maintenance department is very large.

So, with containers, the biggest change is the early delivery of the environment, 5% more time per development, 200% more o&M effort, and improved stability.

On the other hand, operations just deliver resources, give you a virtual machine, the applications in the virtual machine access each other, I don’t care what you want, with Kubernetes, operations is about service discovery, configuration center, fuse downgrading.

The two are fused together. From the perspective of microservice research and development, although Kubernetes is complex, its design is reasonable and conforms to the idea of microservice.

Three, ten design points of micro-service

What are the key points of microservices? The first picture shows the entire ecology of SpringCloud.

The second chart shows the 12 elements of micro-service and its practice in netease Cloud.

The third diagram shows all the points to consider when building a highly concurrent microservice. (Advertise that this is a course, coming soon.)

Let’s talk more about the design of microservices.

Design point 1: API gateway.

In the process of implementing micro service, unavoidable face service aggregation and break up, when the back-end service split is relatively frequent, as a mobile phone App, often require an unified entrance, will request is routed to the different service, no matter how to split and aggregation, for mobile terminal is transparent.

With the API gateway, simple data aggregation can be completed at the gateway layer, which does not need to be completed at the App end of the mobile phone, so that the mobile App consumes less power and has better user experience.

A unified API gateway, also can undertake unified authentication and authentication, despite the each other between service call is more complex, the interface will be more, API gateway often only exposure to external interface, and the interface for unified authentication and authentication, makes the internal service access to each other, need not undertake certification and authentication, with a higher efficiency.

With A unified API gateway, you can set certain policies at this layer, do A/B testing, blue-green release, pre-release environment diversion, and so on. API gateways tend to be stateless and scaleable so that they do not become performance bottlenecks.

Design Point 2: Stateless, distinguish between stateful and stateless applications.

An important factor affecting application migration and horizontal expansion is the state of the application. Stateless service is to move this state outward and store Session data, file data and structured data in unified storage at the back end, so that the application only contains business logic.

States are unavoidable, such as ZooKeeper, DB, Cache, etc., and all these stateful things converge into a very centralized cluster.

The whole business is divided into two parts, one is stateless part, one is stateful part.

The stateless part can achieve two points, one is the random deployment across the machine room, namely mobility, and the other is elastic expansion, easy to expand.

Stateful parts, such as DB, Cache, and ZooKeeper, have their own high availability mechanism, and use their own high availability mechanism to implement this state of clustering.

Although stateless, but the current data processing, will still be in memory, hang up the current process data, must be part of the lost, in order to achieve this, the service should have retry mechanism, the mechanism of interface to be idempotent, through the service discovery mechanisms, to call a back-end service another instance.

Design point three: horizontal expansion of the database.

The database is the state to save, is the most important and most prone to bottlenecks. Having a distributed database allows the performance of the database to increase linearly with the number of nodes.

Distributed database at the bottom is RDS, is the master/slave, through the MySql kernel development ability, we can achieve the master/slave switch data zero loss, so the data falls in this RDS, is very relieved, even if a node is hung, after the switch, your data will not be lost.

On top of that, there is a load balancing NLB with LVS, HAProxy, Keepalived, and a layer of Query Server. Query Server can scale horizontally based on monitoring data, and if a failure occurs, it can be replaced and fixed at any time, with no awareness of the business layer.

Another is the deployment of the double room, DDB has developed a data canal NDC components, can make the different between DDB synchronized in different rooms, this time not only in a data center is distributed, in multiple data centers will have a similar double live a backup, high availability has very good guarantee.

Design point 4: Cache

Caching is important in high concurrency scenarios. Have hierarchical caching so that the data is as close to the user as possible. The closer the data is to the user, the greater the concurrency and the shorter the response time.

There should be a layer of cache on the mobile client App. Not all data should be taken from the back end all the time, but only important, critical and constantly changing data.

Especially for static data, it can be fetched once after a period of time, and there is no need to fetch it from the data center. Data can be cached on the nearest node to the client through CDN for nearby download.

Sometimes there is no CDN, or to go back to the data center to download, called back source, in the outermost layer of the data center, we call access layer, you can set a layer of cache, will be most of the request interception, so as not to cause pressure on the background database.

If it is a dynamic data, or need to access the application, through the business logic of the application of generated, or to read the database, in order to reduce the pressure of a database, the application can use local cache, you can also use a distributed cache, such as Memcached or Redis, make the most of request read cache can, don’t need to access the database.

Of course, dynamic data can also be statically, or demoted to static data, to reduce the stress on the back end.

Design point 5: Service split and service discovery

When the system is overwhelmed and applications change rapidly, it is often necessary to consider splitting the larger services into a series of smaller services.

So the first advantage is that the development is independent, when so many people in the maintenance of the same code warehouse, often code changes will influence each other, often can appear what I didn’t change test is not passed, and the code submitted, frequently conflict, the need for code merge, greatly reduce the efficiency of development.

Another advantage is that it is independent online. The logistics module connects with a new express company, and it needs to be online together with the order. This is very unreasonable behavior.

Another is the capacity expansion during the period of high concurrency, often only the most critical order and payment process is the core, as long as the key transaction link can be expanded, if many other services are attached at this time, capacity expansion is not economic, but also very risky.

Disaster recovery and degradation, in the big push, may need to sacrifice part of the function of the corner, but if all the code is coupled together, it is difficult to degrade part of the function of the corner.

Of course, after the separation, the relationship between applications is more complex, so the mechanism of service discovery is needed to manage the relationship between applications, to achieve automatic repair, automatic correlation, automatic load balancing, automatic fault-tolerant switchover.

Design point 6: Service orchestration and elastic scaling

When services are unbundled, there are so many processes that service choreography is required to manage the dependencies between services and to code the deployment of services, which is often referred to as infrastructure as code. In this way, services can be published, updated, rolled back, expanded, or scaled down by modifying the orchestration file, increasing traceability, manageability, and automation capabilities.

Since layout file code can also be used for warehouse management, can achieve one hundred service, update the five services, as long as the modified format file of five service configuration, when layout file to submit code automatic trigger automatic warehouse deployment upgrade script, so as to update the online environment, when found there is something wrong with the new environment, Of course you want to roll back the five services atomically, but if there is no choreography file, you need to manually record which five services were upgraded this time. With marshalling files, you can simply revert to the previous version in the repository. All operations are visible in the code repository.

Design point 7: unified configuration center

Service after break up, the number of very much, if all configuration in a configuration file on the application of local, is very difficult to manage, as you can imagine when hundreds of thousands of process in a configuration there is a problem, it is difficult to find out, so need to have unified configuration of the center, to manage all configuration, the configuration of the distributed centrally.

In micro service, configuration is often divided into several classes, one kind is almost the same configuration, this configuration can be directly inside the container mirror, the second is the startup will determine the configuration of the this kind of configuration is often through the environment variable, at the time of container startup handed in, the third class is unified configuration, issued by configuring center will be required, such as in the case of large presses, You can configure the functions that can and cannot be degraded in the configuration file.

Design point 8: Unified log center

Number is the same process a lot of time, it is very hard to hundreds of thousands of a container, a login in to view a log, so you need to log center to collect the log, in order to make it easy to the collected log analysis, specification for logs, needs to have certain request, when all the services are subject to unified log specification, A transaction process can be traced uniformly in the log center. For example, in the final log search engine, search for the transaction number and you can see where the error or exception occurred in the process.

Design point nine: fusing, limiting current, downgrade

The service should have the ability of fusing, limiting flow, and degrading. When a service calls another service and there is a timeout, it should be returned in time, rather than blocked in that place, thus affecting the transactions of other users. It can return the default base data.

When a service finds that the called service is too busy, the thread pool is full, the connection pool is full, or there are always errors, it should be disconnected in time to prevent the fault or busy of the next service from causing the abnormal of the service and gradually spreading forward, resulting in the avalanche of the whole application.

When it is found that the whole system is really overburdened, you can choose to downgrade some functions or some calls to ensure that the most important transaction flows pass, and the most important resources are devoted to the most core processes.

Still have a kind of means is the current limit, when setting the fusing strategy, and setting the reversion strategy, through the link of the stress tests, should be able to know the supporting capacity of the whole system, and therefore requires current-limiting strategy and guarantee system in the tested support ability within the scope of services, beyond the support ability, denial of service. When you place an order and the system pops up a dialog box saying “system busy, please try again”, it does not mean that the system is down, but it means that the system is working properly, but the traffic limiting policy is working.

Design point 10: Comprehensive monitoring

When the system is very complex, there are two main aspects of unified monitoring, one is health, and one is where the performance bottlenecks are. When the system is abnormal, the monitoring system can cooperate with the alarm system to discover, inform and intervene in time, so as to ensure the smooth operation of the system.

When the pressure test, often encountered bottlenecks, also need to have a comprehensive monitoring to find bottlenecks, while preserving the site, so that traceability and analysis, all-round optimization.

Kubernetes itself is a microservices architecture

Based on the above 10 design points, we will come back to Kubernetes and find it more and more pleasing to look at.

First of all, Kubernetes itself is a microservices architecture. Although it looks complex, it is easy to customize and scale horizontally.

The black part is the native part of Kubernetes, while the blue part is the customized part made by netease Cloud to support large-scale and high-concurrency applications.

Kubernetes API Server is more like a gateway, providing a unified authentication and access interface.

As we all know, Kubernetes tenant management is relatively weak, especially for the public cloud scenario, complex tenant relationship management, we only need to customize the API Server, Keystone, can manage complex tenant relationship, without taking care of other components.

In Kubernetes almost all components are stateless, the state is stored in a unified ETCD, which makes the scalability very good, components asynchronously complete their own tasks, put the results in etCD, not coupling with each other.

For example, in the creation process of POD in the figure, the client only generates a record in ETCD, and other components listen to this event, also asynchronously do their own thing, and put the result of processing in etCD. Similarly, it is not a component that calls Kubelet remotely and commands it to create a container. But found that the ETCD, pod is bound to its own, just pulled up.

In order to achieve tenant isolation in the public cloud, our policy is that different tenants do not share nodes. This requires Kubernetes to be aware of the IaaS layer and implement its own Controller. Kubernetes is designed so that we can create our own controllers independently, rather than directly changing the code.

As an access layer, apI-Server has its own caching mechanism to prevent all requests from going directly to the back-end database. But while still unable to handle high concurrency requests, the bottleneck remains in the back-end ETCD storage, just like in e-commerce applications. Of course, we can think of a way to divide etCD into different databases and tables, so that different tenants can store in different ETCD clusters.

With THE API Server as the API gateway, the backend services are customized and transparent to the client and Kubelet.

As shown in figure is customized container creation process, due to the large community is not big to promote during the difference is larger, the number of nodes and thus cannot adopt the way of all the created node in advance, this will cause a waste of resources, thus the middle add the netease cloud Controller module and IaaS management, make when creating container resources shortage, Dynamically invoke IaaS interfaces to dynamically create resources. All of this is insensitive to the client and Kubelet.

In order to solve the problem of the scale of the more than 30000 nodes, netease cloud need to optimize the various modules, because each module to complete their own functions, only the Scheduler scheduling, and the Proxy just forward, rather than coupling together, so each component can be independent optimization, which conforms to the independent function of micro service independent optimization, Each other. And all components of Kubernetes are developed by Go, which makes it easier. So Kubernetes is slow, but once you need to customize it, you’ll find it easier.

Kubernetes is a better fit for microservices and DevOps

So, having said K8S itself, let’s talk about the conceptual design of K8S and why it’s so good for microservices.

One of the ten patterns of microservice design mentioned above is to distinguish stateless from stateful. In K8S, stateless corresponds to Deployment while stateful corresponds to StatefulSet.

The problem of horizontal expansion is solved mainly by the number of copies.

StatefulSet makes good use of its high availability mechanism by ensuring stateful applications with consistent network ids, consistent storage, sequential upgrades, extensions, and rollbacks. Most clusters have high availability mechanisms that allow one node to temporarily fail, but not most of them to fail simultaneously. In addition, although the high availability mechanism can guarantee a node to come back after a failure, there is a certain repair mechanism, but you need to know which node just failed. The StatefulSet mechanism can make the script in the container have enough information to deal with these situations, and realize the repair as soon as possible even if it is stateful.

In microservices, it is recommended to use PaaS of cloud platform, such as database, message bus, cache, etc. However, the configuration can be complex because different environments need to connect different PaaS services.

A Headless service in K8S can solve this problem by creating a Headless service for the external service, pointing to the corresponding PaaS service, and configuring the service name to the application. Because the production and test environments are divided into namespaces, the same service name is configured, but incorrect access is prevented, simplifying the configuration.

In addition to the application layer, SpringCloud or Dubbo can be used for Service discovery. In the container platform layer, of course, Service can be used to achieve load balancing, self-healing, and automatic association.

Service orchestration, which K8S is originally the standard for orchestration, can be managed by putting YML files into the code repository, and can be flexibly scaled by the number of copies of Deployment.

For configuration hubs, K8S provides configMap, which can be injected into environment variables or volumes when the container is started. The only drawback is that the configuration injected into the environment variable cannot be changed dynamically. Fortunately, the configuration of the Volume can be changed dynamically as long as the process in the container has the Reload mechanism.

Unified logging and monitoring often requires the deployment of agents on nodes to collect logs and indicators. Of course, daemonset design is available on each Node, making it easier to implement.

Of course, the most popular Service Mesh at present can realize more refined Service governance, such as fusing, routing and degradation. The implementation of a Service Mesh intercepts Service traffic and controls it through a Sidecar. This is also due to the Pod concept, a Pod can have multiple containers, if the original design does not have Pod, directly start the container, it will be very inconvenient.

So the various designs of THE K8S may seem very redundant and complex, and the bar to entry is high, but once you want to achieve true microservices, the K8S gives you a variety of possible combinations. People who have practiced microservices tend to have a deep experience of this.

Kubernetes common usage

For details on how Kubernetes is used in different stages of microservitization, see this article: Different ways of playing Kubernetes in different stages of microservitization


Understand netease Cloud:

The official website of netease Cloud is www.163yun.com/

New user package: www.163yun.com/gift

Netease Cloud community: sq.163yun.com/

Free trial of netease Cloud Basic services: www.163yun.com/product-clo…