In the 618 Promotion in 2015, JD boldly enabled docker-based container technology to carry the key business of promotion (picture display, single product page, group purchase page). At that time, the flexible cloud project based on Docker container had nearly ten thousand Docker containers running in the online environment, and had withstood the test of large traffic. On June 18 this year, elastic cloud project is even more important, with all application systems and most DB services running on Docker. The elastic cloud can automatically manage resources and expand them flexibly during peak traffic periods such as 618. In off-peak traffic periods, the elastic cloud can recycle resources to improve resource utilization and ensure the stability of o&M systems. According to official estimates, nearly 150,000 Docker containers will be launched on JD online in this promotion activity. In terms of the number, JD is one of the largest Docker application users worldwide.
To find out more, InfoQ spoke with Bao Yongcheng, director of the Elastic Cloud project. In addition, Bao will share the technical details of the 618 Initiative at CNUTCon, a global container technology conference hosted by InfoQ.
Introduction of guests interviewed
Bao Yongcheng, project leader of JD Elastic Computing Group, leads the elastic computing team to deeply explore the IaaS field and is committed to building a strong virtualization platform for JD. In early 2013, I joined JD, focusing on the research and development of JD Elastic cloud platform system, operating a number of medium and large-scale IaaS clusters, including (JD Elastic cloud, public cloud, hybrid cloud and other products), and having certain practical experience in OpenStack research and development & performance optimization, automatic deployment, KVM, Docker, distributed system and other aspects.
InfoQ: We talked about the application of JD Docker on 618 last year. Can you compare with last year’s 618 and introduce the scale, application and adjustment of this year?
Bao Yongcheng: In terms of quantity, the number of online container handling cases peaked at 9,000 last year, and by June 17th this year, the number of online container handling cases exceeded 150,000. In terms of overall layout, compared with last year’s 618, elastic cloud has achieved strategic landing in terms of scale and business full container. At the application level, all jingdong applications publish and manage application clusters 100% through container technology. It is worth pointing out that there are 5600 container instances supporting DB cluster this year, which provides very convenient support for JINGdong cloud database. The core architecture of the elastic cloud has not changed much. The definition is still simple: Elastic cloud = software-defined data center + service container cluster scheduling. On this basis, there are two enhancements:
- The stability and performance of a single container have been greatly improved to effectively meet the huge needs of the core system for computing and network;
- Manage physical machines, VMS, and containers in a unified manner and add them to a unified cluster for scheduling to meet different computing resource requirements of different services.
Review images
Review images
InfoQ: Can you tell us what services are running in full on Docker? What changes have been made to the architecture to meet this challenge?
Bao Yongcheng: Websites, transactions, wireless, wechat Q and most of the application systems and DB services run on Docker. Because JINGdong business has been under microservitization governance for many years, the adjustment of application layer architecture is very small. Nowadays, the proportion of microservitization is already large, so the integration of container technology is relatively smooth. Elastic cloud platform is standing on the shoulders of giants of microservitization management of various business systems for many years.
InfoQ: How do you do mass monitoring with so many containers? Is there any open source solution adopted? What should I pay attention to?
Bao Yongcheng: The monitoring system adopts independent research and development system, which is responsible for the collection and storage of massive data. Its framework is shown in the figure.
Review images
For 15W+ container monitoring, note the following: collection and transmission of indicators must be designed efficiently and resource consumption must be controlled in the collection process. It is recommended to use accelerated alarms and monitoring charts to track cache status.
InfoQ: How does an elastic compute cloud platform scale up during peak times? Can you share the process from a system perspective?
Bao Yongcheng: There are two modes of elastic cloud:
- Automatic mode Automatically triggers capacity expansion based on the preset elastic expansion conditions of each service party. Capacity expansion includes using the Spawn instance of the service image, automatic DB authorization, micro-service framework registration, and adding instances to LB loads.
- In manual mode, an elastic event is triggered, and the extension confirmation message is received by the vertical O&M. In manual mode, manual users only click the extension confirmation message to confirm capacity expansion. The subsequent process is the same as that in automatic mode.
InfoQ: Can you talk about the key open source projects currently in use? What level are they in the system? What are the corresponding versions?
Bao Yongcheng: Mainly OpenStack and Docker, as follows:
- OpenStack-IceHouse manages computing, network, and storage resources in the JD Data Center Operating System layer.
- Docker-1.3 is used for spawn container instances, adding a lot of self-developed content and features.
InfoQ: With so many containers, what do you think is the biggest challenge?
Bao Yongcheng: 1. Cluster scale: Between efficient operation and maintenance, elastic clustering is the idea of large cluster architecture. An OpenStack cluster can be very large. Currently, the size of an OpenStack cluster is limited to about 6,000 computing nodes. It is very difficult to manage a cluster of 6K compute nodes using only native OpenStack, so we redesigned and implemented all the databases and MQ that OpenStack relies on, and the design can support 1W compute nodes after testing. 2. Single instance performance tuning: Many core businesses have extremely strict requirements on the response time of a single request, which is bound to require that each container instance provided by elastic cloud should have excellent performance. We start from THE ASPECTS of CPU and network. For CPU, we adopt scale-up algorithm to flexibly allocate CPU, so that busy business can get enough CPU resources in time. In the network, the main direction of attack is how to bring the performance of the 10 gigabit card into play. We made some improvements to OVS (Open vSwitch) : reduced some locks, optimized peer port and nic interrupts. This brings network performance close to that of a physical network card.
Review images
InfoQ: There are only a few Docker apps of this scale both at home and abroad. What do you gain and what do you lose in the whole process of containerization?
Bao Yongcheng: Jd Elastic Computing cloud achieves unified management of massive computing resources and meets performance and efficiency requirements through software-defined data centers and large-scale container cluster scheduling. Improve self-service online service efficiency. The deployment density of applications increases and the resource usage increases, saving a lot of hardware resources.
There are also a lot of adjustments, such as kernel upgrades due to bugs; After all the containerization, the versions of the underlying dependent systems are highly consistent, which is easy to cause the scale upgrade of the underlying bugs. Fortunately, we have always attached great importance to the construction of the kernel team, and there have been kernel release versions, kernel hot patch and other ways to effectively solve such problems.
InfoQ: What will you be highlighting at CNUTCon 2016?
Yongcheng Bao: Focus on sharing software-defined data centers, large-scale container cluster scheduling and Docker container performance tuning.
InfoQ: Thank you bao Yongcheng for joining us. Look forward to you inCNUTCon 2016 Global Container Technology ConferenceOn.
Thanks to Guo Lei for correcting this article.