Jingdong dong architecture evolution
What is dong Dong? Dongdong is to JINGdong what Wangwang is to Taobao, both of which serve the communication between buyers and sellers. Dongdong was born after JD.com started offering third-party sellers a platform to settle in. Let’s start by looking at what it looked like when it was born.
1.0 born (2010-2011)
The technical architecture implementation of version 1.0 was straightforward and unsophisticated in order to bring the business online quickly. How to be simple and rude? See the architecture diagram below.
The functionality of JD Architecture Path 1.0 is very simple, realizing the basic functions of an IM, access, message exchange and status. In addition, there are customer service functions, that is, customer access consulting customer service allocation, according to the polling way to assign customers to the online customer service reception. The open source Mina framework is used to implement TCP long connection access and Tomcat Comet mechanism is used to implement HTTP long polling service. The realization of message delivery is that the message sent at one end is temporarily stored in Redis, and the production and consumption model pulled at the other end.
The approach of this model results in the need to poll Redis for associated session messages that belong to its own connections in a high frequency manner. The model is simple, and simple means multiple things: simple to understand; Simple to develop; It is also simple to deploy. Only one Tomcat application depends on a shared Redis, which simply realizes the core business functions and supports the fast online business.
But this simple model has some serious flaws, mainly efficiency and scalability. The frequency interval of polling basically determines the message delay. The faster the polling, the lower the delay, but the faster the polling, the higher the consumption. This model is actually a high power and low efficiency model, because inactive connections do high frequency meaningless polling there. How high is the frequency? It’s basically under 100 ms. You can’t make polling too slow, say more than 2 seconds, and people will experience significant conversation delay during the chat. As the number of online users increases, the time of polling increases linearly. Therefore, this model leads to poor scalability and load-carrying capacity, and will definitely encounter performance bottlenecks as the number of online users increases.
The background of 1.0 was the transformation of JINGdong technology platform from.NET to Java. It was during this period that I joined jingdong and participated in the process of technical transformation and architecture upgrading of jingdong’s main website. After that, I took over JINGdong Dongdong, and continued to improve the product, and carried out three times of technical structure evolution.
2.0 Growth (2012)
When we first took over, 1.0 was already running online and supporting JINGdong POP (open platform) business. After that, Jingdong planned to set up its own online customer service team and set up shop in Chengdu. Both the proprietary and POP customer service consulting businesses were in their infancy, and the performance and efficiency flaws in the 1.0 architecture had not yet reached the business tipping point. And self-service customer service was still in its infancy, the number of customer service is not enough, the service ability is not enough, the number of customer consultation far exceeds the service ability of customer service. Customer consultation beyond the service capacity, at that time our system returned a unified message that customer service is busy, please consult later. As a result, a large number of customers in peak hours are likely to be unable to access customer service no matter how much they refresh their request, resulting in a poor experience. So 2.0 focuses on improving the business function experience, as shown in the figure below.
For customers who cannot provide timely service, they can queue up or leave messages. For pure text communication, it provides richer expressions such as files and pictures. In addition, customer service transfer and quick reply are supported to improve the reception efficiency of customer service. In short, the whole 2.0 is about improving customer service efficiency and user experience. The efficiency that we were worried about wasn’t there in the 2.0 boom years, but the volume was building up and we knew it was going to explode. By the end of 2012, a major architecture upgrade for 3.0 began after Singles’ Day.
3.0 Outbreak (2013-2014)
After a year of rapid business growth in the 2.0 era, the code size actually expanded very quickly. Along with the code ballooned the team, from an initial four people to nearly 30. With a large team, a system will be developed by multiple people, with different levels of developers, difficult to unify specifications, heavy coupling of system modules, communication and dependence on changes, and difficult to control online risks. A single Tomcat application multi-instance deployment model has finally come to an end, and the theme of this architecture upgrade is servitization. If you want to learn Java engineering, high performance and distributed, simple. Micro services, Spring, MyBatis, Netty source analysis of friends can add Java advanced group: 726610841, there are ali cattle live in the group to explain technology, and Java large Internet technology video free to share to you.
The first problem of servitization is how to divide a large application into sub-service systems. The background at that time was that the deployment of JINGdong was still in the era of semi-automation, and the automatic deployment system had just started. If the sub-service system was divided too much by business, the deployment workload would be heavy and difficult to manage. So instead of partitioning services by business function, we divided sub-business service systems by business importance level 0, 1, and 2. In addition, a set of independent access services are provided for access terminals of different channels and communication modes, as shown in the figure below.
Jingdong architecture experts share the road of jingdong architecture with more detailed application services and architecture layering methods as shown in the figure below.
This major architecture upgrade mainly considers three aspects: stability, efficiency and capacity. Did the following:
1. Business classification, isolation of core and non-core businesses
2. Deploy multiple equipment rooms, providing traffic diversion, Dr Redundancy, and peak response redundancy
3. Read the database from multiple sources
4. Write the active/standby database, which temporarily compromises the fast switchover tolerated by the service
5. External interface. Failure transfer or fast disconnection
6.Redis is in active/standby mode
7. Large table migration. MongoDB replaces MySQL to store message records
8. Improve message delivery model
The first six are basically improvements and upgrades in terms of system stability and availability. This part is completed by successive iterations. The configuration and control functions that bear many failed transitions are provided by the control center in the figure above. The seventh clause mainly uses MongoDB to store the chat records with the largest storage after the daily message volume increases with the increase of business volume. Article 8 is an improvement on the low efficiency of message polling of version 1.0. The improved delivery mode is shown in the figure below:
Jd architecture experts share that the way of JD architecture is no longer polling, but to let terminals register the access point location after each connection is established, and locate the access point location before message delivery and then push to the past. The delivery efficiency is constant, and it’s easy to scale up. The more people you have online, the more connections you have. Just expand the access point. In fact, this model still has some minor problems, mainly in the offline message processing, you can think about it first, we will talk about it at the end.
After two years of iterative upgrades, 3.0 can continue to support a long time of growth purely in terms of business volume. But in fact by the end of 2014 we are no longer facing a business volume problem, but a change in business model. This led directly to a whole new era.
4.0 Nirvana (2015-present)
In 2014, JD’s organizational structure changed greatly, from a company to a group with multiple subsidiaries. The original mall has become one of the subsidiaries, and the newly established subsidiaries include JINGdong Finance, Jingdong Smart, Jingdong Home, Paipai, overseas Business Division, etc. Each business scope is different, business model is also different, but no matter what business always need customer service. How to reuse the original customized dongdong customer service system for the mall and support other subsidiary businesses quickly access to become our new topic.
Paipai, the first to require access, was acquired from Tencent, so it has a completely different account and order trading system. Due to time constraints, we stripped the part customized for mall, customized a set based on 3.0 architecture, and deployed it independently, like the following.
Although the launch was completed before the time required by the business, it also brought obvious problems:
1. Copy engineering, customized business development, multiple sets of source code maintenance costs are high
2. Independent deployment, with at least two equipment rooms in active/standby mode and one gray cluster, wastes a lot of resources
Before we are business-oriented to architecture system, now the new business changes under the situation we began to consider platform-oriented to architecture, in a unified platform to run multiple sets of business, unified source code, unified deployment, unified maintenance. Business services continue to be split, stripped out of the most basic IM service, IM general service, customer service general service, and customized services for different special needs to do minimal development. In the deployment mode, the services of different service providers run on the same platform, but the data is isolated from each other. Services continue to be broken down and more granular, forming a set of service matrices (see figure below).
Jingdong architecture experts share the way of jingdong architecture, and the deployment method only needs to establish two sets of peer clusters in the double machine room and another small grayscale publishing cluster. All different businesses run on the unified platform cluster, as shown in the figure below.
Finer grained services mean simpler development for each service, less code, fewer dependencies, and higher isolation stability. However, more fine-grained services also mean more complicated operation and maintenance monitoring and management. Until this year, the company’s internal elastic private cloud, cache cloud, message queue, deployment, monitoring, log and other basic systems have become increasingly perfect, making it possible to implement such fine-grained micro-service architecture with controllable operation and maintenance costs. From 1 application in 1.0, to 6 or 7 applications in 3.0, to 50+ more fine-grained different applications in 4.0. Each process allocates different number of instance processes based on service traffic. The actual number of instance processes exceeds 1000. In order to better monitor and manage these processes, a service oriented operation and maintenance management system is customized, as shown in the figure below.
Unified service operation provides practical internal tools and libraries to help develop more robust microservices. This includes central configuration management, traffic burying point monitoring, database and cache access, and runtime isolation. Here is a schematic of operational isolation:
Fine-grained microservices achieve interprocess isolation, and strict development specifications and toolsets help to implement asynchronous messaging and asynchronous HTTP to avoid long synchronous call chains across multiple processes. Service enhancement container Armor is introduced internally to isolate threads in a tangential manner and to support separate business degradation and synchronous to asynchronous execution within the process. All of these tools and library services serve two goals:
1. Make the runtime status of the server process visible
2. Make the runtime state of the server process manageable and changeable
Finally, we return to the suspense left by the previous article about the shortcomings of the message delivery model. At the beginning, we detected at the access layer that the message could not be delivered after the terminal was disconnected, so we cached the message and pulled the offline message after the terminal was reconnected. This model performs poorly in the mobile era because of the instability of mobile network, which leads to frequent disconnection and reconnection. However, accurate detection of network disconnection depends on a network timeout, which may lead to inaccurate detection and false message delivery success. The new model, as shown in the figure below, no longer relies on accurate network connection detection, and ensures that message ids are cached and message bodies are persisted before delivery. After the terminal receives the confirmation and returns, the message is considered to be properly sent. The unconfirmed message ID will be pushed as an offline message after re-login or re-connection. This model does not result in message loss due to false drop, but may result in message duplication, which only needs to be de-duplicated by message ID by client terminal.
Jingdong was born when JINGdong technology was transformed to Java. After years of development, it has made great progress. From grassroots to professional, from weak to scale, from scattered to unified, from disorderly to standardized. This paper mainly focuses on the evolution process of the dong-Dong-architecture in the past few years. In my opinion, there is no absolute good or bad about the technical architecture. The technical architecture should always be viewed in the context of the time, considering the time-effectiveness of the business, the size and capability of the team, the environmental infrastructure and so on. The best results are achieved when the lifecycle of the architecture evolution matches the lifecycle of the business.
Jingdong peak system design
Unlike social networking, search and gaming sites, e-commerce sites are highly operational and dynamic in terms of user traffic. In the US and Europe, Black Friday and Cyber Monday mark the peak of holiday spending. The main factors affecting the peak traffic of e-commerce are buying, promotions and malicious attacks, especially large-scale promotions such as JD.com’s 618 store celebration and Double 11. In the case of high traffic and high concurrency, how to ensure the reliability and stability of the whole system is a problem that many e-commerce enterprise R&D teams are thinking about.
The design of JINGdong e-commerce system focuses on system stability, reliability, high concurrency and scalability. How do you ensure that the user has a smooth and smooth experience and that the system does not have exceptions during peak times? Let’s first look at some characteristics of JINGdong system (Figure 1).
Figure 1. The system architecture is huge and complex
Jingdong has a wide variety of businesses, involving tens of millions of SKUs, which makes the system huge and requires external contact with suppliers, consumers and third-party merchants. The internal system includes almost all links in the commodity supply chain except for commodity design and production, including login, transaction, background, supply chain, warehouse distribution, customer service, etc. All of these involve thousands of systems, both large and small, creating an enormously complex system. In addition, jingdong system has strong interaction and strong correlation between various functional modules, so any modification needs to be extremely cautious. Therefore, all optimization schemes are based on the premise of maintaining system stability.
In order to alleviate the pressure brought by peak value as far as possible on the basis of complex system, the design of JINGdong peak value system is mainly carried out from four aspects: performance improvement, flow control, disaster preparedness downgrading and pressure test plan.
Performance improvement
Shard business systems
We first split the whole business system into several relatively independent subsystems, such as SSO, trading platform, POP platform, order placing system, WMS and warehousing and distribution (Figure 2). Each subsystem can be subdivided into several parts, simplified step by step, until the operational and optimized level. For example, the trading platform includes price, shopping cart, settlement, payment and order center, etc. The website system includes home page, login, list channel, single product and search. Next, the key parts of each functional module are segmented to optimize performance.
Jd architecture experts share the road of JD architecture Figure 2 business segmentation scheme
For example, the second kill system of trading, originally is rooted in the ordinary trading system, the shortcoming is very obvious. When the traffic suddenly increases, it will not only lead to slow reaction of seckill system, but also affect the normal operation of ordinary trading system. So we physically separate it from other business systems and make it a relatively independent subsystem. And for the second kill feature, reduce the dependence on background storage. At the same time, the storage mechanism of the middle layer is optimized to disperse hotspot deployment. It even supports multi-point deployment of a single SKU, which greatly improves the throughput and reliability of the seckill system.
distributed
Distributed trading systems are the future of e-commerce. Distributed systems solve two major problems: improve user experience and enhance fault tolerance. When a distributed system is designed, there is enough space for traffic growth. Therefore, when a data center is saturated, the rest of the traffic can be transferred to other data centers for mutual backup and mutual support. At the same time, due to provide users with nearby services, so reduce the network delay, page response speed up. Google search, for example, is a global service, with different IP addresses across Europe, Asia and the US. When one of these IP addresses fails, Google can easily switch its services to the nearest IP address and continue searching for services. For e-commerce, the situation is more complicated. The data that needs to be synchronized requires more precision, the data volume is larger, the tolerance of delay is lower, and the construction cycle is longer. Jingdong is making efforts to improve this aspect, starting from the read-only system, step by step to achieve system distribution.
API service
Across systems, there are always many of the same components. Front-end load balancing needless to say, middleware processing is a very typical example. How to manage these components efficiently and uniformly, API servitization is our answer. It is best to have a well-trained team that manages these components centrally and provides interface services to the outside world, hiding the complexity of using the software and invoking a straightforward API. Having professionals handle complex logic to ensure the availability and scalability of the system greatly reduces the probability of errors while achieving economies of scale.
Redis is a common caching component. In the past, they were maintained by each business implementation team, which was not professional enough and had many improper uses. Later, we centralized management, customized development of new features and upgrades, and provided them to users at all levels through API servitization. This not only enriches application scenarios, but also improves performance and reliability.
Architecture, code optimization
A reasonable e-commerce system architecture is closely related to a company’s research and development level and technical management level, which directly determines how much peak flow can be supported and how high it can reach in the future. Choosing a framework suitable for one’s own development can not only give full play to its efficiency, but also save resources. Code optimization can also improve performance, such as SQL statement optimization to make better use of indexes; Java/C++ logic optimization to reduce unnecessary loops and complex operations; Algorithm optimization to make it more efficient; The function realizes the logic optimization, becomes more concise and clear; And so on. But after all, code optimization can not break through the limit, it is difficult to pursue the ultimate, appropriate until appropriate.
System virtual elasticity
When disk I/O is not the bottleneck, it becomes much easier to solve the system scaling horizontally. ZooKeeper or ZooKeeper-like software stacks can be organically strung together with effective performance monitoring. When transaction processing becomes a bottleneck, using today’s popular virtualization technologies, such as LXC or VM, can automatically scale flexibly without human intervention. If you want to learn Java engineering, high performance and distributed, simple. Micro services, Spring, MyBatis, Netty source analysis of friends can add my Java advanced group: 617434785, group ali Daniel live explain technology, as well as Java large Internet technology video free to share to you.