As an e-commerce product, transaction plays an important role in Yanxuan’s business. With the continuous development of business, the customization and differentiation of transaction scenarios have become prominent, and the access of third-party payment partners is also increasing. How to ensure the security and stability of transaction services while achieving good expansion and flexibility is the focus of thinking and practice in distributed transaction architecture.
InfoQ is honored to interview Ma Chao, technical manager of Netease Yanxuan and lecturer of ArchSummit Global Architect Technology Summit, to introduce the distributed technology architecture practice of Yanxuan Mall in the transaction process from the aspects of core data model iteration and service architecture evolution. (Ma Chao will also share the topic of “Avenue to Simplicity – Yan Xuan After-sales Service Architecture Evolution Practice” at ArchSummit in Shenzhen in July)
(In order to fully show the iterative process of Yan Selection technology, the dialogue is rewritten in the first person without changing the original meaning.)
Yan Xuan defines transaction as a dynamic process that leads to a contract between buyer and seller, rather than simply equating transaction with payment. In most e-commerce fields, except for special scenarios such as cash on delivery, contracts are generally concluded in the form of successful payment orders. Therefore, the transaction architecture needs to be able to support the most core ordering and payment links of e-commerce.
In the early stage of slash-and-burn farming, Yanxuan Mall has small business volume, small quantity of goods and small difference, and the transaction mode of users from shopping cart to order and then to payment is relatively single. Therefore, when the settlement page generates the order, a payment related record is redundant in the database, so that the payment amount and the order settlement amount are consistent. Then, it connects with wechat, Alipay and netease Pay through a simple payment service as an intermediary, guides users to complete the order payment on the client, and finally synchronizes the status of the payment record into the order. The architecture is simple and straightforward, and it works very efficiently.
In the early stage of bone scraping and healing, with the development of business, Yanxuan trading scene began to appear diversification and differentiation. Originally paid links, such as joint login account system has brought more access third-party payment mechanism, we found these third-party payment institutions access standard way and there is a big difference, even interaction pattern is different, also represented by enterprise purchase, spelling a group of independent business module also developed rapidly, The order life cycle and rules of these businesses are different, and the storage is relatively discrete, so it is difficult to establish a mapping relationship with the payment records in the original payment service. Multiple functional modules of the original architecture are mixed together, which makes it bloated and difficult to expand, and the online quality cannot be guaranteed. Therefore, we quickly adjusted the structure and split the service. The payment service was divided into payment system and cashier system, and the scope of the payment system was reduced to the master station order payment status management. While external payment institutions are responsible for the cashier, refund and other businesses are entrusted to the cashier system, the association between orders and payment information is replaced by the only internal transaction serial number.
Strictly choose the cash desk structure diagram
Through the architecture upgrade, flexible configuration can be realized. Different accounts can see different cashier customized pages on different product modules and terminals, which greatly meets the demands of upper-layer business parties. At the same time, the cashier screen the differences and complexity of the mode of connecting with the third party, and unify the payment notification service and refund service into asynchronous callback mode in the design, which reduces the complexity of the access of the upper business.
In the middle stage of development, while the payment link is undergoing structural upgrade, the differences of commodities also begin to have an impact on the order and transaction link, such as commodity category attribute and commodity freight. Taking gift cards as an example, gift cards are a special commodity, which can be used as funds in transactions after users purchase and bind them. Electronic gift cards do not require fulfillment and delivery, but need to be connected with additional business card printing services. In addition, real-name authentication should be determined according to the amount of money in the purchase process of gift cards. In the initial architecture we this kind of products in the trading of every link to do special processing, but the coupling is deep, with the virtual goods (point card, phone) and non-standard products (glasses), products appear constantly, the technical team to do the upgrade of the original architecture, abstracts the concept of template, template can provide the settlement page customization capabilities, For example, whether you can use gift cards, whether you can use coupons/red packets, whether you need to fill out additional information. Orders generated by different transaction templates will be attached with different transaction identifiers for subsequent business module docking and order center processing.
At present, in the flowering stage, we are working on the platformization of gift cards, providing configurable consumption and reverse strategies of gift cards, so as to jointly build a strictly selected basic transaction service matrix with payment system and cashier system. At the same time, it is also considering to expand the ability of transaction template, which can effectively connect the order settlement link with the basic transaction service matrix, and support transaction scenarios under different business modes through the platform configuration ability.
In the future, yan can choose planning under the condition of the existing trade structure gradually three-dimensional one step closer, form a complete strictly selected trading center platform, trading center platform contains not only includes shopping cart service, service, the payment service, general trade ability, will also provide the full trading arrangement ability, provide more reliable support for the upper business. At the same time, in the case of yan Selection mode expanding to offline, it is also actively exploring the offline scene of the transaction mode, realizing the integration of online and offline data, and bringing better shopping experience to consumers.
In the process of strict selection of the whole distributed architecture evolution, on the one hand, it well meets the demands of high availability and high throughput of the system; On the other hand, like most Internet companies, data consistency and resource balance problems are common in distributed architectures. In the process of solving these problems, slowly precipitate out of the strict selection of their own middleware.
Let’s start with distributed locks. Dlock is a distributed lock middleware with strict selection of reentrant, blocking, timeout, high availability, high performance, low cost and flexible scalability (the overall architecture is shown in the figure below). Its design objectives are mainly two:
-
Synchronization of access to resources in distributed environment, suitable for multi-machine, multi-process, multi-thread, single thread and other scenarios;
-
Replace high overhead, high cost, and poor scalability database locks.
Dlock can be applied to many types of caching infrastructure, such as Redis, Memcached, Zookeeper, and so on. The capacity of DLock can be scaled according to requirements.
Dlock supports three types of lock reentrant policies: thread-level reentrant, process-level reentrant, and distributed reentrant. These reentrant policies meet the requirements for distributed locks in complex scenarios of strict selection services.
The second is distributed transactions. In order to solve the data consistency between distributed systems in the process of transaction, the distributed transaction middleware DTS is developed by Yan. DTS follows the BASE type and is a typical TCC(Try/Confirm/Cancel) type transaction. The architecture of the entire DTS is shown as follows:
DTS optimizes for two-phase commit, called last-actor optimization (LPO): The distributed transaction initiator does not have two phases, but only a single phase, that is, after all participants are ready for the first phase, its commit results directly determine the success of the distributed transaction.
In order to intercept abnormal traffic such as flash traffic and attacks, and to reasonably limit normal user traffic, the middleware eudemon has been strictly selected from the research of traffic limiting & Emergency fuse breaker. Eudemon moves the abstraction of access to resources such as CPU, MEM and DB to the level of method invocation and page access, and indirectly protects resource overload by limiting method invocation and page access. The overall architecture of Eudemon is as follows:
Eudemon controls traffic at two levels:
-
Interception of brushing and attack traffic: By analyzing the behavior characteristics of traffic subjects, more than 99% of brushing and attack traffic can be intercepted, preventing these black-generated traffic from competing with normal traffic for system resources, and reducing the pressure on downstream risk control system.
-
Normal flow control and fusing: Limit normal user flow reasonably, cut peak load and fill valley, always ensure that the inflow of traffic does not exceed the load of the system, and fuse services in emergency situations.
How do I do resource isolation? The system tends to be distributed architecture. The more service nodes there are, the single node failure will be magnified in distributed architecture due to cascading failure. In addition to Eudemon focusing on inflow flow control, Yan Xuan has developed a resource isolation middleware aegis focusing on outflow flow control.
In a distributed link, a service depends on multiple downstream services. If a downstream service is unavailable or responds slowly, the current service may be affected or even avalanche effect may occur. The typical slow SQL that we encountered online before is slow SQL. The emergence of a slow SQL leads to the surge of database load, which leads to the basic sudden unavailability of DB for a short time, thus leading to the unavailability of the whole service or system.
The resource isolation process of Aegis is shown in the figure above. Dynamic isolation mechanism is used to isolate resources according to business dimensions. For example, in a distributed database scenario, you can isolate only one DB Node that fails, rather than the entire distributed database.
Yanxuan middleware has always been adhering to the concept of solving the most core technical problems at the lowest cost, adhering to the belief that technology drives business, and always doing a 360 degree support with business as the core. The construction of middleware follows the dual-engine mode of “open source + self-research”, which is mainly open source, self-research as auxiliary and feedback open source. Standing on the shoulders of the open source giants, yanxuan can reduce the cost of research and development, and make up for the shortcomings of open source through self-research, and feed back the technical strength of Yanxuan for the prosperity of the open source world through feedback.
The big push in e-commerce is the best test of architecture design and technology accumulation. As a special kind of high concurrency scenario, its peak is often several times or even several orders of magnitude difference from normal, some seemingly stable systems and services may expose problems under extreme concurrency.
In the promotion activities with a short cycle, the transaction link is particularly important. From shopping cart to order and then to payment, the failure of any link may affect the effect of operation activities. Reviewing the performance of Yanxuan in the previous promotion, the main challenges faced by the trading system are as follows:
-
Rational use of database resources
-
Stability and timeliness of payment services
-
Defense against external attacks
Database is an important component of transaction system, which requires high consistency of data. As an example, we usually use transactions to ensure strong consistency. However, with the development of business, transactions become bloated, and the performance deteriorates in high concurrency scenarios. Database resources are easy to become bottlenecks, which may affect some data nodes, or the whole ordering service or even the whole transaction process. In order to guarantee the promotion, Yan Xuan dissolves the transaction, sorting out the core business and non-core business. Non-core business is completed by asynchronous and compensation mechanism as far as possible, and the transaction granularity of core business is reduced to the minimum, and distributed lock and distributed transaction are adopted for landing. After the transaction is disintegrated, the transaction size becomes reasonable. In DDB(netease Distributed Database), appropriate user partitioning policies are selected to effectively distribute transactions of different users to different database nodes, avoiding hot issues.
The stability and timeliness of payment is another big challenge. In the promotion process, if payment service jitter or callback is not timely, it may cause a large loss of interests to users in a short period of time, and customer complaints will also increase. On the premise of enhanced monitoring, we will also carry out strict isolation and division of resources, retain a certain degree of mobility, and timely isolate some unstable channels of services. It also has an active query and retry strategy in place to switch to the standby mode in the event of a callback failure. At the same time, a small optimization was made on the client side, adding a waiting copy for payment processing to appease the user.
In addition, the trading system may also be threatened by external attacks, such as online brushing orders, brushing coupons and other malicious behaviors occur from time to time. If the trading interface is directly exposed to attacks, the trading resources will be tilted, which may bring down the whole service system seriously. Therefore, we build a reliable shield outside the trading system to protect the interests of core users. Similar to the common practice in the industry, strict selection of defense also starts from the entry layer, through access characteristics to identify malicious requests for the first defense; On top of the business service, there is an interface granularity protection component, which can intervene the key interface of the transaction link through MVEL syntax. In the logic of core transaction business service, risk control query interface is generally connected to make accurate analysis of users. Through these three layers of protection, the security of the strictly selected core link is effectively guaranteed.
Finally, the process and strategy of promoting security for Yan Xuan are mainly divided into three stages.
The first is the evaluation phase, with resource evaluation preparation and pressure measurement. By decomposing the indexes of big promotion calculation, the interface throughput and response indexes of key transaction links of the system are set, and the existing system capacity is evaluated to meet the requirements and the corresponding expansion plan is formulated. At the same time, understand the existing indicators of the service in the module pressure test and the full link pressure test, analyze the bottleneck and formulate the optimization plan based on the test report.
Then comes the preparation stage, during which we will formulate the traffic limiting downgrade and disaster recovery plan for the big promotion trading link in detail. For example, we will prepare at least two sets of inventory management plans for the second kill trading scenario, so that we can timely switch when the service fails. Check whether there is any lack of monitoring in all transaction links, from shopping cart to settlement page to cashier, make relevant discovery, solution and compensation plan for each potential transaction failure scenario, and implement it to the specific person in charge. After the plan is completed, it will be validated together with the test through full-link pressure test and partial module test.
Finally the implementation stage, even made full preparation, online or may appear all sorts of unexpected situation, when failure occurs in large presses, timely stop is the first principle, then prepare the current-limiting fuse even plans are may come in handy, and immediately set out to analysis the reason and make the minimum cost of repair plan, instant publishing to online.
The authors introduce
Ma Chao, Technical Manager of Netease Yanxuan, joined Netease Mail Business Division in 2013 and has participated in the research and development of a number of netease mail products. In 2015, I began to take charge of the RESEARCH and development of Yanxuan Mall. In the past two years, I led the transformation and evolution of technical architecture for many times, and promoted the rapid development of Yanxuan business. At the same time, I led the team to guarantee the stability of the system after many big promotions. He has rich practical experience in high concurrency, distributed, business system optimization and middleware research and development.