Mode of transportation is one of the core factors that users should consider before traveling. In order to help users to better complete the consumption decision-making closed loop, hornet’s nest online large traffic business. Now, users in the Honeycomb can also complete the purchase of air tickets, train tickets and other operations.

Like most business systems, we have gone from scratch to rapid growth. This article will take train ticket business system as an example, mainly from a technical perspective, and share with you some experience of system construction and architecture evolution behind different stages of the development of an emerging business.

Stage one: Start from scratch

At this stage, the first goal is to quickly support the business and fill the business gap. Based on such considerations, the train ticket business at that time chose suppliers to buy on behalf of the model; From a technical point of view, it is necessary to prioritize the development of core functions such as users’ enquiry of residual train ticket information, ticket purchase and payment, as well as cancellation and refund of ticket in hornet’s Nest App.

The technical architecture

Considering the project objectives, time, cost, manpower and other factors, we chose LNMP (Nginx+MySQL+PHP under Linux system) as the website service architecture. The whole system is divided from the physical layer into access layer (App, H5, PC operation background), access layer (Nginx), application layer (PHP program), middleware layer (MQ, ElasticSearch), storage layer (MySQL, Redis).

The dependence on external systems mainly includes the company’s internal payment, reconciliation, order center and other two-party systems, as well as external supplier systems.

As shown in the figure, the external display function is mainly divided into two parts, one is C-terminal App and H5, and the other is the operation background. Nginx and Nginx are called to PHP train respectively. There are four main entrances inside the program, which are:

  1. Facade module for other two-party system calls

  2. Admin module called by operation background

  3. Train core module to handle App and H5 requests

  4. Handles external vendor callback modules

The four entries rely on the underlying modules modules to implement their respective functions. There are two kinds of external invocation. One is to invoke facade module of two-party system to meet the internal dependence of the company. One is to invoke an external vendor. Infrastructure relies on search, messaging middleware, databases, caching, and so on.

This is a typical single architecture mode with simple deployment, clear hierarchical structure, easy and quick implementation, which can meet the requirements of rapid iteration of initial products. Moreover, it is based on the mature PHP technology of the company, so there is no need to worry too much about stability and reliability.

This framework has supported the development of train ticket business for nearly a year. The simple and maintainable framework has made great contributions to the rapid development of train ticket business. However, as the business progresses, the defects of the single architecture are gradually exposed:

  • With all the functionality converging, the cost of code modification and refactoring increases

  • The scale of research and development team is gradually expanding, and the difficulty of multi-person development and collaboration of a system is increasing

  • Delivery efficiency is low and range of variation difficult to assess. In the case of imperfect automated testing mechanism, it is easy to lead to a vicious circle of “more fixes, more defects”

  • Poor scalability, can only scale horizontally, not vertically by module

  • Poor reliability, a Bug may cause system crash

  • Hindering technological innovation, upgrading is difficult, affecting the whole body

In order to solve a series of problems caused by monomer architecture, we began to try to evolve to microservices architecture as the direction of subsequent system construction.

The second stage: structure transformation and servitization

Starting from 2018, the entire transportation business began to evolve from LNMP architecture to servitization.

Architecture shift — from monolithic applications to servitization

“If a worker wants to do a good job, he must first sharpen his tools”. First of all, a brief introduction is given to some core facilities and components accumulated during the implementation of the servitization of large transportation.

Our main shift was to move from PHP to Java, so the technology selection was centered around the Java ecosystem.

Develop frameworks and components


As shown in the figure above, the overall development framework and components are divided into four layers from bottom to top. This section mainly introduces the packaging framework and component layer of the uppermost large traffic business scenario:

  • Mlang: Big Transit inside kit

  • Ms-client-starter: Collect and report buried points of mass transportation MES technology

  • Dubo-filter: Tracing of dubbo calls

  • Mratelimit: API traffic limiting protection component

  • Deploy-starter: deploys the traffic removal tool

  • Tul: unified login component

  • Cat-client: Unified component that calls enhanced encapsulation of CAT

Infrastructure system

The implementation of servitization cannot be separated from the support of infrastructure system. On the basis of the existing company, we have successively built:

  • Agile infrastructure: Based on Kubernetes and Docker

  • Infrastructure monitoring alarms: Zabbix, Prometheus, Grafana

  • Service alarm: Based on ES logs, MES buried point + TAlarm

  • Log system: ELK

  • CI/CD system: Based on Gitlab+Jekins+Docker+Kubernetes

  • Configuration center: Apollo

  • Servitization support: SpringBoot2. x+Dubbo

  • Service governance: dubo-admin,

  • Gray scale control

  • TOMPS: Big transportation application management platform

  • MPC message Center: Based on RocketMQ

  • Scheduled task: Based on Elastic-Job

  • APM system: PinPoint, CAT

  • PHP and Java two-way communication support

As mentioned above, a relatively complete DevOps + microservice development system has been preliminarily constructed. The overall structure is as follows:

From top to bottom, it is divided into:

  • Access layer — currently App, H5 and wechat applets;

  • Access layer – go to the company public Nginx, OpenResty;

  • Business layer applications – including wireless apis, Dubbo services, and elimination

  • Message center and scheduled tasks – deployed in Kubernetes+Docker

  • Middleware layer – includes ElasticSearch, RocketMQ, etc

  • Storage layer — MySQL, Redis, FastDFS, HBase, etc

In addition, the peripheral support system includes CI/CD, service governance and configuration, APM system, log system, and alarm monitoring system.

CI/CD system

  • CI is based on Sonar + Maven(dependency check, version check, build package) + Jekins

  • Based on Jenkins CD + + Kubernetes Docker

At present, we have not released the OPS permission of Prod for development, and we plan to open it gradually in the new CD system.

The servitization framework Dubbo

We chose Dubbo as the distributed micro-service framework for the following reasons:

  1. Mature high-performance distributed framework. At present, many companies are in use, has withstood the test of various aspects of performance, relatively stable;

  2. Seamless integration with the Spring framework. Our architecture is based on Spring, and access to Dubbo can make code intrusion free, access is very convenient;

  3. Service registration, discovery, routing, load balancing, service degradation, weight adjustment and other capabilities;

  4. Open source code. Can be customized according to the needs, expand the function, self-development

In addition to maintaining close contact with Dubbo officials and the community, we are also constantly enhancing and improving Dubbo, such as log tracking based on Dubbo-Fitler, Dubbo unified configuration management based on the Unified Application Management Center of Large Transportation, and service governance system construction.

A preliminary study of servitization — Snatching tickets system

The evolution to servitization must not be a great leap forward, as it will simply break up the application into pieces and end up with significantly higher operating costs and no benefits.

In order to ensure a smoother evolution of the whole system, we first choose the ticketing system for practical exploration. Ticket snatching is an important part of train ticket business, which is relatively independent and less conflicted with the existing PHP e-ticket business, so it is a better scenario for us to implement servitization.

In the service separation and design of snatch tickets system, we have accumulated some experience and experience, mainly to share with you the following points.

Function and Boundary

To put it simply, snatching tickets is a process in which users grab tickets in advance and the system keeps trying to buy tickets for users after the official sale. In essence, ticket snatching is a means to initiate seat reserving for the user when there are spare tickets by constantly checking the information of remaining tickets for the selected date and train. As for the handling of the seat, there is no difference with the normal electronic ticket. Once we understand this process, we divide the functional boundaries between them as follows, without changing the original PHP system as much as possible:

That is to say, after the user successfully grabs the ticket and pays, the subsequent issue of the ticket will be handed over to the PHP e-ticket system. Similarly, in the reverse aspect of snatching tickets, only the functions of “full refund without snatching tickets” and “refund with the difference between snatching tickets” are realized. Online and offline refund of issued tickets are completed by THE PHP system, which greatly reduces the development task of snatching tickets.

Service design

The design principles of services include isolation, autonomy, single responsibility, bounded context, asynchronous communication, independent deployment, and so on. Other parts are easier to control, while bounded context is generally a reflection of the granularity of the service, which is also a topic that can not be avoided when doing service splitting. Too much granularity leads to similar problems as a monolithic architecture, and too much granularity is constrained by business and team size. Combined with the actual situation, we split the ticket snatching system from two dimensions:

1. From the business perspective, the system is divided into supplier service (synchronization and push), forward transaction service, reverse transaction service and active service.

  • Forward transaction service: including users to grab tickets, payment, cancellation, ticket, query, notice and other functions

  • Reverse transaction service: including reverse order, refund, refund, inquiry, notice, etc

  • Supply side: request the resource side to complete corresponding business operations, grab tickets, cancel, occupy seats, issue tickets, etc

  • Activity services: including daily activities, sharing, ranking statistics, etc

2. The system level is divided into front-end H5 layer before and after separation, API access layer, RPC service layer, bridge layer between PHP, asynchronous message processing, scheduled task, supplier external call and push gateway.

  • Display layer: H5 and small program, front and rear end separation

  • API layer: the unified API gateway for H5 and applets, responsible for the aggregation of background services, exposed in HTTP REST style

  • Service layer: Includes the business services mentioned in the previous section and provides RPC services externally

  • Bridge layer: includes the proxy service that calls PHP, provides Dubbo RPC service to Java side, and calls unified PHP internal gateway in HTTP form; For the unified GW provided by PHP, PHP invokes Java services via GW in HTTP form

  • Message layer: Asynchronous message handler, including order status change notification, coupon processing, etc

  • Scheduled task layer: provides various compensation tasks or business polling

Data elements

For a trading system, regardless of language or architecture, the most important thing to consider is data. Data structure basically reflects the business model, also controls the program design, development, expansion and upgrade maintenance. Let’s briefly review the core data table involved in the ticket snatching system:

1. Order creation: After the user selects the number of train and enters the filling page, he/she needs to select passengers and add contacts. Therefore, the passenger table, which is a reusable PHP electronic ticket function, will be involved first

2. After the user submits the order creation application, the following data tables will be involved:

  • Order snapshot table – First stores the user’s order request elements

  • Order: Creates a ticket order for the user

  • Segment: Used for multiple trips (similar to air ticket segments) that may exist in a single order.

  • Passenger form: The ticket form contains passenger information

  • Activity table: Reflects the activity information that may be contained in the order

  • Item: reflects the included ticket, insurance and other information

  • Performance form: after the user buys the ticket, the insurance, will eventually do the ticket number backfill, we also call the ticket number information form

3. After the result of seat reservation: if the user fails to occupy a seat, a full refund is required; if the user succeeds in occupying a seat, a difference refund may be required; therefore, we will have a Refund_order. Although only a refund is involved, there is also a refund_item table to record the details of the refund.

The order status

The key points of order system are the definition and flow of order state, which run through the whole order life cycle.

We summed up two major pain points from the previous system experience. First, the definition of order state was complicated. We tried to use a status field to connect the foreground display with the back-end logical processing, resulting in a single order state up to 18; Second, the logic of state flow is complex, and there are too many if and else judgments in the pre-factor judgment and post-direction of the flow, resulting in high maintenance cost of the code.

Therefore, we use finite state machine to sort out the state and state flow of forward ticket snatching orders. For the application of the state machine, you can refer to the article “Application and Optimization of the State Machine in hornet’s Nest Ticket Trading System”. The following figure is the state flow diagram of ticket snatching orders:

We divide the state into order state and payment state, and promote the flow of the state through the event mechanism. There are two preconditions to reach the target state: one is the original state, the other is the trigger event. The state is transferred according to preset conditions and routes, and the execution of business logic and event triggering are separated from the state flow, so as to achieve the purpose of decoupling and extension maintenance.

As shown in the figure above, the order status is defined as: initialization → successful order → successful transaction → close. The payment status is defined as: initialization → To be paid → Paid → closed. In normal case, after the user places an order successfully, the order will be placed successfully and the payment will be made. After the user pays through the cashier, the order status remains unchanged and the payment status is paid; After that, the system will help the user to occupy the seat. After the seat is successfully occupied, the order will enter the transaction success and the payment status will remain unchanged.

If only the above two states, then the business program execution is simple, but not enough to meet the foreground to the user rich single state display, so we will also record the reason for closing the order. At present, there are seven reasons for closing orders: unclosed, failed to create orders, user cancellation, payment timeout, operation of closing orders, expired orders, failure to grab tickets. We will calculate the external display status of an order according to the order status, payment status and reason for closing the order.

Idempotent design

Idempotent means that the results of one call and many calls to an interface should be the same. Idempotence is an effective guarantee of high availability and fault tolerance in system design, and not only exists in distributed systems. As we know, in HTTP, the GET interface is inherently idempotent. Executing a GET operation multiple times does not have inconsistent effects on system data, but repeated calls to POST, PUT, and DELETE may produce inconsistent results.

Specifically speaking of our order status, as mentioned above, the flow of the state machine needs to be triggered by events. At present, the positive triggering events of ticket snatching include successful placing of order, successful payment, successful seat holding, order closing, and closing of payment order, etc. Our events are typically triggered by user actions or asynchronous push messages, neither of which can avoid the possibility of repeated requests. For example, in addition to modifying the state of its own table, it also needs to synchronize the state to the order center and the order information to THE PHP electronic ticket. If idempotent control is not done, the consequences will be very serious.

There are many ways to ensure idempotency. Take seat occupying messages as an example, we have two measures to ensure idempotency:

  1. All seat occupying messages have unique serialNo constrained by protocol, and the push service can judge whether the message has been processed normally.

  2. Service side modification to achieve CAS(Compare And Swap), simply speaking, database optimistic lock, Update order set order_status = 2 where order_id = “1234′ and order_status = 1 and pay_status = 2.

summary

The implementation of servitization has a certain cost and requires a certain foundation of personnel and infrastructure. In the initial stage, we should start from the relatively independent new business and do a good job in the integration and reuse of the old system, so that we can obtain results quickly. Grab ticket system in less than a month to complete the product design, development, test online, also very good confirmation of this point.

The third stage: servitization promotion and system capacity improvement

Grab the completion of the construction of the ticket system, on behalf of us to take a small step, but also a small step, after all, grab the ticket is a cyclical business. More of the time, e-tickets are the main support for our business. In the parallel period of the old and new systems, there are mainly the following pain points:

  1. The original e-ticket system was closely bound to specific suppliers due to the influence of factors at that time, and was greatly restricted by suppliers.

  2. Due to the high cost of compatibility with ticket snatching system and other transportation systems, it is very difficult for us to implement unified link tracking, environmental isolation and alarm monitoring.

  3. The PHP and Java bridge layers take on too much business to guarantee performance

Therefore, our next goal is to remove the historical burden, complete the service transfer of the old system as soon as possible, unify the technology stack, and make the main business get more powerful system support.

With Business: E-ticket process transformation

Through the transformation of the e-ticket process, we hope to reshape the train ticket project established under the emergency mode before, and finally achieve the following goals:

  1. Establish hornet’s nest train ticket business rules, change the previous business functions and processes subject to supply side rules;

  2. Improve user experience and functions, add online seat selection function, optimize search and order process, optimize refund speed, improve user experience;

  3. Improve data metrics and stability, introduce new supply side services to improve reliability; Supplier sub-order system, improve the success rate of seating and ticket rate;

  4. Technically, the migration to Java as a service is completed, laying a foundation for the follow-up business

What we want to accomplish is not only technical reconstruction, but also combining with new business demands to continuously enrich new systems and strive to achieve the consistency of business and technology goals. Therefore, we combine service migration and business system construction together to promote. The following figure shows the overall structure of train tickets after the e-ticket process transformation:

In the picture, the light blue part is the function that has been built during the ticket snatching period, and the dark blue module is the newly added part of the e-ticket process transformation. In addition to the supplier access, forward transaction and reverse transaction similar to the ticket snatching system, it also includes search and basic data system, and adds the business function of electronic ticket on the supply side. At the same time, our new operation background has been established to ensure continuity of operation support.

During the implementation of the project, in addition to some problems mentioned in snatching tickets, we also focused on solving the following problems.

Search optimization

Let’s take a look at the system that the user may pass through in a site-by-site search:

Request first to TWL API layer, then to tsearch query service, tsearch to TJS access service and then to the supply side, the whole call link is still relatively long, if each call is a full link call, then the result is not very optimistic. Therefore, TSearch has redis cache for query results, which is also the key to shorten links and improve performance. There are several difficulties in caching station queries:

  1. High requirements for real-time data. The core is the number of remaining tickets, if the data is not real-time, then the success rate of users to place orders will be very low

  2. The data is scattered. Such as departure station, arrival station, departure date, cache hit ratio is not high

  3. The ports on the supply side are unstable. The average is over 1000ms

Considering the above factors, we design the search process of tsearch station as follows:

As shown in the figure, for a query condition, we will cache the results of multiple channels, on the one hand to compare which channel is more accurate, and on the other hand to improve system reliability and cache hit ratio.

We set the expiration time of Redis to 10min, and define the expiration time of cache results to 10s. If valid is empty, then invalid; If invalidation is also null, the channel acquisition is invoked synchronously within 3s, and the invalidation and non-existent cache channels are handed over to the asynchronous task for updating. Care needs to be taken to prevent concurrent updates to a channel result with distributed locks. The final cache result is as follows:

The cache hit rate will be above 96%, and the AVERAGE RT is around 500ms, which can ensure good user experience and timely data update.

Consumption of messages

A lot of our business is handled by asynchronous messaging, such as order status change messages, seat reservation notification messages, payment messages, etc. In addition to normal message consumption, there are special scenarios such as sequential consumption, transaction consumption, and repeated consumption, which are implemented based on RocketMQ.

Order consumption

This method is used in the scenario where the message sequence depends on the message. For example, the new order message must be processed before the seat occupation message. We implemented this business scenario based on RocketMQ itself, which supports sequential message consumption. Simple in principle, RocketMQ limits producers to sending messages to a single queue, and limits consumers to having only one thread read them, so that the global single queue, single consumer guarantees sequential consumption of messages.

Repeat purchases

RocketMQ guarantees At Least Once, but does not guarantee Exactly Only Once. First, by requiring the business side to be idempotent, In addition, database tables message_produce_record and Message_consume_record are used to ensure accurate delivery and confirmation of consumption results.

Transaction consumption

Rocketmq-based transaction messaging capabilities. It supports a two-phase commit, sending a preprocessed message, then calling back to perform the local transaction, and finally committing or rolling back to help ensure that the information for modifying the database is consistent with sending asynchronous messages.

Gray level run

The chief designer of THE J-10 fighter jet once said: “Building a plane is not the hardest part, the hardest part is getting it into the sky.” The same is true for us. March is the peak of spring outing season, and the business volume is increasing day by day. During this period, we need a complete plan to ensure the smooth business switchover.

The project design

Gray scale is divided into whitelist part and percentage gray scale part. We first whitelist gray scale internally and enter 20% flow gray scale period after stabilization.

The core of gray scale is the entrance problem, because the front end has also been completely revised, so we will introduce users to different pages from the station search entrance, users will respectively complete business in the old and new systems.

Search ordering process

App calls grayscale interface at the search entrance of the station to obtain the jump address and realize entrance shunting.

Effect of contrast

The recent planning

We have only preliminarily realized the implementation of servitization in train ticket business lines. Meanwhile, there are some things that we will continue to promote and improve in the future:

1. Service granularity refinement: The current service granularity is still rough. With the continuous increase of functions, the refinement of granularity is the focus of our improvement, such as the separation of transaction service into order query service, order creation service, seat booking service and ticket issuing service. This is the inevitable trend for DevOps. Fine-grained services can maximize meet our requirements of rapid development, rapid deployment, and risk control.

2. Service resource isolation: Isolation at service granularity is not sufficient. DB isolation, cache isolation, and MQ isolation are also necessary. As systems continue to scale and data volumes grow, fine-grained isolation of resources is another priority.

3. Grayscale multi-version release: at present, our grayscale strategy can only support the parallelism of old and new versions. In the future, in addition to carrying out multi-version parallel verification, we will also combine business customization requirements to make the grayscale strategy more flexible.

Write in the last

The development of business is inseparable from the development of technology. Similarly, the development of technology should also fully consider the current situation and conditions of the business under the scene at that time. The two complement each other. We should avoid over-design rather than under-design.

The technical architecture is evolved, not designed at the beginning. According to the law of business development, we need to decompose the long-term technical plan in stages to achieve the goal gradually. At the same time, consider that servitization brings new problems such as sudden increase in complexity, service fragmentation, consistency, service granularity, link length, idempotency, performance, and so on.

More difficult than service support is service governance, which is something we all need to think about and do.

Author: Li Zhanping, technical expert of hornet’s Nest transportation business research and development.

(Photo source: Internet)