About the author:
Shi Songran, didi’s senior development engineer, is responsible for the maintenance and development of the Business in China. The main content of this paper is the practical experience of building didi’s core business platform based on Go.
Content Outline:
1. Application development and scale of Golang’s business in Didi
2. Didi’s experience in using Go governance module
3. Share two specific Go application problems
4. Recommend two open source tools
The body of the
1. Application and development of Go in Didi
In Didi’s code warehouse, there are more than 1500 modules containing Golang’s code fragments, and more than 1800 Gopher submitted Golang’s code in Didi. Just for our zhongtai service, there are more than 2000 machines running Go’s service.
1.1 What do we do with Go
DUSE is didi’s single engine, which is responsible for matching drivers and passengers, and is responsible for the matching demand of tens of thousands of times per second.
DOS is didi order system, which is responsible for the real-time order status transfer and the retrieval of didi’s historical orders. It is a retrieval service with tens of billions of data.
DISE is our own schemaless data storage service that uses a similar implementation to Bigtable.
DESE service is a Serverless distributed framework, which only needs to complete business functions to complete the establishment of distributed services, similar to Amazon’s Lambda service.
1.2 Middle Taiwan Business
When you use Didi, you run into some lines of business. In Didi, these business lines are called reception service. They have some common features, including driver information, order status, cashier, account number and other business logic. We will combine the specialized business logic to form a full-time service, which is called Zhongtai service. As the support of the front desk service, the importance is self-evident.
1.3 the challenge
There were some challenges in developing the Mid-platform service, mainly from three aspects:
1, high availability: the center service supports all the front desk services, problems will lead to the front desk service collective failure, high availability is very important.
2, high concurrency: medium service is the carrier of all traffic, requiring very high carrying capacity and fast response speed.
3, business complexity: Taiwan service is also a business service, so the complexity of business directly determines how complex the Taiwan system is.
1.4 according to Golang
The first is that the execution efficiency is very high
The second is excellent development efficiency. Golang’s syntax is concise and clear, which can shield many technical details and make business development more smooth.
The third is Golang’s active community and rich library, which helps us solve many problems.
Fourth, the cost of learning is low. When we first started to use Go, we found it very difficult to recruit engineers. Engineers of other languages can learn Go, get familiar with Go and develop Go programs through quick learning.
2. Didi’s experience in governance of Go module
2.1 Huge business system
Didi’s business is quite special. Every request involves the status information of the driver, passenger and order. We have many micro services to ensure the status of the service. I have done simple statistics. If an express order is taken out, there are more than 50 sub-services involved, 300 Rpc requests and more than 1000 log lines, so it is very difficult for manual analysis.
2.2 Service governance Problems
Too many microservices bring many problems, such as the difficulty in locating exceptions; The system link is not clear which piece is good, which piece is bad; Service optimization and service migration will also be difficult. In view of these three points, I would like to introduce what Didi does.
Abnormal positioning
With the savage growth of Didi’s business in the early stage, many services did not follow the development specifications, which led to the confusion of our Didi log and the difficulty of abnormal positioning, and the lack of basic positioning information of upstream and downstream. A lot of engineers’ manpower is wasted on anomaly location and anomaly analysis. With the development of our business, more and more manpower will be invested in the later stage. We wondered if we could do distributed tracking of exceptions and normalization of logs.
Log normalization
Log streaming
We referenced a set of logic from OpenTracing, via SpanID and TraceID. All TraceID and SpanID are retained in the log. All requests are connected through TraceID, and the time consumption of each node is recorded through SpanID. At the same time, the log structure specification is a human-friendly, machine-readable set of log specifications, through DLTAG to express what the log records. It’s not enough to just normalize the logs, but it’s easier to concatenate the logs than it is to rely on manual analysis. At this time, we made a unified processing system to collect, calculate, store, index and finally visualize the log data. How does this system work? Our logs generally come from the server end and APP end. We collect these logs through the log collection service SWAN and send them to the message queue. The SRIUS service reads the log information from the message queue, converts the log information into our structured data, and stores the structured data in the ARIUS system. The bottom of the ARIUS system is actually the retrieval of ES. Through the construction of the log index, it can help us to quickly retrieve the log information. Finally, there is an application based on the ARIUS system to complete the link restoration, service analysis and performance tracing of the business through the ARIUS query.
This is our pulse generated online service trace link, you can see more clearly each request time, protocol.
Pulse tracking has solved the problem of abnormal tracking, but we still can’t answer the bottleneck of system throughput, the total capacity of the system, whether the new machine room is used, and whether the disaster recovery plan is reliable. Generally speaking we solve these problems in the industry, the best way is to run pressure measurement, it is a pity because drops business is special, I invented a term for the functional business — that is to say we had the same input and output is different, there are a lot of state information between the input and output, the driver status, order status, including it is rain today, Whether it’s peak time or not and all those things, they all affect business results, and it’s very difficult to get it all exactly the same. In this case, it is difficult to perform pressure measurement by means of flow playback, and it is also difficult to estimate the capacity of the whole system by means of offline pressure measurement with equal proportion due to the state information involved.
Since the offline pressure can not be measured, it is called the full link pressure measurement. The basic logic is to separate pressure traffic by adding an extra identifier, such as the thrift protocol with an extra parameter and the HTTP protocol with an extra Header.
After we identified the traffic, additional development work was required for each business module through which the traffic passed. When the service module identifies the pressure measurement traffic, the service module needs to mark the traffic through transmission to ensure that all service modules can sense the pressure measurement label. Our Cache module designs a short timeout for this traffic to ensure that Cache resources can be released as soon as possible after the pressure test. Finally, in the database, we will set up a set of data structures exactly the same as online, a set of tables, which we call shadow tables, and shadow tables are only responsible for handling pressure flow. We call the flow marking logic and the modification of each module the pressure channel. The pressure test frequency of Didi is very frequent. In addition to the pressure test of the new machine room, the pressure test of the service is also carried out periodically, so as to ensure that the expected capacity of the system can be met while the service changes rapidly. Pressure measurement covers all of our main process modules to ensure the stability of our services.
Solved the problem of on-line pressure measurement, then how to send the measurement? As MENTIONED earlier, it is very difficult to stress test a business with traffic playback.
The solution adopted by Didi is to Mock the taxi-hailing behaviors of didi drivers and passengers and use the event engine to simulate the taxi-hailing operations of didi drivers and passengers, so as to complete the test of the complete ordering process of Didi drivers and passengers. Pressure control was simulated by the number of drivers and passengers online at the same time.
Didi does this by writing two event engines. One is a Driver Agent that simulates the Driver going online, waiting for the order, receiving the order and completing the order, and simulates the Driver’s feedback to the system. Another Agent is used to simulate the behavior of passengers, from boarding the bus to finishing the payment process, so as to avoid the pressure test of traditional traffic replay, because the status is wrong and the problems at the bottom of the system cannot be conveyed. How to control the pressure of pressure measurement flow? It is to control the pressure of the whole system through the number of drivers and passengers online at the same time, which is very similar to our business scene and very intuitive.
Through such a large online military exercise, we can know the detailed performance data of this system level, the upper limit of machine room flow, system bottleneck analysis. Troubleshooting plans can also be evaluated to see if they are effective. But at the same time, there are other problems. First, the cost is too high. All our modules and platforms have to maintain the pressure measurement channel. In actual production, the risk can be controlled through some basic components and operation specifications, and the cost of pressure testing can be reduced through the construction of basic components.
Through pulse and pressure measurement, we found that some services had become system bottlenecks. We tried to optimize or reconstruct these services to Go service.
The first is performance. We want it to migrate to a platform with better performance.
The second is interface accuracy. We want the interface to have a precise feature, while some dynamic languages lead to some interfaces that are erratic and simply unreliable.
Finally, asynchronous business logic. In general, it is possible to do asynchronous logic when doing some online services, which is difficult to do in a dynamic language of the process model.
These modules may be performance issues; May be a logic problem, error rate is high; Or maybe the old modules really need to be retooled due to too many patches.
2.3 What to hope for
How does Didi migrate its business
When it comes to migration, we want to do three things.
First, business is imperceptive. We hope that in the process of service migration in Taiwan, the front desk business has no perception, he does not know our migration, or he just helps us observe the service, micro perception is ok.
Second, the service migration is stable, do not die in the migration.
Third, there is no difference between the functions of the old and new modules after migration.
The migration experience
Let’s take PHP’s typical MVC framework as an example. This is a typical backend service that ideally translates Go directly into PHP code, API and functionality, and ends up being exactly the same, to everyone’s satisfaction. And it turns out when we do it it’s not that easy. When we were developing, people used some of the dynamic language features, such as maybe no distinction between String and numeric types, or PHP’s associative array and regular array mixing. In this case, hard translation requires Golang code to do a lot of Adapter adaptation language differences, which is a serious pollution of Golang business code.
So, in addition to translating the Go code, you need to build an additional layer of Proxy, or SDK. We want Go Server to focus on its business, focus on the business logic, not the details of the interface. If the interface is inconsistent, or the interface type is inconsistent for some special reason, through the Proxy layer, we shield the difference between Golang’s interface and Proxy’s interface. Keep our Go Server clean and focus on our own business. At the same time, the Proxy layer can also help us to direct the flow. For example, when we cut the flow, we use the Proxy to cut the flow. Through the Proxy interval, we can ensure that the Client is unaware of our Go server migration, and the Proxy can also record offline traffic to help offline Go Server test. Let’s say we finally finished the Go code development, the test has passed, we feel like there is no problem, ready to Go online, then is the most dangerous time of the whole process – to cut the flow.
With a lot of experience, Didi summed up three steps to ensure a stable service in the flow cutting process. The first step is bypass drainage, the second is flow switching, and the third is online observation.
The first step is to deploy the Go Server and use Proxy to divert 100% bypass traffic to the Go Server, which is actually a pressure test of the Go Server. The return value of the client is based on the return value of PHP, which is equivalent to saying that the Proxy calls the Go Server asynchronously, but does not spit out data to the front-end. At this time, we will do Diff in the Proxy to see whether their data is consistent. At the same time, we will Diff their underlying storage in the Go Server and Proxy to see whether the business logic is consistent. If there is a problem, we will fix it. When the whole process lasts for a period of time and the amount of DIff reaches a certain point of control, the next step — small flow cutting flow is carried out.
The Proxy gradually transmits the return value of the Go Server to the Client, which is a relatively slow process: 1%, 2%, 10%, 20%, and the duration of flow cutting may be quite long. At this time, we ask the Client side to observe to see if there is any exception. At this time, the Proxy layer continues to Diff whether there is a problem with the return value, and the bottom layer also checks whether the storage is consistent. If the process is very smooth, there is no problem, the logic is consistent, it will enter the next step.
Similar to the first figure, PHP Server becomes a bypass traffic, while Go Server becomes a primary traffic, and Client is completely Go Server logic. We will continue to observe online for a period of time, maybe in months, to verify that the Go Server is feasible. If there is no problem, we will remove the PHP Server at the appropriate time. Otherwise, we will cut back if there is any trouble.
Finished cutting flow, has just been added to the temperature, pressure test, when it comes to traffic migration, every middle service may want to go to access these services management component, access and pulse pressure test and service discovery, and load balancing, and so on module, if our services are step-by-step, additional development effort is very big. At the same time, each service governance component is not easy to access, resulting in a long development cycle, waste of a lot of manpower, it is difficult to promote. At this time, the students of service governance put forward the idea of DIRPC, which is actually a set of standardized SDK components. The upstream and downstream interactions are divided in the form of standard SDK to provide unified and one-stop service discovery, fault-tolerant scheduling, monitoring and collection, thus reducing the cost of service development, operation and maintenance.
3. What are you discussing?
What exactly are we talking about when we discuss RPC and SDK? The students of service governance hope to provide a unified one-stop service governance access scheme. Through the one-stop service platform, one-stop access to SDK and service governance can be completed, so as to reduce the cost of service development and operation and maintenance and ensure the stability of service. In addition to c-side fault tolerance, service discovery, request burying point, and service specification, DIRPC also implements the pressure channel, pulse and other logic just described.
How do you do that? We encapsulated a base library in the base component, which contained the basic components of service state, load balancing and so on, and then developed our own client on the top layer. The middle platform service uses these clients to develop the business SDK, and at the same time needs to meet certain specifications, so that the SDK can be developed. This is our ideal, but there are many difficulties in development
First, it is impossible for them to fully conform to certain specifications, because there are so many components and it takes so long.
Secondly, if some modules already have SDK, and then migrate to a new SDK, the migration cost is too high, and there will be stability risk. At this time, I made a further set of things called DiRPCGen on the DiRPC. This is a CodeGen tool, through the Idl extension of thrift, businesses can write a set of Idl, through the Dirpc tool, directly generate SDK components, very convenient, and directly implement all the specifications in DiRPCGen. This is quite a bit of work. At the moment, the IDL is compatible with thrift syntax. There are some extensions on thrift IDL to support our HTTP protocol and extra dribs and dribs. In this way, the cost of each service migration is very low, and the benefits are very large, and each module is willing to migrate, promotion is relatively smooth.
4. Now two small problems with the two Golang
4.1 The first problem is the double close problem of Golang’s net.Conn interface
The following figure shows our previous attempt to build graceful restart logic on the Web service. We want the service to exit with a guarantee that the requests that have been linked can be processed without increasing errors due to restart. How do you do that?
We implemented a service global counter of +1 when the link is established and -1 when the link is closed. When the service exits, it checks to see if it’s zero, if it’s zero, it exits, if it’s zero, it waits for zero, and you can see that the bottom layer is actually WaitGroup. The linking operations are hosted to the underlying net.http package and we don’t have any direct operations on our own Net.conn.
As a result, when we put this logic online, the service Panic crashed because the counter went negative. The idea is that a link is only open once and only closed once, so there are definitely no negative numbers. Unless, does the net. HTTP package close the link multiple times? As it turns out, Golang’s underlying net.http may close links multiple times while processing them, two of which ARE listed here. Is this a bug?
Do you want to make an issue? It’s not actually a bug. If you look at the comments on the Net.conn interface, you’ll notice that “multiple coroutines may concurrently call functions in the Net.conn interface.” This means that when implementing linking operations, you must ensure that the first is against concurrency and the second is against reentrant. It’s important not to assume that the underlying Golang will only open and close once.
4.2 The second problem is GC
We were planning to launch a model service, which has a large dimension and a lot of various parameters. After the service went online, there was no problem in the offline test, which was very stable. The service dropped onto the line and found that as the traffic increased, the number of timeout requests increased, but the average time was not very high. If you look at the 99th bit time, the burr is very serious, it’s a second request.First check the CPU memory of the machine, the change is not big, not too big floating, and check the network and so on.After eliminating the objective factors of the machine, we wondered if there was a code writing problem.
We used Golang’s tools to analyze our code and quickly discovered that a lot of CPU resources were being consumed by the scanning function of Golang’s GC tri-color tag algorithm. The number of in_used objects in the service also reaches 1000W. Although the STW time of the current GC is relatively short, but the three-color algorithm is marked concurrently, which may use a large amount of CPU resources to traverse these objects, resulting in a high CPU resource consumption rate, which indirectly affects the throughput and quality of the service.Now that we know why, the idea of optimization is clear. We are trying to reduce the allocation of unnecessary object types. So what are object types in Golang? Except for the more familiar Pointers, String, map, and slice are all object types. The allocation of object types can be reduced by making strings fixed-length arrays to avoid tricolor traversal, and by making unnecessary slices, all arrays, etc. Although it wasted some memory resources, it helped us reduce GC consumption, and the effect of the optimization was obvious, and the 99 quint time was soon reduced. This solution is more general, if you have found that 99 average time is relatively high, burr is more serious, you can see if there is this factor in.
5. Finally, I recommend two open source wheels
5.1 The first wheel is Gendry, Didi’s open source database operation aid tool
It provides three tools to help manage database links, build SQL statements, and complete data relational mapping.
The first component is the connection pool management class, which helps you manage connection pool information and handle some basic operations.
The second is the SQL build tool, which can help you complete the CONCatenation of SQL.
The last scanner is a structural mapping tool that maps the raw data you find into an object.
5.2 The second wheel is Jsoniter
It is a set of Json codec tools. Compatible with native GOLang’s JSON codec library, the efficiency is about 6 times higher. I highly recommend this library, because compared to easyJson, it doesn’t need to generate additional JSON processing code, only need to replace a reference, and can perfectly help you achieve a sixfold profit.
That’s all, thank you!
[Q&A]
The questioner 1
Q: when PHP is launched via bypass, how does another service state plus database plus storage not conflict?
Mr. Shi: The underlying storage of the two systems is isolated. For example, today we are running a day off, and we will use scripts to Diff the data stored at the bottom of the two systems. If there is a difference in the storage, it means that there is some logical difference that causes the inconsistent storage. The two stores are synchronized periodically, or periodically, by script, starting with PHP and later using Go.
Q: when you started talking about full pressure measurement, at first you said the data source was difficult to construct, what did you do afterwards?
Shi Songran: Every business order information of Didi includes the status information of drivers and passengers, which is not in the range of interface input, but additional status information of the system. In traditional pressure measurement, the general way is to aggregate the flow on the line according to the time dimension, and then play back to ensure the success of pressure measurement. If Didi adopts this approach, the traffic is likely to fail directly due to the inconsistent status of the driver and passenger. Our approach is to complete the pressure test by simulating the behavior of the driver and passengers through the event engine, so as to ensure that the failure of the pressure test will not be caused by the information of the driver and passenger order status.
The questioner 2
Lift to ask: When you migrate from PHP Server to Go Server, you say that you need to migrate online traffic, you need to diff PHP Server and Go Server, and the requests and returns of users in your business are different depending on the actual situation. Does the PHC Server return the same as the Go Server?
Shi Songran: The deterministic business can be diff completely, while the uncertain business depends on whether the result of the interface is in line with the expectation, rather than requiring the same result. So when the small flow cuts the flow, we must continue to observe. If we can do exactly the same, we do not need the small flow behind us, nor do we need to observe the line.
Question: I see two agents, how do you make it look like the real environment?
Shi Songran: I talked with my classmates. They have two event engines at the bottom, which trigger different functions according to the return value of the system according to the simulated driver coming online in a certain period of time. When receiving an order, it selects the corresponding function to deal with it according to the event engine, which is to simulate the choice of different functions of drivers and passengers.
The questioner 3
Q: You just mentioned the consistency of the API, that is, this part of the cut stream, there is a part of the Proxy in the middle, how is this implemented? Should I take a request and send it to the upstream server, or turn it into a proxy implementation? How is the latency of downstream and upstream requests resolved? How much delay is there, and is the pressure a bottleneck in the system?
Shi Songran: Proxy is a stateless service, so it can be expanded horizontally indefinitely. This performance problem is not a big obstacle. As for the delay caused by the double transfer you just mentioned, this is an asynchronous process and there will be no time-consuming problem.
Q: Will there be a timeout problem? How do you make the difference that causes the anomaly to be inconsistent?
Shi Songran: Yes, the timeout problem may also be caused by network problems. This situation will certainly occur, it is difficult to achieve 100% consistent results at the interface layer, so in the process of following the small traffic flow, it is also necessary to judge manually, because of the timeout of network jitter, or logic problems. You need to continuously observe small traffic to determine whether the fault is caused by a network exception.
The questioner 4
Q: when the background service is updated on a large scale, are there any short outages or smooth transitions?
Shi Songran: During the restart process, our operation and maintenance service has some additional logic. The mid-station service does a restart, not a complete hang up and start up process, but one machine, one machine restart. In the retry scenario, services may fail and restart again. During the online update process, we will ask our Proxy, the front-end business, to perform an extra retry to ensure the success rate of business.
Question: so the client’s performance might be stuck?
Shi Songran: Most of them are not perceptive, because if you retry, if you have 100 machines, one machine at a time, it is very difficult for you to hit a situation of two retries, the probability is very small.
Q: you need 300 RPCS to complete a single operation. This is performance expensive. Is this process all asynchronous?
Shi Songran: No, 300 RPCS actually refers to the whole order process, it is not 300 RPCS caused by one Rpc request. For example, if this is a driver getting out of a car is an RPC, and this is a passenger issuing an order is an RPC, they may involve more than ten, two dozen related requests, and this is not a one-time RPC request.
The 2018 Gopher Meetup will be in Shenzhen for the first time, and this time we will invite a lot of new lecturers to share their experience on Go
Click to read the original article to sign up