1. Introduction
Autonavi has started the Go business construction for a period of time, mainly including Go application landing, Go middleware construction and cloud native. After continuous efforts, good progress has been made in these areas. How is the business landing process of Autonavi Go realized? What are the problems encountered and how to solve them? This article will introduce the relevant experience for you, hope to be helpful to the students who are interested.
2. Why did Autonavi launch the Go app
Now gaudenet’s mainstream language or Java, Java applications, machine scores amazing. And autonavi’s overall business is also running fast forward, the speed of cost increase is very fast. Go has considerable advantages over the Java language at the language level in terms of reducing machine load. Reducing machine costs was our first consideration in implementing Go.
Secondly, Go language has developed rapidly in recent years. Both inside Alibaba Group and Autonavi, the voice for using Go language is getting higher and higher. The landing Go application can well verify the stability of Go middleware. Of course, we can verify it by means of chaos engineering, but only through the test of production environment can it be most convincing. Verifying the stability of the sediment Go language middleware was our second consideration for implementing Go applications.
Finally, Go language as a cloud native basic framework used more language, early implementation of Go application, the subsequent implementation of cloud native can reduce a lot of resistance. Autonavi’s current Serverless/Faas footprint is quite large. The third consideration for landing Go applications is to pave the way for subsequent cloud native landing.
3. Deploy the Go application in heavy traffic scenarios
3.1 Introduction to rendering gateway
The Autonavi rendering gateway mentioned in this paper is among the top applications in terms of business flow, transformation difficulty, risks and benefits of our landing Go applications. Rendering gateway in the access layer, accounting for half of Autonavi’s total traffic, importance can be imagined.
The next brief introduction to render gateway to undertake the business, to facilitate you to have some more three-dimensional understanding.
The rendering gateway accepts all drawing rendering from Autonavi mobile App, vehicle, open platform and other sources. When you use Autonavi, you can see buildings, topographic maps, names, routes, subway stations, bus stations, traffic lights and so on, all of which are shown to the end by the rendering engine through the rendering gateway. I’m going to put a couple of pictures here just to give you a little bit of a sense of what it is.
Figure 1 above is before the trip, Figure 2 is in the trip, Figure 3 is the taxi page, And Figure 4 is the hand-drawn map of the scenic spot. Rendering gateway involves a lot of business, the above is just an example, other business will not be mapped here.
3.2 Reconstruction Difficulties
Those of you who have done reconstruction projects know that there are two biggest difficulties in reconstruction projects. One is to ensure the correctness of business and the other is to ensure the stability of service.
In order to ensure the correctness of services, most of the reconstructed services are old services. The biggest problems faced by the old services are complicated historical logic, personnel change, and lack of documents. These factors are the “obstacles” in the process of reconstruction.
The same is true for rendering gateway reconstruction, which involves various business lines of Autonavi such as mobile phone terminal, vehicle terminal, open platform, taxi hailing and all historical versions. In addition to the above factors, it is very difficult to ensure the correctness of business.
For ensuring service stability, those who have done gateway should know that the property of gateway itself determines that it will not have frequent business iterations, and stability is the first demand of gateway. We want to ensure that the gateway remains highly available regardless of the external environment/dependencies. Due to the lack of sufficient verification of Go version middleware in heavy traffic scenarios, this difficulty needs to be carefully evaluated. Appropriate methods and means should be used to verify various boundary conditions in simulation environment as far as possible, so as to ensure no problems in production environment.
3.3 Technical Solutions
In the reconstruction of Autonavi rendering gateway, our overall technical solution is divided into three steps:
3.3.1 Comparison of Online Traffic
How do you verify the business correctness of the new service? We used the online flow comparison method.
We did a lot of research looking for a tool that would satisfy the (near) real-time, binary level comparison, but we didn’t find one. Due to the special nature of the rendering business, the majority of the render gateway interfaces return binary vector data, so the ideal tool should support not only regular data comparison, but also binary level comparison.
Another benefit of binary comparison is that character set differences and library function differences can be excluded. It can guarantee the accuracy of comparison. Some students may think of logging and then offline reading comparison method to do comparison, this method has many disadvantages.
First, traffic cannot be replayed to the specified machine. Secondly, this method of use is generally a fixed corpus, which is not complete enough to simulate the online environment. In addition, the difference in character set and language library functions caused by logging comparison can greatly affect the accuracy of comparison, especially for special characters (especially when the layer 7 protocol is binary). What if I don’t have a hand weighing tool?” “Open a road in the mountains, meet the water bridge.”
We independently developed a (near) real-time traffic comparison tool, which ensures the correctness of this reconfiguration and can also serve other business reconfiguration of Autonavi. The technical details of TCP/IP are very interesting, and those who are interested can skip to the “Traffic Comparison Tool (LN) technical details” section.
3.3.2 Pressure test in simulation environment
The students who do service must have experienced that it is not easy to guarantee the availability of 5 nines. In the real production environment, there may be various situations, so we need to find ways to verify the stability of services under various boundary conditions to ensure high availability of services. For the reconstructed new service, a simulation environment is needed to verify various situations.
To build a simulation environment, we need to maintain consistent machine baselines, external dependencies, and external flow (such as online drainage). Simulation environment should not only provide the ability of normal environment, but also the ability of abnormal state environment.
Abnormal states include network disconnection and packet loss. As the saying goes, 20% of the code is functional and 80% of the code handles exceptions. In practice, chaos engineering is the main method to construct abnormal state, through which chaos engineering can simulate anomalies at the operating system level (such as network disconnection, packet loss, etc.) and up to the application layer (such as message middleware backlog, Hook simulation of business anomalies before and after JVM methods, etc.).
In the simulation environment, the long-term limit pressure measurement is carried out at the same time, the corpus is guided from the line, the pressure measurement is carried out in both normal and abnormal states, and the performance of the service in a long period of time is observed, so as to draw the stability and availability conclusion of the service.
Observation indicators include basic indicators, such as CPU usage, disk usage, memory usage, connection number, and service indicators, such as the success rate, number of successful service interfaces, total number of successful service interfaces, and TP99. In this way, the possible situations are almost completely covered and the service stability and high availability are fully guaranteed.
3.3.3 Smoothing gray tangent flow
The previous part talked about how to ensure business correctness and service stability. Next, how to ensure smooth gray tangent flow. Firmly comply with the three principles of Ali release is smooth gray cutting flow “magic weapon” : gray, can be monitored, can be rolled back.
In specific practice, we follow the following steps of gray tangent flow:
A. If the original Java cluster does not move, apply for a Go cluster. Modify the routing rule. Some whitelisted users use the Go cluster service.
B. Modify the routing rules to Go cluster interface by interface, and slowly grayscale. During this period, closely observe the machine posture, service logs, and indicators. If there is an exception, switch back to the Java cluster.
C. After all interfaces are switched to the Go cluster, the Java cluster and Go cluster coexist for a period of time.
D. Gradually drop the Java cluster machine.
3.4 Main Income
The first important benefit: cost reduction and efficiency improvement. When The Autonavi Rendering gateway switched from Java to Go, the number of machines was reduced by nearly half. It uses half of the original resources to complete the same work, greatly reducing costs, improving resource utilization, better supporting business development, and greatly reducing the growth rate of access layer machines caused by the rapid growth of service traffic.
The second important benefit is that it verifies the stability of the Go version middleware co-built by Autonavi and the Group, and improves and prosperity the Go ecology of the Group to a certain extent. After the test of the large-flow scenario, the stability of the Go version middleware co-built by Autonavi and the Group has been fully verified.
A third important benefit: paving the way for gateway cloud biogenesis. Go of the gateway is only the first step. Go is the language that is widely used in the implementation of cloud native infrastructure. The first step is to smooth out language differences.
Of course, there are also many useful tools to precipitate in the Autonavi rendering gateway refactoring process. For example, a self-developed traffic comparison tool, LN, can be used to ensure subsequent service reconstruction.
4. Technical dry goods
4.1 Technical details of traffic Comparison Tool (LN)
Let me start with a question: what does it take to make a (near) real-time traffic comparison tool? That’s right, traffic replication, traffic parsing, traffic replay, traffic comparison. In fact, it is more than that. In practice, it is more of a flow regression closed-loop, as shown in the figure below:
4.1.1 Traffic Replication
To support all layer 7 protocols, traffic capture must start at layer 3 or 4. Some students will immediately think of tcpdump. Yes, tcpdump. The files produced by tcpdump are the actual traffic. The replication traffic step is already in place. As for real-time, two or three processes can stagger the time, overlapping the beginning and end of the time period to achieve real-time.
In addition, another consideration for the design of this tool is not to put too heavy a load on the on-line machine to avoid the stability of the on-line machine. This traffic replication mode is very light, the load on the online machine is very small, can be ignored.
4.1.2 Traffic Upload & Traffic Pull
Internal file services are used for traffic upload and traffic pull.
4.1.3 Traffic Comparison
In order to ensure the strictness of the comparison and exclude possible interference from character set/different library functions, we natively support binary stream comparison.
4.1.4 Local Replay Debug of Faulty Traffic
When regaining traffic, it may be found that some traffic comparisons are inconsistent. In this case, we want to replay only certain traffic to specific machines for debugging and other operations. Ln natively supports this function.
4.1.5 Traffic Parsing
Traffic parsing is a lot of fun, and the pure pleasure comes from “playing” with network protocols.
The actual practice is how to parse tcpdump files, get the TCP payload, and restore HTTP requests.
There are two key points here, one is how we get the TCP payload from the tcpdump file, and the other is how we reaggregate the four layer TCP payload into seven layer HTTP requests.
4.1.5.1 Tcpdump File Format
How can I get the TCP payload from the tcpdump file? If I know the format of the tcpdump file, I can know where the TCP payload is and how long it is. This week we’ll take a look at the tcpdump file format.
Take a look at the overview of tcpdump files
The format and length of the header are fixed as follows:
We can move back 23 bytes after reading the tcpdump file and start processing each packet. The format of each packet is as follows:
For each packet, we skip the packet header, the data link header, the IP layer header, and the TCP protocol header. Finally, the packet is offset to the first byte of the TCP payload. More implementation details (determination of header field values for different layers, determination of different lengths, determination of size, how request packets correspond to response packets, etc.) are not expanded here. This is just the general idea, but you can dig into network protocols if you’re interested.
4.1.5.2 TCP Payload Restores HTTP requests
This section describes how to restore the TCP payload to the HTTP request (HTTP refers to http1.0/1.1, not including HTTP2). The full implementation of the LN tool is the TCP payload to restore the request and the corresponding response. Parse out that HTTP requests can actually be re-requested to the old and new services separately, comparing the response binary stream.
A TCP connection, multiple payloads sent (for example, packet loss and retransmission) Multiple payloads may correspond to one HTTP request; Maybe the first part of a payload corresponds to an HTTP request, and the second part corresponds to another HTTP request. What we need to do is read the byte stream formed by multiple payloads, and aggregate HTTP requests according to the format of HTTP frames. Additionally, HTTP2 requests cannot be aggregated in this way.
4.2 Some go language best practices
2 the sync. The pool to practice
Because the memory management mechanism of Go language and Java language is different, the cost of allocating memory is also different.
Sync. pool is a great tool for memory reuse for the Go language. Sync.pool has many advantages, such as reduced memory footprint, reduced system calls, and reduced GC stress. Sync. pool has two sides, and sync.pool also has two sides. When using sync.pool, we need to be aware that objects stored in sync.pool will be recycled without notice, so resources such as database connections are not suitable to use sync.pool.
In summary, sync.pool can reuse memory and reduce machine load, making it ideal for temporary objects.
4.2.2 Golang Byte
The Byte type of the Go language is unsigned, and the Byte type of the Java language is signed. During the Java service migration of the Go service, pay attention to the comparison of positive, negative, and zero Byte types in Java code.
4.2.3 Efficient conversion of Golang byte slicing and string
Byte slice to string
func Bytes2String(b []byte) string {
return *(*string)(unsafe.Pointer(&b))
}
Copy the code
String to byte slice
func String2Bytes(s string) []byte {
x := (*[2]uintptr)(unsafe.Pointer(&s))
h := [3]uintptr{x[0], x[1], x[1]}
return *(*[]byte)(unsafe.Pointer(&h))
}
Copy the code
Using this method of conversion, high performance. The reason is that there are no new memory requests and copies. However, whether byte slice turns to string or string slice turns to byte slice, the value change in byte slice will affect the value of string. Users should judge whether it is acceptable or not according to the business logic and control the life cycle more accurately.
4.2.4 Golang library function rewrite
For the gateway, the CPU consuming part is Hash function/codec function/encryption and decryption function/serialization and deserialization function. In practice, we rewrote the related library functions and made a lot of optimizations on CPU load.
In order to reduce CPU load, we need to know how the CPU works so that we can know how to write code to reduce CPU load. Here is a rough overview of how the CPU works.
Put up a DIAGRAM of the CPU pipeline
- Instruction fetch (IF)
- Instruction decode (ID)
- Execute (execute, EXE)
- Memory Access (MEM)
- Register write-back (WB)
The MEM steps are mainly optimized to minimize the clock cycle occupied by MEM steps by using CPU cache, thus reducing CPU load.
Similar to NUMA architectures, affinity and others reduce CPU Load in the same way, minimizing the clock cycles required to Load data.
For optimizing Golang library functions, there are two points that can be improved: the optimization algorithm itself; Optimize CPU cache affinity.
We focus on the second one, taking the base64 codec function as an example. The passed Byte slice and the returned Byte slice are not the same array and the same memory at the bottom. This involves two points that can consume additional CPU clock cycles, one is memory allocation and release, and the other is CPU cache contention caused by two separate memory accesses (not exactly the same as pseudo-sharing).
What if we reuse the memory we passed in? That is, decoding side overwrite the same block of memory. A wonderful thing happened, and the above problem did not exist. The same work is done in fewer clock cycles. It should be noted that since the input and output of functions use the same block of memory, there are higher coding requirements for program developers, that is, more accurate control of the life cycle of data flow in the program, the code should be polished very carefully.
5. Future outlook
The next step of the gateway is cloud biochemistry, which is implemented using Service Mesh. This can solve the disadvantages of the current centralized gateway. Decentralization can improve the stability of the access layer, reduce the explosion radius, enhance the isolation ability, and achieve more fine-grained control.
Secondly, reduce the machine cost. According to the current internal pressure test and the existing practical pressure test conclusion in the industry, the cost will be further reduced after Mesh. Considering the consumption of the existing RPC framework itself, the cost will be further reduced. And the data plane agent is also constantly optimized, the subsequent performance will be more excellent, the additional two hops on the machine will further reduce the load.
Furthermore, the network layer capability set is greatly enhanced. ** Gateway Mesh can drive upstream service Mesh, and finally make a superset of capabilities in the entire network layer.
Existing Service framework provides the ability to Mesh can be summarized as the Connect, Secure, Control, and Observe the four most, its capability is a superset of existing gateway ability could not do before can do it, the benefits of the most obvious is the ability to Observe, It can greatly enhance the observability of the whole link service, which is of great help to the follow-up work of service stability and quick fault location of the whole link.
There is a long way to go to do the above things. In addition, we will do more cloud native pilot and implementation. All technical students know that there is a long way to go from technology selection to technology prototype and then to actual business implementation. But if you choose the right way, you are not afraid of going far.
Sincerely invite fellow travelers
The author’s team is looking for talents, looking forward to enthusiastic technical partners to do some interesting things together, all technical stacks are available, willing partners please feel free to hit the resume to [email protected], email subject: name – technical direction – from Autonavi Technology.
Happy Hacking!