Prior to 2015, the main programming languages in the headlines were Python and some C++. With the rapid growth of services and traffic, the pressure on the server becomes more and more, which leads to frequent problems. The nature of Python’s interpreted language and its outdated multi-process service model have been greatly challenged. In addition, the server architecture at that time was a typical monomer architecture with serious coupling, and some independent functions needed to be removed from the monomer architecture.

Go has several natural advantages over other languages:

  1. Simple grammar, quick to use

  2. High performance, fast compilation, development efficiency is not low

  3. Native concurrency support, coroutine model is a very good server-side model, but also suitable for network calls

  4. Easy to deploy, small compilation package, almost no dependence

At that time, version 1.4 of Go had been released. When Go was at version 1.1, I started to use Go to develop back-end components and used Go to build back-end services with large traffic. Therefore, I was confident in the stability of Go language. In addition to the overall service-oriented architecture transformation of Toutiao backend, we decided to use Go language to build the microservice architecture of Toutiao backend.

In June 2015, Toutiao began to use Go language to reconstruct the back-end Feed streaming service. During the process, the existing business was iterated while the service was split. Until June 2016, almost all the back-end Feed streaming services were migrated to Go. Due to rapid business growth and service separation during the period, there is no horizontal comparison of indicators before and after reconstruction. In fact, the overall stability and performance of the service improved dramatically after switching to Go.

For complex inter-service calls, we abstract the concept of a quintuple :(From, FromCluster, To, ToCluster, Method). Each quintuple uniquely defines a class of RPC calls. Using quintuples as a unit, we built a whole set of microservices architecture.

We developed an internal microservices framework called Kite using Go, which is fully compatible with Thrift protocols. Based on the quintuple, we integrated service registration and discovery, distributed load balancing, timeout and fuse management, service degradation, method-level metrics monitoring, and distributed call chain tracing on kite framework. Currently, kite framework is used to develop internal Go language services, and the overall architecture supports unlimited horizontal scaling.

The implementation details of Kite framework and microservice architecture will be specially shared in the future. Here we will mainly share the convenience brought by Go language and the experience gained in the practice process when we used Go to build large-scale microservice architecture. The content mainly covers concurrency, performance, monitoring, and some experience with the Go language.

As an emerging programming language, Go is characterized by its native support for concurrency. Unlike the traditional os-based thread and process implementation, Go language concurrency is user-based concurrency, which becomes very lightweight and can easily run tens of thousands or even hundreds of thousands of concurrent logic. Therefore, server-side applications developed with Go use the “coroutine model”, where each request is handled by a separate coroutine.

Compared with the process thread model several orders of magnitude higher concurrency, and relative to the server model based on event callback, Go development ideas are more in line with the logical processing thinking of people, so even if the use of Go development of large projects, it is also easy to maintain.

Concurrency in Go is an implementation of the CSP concurrency model, the core concept of which is: “Do not communicate through shared memory, but share memory through communication”. This is implemented in the Go language as Goroutine and Channel. In a CSP paper published in 1978, there is a description of using THE CSP idea to solve the problem.

“The Problem: To print in ascending order all primes less than 10000. Use an array of processes, SIEVE, in which each process inputs a prime from its predecessor and prints it. The process then inputs an ascending stream of numbers from its predecessor and passes them on to its successor, Suppressing any that are multiples of the original prime.”

To find all primes up to 10,000, the method used here is sieving, which marks all numbers divisible by each prime found starting from 2. Until there’s nothing left to mark, everything else is prime. The following to find all primes within 10 as an example, using CSP to solve this problem.

As you can see from the figure above, each line of filtering uses an independent concurrent handler, with adjacent concurrent handlers above and below passing data to communicate. The primes within 10 are obtained by four concurrent processing programs, and the corresponding Go implementation code is as follows:

This example illustrates two features of development using the Go language:

  1. Concurrency in Go is simple, and processing efficiency can be improved by increasing concurrency.

  2. Coroutines can communicate with each other to share variables.

When concurrency becomes a native feature of a language, it is often used in practice to deal with logic problems, especially those involving network I/O, such as RPC calls, database access, etc. Here is an abstract representation of a microservice processing request:

When the Request reaches GW, GW needs to integrate the results of the five downstream services to respond to the Request, assuming that there is no mutual data dependence between the invocation of the five downstream services. Then five RPC requests are made at the same time, and the results of the five requests are awaited. To avoid long waits, the concept of wait timeout is introduced here. After a timeout event occurs, events are sent to requests that are being processed concurrently to avoid resource leaks. In practice, two abstract models are obtained.

  • Wait

  • Cancel

Wait and Cancel are concurrency controls that are used everywhere when developing services using Go and are used whenever concurrency is used. In the above example, after GW starts 5 coroutines and initiates 5 parallel RPC calls, the main coroutine enters the Wait state and needs to Wait for the return results of these 5 RPC calls, which is the Wait mode. In the other Cancel mode, the total timeout time for this request processing has reached before the return of the five RPC calls, so it is necessary to Cancel all outstanding RPC requests and terminate the coroutine in advance. Wait mode is more widely used, while Cancel mode is mainly reflected in timeout control and resource reclamation.

In the Go language, there are sync.waitGroup and context.context to implement both modes.

Reasonable timeout control is very important in building a reliable large-scale microservice architecture. Improper timeout setting or failure of timeout setting will cause an avalanche of services along the whole invocation chain.

The dependent service G in the figure responds slowly for some reason, so requests from upstream services are blocked on the invocation of service G. If the upstream service does not have reasonable timeout control at this time, so that the request block cannot be released on service G, then the upstream service itself will also be affected, further affecting all services in the whole invocation chain.

In the Go language, the Server is modeled as a “coroutine model,” where a coroutine processes a request. If the current request processing is blocked due to the slow response of the dependent service, it is easy to pile up a large number of coroutines in a short period of time. Each coroutine consumes a different amount of memory due to its different processing logic. When coroutine data surges, the server process quickly consumes a large amount of memory.

Coroutine inflation and memory usage spikes burden the Go scheduler and runtime GC, again affecting the service’s processing power, a vicious cycle that leads to the entire service becoming unavailable. This problem, called coroutine inflation, has occurred many times in the development of microservices using Go.

Is there a good way to solve this problem? The common cause of this problem is that the network call blocks for too long. Even after we set the network timeout reasonably, sometimes the timeout will not be limited. To analyze how to use the timeout control in Go language, first we will look at the process of the next network call.

The first step, establishing a TCP connection, usually sets a connection timeout to ensure that the connection is not blocked indefinitely.

The second step is to write serialized Request data into the Socket. To ensure that data writing will not be blocked all the time, Go language provides a method called SetWriteDeadline to control the timeout time of data writing into the Socket. Depending on the data volume of the Request, multiple Socket writes may be required, and serialization while writing may be used to improve efficiency. Therefore, the Thrift implementation resets the timeout before each Socket is written.

The third step is to read the returned result from the Socket. Like writing, Go also provides the SetReadDeadline interface. Since the read data may be read several times, the Go language also sets the timeout period before each read.

By analyzing the above process, it can be found that the length of time affecting the total RPC consumption consists of three parts: connection timeout, write timeout, and read timeout. Moreover, read and write timeouts can occur multiple times, which makes the timeouts uncontrollable. To address this issue, the concept of concurrent timeout control was introduced into the Kite framework and functionality was integrated into the Kite framework’s client-side call library.

The Concurrent timeout control model is shown in the figure above, in which a “Concurrent Ctrl” module is introduced, which is part of the microservice fuse function and controls the maximum number of Concurrent requests a client can make. The overall flow of concurrent timeout control looks like this

First, the client initiates an RPC request and, through the Concurrent Ctrl module, determines whether the current request is allowed to initiate. If allowed to initiate an RPC request, a coroutine is started and the RPC call is executed, initializing a timeout timer. Then the RPC completion event signal and timer signal are monitored in the main coroutine. If the RPC completion event arrives first, it indicates that the RPC succeeds. Otherwise, when the timer event occurs, it indicates that the RPC call times out. This model ensures that no matter what the case is, an RPC will never exceed the pre-defined time, achieving precise control over timeouts.

Go introduced “Context” in the 1.7 standard library, which became almost standard practice for concurrency and timeout control, and later added support for “context” in the 1.8 standard libraries, including the “Database/SQL” package.

Go already has significant performance advantages over traditional Web server programming languages. Many times, however, performance analysis tools have to be used to track down problems and optimize service performance because of incorrect usage or high latency requirements for services. The Go language tool chain comes with a variety of performance analysis tools for developers to analyze problems.

  • CPU Usage Analysis

  • Internal service analysis

  • View the coroutine stack

  • Viewing GC Logs

  • Trace analysis tool

Below are screenshots of various analysis methods

In the process of developing with the Go language, we summarized some approaches to writing high-performance Go services

  1. Pay attention to the use of locks, try to lock variables rather than locking processes

  2. If CAS is available, perform operations using CAS

  3. For hot code to do targeted optimization

  4. Don’t ignore the impact of GC, especially for high performance and low latency services

  5. Reasonable object reuse can achieve very good optimization results

  6. Avoid reflection and use reflection in high-performance services

  7. In some cases, you can try to tune the “GOGC” parameter

  8. Upgrade to a new version of Go as long as the new version is stable, because the old version will never get better

The following describes a real-world example of online service performance optimization.

This is a basic storage service that provides SetData and GetDataByRange methods to batch store data and batch retrieve data by time interval, respectively. In order to improve performance, the storage method is to use user ID and a period of time as key, and all data within the time range as value is stored in KV database. Therefore, when you need to add new storage data, you need to read the data from the database, splice the data to the corresponding time interval, and then save the data in the database.

For requests to read data, the list of keys is calculated based on the time interval of the request, and the data is read from the database in a loop.

In this case, the interface response time of the peak service is high, which seriously affects the overall service performance. After analyzing peak service through the above performance analysis method, the following conclusions are drawn:

Problem:

  • The GC pressure is high and occupies high CPU resources

  • Deserialization takes up a lot of CPU

Optimization idea:

  1. GC stress is mainly due to frequent memory requisition and release, so it is decided to reduce memory and object requisition

  2. The Thrift serialization method was used, and by Benchmark, we found a relatively efficient Msgpack serialization method.

The analysis of the service interface shows that data decompression and deserialization are the most frequent processes, which is consistent with the results of the performance analysis. A close analysis of the decompression and deserialization process shows that for deserialization, an IO.Reader interface is required, while for decompression, it implements the IO.Reader interface. In Go, the IO.Reader interface is defined as follows:

This interface defines the Read method, from which any object implementing the interface can Read a certain number of bytes of data. Therefore, only a relatively small section of memory Buffer is needed to realize the process from decompression to deserialization, instead of decompressing all data and deserialization, which saves a lot of memory usage.

In order to avoid frequent Buffer requests and releases, sync.Pool is used to implement an object Pool for object reuse.

In addition, for the fetch history interface, data from multiple keys is read from the original loop, optimized to read data from each key concurrently from the database. After these optimizations, the peak service PCT99 was reduced from 100ms to 15ms.

The above is a typical Go language service optimization case. Summarized as two points:

  1. Improve concurrency at the business level

  2. Reduce memory and object usage

In the process of optimization, Pprof tool was used to find performance bottlenecks, and then the data processing mode of Pipeline with “IO.Reader” interface was found, and the overall performance of the whole service was optimized.

The Go Runtime package provides multiple interfaces for developers to obtain the status of the current process. The kite framework integrates monitoring of coroutine count, coroutine status, GC pause times, GC frequency, stack memory usage, and more. These metrics are collected in real time for each currently running service and alarm thresholds are set for each metric, such as the number of coroutines and GC pause times. On the other hand, we are also trying to take snapshots of the stack and health status of the runtime services to track down process restarts that cannot be reproduced.

Compared to traditional Web programming languages, Go does bring about many changes in programming thinking. Each Go development service is an independent process. Any request processing causes Panic, the whole process will exit. Therefore, when starting a coroutine, it is necessary to consider whether to use recover method to avoid affecting other coroutines. For Web server development, you often want to string together the entire process of processing a request, which relies heavily on Thread Local variables. In Go, there is no such concept, so you need to pass the context during function calls.

Finally, in projects developed using Go, where concurrency is the norm, access to shared resources requires extra attention, and handling critical section code logic adds more mental baggage. These differences in programming thinking require a transformation for developers accustomed to traditional Web back-end development.

Engineering is also something that the Go language doesn’t talk about much. In fact, in the official website of Go about why to develop Go language, it is mentioned that when the code quantity of most languages becomes huge, the management and dependency analysis of the code itself become extremely difficult, so the code itself has become the most troublesome point, and many huge projects finally become afraid to touch it. Go, on the other hand, has a simple design syntax, a C-like style, not many ways to do a thing, and even some code styles are defined within the requirements of the Go compiler. Moreover, the Go standard library comes with an analysis package for source code, which can easily transform a project’s code into an AST tree.

Here is a graphic representation of the engineering nature of the Go language:

To form a square, there is only one way to Go, and each unit is consistent. Python concatenation can be done in a variety of ways.

Toutiao uses Go language to build a large-scale microservice architecture. Combining with Go language features, this paper focuses on the practice of concurrency, timeout control and performance in the construction of microservices. In fact, the Go language is excellent not only in terms of service performance, but also for containerized deployment, with a large portion of our services already running on internal private cloud platforms. We are evolving towards Cloud Native architecture with microservices-related components.

For more technical practices, check out today’s Headlines technology blog: techblog.toutiao.com

The authors introduce

Xiang Chao, Senior R&D engineer of Toutiao. In 2015, I joined Toutiao and was responsible for the transformation of services. I promoted the use of Go language internally, developed internal micro-service framework Kite, integrated service governance, load balancing and other micro-service functions, and realized the implementation of large-scale micro-service architecture constructed by Go language in Toutiao. Used to work for Xiaomi.