background
Ant Financial has high requirements on stability and performance of Service Mesh, and internal MOSN is widely used in production environment. On the Cloud and in the open source community, the RPC domain Dubbo and Spring Cloud are also widely used in production environments, and we support Dubbo and Spring Cloud traffic agents on top of MOSN. We found that in the process of supporting Dubbo protocol, the performance of the Mesh traffic proxy has a very large performance loss, and the landing Mesh of large merchants also has high requirements on performance. Therefore, this paper will focus on the performance optimization of MOSN based on dubo-Go-Hessian2, Dubbo protocol.
Overview of Performance Optimization
Based on actual service deployment scenarios, common Linux machines are used instead of high-performance machines. Set and pressure test parameters as follows:
- Intel(R) Xeon(R) Platinum 8163 CPU @2.50 GHz 4-core 16G;
- Pod configuration
2 c, 1 g
The JVM parameter-server -Xms1024m -Xmx1024m
; - Network latency was 0.23ms. Two Linux machines were deployed with Server +MOSN and pressure measuring program RPC-Perfomance respectively.
After 3 rounds of performance optimization, using the optimized version of MOSN will yield the following performance benefits (512 and 1K byte random pressure measurements for the framework) :
- 512 bytes: MOSN+Dubbo service call TPS overall increase 55-82.8%, RT reduced about 45%, memory occupation 40M;
- 1K data: THE overall TPS of MOSN+Dubbo service invocation increased by 51.1-69.3%, RT decreased by about 41%, and memory occupation was 41M;
Performance optimization tool pprof
Before performance optimization, we should first find the performance bottleneck point. After finding the performance bottleneck point, another difficulty is how to replace slow code with efficient code optimization. Because Ant Financial Service Mesh is implemented based on Go language, we preferred the pprof performance tool of Go. We briefly introduce how to use this tool. If we import import _ “net/ HTTP /pprof” in the main header of the Go library with http.Server, Go will mount the corresponding handler for us. See godoc for details.
Because the MOSN exposes HTTP services on port 34902 by default, you can easily obtain the performance diagnostic file of the MOSN by running the following command:
go tool pprof -seconds 60 http://benchmark-server-ip:34902/debug/pprof/profile
This command samples the CPU for 60 seconds
# pprof.mosn.samples.cpu.001.pb.gz
Copy the code
Then continue to open the diagnostic file with pprof for easy viewing in the browser. The profiler flame diagram after pressure measurement is shown in Figure 1-1:
# HTTP =:8000 opens port 8000 on behalf of pprof and is then used for Web browser analysis
# mosnd represents the binary executable of mosN, used to analyze code symbols
# pprof. Mosn. Samples. CPU. 001. The pb. Gz is CPU diagnostic files
go tool pprof -http=:8000 mosnd pprof.mosn.samples.cpu.001.pb.gz
Copy the code
Once you have the diagnostic data, you can cut to the browser Flame Graph(which comes with Go 1.11 and above), where the X axis of the Flame Graph represents CPU consumption and the Y axis of the code method call stack. Before optimization begins, we use the Go tool pprof to diagnose the general performance bottlenecks in the following areas (directly pressing the Server MOSN) :
- MOSN in Dubbo request, CPU card point in streamConnection. Dispatch;
- The MOSN is forwarding Dubbo requests and the CPU is in downStream.
Click any horizontal bar of the flame diagram to view the time and stack details of rectangular blocks (please refer to Figure 1-2 and 1-3) :
Performance Optimization
This paper focuses on recording which cases are optimized to improve throughput by 50%+ and reduce RT, so the current case optimization will be directly analyzed later. Until then, let’s use Dispatch as an example to see why it eats so much performance. In Terminal, run the following command to view the CPU consumption of the code line (the code has been deleted) :
go tool pprof mosnd pprof.mosn.samples.cpu001..pb.gz
(pprof) list Dispatch
Total: 1.75mins
370ms 37.15 s (flat, cum) 35.46% of Total
10ms 10ms 123:func (conn *streamConnection) Dispatch(buffer types.IoBuffer) {
40ms 630ms 125: log.DefaultLogger.Tracef("stream connection dispatch data string = %v", buffer.String())
. . 126:).127: // get sub protocol codec
. 250ms 128: requestList := conn.codec.SplitFrame(buffer.Bytes())
20ms 20ms 129: for _, request := range requestList {
10ms 160ms 134: headers := make(map[string]string.).135: // support dynamic route
50ms 920ms 136: headers[strings.ToLower(protocol.MosnHeaderHostKey)] = conn.connection.RemoteAddr().String()
. . 149:).150: // get stream id
10ms 440ms 151: streamID := conn.codec.GetStreamID(request)
. . 156: // request route
. 50ms 157: requestRouteCodec, ok := conn.codec.(xprotocol.RequestRouting)
. . 158: if ok {
. 20.11 s 159: routeHeaders := requestRouteCodec.GetMetas(request)
. . 165:}.166:).167: // tracing
10ms 80ms 168: tracingCodec, ok := conn.codec.(xprotocol.Tracing)
. . 169: var span types.Span
. . 170: if ok {
10ms 1.91 s 171: serviceName := tracingCodec.GetServiceName(request)
. 2.17 s 172: methodName := tracingCodec.GetMethodName(request)
. . 176:).177: if trace.IsEnabled() {
. 50ms 179: tracer := trace.Tracer(protocol.Xprotocol)
. . 180: iftracer ! =nil {
20ms 1.66 s 181: span = tracer.Start(conn.context, headers, time.Now())
. . 182:}.183:}.184:}.185:.110ms 186: reqBuf := networkbuffer.NewIoBufferBytes(request)
. . 188: // append sub protocol header
10ms 950ms 189: headers[types.HeaderXprotocolSubProtocol] = string(conn.subProtocol)
10ms 4.96 s 190: conn.OnReceive(ctx, streamID, protocol.CommonHeader(headers), reqBuf, span, isHearbeat)
30ms 60ms 191: buffer.Drain(requestLen)
. . 192:}.193:}
Copy the code
With the list Dispatch command above, performance points are mainly distributed in lines 159, 171, 172, 181, and 190, and the main points are in decoding Dubbo parameters, repeat solution parameters, Tracer, deserialization, Log, and so on.
1. Optimize Dubbo decoding GetMetas
By decoding Dubbo’s body, we can obtain the following information, such as the target interface of the call and the service group of the call method, but we need to skip all the business method parameters. Currently, we use the open source Hessian-Go library. Parsing string and Map performance is poor, and improving the decoding performance of the Hessian library is explained later in this article.
Optimization idea:
On the INGress side of the MOSN (the MOSN directly forwards requests to the local Java Server process), we spy the interface and group used by the user according to the path and version of the request. Constructing the correct dataId can carry out brainless forwarding without decoding body and extract performance improvement.
We can build path, version, and group to interface and group mappings for service publication at service registration. When the MOSN forwards a Dubbo request, it can read the cache lock and skip the decoding body to accelerate the MOSN performance.
Therefore, we built the following cache implementation (array + linked list data structure), which can be seen in the optimization code diff:
// metadata.go
// DubboPubMetadata dubbo pub cache metadata
var DubboPubMetadata = &Metadata{}
// DubboSubMetadata dubbo sub cache metadata
var DubboSubMetadata = &Metadata{}
// Metadata cache service pub or sub metadata.
// speed up for decode or encode dubbo peformance.
// please do not use outside of the dubbo framwork.
type Metadata struct {
data map[string]*Node
mu sync.RWMutex // protect data internal
}
// Find cached pub or sub metatada.
// caller should be check match is true
func (m *Metadata) Find(path, version string) (node *Node, matched bool) {
// we found nothing
if m.data == nil {
return nil.false
}
m.mu.RLocker().Lock()
// for performance
// m.mu.RLocker().Unlock() should be called.
// we check head node first
head := m.data[path]
if head == nil || head.count <= 0 {
m.mu.RLocker().Unlock()
return nil.false
}
node = head.Next
// just only once, just return
// for dubbo framwork, that's what we're expected.
if head.count == 1 {
m.mu.RLocker().Unlock()
return node, true
}
var count int
var found *Node
for; node ! =nil; node = node.Next {
if node.Version == version {
if found == nil {
found = node
}
count++
}
}
m.mu.RLocker().Unlock()
return found, count == 1
}
// Register pub or sub metadata
func (m *Metadata) Register(path string, node *Node) {
m.mu.Lock()
// for performance
// m.mu.Unlock() should be called.
if m.data == nil {
m.data = make(map[string]*Node, 4)}// we check head node first
head := m.data[path]
if head == nil {
head = &Node{
count: 1,}// update head
m.data[path] = head
}
insert := &Node{
Service: node.Service,
Version: node.Version,
Group: node.Group,
}
next := head.Next
if next == nil {
// fist insert, just insert to head
head.Next = insert
// record last element
head.last = insert
m.mu.Unlock()
return
}
// we check already exist first
for; next ! =nil; next = next.Next {
// we found it
if next.Version == node.Version && next.Group == node.Group {
// release lock and no nothing
m.mu.Unlock()
return
}
}
head.count++
// append node to the end of the list
head.last.Next = insert
// update last element
head.last = insert
m.mu.Unlock()
}
Copy the code
The cache built during service registration can hit the cache when the MOSN stream is decoded, without decoding parameters to obtain the interface and group information, see optimization code diff:
// decoder.go
// for better performance.
// If the ingress scenario is not using group,
// we can skip parsing attachment to improve performance
if listener == IngressDubbo {
if node, matched = DubboPubMetadata.Find(path, version); matched {
meta[ServiceNameHeader] = node.Service
meta[GroupNameHeader] = node.Group
}
} else if listener == EgressDubbo {
// for better performance.
// If the egress scenario is not using group,
// we can skip parsing attachment to improve performance
if node, matched = DubboSubMetadata.Find(path, version); matched {
meta[ServiceNameHeader] = node.Service
meta[GroupNameHeader] = node.Group
}
}
Copy the code
If the Egress of the MOSN directly forwards requests to the local Java client process, the MOSN spies the interface and group used by the user based on the path and version of the request. Constructing the correct dataId can carry out brainless forwarding without decoding body and extract performance improvement.
2. Optimize Dubbo decoding parameters
When Dubbo decodes parameter values, the MOSN uses Hessian’s regular expression lookup, which is very performance consuming. Let’s take a look at the comparison of benchmark before and after optimization, and the performance is improved by 50 times!!
go test -bench=BenchmarkCountArgCount -run=^$ -benchmem
BenchmarkCountArgCountByRegex- 12 200000 6236 ns/op 1472 B/op 24 allocs/op
BenchmarkCountArgCountOptimized- 12 10000000 124 ns/op 0 B/op 0 allocs/op
Copy the code
Optimization idea:
Instead of using regular expressions, we can use simple string parsing to identify the number of parameter types, and Dubbo encoding the number of parameter types. The string implementation is not complicated, and the objects are prefixed with L and arrays are added with [, and the primitive type is replaced with a single character.
func getArgumentCount(desc string) int {
len := len(desc)
if len == 0 {
return 0
}
var args, next = 0, false
for _, ch := range desc {
// is array ?
if ch == '[' {
continue
}
// is object ?
ifnext && ch ! ='; ' {
continue
}
switch ch {
case 'V', // void
'Z', // boolean
'B', // byte
'C', // char
'D', // double
'F', / /float
'I', // int
'J', // long
'S': // short
args++
default:
// we found object
if ch == 'L' {
args++
next = true
// end of object ?
} else if ch == '; ' {
next = false}}}return args
}
Copy the code
3. Optimize hessian Go decoding string performance
As you can see in Figure 1-2, hessian Go has a high CPU sampling ratio in decoding strings. When decoding Dubbo requests, we parse the Dubbo framework version, call path, interface version, and method name, which are all strings. Hessian Go Parsing string affects RPC performance.
Let’s first run the string decoding comparison before and after benchmark, performance improved 56.11%!! This corresponds to about a 5% increase in RPC.
BenchmarkDecodeStringOriginal-12 1967202 613 ns/op 272 B/op 6 allocs/op
BenchmarkDecodeStringOptimized-12 4477216 269 ns/op 224 B/op 5 allocs/op
Copy the code
Optimization idea:
Utf-8 byte decoding directly, the highest performance, before decoding byte into rune, rune into string decoding and performance. Adding bulk String chunk copies, reducing read calls, and using unaddressed to convert strings (to avoid checking), because the code optimizes diff a lot. The optimizer code PR is shown here.
Go # sliceruneToString (rune converted tostring) also converts runeto byte array, which gives me some ideas for optimization.
4. Optimize the hessian library codec object
Although the body decoding part of Dubbo is eliminated, the MOSN must decode the frame version of the request header, request path, and interface version values when processing a Dubbo request. However, serialized objects are created every time they are decoded, which is very expensive because Hessian allocates 4K data and resets it every time it creates a reader.
10ms 10ms 75:func unSerialize(serializeId int, data []byte, parseCtl unserializeCtl) *dubboAttr { 10ms 140ms 82: Attr := &dubboattr {} 80ms 2.56s 83: decoder := hessian.NewDecoderWithSkip(data[:]) ROUTINE ======================== bufio.NewReaderSizein /usr/local/go/ SRC /bufio/bufio. Go 50ms 2.44s (flat, cum) 2.33% of Total. 220ms 55: r := new(Reader) 50ms 2.22s 56: r.reset(make([]byte, size), rd) . . 57:return r
. . 58:}
Copy the code
We can write a pooled memory before and after performance comparison, performance improvement 85.4%!! Benchmark use case:
BenchmarkNewDecoder- 12 1487685 803 ns/op 4528 B/op 9 allocs/op
BenchmarkNewDecoderOptimized- 12 10564024 117 ns/op 128 B/op 3 allocs/op
Copy the code
Optimization idea:
Pool hessian decoder object at each codec, add NewCheapDecoderWithSkip and support reset reuse decoder.
var decodePool = &sync.Pool{
New: func() interface{} {
returnhessian.NewCheapDecoderWithSkip([]byte{}) }, Decoder (* hessian.decoder) // Fill decode data decoder.Reset(data[:]) hessianPool.Put(decoder)Copy the code
5. Optimize repeated decoding of service and methodName values
When xProtocol implements Xprotocol.Tracing to obtain service names and methods, the invocation will be triggered and resolved for two times, resulting in high invocation overhead.
171:10 ms 1.91 s serviceName: = tracingCodec. GetServiceName (request). 2.17 s 172: methodName := tracingCodec.GetMethodName(request)Copy the code
Optimization idea:
Since GetMetas has parsed headers once, you can pass in the headers that have been parsed. If headers has been parsed, you don’t need to parse any more, and refactor the interface name to one that returns a binary value, eliminating a call.
6. Optimize streamId type conversion
The performance is compared when the byte array and streamId are interrotated in Go.
Optimization idea:
In production code, try not to use ftt. Sprintf and FTt. Printf for type conversion and printing information. Strconv can be used to convert.
. 430ms 147: reqIDStr := fmt.Sprintf("%d", reqID)
60ms 4.10s 168: fmt.Printf("src=%s, len=%d, reqid:%v\n", streamID, reqIDStrLen, reqIDStr)
Copy the code
7. Optimize expensive system calls
When the MOSN decodes the Dubbo request, it will insert a copy of the remote host’s address in the header and fetch the remoteIp in the for loop, which is high system call overhead.
Optimization idea:
50ms 920ms 136: headers[strings.ToLower(protocol.MosnHeaderHostKey)] = conn.connection.RemoteAddr().String()
Copy the code
When retrieving remote addresses, cache remote IP values in streamConnection whenever possible and do not call RemoteAddr every time.
8. Optimize slice and map to trigger expansion and rehash
When the MOSN processes a Dubbo request, dataId is built based on interface, version, and grouping. Then the cluster is matched and default Slice and map objects are created. After performance diagnosis, This results in continuous allocate Slice and Grow map capacity comparison cost performance.
Optimization idea:
When using Slice and map, estimate the capacity as much as possible and use make(type, capacity) to specify the initial size.
9. Optimize Trace log level output
A lot of code in MOSN logs a lot of Trace and passes a lot of parameter values while processing logic.
Optimization idea:
Before calling Trace output, try to determine the log level. If there are multiple Trace calls, write as many strings as possible to BUF, then write the BUF contents to the log, and call Trace logging methods as little as possible.
Optimize Tracer, Log, and Metrics
During the rush period, the machine has high performance requirements. After performance diagnosis, Tracer, MOSN Log, and Cloud Metrics Log writing (IO operations) consume high performance.
Optimization idea:
The API is allowed to invoke these Feature switches by delivering the configuration through the configuration center or adding a large speed switch.
/api/v1/downgrade/on
/api/v1/downgrade/off
Copy the code
Optimize route header parsing
Before performing routing in AN MOSN, you need to perform a lot of header map access, such as LDC and ANTVIP logic judgment. Commercial or open source MOSN does not need these logic, which also consumes some overhead.
Optimization idea:
If the logic is on the cloud, the logic of the internal MOSN does not run.
12. Optimize featuregate calls
Featuregate is used to determine which parts of the internal and commercial version of the routing logic to use when processing requests in an MOSN. Calling through Featuregate is expensive and requires frequent type conversions and multi-layer maps to get.
Optimization idea:
Log the Corresponding featuregate switch with a bool variable, and call Featuregate if it hasn’t been initialized.
Future performance optimization thinking
After several rounds of performance optimization, the current flame chart shows that the card points are all read and write in connection, so the space for optimization is relatively small. But you might benefit from the following scenarios:
- Reduce connection read and write times (syscall);
- Optimize IO thread model, reduce ctrip and context switch, etc.
As a conclusion, the optimized flame diagram is presented. Most of the stuck points are in system call and network read and write, please refer to Figure 1-4.
About the author
Attainments, open source Apache Dubbo PMC. Currently, I am working in the Middleware team of Ant Financial, focusing on RPC and Service Mesh. Author of in-depth Understanding of Apache Dubbo and The Real Thing. Making: github.com/zonghaishan…
other
The pprof tool is extremely powerful and can diagnose CPU, Memory, Go coroutines, Tracer, deadlocks, etc.
- blog.golang.org/pprof
- www.cnblogs.com/Dr-wei/p/11…
- www.youtube.com/watch?v=N3P…
Financial Class Distributed Architecture (Antfin_SOFA)