This article is excerpted from the Bytedance Infrastructure Practices series. “Bytedance Infrastructure Practice” is a series of articles designed by the technical teams and experts of Bytedance Infrastructure department to share the team’s practical experience and lessons in the development and evolution of infrastructure, and to exchange and grow with technical students. Online traffic diversion offline environment is a common requirement and is widely used in functional testing and pressure measurement scenarios. This paper introduces the development process and system design of drainage system in Bytedance, hoping to bring you some new thinking and harvest.
1. The background
AB Test (DIFF Test) is a commonly used verification method in the Internet industry. For example, Google verifies the effect of advertisements and recommendations through AB Test. Twitter developed Diffy and applied the DIFF verification ability to the quality assurance of API interface. There are usually two forms of AB Test. One is multiple online service versions, and AB is split through the access side to conduct experiments. However, for such scenes as advertising, once a model has problems, it will cause damage. Another model is to replicate online traffic back to the internal environment, which is absolutely safe for production. For example, Twitter’s AB authentication service Diffy follows this model. Today byte internal recommendation, advertising, and many other business lines are doing AB experiments through the mode of real-time playback of online traffic. In order to meet the needs of the business, we developed a set of online traffic recording and playback system ByteCopy to support the massive traffic handling of the business and the new demand for traffic recording and playback.
This article will introduce the evolution process of ByteCopy from business scenarios, system architecture, problem analysis and other aspects.
2. Build a first-generation traffic diversion system based on TCPCopy
2.1 Service Requirements
After the target services (HTTP and RPC) are deployed, the traffic diversion system can copy and forward the traffic generated online. The traffic diversion process can be flexibly controlled and enabled only when traffic is needed.
2.2 System Selection
From the point of view of drainage itself, it is mainly divided into two types, main route replication and bypass replication. We respectively analyze the advantages and disadvantages of these two modes.
2.2.1 Primary Replication
Primary replication refers to traffic replication in the call chain. One type is traffic replication in the service logic. For example, during API/RPC invocation, the service side writes the code logic to record the request/Response information. The other is to replicate within the processing logic of a framework such as Dubbo, Service Mesh, and so on.
advantages
- Fine-grained and customized traffic replication can be implemented based on service logic. For example, traffic can be replicated only for certain features to maximize the effective traffic collection ratio on traffic diversion sources.
disadvantages
- The service logic and traffic diversion logic are highly coupled and affect each other functionally.
- Each request requires additional diversion processing, which has a performance impact on the business process.
Applicable scenario
- For traffic that requires fine-grained filtering or is related to service logic, you can choose primary route replication. For example, in the Service Mesh, traffic is replicated according to the staining markers.
2.2.2 Bypass Replication
In contrast to in-route replication, off-line replication is insensitive to services. Generally, a third-party service listens for replication traffic in the network protocol stack
advantages
- Decoupled from services, the traffic diversion module can be independently deployed and upgraded, and the service provider does not need to pay attention to the implementation of the traffic diversion function. Traffic replication at the bottom of the protocol stack improves performance.
disadvantages
- After capturing network packets at layer 4 NICS, data packets need to be reassembled and parsed, consuming extra computing resources.
- In most cases, full packet capture and parsing are required before filtering, and customized sampling cannot be combined with service logic.
Open source solution TCPCopy
Although Linux provides an underlying packet Capture library such as LibpCap, we chose the open source TCPCopy as the core foundation of the entire traffic diversion system in order to deliver business requirements quickly. TCPCopy is not introduced here, but a simple architecture diagram is attached below, among which TCPCopy and Intercept are two components of TCPCopy. Students who are interested in relevant details can find information by themselves.
Key advantages of TCPCopy:
- Protocol is not aware, can be transparent forwarding, can support any application layer protocol based on TCP, such as MySQL, Kafka, Redis and so on
- Real-time forwarding, low delay
- The original request IP port information can be retained and the test server can be used for statistics
At the same time, it also has the following shortcomings:
- You cannot dynamically add multiple downstream servers
- Due to transparent forwarding and no protocol parsing, data anomalies cannot be detected. For example, if some TCP packets are lost, the test server receives incomplete data. In addition, application layer data cannot be filtered and modified
- The core components are not designed with multi-threading, so there is a bottleneck in processing capacity
- Iptables needs to be modified to discard the backpackings of downstream services, which is risky to use in production or public test environments
To meet the byte requirements, we introduced other components to the overall architecture to compensate for TCPCopy’s own shortcomings.
2.3 System Architecture
In order to solve the shortcomings of TCPCopy, we make some optimization based on the scheme of traffic forwarding directly through TCPCopy.
First, an additional seven layer proxy is introduced between TCPCopy and the service under test for data forwarding. The seven-tier proxy itself verifies the integrity of the request, avoiding interference testing by forwarding incomplete requests to the test service.
In addition, we deploy layer 7 proxy and Intercept component of TCPCopy on a batch of servers dedicated to traffic forwarding. When forwarding tasks, only the iptables of these servers need to be modified, and the tested service only needs to run normally on the test machine without additional configuration. Therefore, you can avoid the risks associated with changing iptables as much as possible.
To better integrate into the company’s technology ecosystem and support richer test scenarios, we also implemented a seven-tier agent for testing. The following capabilities have been added:
- With access to the company’s service discovery framework, the tested instance only needs to register the specified service name to receive the traffic sent by the proxy. Therefore, traffic can be forwarded to multiple instances under test simultaneously, or instances under test can be added or deleted dynamically
- Supports traffic filtering. The specified method is used to filter the received traffic and forward it. For example, you can filter out the forwarding traffic that contains write operations to avoid storage contamination
- Introduce flow control mechanism. Speed limiting of forwarded traffic and pressurization by repeatedly sending received requests supports simple pressure test scenarios
Finally, to make traffic diversion easy to use, we packaged the two components of TCPCopy, along with our seven-tier agent, into a single platform for users. To perform a traffic diversion test, you need to specify the IP address and port of the traffic diversion source, service to be tested, and the duration of the traffic diversion task.
The overall online architecture is roughly as shown in the figure below. After users submit tasks, we will be responsible for scheduling, configuration and deployment of each component, and forward traffic from online to users’ instances to be tested.
2.5 Existing problems
As the scale grew and more user scenarios were proposed, the architecture gradually exposed some problems.
2.5.1 TCPCopy Has Performance Problems
TCPCopy does not carry out multithreading design in the implementation, so the actual forwarding capacity is limited, for some high-bandwidth test scenarios can not be well supported.
2.5.2 Existing implementations cannot support more scenarios such as response recording
TCPCopy locating is a request copy forwarding tool. It copies only the request part of online traffic, but not the response of online traffic. We have received requests from users who want to analyze online traffic and want to be able to collect requests and responses for online traffic at the same time, which TCPCopy does not support.
3. Develop ByteCopy to support massive traffic and complex business scenarios
As mentioned above, the first-generation traffic diversion system has some performance and flexibility problems. Meanwhile, services also have some new requirements, such as support for MySQL protocol and storage and playback of historical traffic. Considering that it is difficult to extend the existing TCPCopy architecture, we considered to overturn the existing architecture and rebuild the next generation of byte traffic diversion system – ByteCopy (meaning copy every byte on the line).
Based on the above evolution, we can roughly divide the seven-layer traffic replication into the following three modules according to responsibilities
- Traffic acquisition
- Traffic analysis
- Application flow
We will introduce the three modules respectively
3.1 Traffic Collection
The traffic collection module is pulled up in different ways depending on the platform where the service is deployed. For example, in Kubernetes, it is aroused by the Mesh Agent, uses libpCap to listen for specific port traffic, reassembles TCP packets in user mode, and sends them batch to Kafka.
By default, the traffic collection module captures packets only for the IP address and port monitored by the collected service. In addition, to provide the ability to export traffic capture (that is, collect calls made by a service to its downstream dependencies), the traffic capture module also interfaces with the company’s service discovery framework. To collect egress traffic, the traffic collection module queries the IP addresses and ports of all instances of downstream dependent services and collects this traffic. (There are some problems with this scheme, more on that later)
As the traffic collection process and the application process are deployed in the same Docker instance or physical machine, the business will be sensitive to the resource occupation of the traffic collection module. In addition to optimization in the code layer, we will also use Cgroups to impose a hard limit on the resource usage.
In addition, the traffic collection platform is A multi-tenant design, and multiple users may have different collection requirements for A service. For example, user A wants to collect 5% instance traffic of ENV1 environment, and user B wants to collect traffic of one instance of ENV1 environment and one instance of ENV2 environment. If the requests of users A and B are simply processed independently, A redundant deployment such as 5%+1 instance deployment in ENV1 environment and 1 instance deployment in ENV2 will occur. Our approach is to decouple the user’s request specifications from the actual deployment of the acquisition module. After the user submits a request for specifications, it will first merge with the existing specifications to get a minimum deployment scheme, and then update the deployment status.
3.2 Traffic Parsing
The original traffic collected by the traffic diversion source is layer 4 protocol. In order to support some more complex functions, such as filtering, multi-channel output, historical traffic storage, traffic query and traffic visualization, we need to parse the traffic from Layer 4 to Layer 7. The most commonly used protocols for byteDance’s internal services are Thrift and HTTP, which are well parsed according to the protocol specification.
One difficulty in traffic parsing is to determine the boundary of traffic, which is different from HTTP/2 and other Pipeline connection multiplexing transmission forms. Thrift and HTTP/1.X transmit traffic strictly in request-response pairs on a single connection. Therefore, complete request or response traffic can be separated by switching between request and response.
3.3 Traffic Application
For the traffic collected online, different users will have different business purposes. For example, the pressure measuring platform may want to persist the traffic to Kafka first, and then make use of Kafka’s high throughput. Some r & D students simply transfer a copy of traffic from the Internet to their own development environment to test new features; Some students hope that the QPS can reach a certain water level to achieve the purpose of pressure measurement; Some specific traffic triggers the online coredump and they want to record the traffic for offline debug, etc. For different scenarios, we have implemented several flow output forms.
The following sections will focus on forwarding and storage.
3.3.1 forward
The structure is shown in the figure above. Emitter registers itself on ZooKeeper, and scheduler senses information of Emitter nodes, filters and scores tasks according to labels and statistics of each Emitter node, and then schedules them to the most appropriate node. The question here is why stateless service is not directly used, which is forwarded equally by each Emitter instance, while Sharding scheme is mainly based on the following considerations:
-
If each task is evenly distributed across all instances, each instance will need to establish connections to all downstream endpoints, and the goroutine, connection, memory, and other resource footprint under massive tasks is unacceptable
-
By converging the processing of a single task to a few instances, the request density for a single endpoint is increased. In this way, the peer end will not close the connection due to a long idle time and reuse the connection.
While Emitter is sensitive to performance, we’ve made many optimizations for it, such as using Fasthttp’s Goroutine pool to avoid frequent applications, pooling connected Reader /writer objects, Dynamically adjust the number of working threads per endpoint to adapt to user-specified QPS, avoid goroutine waste and idle long connections degenerate into short connections, complete lock-free, thread synchronization and data transfer via Channel + SELECT, etc.
3.3.2 rainfall distribution on 10-12 storage
Storage is divided into two layers, the data layer and the index layer, using a double-write model, and a scheduled task from the data layer to correct the index layer to ensure the final consistency of the two. Storage needs to support playback and query semantics. Data Layer is abstracted into a storage model supporting KV query, ordered by Key, and large capacity. Index Layer is a multi-index ->Key mapping model, through which traffic query and playback can be met. So the underlying implementation of Data Layer and Index Layer is modular, as long as it conforms to the above model and implements the model definition API.
The ideal Data structure of the Data Layer is LSM tree, which has excellent write performance. In traffic replay scenarios, traffic records need to be scanned orderly by key, because LSM can meet the local and ordered characteristics by key, and can make full use of page cache and disk sequential read to achieve high playback performance. The mature open source product of distributed LSM Tree industry is HBase. Bytedance also has a reference product Bytable. We have implemented Data Layer based on these two engines at the same time. After a series of performance benchmark we choose Bytable as Data Layer implementation.
Index Layer uses the ability of ES, so it can support users’ compound condition query. We will pre-build some query indexes, such as source service, target service, method name, traceid, etc. The current usage scenario of traffic query one is as the data source of service mock. You can eliminate unnecessary external dependencies in functional testing or diff, and another feature is traffic visualization, where users can see the content of a particular request by request time, traceid, and so on. Other scene features have yet to be explored.
3.4 Service Scenarios
3.4.1 Support for general DIFF capabilities
Diff validation is a key tool for Internet companies to maintain product quality under rapid iteration. Similar to Twtiter’s Diffy project, it is achieved by recording and playing back online traffic. However, its application scenarios are limited, because it is played back directly in the production environment through the AB environment and cannot support the write traffic. Although Alibaba’s Doom platform can solve the playback isolation problem of writing scenarios, it is implemented in the application through AOP, and is strongly bound to the Java ecosystem.
With ByteCopy’s non-invasive drainage and traffic storage playback capabilities, combined with our own ByteMock components, we provide a business-oriented non-invasive DIff solution and address the issue of write isolation.
In A link (A,B,C) in A production environment, ByteCopy is used to collect each hop (request, response). During the diff verification of A, ByteCopy is used to play back A service request. Meanwhile, Mock service B based on ByteMock and support playback of response to a trace (relying on traffic storage in ByteCopy for accurate playback). Since B is mock, even if it is a write request, it can be done without any effect on the line.
4. Future outlook
4.1 More accurate traffic collection
As mentioned earlier, when collecting egress traffic, packets are captured for IP: port of all instances that depend on downstream services. In the actual production environment, multiple services with the same downstream dependencies may be deployed on the same server and only rely on the four-layer data. Therefore, it is impossible to determine which service the captured data comes from, resulting in resource waste in packet capture, processing, and forwarding. At present, the scheme based on network card packet capture can not solve this problem well. We are also trying to explore some other traffic collection schemes, such as ebPF process level traffic collection.
4.2 Redefining the Traffic Diversion Playback System
At present, our drainage playback system only passively collects traffic according to the user’s configuration. In order to get the traffic in time for testing, users generally choose real-time drainage for testing. In fact, not all scenarios necessarily require real-time traffic testing. We are planning to gradually transform the traffic diversion playback system from a tool for traffic forwarding and playback according to user requirements to a platform for online traffic replication.
In flow storage capacity, on the basis of service, with testing requirements platform wrong peak flow, timing, by recording task actively, so as to maintain a constantly updated flow pool, the user needs to flow directly from the flow in the pool access, so that can avoid drainage operation of preemption and online business computing resources, also can make the flow availability is higher.
4.3 Traffic Storage Optimization in specific Scenarios
With the improvement of the upper layer application based on traffic recording and playback, we are considering the evolution towards normal traffic diversion for the convenience of more business access trial. This is bound to bring new challenges to our traffic storage, in terms of data size, storage morphology and query performance. We hope to build a traffic storage solution based on the existing storage system architecture, which supports massive data processing, point-of-view (based on TraceId), time-range scan and other complex high-performance query methods. In addition, we are actively working with the security team to ensure that core traffic data can be desensitized during storage, and we are continuously strengthening security audits on the use of traffic storage.
5. To summarize
So far, ByteCopy system has supported most of the business lines in different scenarios. We have been working hard to enrich ByteCopy’s functional scenarios and improve system stability and throughput capacity. In addition, we are actively building our own development components such as ByteMock. The combination of ByteCopy and production flow unlocked more usage scenarios for r&d activities and helped business teams build interesting products.
More share
Transactions stored in a bytedance table
IOS: KVO
Toutiao Android ‘second’ level compilation speed optimization
Evolution of Bytedance distributed table storage system
Bytedance infrastructure team
Bytedance’s infrastructure team is an important team that supports the smooth operation of bytedance’s multi-hundred-million-scale user products, including Douyin, Toutiao, Watermelon Video and Huoshan Video, providing guarantee and driving force for the rapid and stable development of Bytedance and its businesses.
Within the company, the infrastructure team is mainly responsible for the construction of Bytedance private cloud, managing clusters of tens of thousands of servers, being responsible for the mixed deployment of tens of thousands of computing/storage units and online/offline units, and supporting the stable storage of EB massive data.
Culturally, the team embraced open source and innovative hardware and software architectures. Job.bytedance.com (” read the original article “at the end of the article). If you are interested, please contact [email protected].