Author: Zhong Zhenchi/Post editor: Zhang Handong
preface
This paper is a more detailed text of the speech “High-performance Rust Tracing Library Design” delivered by PingCAP Observability R&d engineer Zhong Zhenchi at Rust China Conf 2020. This paper introduces how the distributed KV database TiKV, which is very demanding for performance, can realize the time tracking of all requests with less than 5% performance impact. The other can click www.bilibili.com/video/BV1Yy… Check out the video of the presentation.
background
Observability of a system is usually composed of three dimensions: Logging, Metrics and Tracing. The relationship between them is shown as follows:
- Log: Indicates discrete error information and status information.
- Metrics: Record and present aggregated data.
- Tracking: A series of events for a single request.
TiKV implements a complete log and metric system, but the absence of tracking leads to the following difficulties in diagnosing TiKV and TiDB problems:
- No correlation between observed data: only students who are familiar with the monitoring indicators corresponding to each operation on the request link can fully trace and diagnose the problem.
- Request jitter is difficult to trace: The TiKV node usually processes services of different modes at the same time, and the performance jitter of sporadic requests cannot be reflected in monitoring indicators such as AVG/P99 / MAX, so the cause of jitter cannot be diagnosed.
Tracing can be an effective solution to the problems encountered in the above scenario. The following details the implementation of high-performance tracking in TiKV. Tracing is an experimental feature in TiKV and requires specific code branches to be enabled. If you are interested, follow the GitHub Issue Introduce Tracing Framework (#8981).
The basic concept
Trace the execution path of a request in the rendering system. For example, trace the entire execution of an SQL statement from TiDB to TiKV:
INSERT INTOtVALUES (1), (2), (3); Interesting information:
- TiDB processes this request through the compile, plan, and execute steps
- TiDB invokes TiKV’s Prewrite RPC and Commit RPC in the Execute phase
- The request takes 5ms
Each box in the diagram represents an event, called a Span. Each Span contains:
- The name of the event
- Event start and end timestamps
There are hierarchies between spans, which can form parent-child relationship or sequential relationship, as shown in the figure below:
implementation
All performance test results in this paper, if the test environment is specified, are completed on the following platforms:
CPU: Intel Core i7-8700 Linux Distros: Ubuntu 20.04 Linux kernel: 5.4 Memory: 32 GB Disk: NVMe SSD
TiKV is written in Rust. There are several readily available trace libraries in Rust Ecology, tokio-Tracing, Rustracing, and Open-Telemetry, all of which are compatible with the OpenTracing specification, but their performance is not ideal and their introduction will reduce TiKV performance by more than 50%. The current implementation of TiKV reduces performance impact to less than 5%. This is mainly due to the fact that a single Span track collection takes only 20ns:
The following details how TiKV can track and collect a single Span in 20ns.
timing
Timing is a high-frequency operation in tracing. Each Span needs to take two time stamps, respectively representing the start and end moments of the event. Therefore, the performance of timing greatly affects the performance of tracing.
Tracking libraries are usually timed in a way that meets the following requirements:
- The timestamp obtained is monotonically increasing
- A high performance
- High precision
std::Instant
Rust Native provides two ways of timing:
std::SystemTime::now()
std::Instant::now()
The first method is used to obtain the current system time, which may be affected by user manual adjustment or NTP service modification. The obtained timestamp does not provide monotonically increasing guarantee, so it cannot be used.
Most of the Rust community’s tracking libraries take the second approach, obtaining monotonically increasing, nanosecond timestamps. But its performance is less than ideal, requiring 50ns to fetch two times, which is one of the reasons for the low performance of the community tracker library.
Coarse Time
If looking for timing scheme only from the perspective of high performance, Coarse Time can be used, which sacrifices certain accuracy for high performance. In Linux, the system uses clock_≤ IC_COARSE as the Time source parameter and obtains the Coarse Time through the clock_getTime system call. The Rust community also provides the coarseTime library to get Coarse Time:
coarsetime::Instant::now()
Copy the code
Coarse Time is high performance, requiring only 10ns to complete two calls in a test environment. Its accuracy depends on the Linux Jiffies configuration, with a default accuracy of 4ms.
Low-precision timing can produce confusing results for tracking short time requests. As the graph below shows, a fair amount of detail has been lost from an observational point of view:
Of course, Coarse Time is still a good choice for fast timing in most cases. On the one hand, it is easy to obtain out of the box under Linux system. On the other hand, 4ms accuracy is acceptable for most applications.
However, as tracing developers, we do not want to limit the user’s scenarios. For example, for KvGet requests, 4ms is sufficient to trace as an exception jitter in demanding scenarios, so it is necessary to support microsecond or even nanosecond tracking. At the same time, performance, as the core starting point, cannot be sacrificed. Fortunately, there is a solution to this problem, which is TSC.
TSC
TiKV adopts Time Stamp Counter (TSC) register for high precision and high performance timing. TSC registers have been around for a long time in modern x86 cpus, dating back to the Introduction of pentium processors in 2003. It records the number of CPU clock cycles from the CPU power reset to the current time. Under the condition of the same CPU clock cycle rate, it can be used for high precision timing after measurement and conversion.
TSC can simultaneously meet the requirements of monotonically increasing, high precision and high performance. In our test environment, it takes only 15ns to fetch TSC twice. In practice, with the continuous development of processors, TSC registers accumulate quite a number of historical problems that will affect their correctness and need to be corrected.
TSC rate
TSC increases at a rate determined by CPU frequency. Modern cpus may dynamically adjust the frequency to save energy, resulting in an unstable TSC increase rate:
In addition, some cpus do not increment TSC in sleep state:
More modern x86 cpus provide features to ensure the stability of the TSC increase rate. In Linux, you can check whether the TSC rate is stable by checking the CPU flag in /proc/cpuinfo:
- Constant_tsc: TSC will increase at a fixed nominal frequency rather than a transient frequency
- Nonstop_tsc: TSC continues to increase in the CPU sleep state
The above TSC rate stability guarantee is only effective for a single CPU core, and TSC synchronization needs to be handled in the case of multiple cores.
TSC multi-core synchronization
X86 cpus do not provide a guarantee of TSC register consistency across all cores, which can cause timing problems. The figure below shows TSC measurements on a 2020 laptop with the latest X64 CPU. It can be seen that one of the 16 cores has a deviation in TSC value of CPU 0.
In tracing, the complete timing operation reads two timestamps, one representing the beginning and end of the event. Because of the operating system’s thread scheduling, the reading of the two timestamps may occur on different cores. If we simply time with the TSC value difference, the time calculation will be biased in the case of multi-core TSC out of sync.
Here’s an example:
- At time T1, the thread is running on Core 1, reading a large tSC1
- The operating system schedules threads from Core 1 to Core 2
- At t2, the thread is running on Core 2, reading the smaller TSC2
The TSC difference calculated at this time even becomes negative and cannot be converted into time consuming.
To solve this problem, TiKV will synchronize the original TSC value of each core, calculate the offset of TSC value in each core, and use the TSC value after synchronization to calculate the time. The specific algorithm is to take TSC and physical time twice in each core, calculate the intercept with physical time as X-axis and TSC on the core as Y-axis, and the difference is the TSC offset of each core, as shown in the figure below:
When calculating the initial TSC offset, you need to ensure that both TSC fetches are performed on the same core. In Linux, threads can be anchored to a core by making a system call sched_setaffinity to set the thread’s kerophilicity:
fn set_affinity(cpuid: usize) - >Result<(), Error> {
use libc::{cpu_set_t, sched_setaffinity, CPU_SET};
use std::mem::{size_of, zeroed};
let mut set = unsafe { zeroed::<cpu_set_t>() };
unsafe { CPU_SET(cpuid, &mut set) };
// Set the current thread's core affinity.
if unsafe {
sched_setaffinity(
0.// Defaults to current thread
size_of::<cpu_set_t>(),
&set as *const_)}! =0
{
Err(std::io::Error::last_os_error().into())
} else {
Ok(())}}Copy the code
With the TSC offset of each core, the TSC value after synchronization can be calculated only by obtaining the CPU and TSC value of the current thread of execution during the timing phase. Note that the CPU where the current execution is performed and the current TSC value must be obtained in one command at the same time to avoid calculation errors caused by thread scheduling of the operating system. This can be done with the RDTSCP directive. It helps us get the raw TSC value and CPU ID atomically.
The Rust code is as follows:
#[cfg(any(target_arch = "x86", target_arch = "x86_64")))
fn tsc_with_cpuid() - > (u64.usize) {
#[cfg(target_arch = "x86")]
use core::arch::x86::__rdtscp;
#[cfg(target_arch = "x86_64")]
use core::arch::x86_64::__rdtscp;
let mut aux = std::mem::MaybeUninit::<u32>::uninit();
let tsc = unsafe { __rdtscp(aux.as_mut_ptr()) };
let aux = unsafe { aux.assume_init() };
// IA32_TSC_AUX are encoded by Linux kernel as follow format:
//
// 31 12 11 0
// [ node id ][ cpu id ]
(tsc, (aux & 0xfff) as usize)}Copy the code
The logic for high-precision timing described above has been extracted into a separate Rust community library, Minstant, that can be used directly by other projects with similar needs.
Span to collect
Spans may be generated on individual threads and eventually aggregated into a trace, so a Span collection mechanism across threads is required. Collection of spans is also a common performance bottleneck for trace libraries.
Thread-safe Span collection is generally done in the following ways:
Arc<Mutex<Vec<Span>>>
std::sync::mpsc::Receiver<Span>
crossbeam::channel::Receiver<Span>
Crossbeam Channel is the best among these common collection methods, and it takes about 40ns to send and collect a Span. To improve performance, TiKV collects spans in a different way: Spans on the same thread are only collected locally, uncompetitively, and eventually a collection of spans already collected on each thread is collected into the global collector.
Local Span
TiKV maintains a thread-local structure, LocalSpanLine, for each thread, responsible for the generation and storage of LocalSpan. Another thread-local structure, LocalCollector, drives LocalSpanLine and collects LocalSpan. The relationship between the three and their respective responsibilities is shown below.
Because LocalSpan, LocalSpanLine, and LocalCollector are thread-local, their interactions do not require interthread synchronization, mutual exclusion, or break the memory cache, resulting in high performance. The LocalSpan collection is a simple Vec::push operation with an average cost of only 4ns.
In addition, when constructing Span dependencies, using thread-local features makes it easy to implement an implicit context mechanism, requiring users to manually pass trace context without modifying function signatures, greatly reducing the intrusion into existing code.
Let’s dive into the implementation details of the LocalSpan generation and collection.
First, LocalSpanLine maintains a container, SpanQueue, for loading ongoing or completed LocalSpans. “In progress” means that the start time of the event indicated by LocalSpan is known and the end time is unknown. These LocalSpans are stored in the Vec structure within the SpanQueue.
In addition, as mentioned earlier, we use implicit contexts to construct parent-child dependencies between LocalSpan, a process that actually relies on a variable maintained by SpanQueue, next_parent_id.
We’ll expand on this process in more detail with some examples.
Suppose an event such as foo is generated at 09:00 and lasts until 09:03:
09:00 foo +
09:01 |
09:02 |
09:03 +
Copy the code
In the initial state, SpanQueue is empty and next_parent_id is root. At the time foo occurs, 09:00, SpanQueue completes the following steps:
- Add a record with the event name foo, start time 09:00, and leave the end time blank
- will
next_parent_id
Assigns the value to Foo’s parent - will
next_parent_id
Updated to foo - Return outward
index
The value of 0 is used to receive notification of the end of the event and to complete the backfilling of the subsequent end time
At the time foo ends, 09:03, the user submits index, notifying The SpanQueue that foo is finished, and the SpanQueue starts backfilling:
- through
index
Index to the record of the event foo - Backfill the end time to
09:03
- will
next_parent_id
Updated to the recordparent
The above example describes the recording process of a single event, which is simple and effective. In fact, the recording of multiple events is just a repetition of the above process. For example, in the following procedure, the foo event contains two child events: bar and baz.
09:00 foo +
09:01 | bar +
09:02 | |
09:03 | +
09:04 |
09:05 | baz +
09:06 | |
09:07 | +
09:08 +
Copy the code
As mentioned above, SpanQueue records not only the start and end times of events, but also the parent-child dependencies between events. In this example, the contents of the SpanQueue are the same when foo occurs. When bar occurs, SpanQueue sets the parent of bar to the current next_parent_id value, which is foo, and updates next_parent_id to bar:
At the end of the bar, the end time of the bar record and the next_parent_id variable are updated following the backfill steps mentioned above:
By repeating these steps, SpanQueue recorded all three events in an efficient manner:
These records are concatenated to form the following Trace tree:
Normal Span
While LocalSpan’s recording is efficient, its thread-local implementation makes it inflexible. For example, in an asynchronous scenario where some Span creation and termination occur on different threads, thread-local implementations are no longer useful.
In view of the above problems, TiKV retains the thread-safe Span recording method described at the beginning of the above, that is, crossbeam Channel is used to collect a single Span at a time, and such Span is called NormalSpan below.
From an implementation point of view, NormalSpan information is not recorded in a thread-local container, but is maintained by the corresponding variables themselves for easy movement across threads. At the same time, parent-child relationships between NormalSpans are no longer implicitly built locally by threads, but manually specified by users.
However, NormalSpan and LocalSpan are not completely isolated, and TiKV connects the two through the following interaction: a set of LocalSpans collected from the LocalCollector can be mounted on NormalSpan as a subtree, as shown in the figure below. At the same time, there is no limit to the number of mounts, and by allowing many-to-many mounts, TiKV supports batch scenario tracking to a degree that is not covered by most tracking libraries in the community.
The above implementation forms the fast and slow paths of Span collection. They work together to record execution path information for a request:
- LocalSpan is not thread spanning but records efficiently, keeping tracing overhead very low by collecting LocalSpan in bulk and then mounting it to regular spans.
- A normal Span record is relatively slow, but it can be passed across threads, making it more flexible to use.
Method of use
The logic of high-performance tracking in TiKV has been extracted into a separate library, Minitrace-Rust, which can be used directly in a variety of projects, as follows:
- When the request arrives, the corresponding root Span is created;
- On the request execution path, use the interface provided by minitrace-rust to record the occurrence of events.
- When the request completes, all spans generated on the execution path are collected.
Root Span creation and collection
The root Span is typically created at the beginning of a request. Used in minitrace-rust as follows:
for req in listener.incoming() {
let (root_span, collector) = Span::root("http request");
let guard = root_span.enter();
my_request_handler(req);
}
Copy the code
Span is based on Guard to automatically end a Span at the end of its scope without manually marking the end of a Span. In addition to returning the root Span, Span::root(event) also returns a Collector. Collector corresponds to the root Span one by one. When the request completes, the Collector’s Collect method can be invoked to complete the collection of all spans generated on the execution path. As shown below.
let (root_span, collector) = Span::root("http request");
let guard = root_span.enter();
handle_http_request(req);
drop((guard, root_span));
let spans = collector.collect();
Copy the code
The event log
The comparison recommends using the Trace and trace_async macros provided by Minitrace-rust for function-level event logging. The following execution information is recorded for a single function in the above manner:
- The time when the call occurs
- The return time of the call
- A reference to the direct (or indirect) caller
- A reference to a child function called directly (or indirectly)
For example, if you trace two synchronization functions foo and bar, you can record the execution of the functions by adding trace(event) as an attribute to the two functions. As shown below.
#[trace("foo")]
fn foo() - >u32 {
bar();
42
}
#[trace("bar")]
fn bar() {}Copy the code
The final information recorded includes the start and finish times of the two functions, as well as the function call relationship: Foo called bar.
For recording asynchronous functions, the steps are slightly different. You must first replace trace with trace_async, as shown below.
#[trace_async("foo async")]
async fn foo_aysnc() - >u32 {
bar_async().await;
42
}
#[trace_async("bar async")]
async fn bar_async() {
yield_now().await;
}
Copy the code
Another crucial step is required: wrapping the Task with the Future adapter provided by Minitrace-Rust, in_SPAN, to bind the Future to a Span.
Task, in the asynchronous context of Rust, refers specifically to the Future that is spawn to an executor, also known as the root Future. For example, foo_async is a Task:
executor::spawn(
foo_async()
);
Copy the code
If you were to trace a Task such as foo_async and bind it to a Span created by Span::from_local_parent(event), the relevant application code would look like this.
executor::spawn(
foo_async().in_span(Span::from_local_parent("Task: foo_async")));Copy the code
The following is the result of the Task trace:
conclusion
As the underlying KV database, TiKV naturally has different performance requirements from ordinary business programs to increase its observation function, which is very challenging. In addition to tracking, TiKV and its upper SQL database TiDB also have other challenging observational requirements. PingCAP’s Observability team focuses on the solution and implementation of such observation problems. Interested students can join us by sending their resume to [email protected] or joining Slack Channel # Sig-Diagnosis for technical discussion.
Content: February issue of Rust Chinese (Rust_Magazine)