Since 2017, bytedance Feishu team has been using Rust to develop feishu multi-terminal cross-platform SDK, providing high-quality underlying base library for Android, iOS, Window, macOS, Linux and other platforms. At the same time, many Rust engineers in Feishu have been contributing to the Rust community in small ways. This article will share a first look at the experience and insights of a Rust engineer who has been working on Rust Binary_search.

Rust binary_search profile

The release of Rust 1.52 on May 6, 2021 includes a PR of mine that optimizes the best time complexity of the standard library Slice :: Binary_search_by () to O(1)O(1). The PR number #74024 was submitted in early July 2020 and officially merged on March 6, 2021, which contained more than 70 comments and lasted for more than half a year. Although this PR is not changed very much, the author also learned a lot of knowledge points from it, so I specially write an article to summarize, hoping to help you.

First take a look at an example of the slice::binary_search_by() method to see what it can do.

let s = [0.1.1.1.1.2.3.5.8.13.21.34.55];

let seek = 13;
assert_eq!(s.binary_search_by(|probe| probe.cmp(&seek)), Ok(9));
let seek = 4;
assert_eq!(s.binary_search_by(|probe| probe.cmp(&seek)), Err(7));
let seek = 100;
assert_eq!(s.binary_search_by(|probe| probe.cmp(&seek)), Err(13));
let seek = 1;
let r = s.binary_search_by(|probe| probe.cmp(&seek));
assert!(match r { Ok(1..=4) = >true, _ = >false});Copy the code

This function is used to bisect the target value seek in a given ordered slice. If found, return Ok(pos), where the target value is located; If not, Err(pos) is returned, where the pos position can be used to insert seek into slice in order. The other slice::binary_search() and slice::binary_search_by_key() methods all call this slice::binary_search_by(), which is no longer verbose.

First optimization

However, the pre-1.52 implementation had a slight problem. If there were multiple consecutive target values in slice, it would always find the last one before returning, so the best time complexity was also O(log⁡n)O(\log n)O(logn), not O(1)O(1)O(1) O(1). It means to return as soon as you find it. Here’s the code before 1.52:

#[stable(feature = "rust1", since = "1.0.0")]
#[inline]
pub fn binary_search_by<'a, F>(&'a self.mut f: F) -> Result<usize.usize>
where
    F: FnMut(&'a T) -> Ordering,
{
    let s = self;
    let mut size = s.len();
    if size == 0 {
        return Err(0);
    }
    let mut base = 0usize;
    while size > 1 {
        let half = size / 2;
        let mid = base + half;
        // SAFETY: the call is made safe by the following inconstants:
        // - `mid >= 0`: by definition
        // - `mid < size`: `mid = size / 2 + size / 4 + size / 8 ... `
        let cmp = f(unsafe { s.get_unchecked(mid) });
        base = if cmp == Greater { base } else { mid };
        size -= half;
    }
    // SAFETY: base is always in [0, size) because base <= mid.
    let cmp = f(unsafe { s.get_unchecked(base) });
    if cmp == Equal { Ok(base) } else { Err(base + (cmp == Less) as usize)}}Copy the code

Now that we’ve found this problem, it’s easy, can’t we just say CMP == Equal in the while loop? Return Ok if it is equal, otherwise continue binary search.

while size > 1 {
    let half = size / 2;
    let mid = base + half;
    // SAFETY:
    // mid is always in [0, size), that means mid is >= 0 and < size.
    // mid >= 0: by definition
    // mid < size: mid = size / 2 + size / 4 + size / 8 ...
    let cmp = f(unsafe { s.get_unchecked(mid) });
    if cmp == Equal {
        return Ok(base);
    } else if cmp == Less {
        base = mid
    };
    size -= half;
}
Copy the code

Duplicate code is omitted here (and later) for brevity.

Well, it looks logical, the single test passed, submit a PR, let’s call it optimization (1). A few days later dtolnay reviewed my PR and replied: Would you be able to put together a benchmark assessing the worst case impact? The new implementation does potentially 50% more conditional branches in the hot loop.

It is true that the Rust standard library is very performance demanding and we must have enough cases to benchmark the old version against the new version to avoid performance regression as much as possible. Here’s the benchmark data I made at the time:

Test slice:: binary_search_L1... bench: 59 ns/iter (+/- 4) test slice::binary_search_l1_with_dups ... bench: 59 ns/iter (+/- 3) test slice::binary_search_l2 ... bench: 76 ns/iter (+/- 5) test slice::binary_search_l2_with_dups ... bench: 77 ns/iter (+/- 17) test slice::binary_search_l3 ... bench: 183 ns/iter (+/- 23) test slice::binary_search_l3_with_dups ... bench: 185 ns/iter (+/- 19)Copy the code
// New implementations of test slice:: binary_search_L1... bench: 58 ns/iter (+/- 2) test slice::binary_search_l1_with_dups ... bench: 37 ns/iter (+/- 4) test slice::binary_search_l2 ... bench: 76 ns/iter (+/- 3) test slice::binary_search_l2_with_dups ... bench: 57 ns/iter (+/- 6) test slice::binary_search_l3 ... bench: 200 ns/iter (+/- 30) test slice::binary_search_l3_with_dups ... bench: 157 ns/iter (+/- 6)Copy the code

You can see that with with_DUps mode (with many duplicate elements), the new implementation is significantly improved, but the l3 level of normal mode performance is much worse. Dtolnay says that The new implementation does potentially carry 50% more conditional branches in The hot loop. What exactly are conditional branches? Why is it so critical in the thermal cycle? This introduces the first point of this article: branch prediction.

Branch Prediction

Branch prediction is a technique used by modern cpus to predict the next possible branch in advance in order to speed up instruction parallelization. Cpus typically have dedicated Branch predictors built into them. Why is processing a sorted array faster than processing an unsorted array? , which explains what branch prediction is and how it affects performance. You could write a whole article about branching prediction, but I’m just going to summarize what I’ve learned.

branch

First of all, what does a branch mean? At the high-level level, the obvious branches are if/else/else if, goto, or switch/match statements. They convert to jump instructions in assembly code. For example, the various j-starting jump instructions in x86 assembly:

instruction role
jmp Always jump
je Jump if cmp is equal
jne Jump if cmp is not equal
jg Signed > (greater)
jge Signed >=
jl Signed < (less than)
jle Signed <=
ja Unsigned > (above)
jae Unsigned >=
jb Unsigned < (below)
jbe Unsigned <=
jecxz Jump if ecx is 0
jc Jump if carry: used for unsigned overflow, or multiprecision add
jo Jump if there was signed overflow

For example, if we write a random piece of code that contains if/else logic, the compiled assembly code will include a jump instruction like this:

#! [allow(unused_assignments)]
pub fn main() {
  // Receives parameters entered by the user from the command line
  let mut a: usize = std::env::args().nth(1).unwrap().parse().unwrap_or_default();
  if a > 42 { 
    a = 1;
  } else {
    a = 0; }}Copy the code

This code logic is mainly written to keep assembly instructions as simple as possible. Here is the corresponding assembly code, only keep the associated with the if/else part of the complete code link here: rust.godbolt.org/z/ahcKcK1Er.

.LBB99_7:
        cmp     qword ptr [rsp + 56], 42 ; if a > 42
        jbe     .LBB99_10
        mov     qword ptr [rsp + 56], 1  ; a = 1
        jmp     .LBB99_11
.LBB99_10:
        mov     qword ptr [rsp + 56], 0  ; a = 0
.LBB99_11:
        add     rsp, 200
        ret
Copy the code

I added a few comments to the code (with semicolons in the assembly) indicating the Rust code that this instruction corresponds to. You can see that the JBE instruction decides whether to jump to.lbb99_10 or not to continue with the mov below.

It’s not enough to know what a branch is, but we still don’t know exactly why the CPU needs to do branch prediction. Next we need to look at another concept: instruction pipelining.

Instruction pipelining

Instruction pipelining is a term that translates from Instruction pipelining, a technology used to improve the efficiency of instruction processing in a single processor, and has been around since the 1970s.

A CPU typically processes an instruction in several steps:

  • Instruction Fetch
  • Instruction Decode
  • Execute command
  • Register write back

This is very much like a factory making something that requires several processes. Imagine how slow it would be if the factory produced the first item in a complete process and then went through the same steps to produce the next item. So in the 19th century, we had an industrial assembly line, breaking all these things up and parallelizing them. The first item in procedure (2) does not affect the entry of the second item into procedure (1) at all. The CPU can streamline the instruction execution process and assign different instruction execution steps to different logic stages. When the first instruction enters the instruction solving stage, the second instruction can enter the instruction fetching stage.

This image from Wikipedia can help you understand (the text is black, and the Light theme is recommended).

With these concepts in mind, why does the CPU need to do branch prediction?

Why do WE need branch predictions?

The instruction pipeline will execute the instruction just like the factory assembly line, including the jump instruction mentioned above. The problem with the jump command is that it needs to know whether to jump or not to jump in the next clock period until the previous logic has been executed. For example, after CMP is done, the jump command can decide whether to jump to.lbb99_10 or continue without jumping. But the CPU instruction pipeline will not wait, otherwise a wasted clock cycle. So people have developed a way to avoid this problem, which is called branching prediction.

The CPU’s branch predictor predicts in advance which branch the jump instruction is likely to jump to, and then puts the predicted branch into the pipeline for execution. If the prediction is correct, the CPU can continue to execute subsequent instructions; If the prediction fails (Branch misprediction), the result of the branch just executed can only be discarded and the correct branch execution can be switched again. As you can see, branch prediction can actually affect performance if too many prediction failures occur. As CPU branch predictors have evolved over the years, there are ways to improve the accuracy of branch predictors. For more information, see Branch_predictor at Wikipedia.

Avoid branching prediction in thermal cycles

Although modern cpus have branch predictors, we still try to avoid branch predictors at the software level, especially during thermal cycles. The most common optimization is to avoid writing if/else statements in loops, known as Branchless code. The standard library has plenty of examples of this branchless code to optimize performance, such as the count() method of STD :: Collection ::Filter.

pub struct Filter<I, P> {
    iter: I,
    predicate: P,
}

impl<I: Iterator, P> Iterator for Filter<I, P>
where
    P: FnMut(&I::Item) -> bool,
{
    type Item = I::Item;

    // this special case allows the compiler to make `.filter(_).count()`
    // branchless. Barring perfect branch prediction (which is unattainable in
    // the general case), this will be much faster in >90% of cases (containing
    // virtually all real workloads) and only a tiny bit slower in the rest.
    //
    // Having this specialization thus allows us to write `.filter(p).count()`
    // where we would otherwise write `.map(|x| p(x) as usize).sum()`, which is
    // less readable and also less backwards-compatible to Rust before 1.10.
    //
    // Using the branchless version will also simplify the LLVM byte code, thus
    // leaving more budget for LLVM optimizations.
    #[inline]
    fn count(self) - >usize {
        #[inline]
        fn to_usize<T>(mut predicate: impl FnMut(&T) -> bool) - >impl FnMut(T) -> usize {
            move |x| predicate(&x) as usize
        }

        self.iter.map(to_usize(self.predicate)).sum()
    }
}
Copy the code

The library Filter type overrides the count() method when implementing Iterator. Imagine if we hadn’t been aware of the branching prediction problem:

// Bad
#[inline]
fn count(self) - >usize {
    let sum = 0;

    self.iter.for_each(|x| {
        if self.predicate(x) {
            sum += 1; }}); sum }Copy the code

However, this implementation has an IF statement in the loop, which causes the CPU to make a large number of branch predictions, and these branch predictions are almost random. It is difficult for the CPU to improve the accuracy of prediction based on the history, resulting in low performance. The implementation of the standard library is completely Branchless, which not only has much better performance, but also facilitates LLVM to do more optimization.

  • Writing Branchless code to optimize performance is another topic worth discussing, and there are plenty of resources available online. However, Branchless code sacrifices a lot of code readability and is not recommended for blind use.

  • Branchless code also has the added benefit of helping you avoid bypass attacks (Timing or side-channel attacks). Reference: en.wikipedia.org/wiki/Timing… .

Let’s go back to our PR. The lower performance of our version in some cases than older versions of the standard library does have to do with branch prediction, because our code has one more branch to predict.

For comparison, I have posted a screenshot of the assembly below. We just need to focus on the colors that I marked the text with. It is easy to see that the new implementation has one more JNE jump instruction, causing the CPU to do one more branch prediction.

Disclaimer: Assembly instructions are optimized to be more concise when -o is enabled. -o is not enabled for comparison of each line of source code.

  • A screenshot of the compilation of the standard library

  • Optimized (1) compilation screenshot

It should be noted that JMP instruction is a direct jump without branch prediction. Interested friends can see on my Godbolt contrast: rust.godbolt.org/z/8dGbY8Pe1. This site is a miracle, highly recommended!

Optimization (2)

So I suspect that one of the possible reasons the authors implemented the binary_search_by() library without considering the O(1) best-time complexity was to avoid redundant branch predictions. Because if you want O(1), you can’t avoid returning early, and if you want to return early, you can’t avoid branching. So what to do? Tesuji offers an idea: since we can’t avoid branch prediction, let’s try to help the CPU do a better job of branch prediction. So I adopted his plan, commit here, let’s call it optimization (2) :

pub fn binary_search_by<'a, F>(&'a self.mut f: F) -> Result<usize.usize>
where
    F: FnMut(&'a T) -> Ordering,
{
    let mut left = 0;
    let mut right = self.len();
    while left < right {
        // never overflow because `slice::len()` max is `isize::MAX`.
        let mid = (left + right) / 2;
        // SAFETY: the call is made safe by the following invariants:
        // - `mid >= 0`
        // - `mid < size`: `mid` is limited by `[left; right)` bound.
        let cmp = f(unsafe { self.get_unchecked(mid) });
        if cmp == Less {
            left = mid + 1;
        } else if cmp == Greater {
            right = mid;
        } else {
            return Ok(mid); }}Err(left)
}
Copy the code

The optimized (2) code is significantly easier to understand than the standard library and optimized (1) code, and look at the generated assembly code.

As you can see, there are still two JNE instructions, so the performance in non-repeating mode may not be as good as that of the standard library, but it is certainly better than optimization (1). A few days later m-ou-se of the LIBS group responded with a comment. She has also benchmarked and found that for native types such as U32 l1 data volumes are still slower than the library, but for those types that require more time to compare (such as String), the new implementation performs better than the library implementation in all cases. After much discussion, M-ou-Se decided to run a crate-test to see how much impact the PR has on crate in Crates. IO. Finally, the Library team meeting agreed to merge the PR.

About the Crater Test:

Crates probably run a test against all Crates on the entire crates. IO for the version of the compiler being tested (such as my PR) to see how the change affects all Crates online. With more than 50,000 Crates in crates. IO, it typically takes about a week to get around the crater. When my crater finished, it was only because it didn’t have a significant impact on crate that the authorities were willing to merge.

From m-ou-se:

“We discussed this PR in a recent library team meeting, in which we agreed that the proposed behaviour (stopping on Equal) is preferrable over optimal efficiency in some specific niche cases. Especially considering how small most of the differences are in the benchmarks above.”

“The breakage in the crater report looks reasonably small. Also, now that partition_point is getting stabilized, there’s a good alternative for those who want the old behaviour of binary_search_by. So we should go ahead and start on getting this merged. :)”

Integer overflow problem

However, scottmcm points out another problem:

// never overflow because `slice::len()` max is `isize::MAX`.
let mid = (left + right) / 2;
Copy the code

This line of code may overflow in Zero Sized Type (ZST)! Let’s analyze why.

The return value for slice::len() is of type usize, but for non-zero-size types (non-zst), the maximum value for slice::len() can only be isize::MAX. So as the comment says (isize::MAX + isize::MAX) / 2 cannot exceed usize::MAX, so overflow will not occur. This is not the case with the ZST type. If all elements in a slice are of zero size (for example, ()), the slice length can be usize::MAX. Although for [(); usize::MAX].binary_search(&())) we will find the result on O(1)O(1)O(1) O(1) and return it immediately, But if we wrote b.b inary_search_by (| _ | Ordering: : Less), it will happen integer overflow.

whyslice::len()The maximum for non-ZST is zeroisize?

The simplest and most straightforward reason is that we cannot construct an array or slice with all elements of non-zst and length usize::MAX, and the compiler will directly report an error at compile time. For example, with the simplest bool type of 1 byte, the size of [bool; usize::MAX] would be equal to STD ::mem::size_of::

() * usize::MAX, which is too large for the entire computer address space.

fn main() {
    assert_eq!(std::mem::size_of::<bool> (),1);
    // error: values of the type `[bool; 18446744073709551615]` are too big 
    // for the current architecture
    let _s = [true; usize::MAX];
}
Copy the code

But it is fine for ZST, because STD ::mem::size_of::<()>() * usize::MAX is still zero.

fn main() {
    assert_eq!(std::mem::size_of::<()>(), 0);
    let s = [(); usize::MAX];
    assert_eq!(s.len(), usize::MAX);
}
Copy the code

STD ::mem::size_of::

() * isize::MAX is still a large number. The root cause is that the Rust pointer addresses a maximum offset that only allows isize::MAX, STD ::pointer::offset(). Also look at the STD ::slice::from_raw_parts() documentation. For the ZST type, the compiler optimizes that it doesn’t need addressing at all, so the maximum size can be usize::MAX.

The final version

Once I realized the integer overflow problem, the solution was simple, which IS what I committed at the time, and I added a unit test for overflow.

pub fn binary_search_by<'a, F>(&'a self.mut f: F) -> Result<usize.usize>
    where
        F: FnMut(&'a T) -> Ordering,
    {
        let mut size = self.len();
        let mut left = 0;
        let mut right = size;
        while left < right {
            let mid = left + size / 2;

            // SAFETY: the call is made safe by the following invariants:
            // - `mid >= 0`
            // - `mid < size`: `mid` is limited by `[left; right)` bound.
            let cmp = f(unsafe { self.get_unchecked(mid) });
            // The reason why we use if/else control flow rather than match
            // is because match reorders comparison operations, which is 
            // perf sensitive.
            // This is x86 asm for u8: https://rust.godbolt.org/z/8Y8Pra.
            if cmp == Less {
                left = mid + 1;
            } else if cmp == Greater {
                right = mid;
            } else {
                return Ok(mid);
            }

            size = right - left;
        }
        Err(left)
    }

#[test]
fn test_binary_search_by_overflow() {
    let b = [(); usize::MAX];
    assert_eq!(b.binary_search_by(|_| Ordering::Equal), Ok(usize::MAX / 2));
    assert_eq!(b.binary_search_by(|_| Ordering::Greater), Err(0));
    assert_eq!(b.binary_search_by(|_| Ordering::Less), Err(usize::MAX));
}
Copy the code

Let mid = (left + right) / 2; let mid = left + size / 2;

Why use if/else instead of match? After looking at the assembly instructions of the two versions, we found that the assembly code generated by the Match version not only had more instructions but also rearranged the order of CMP instructions, which seemed to have worse performance. In theory, the two versions should be able to generate the same assembly instructions, I will not go into the reason why the match version of the assembly is worse, other readers interested can study.

conclusion

The surface of the calm, but in fact the undercurrent surging. There’s a lot involved in what looks like a very simple PR. There is no end to learning. I have gained a lot through this PR, and I hope to inspire more domestic developers to get involved. The culture of the Rust community is so open and inclusive that anyone can submit a PR or issue to the Rust Warehouse whenever they find something that can be improved. Not only will Rust get better and better, but you will also get great growth and harvest!

Recruitment moment

  • Feishu – Bytedance’s enterprise collaboration platform is a one-stop enterprise communication and collaboration platform integrating video conferencing, online documents, mobile office and collaboration software. At present, feishu business is developing rapidly, and we have R&D centers in Beijing, Shenzhen and other cities. There are enough HC in front-end, mobile terminal, Rust, server, testing and product positions. We are looking forward to your joining us and doing challenging things together (please link: future.feishu.cn/recruit).

  • We also welcome the students of flying book to communicate with each other on technical issues. If you are interested, please click on flying book technology exchange group to communicate with each other