- My Favorite Rust Function Signature
- The original link: www.brandonsmith.ninja/blog/favori…
- By Brandon Smith
- github.com/suhanyujie/…
- Translator: suhanyujie
- Translation blog: Suhanyujie
- Ps: limited level, improper translation, please correct.
- Tag: Rust, parser
My favorite Rust function signature
Lately, I’ve been interested in writing parsers, and Rust has proven to be a great language for writing parsers. During my explorations, the following came to mind:
fn tokenize<'a>(code: &'a str) -> impl Iterator<Item=&'a str> {
...
}
Copy the code
It really deepened my love for Rust.
What does this function do?
For those unfamiliar with parsers, tokenization is the first step in the parser. It takes a raw string as input, like the following line:
let a = "foo";
Copy the code
And convert it to a list of symbols that make sense, like this:
["let", "a", "=", "\"foo\"", ";"]
Copy the code
This stage is not complicated, but it simplifies the mental model for the next stage: building an “abstract syntax tree.” It removes whitespace from both sides of the equation in the source string, removes elements such as strings and numbers, and makes the next step of the code simpler.
If you do this alone, the downside is that your parser now has to have all the source code twice. This may not be the worst: tokenization is not the most expensive operation. This isn’t ideal, however, so some parsers combine the two passes into one, optimizing performance at the expense of readability.
What does the Rust version of the parser look like?
I’ll copy the function signature here again as a reference:
fn tokenize<'a>(code: &'a str) -> impl Iterator<Item=&'a str> {
...
}
Copy the code
There are some operations here.
In Rust, & STR is a “string slice.” It is essentially a character pointer plus length. The contents of the slice are guaranteed to be in valid memory. &’a STR is a slice of a string with a lifetime. ‘A’ represents the specific life cycle. The lifecycle here describes ensuring that the reference (and all the contents of the slice) are legal and in valid live memory for a period of time. More on this later.
Iterator<Item=&'a str>
is an iterator over elements of type &'a str
. This is a trait, though, not a concrete type. Rust needs a concrete type with a fixed size when you’re defining something like a function, but luckily we can say impl Iterator<Item=&'a str>
, which tells Rust, “fill in some type that implements Iterator<Item=&'a str>
, to be inferred at compile-time”. This is very helpful because in Rust there are lots and lots of different concrete types for Iterator
; applying something like a map()
or a filter()
returns a whole new concrete type. So this way, we don’t have to worry about keeping the function signature up to date as we work on the logic.
Iterator
- =&’a>
is an element Iterator of type &’a STR. However, it is also a trait, not a specific type. In Rust, when defining functions, the arguments usually need to be of specific types that can determine size, but fortunately we can use the impl Iterator
- =&’a>
, which tells the Rust compiler, when compiling inference, “This type implements Iterator
- =&’a>
“. This is useful because there are many, many different concrete types of iterators in Rust; Based on it, you can call a function like map() or filter() and return a completely new concrete type. This way, we don’t have to worry about keeping the function signature up to date when processing the logic.
What are the advantages?
Ok, now we have a function that takes a character slice reference as an argument and returns a string slice iterator. What’s so special about that? There are two main reasons.
Iterators allow you to pass one pass as if it were two
Remember I mentioned earlier that traditionally you had to choose between separating token re-pass and single-pass after implementing all the logic? With iterators, you get the best of both worlds.
When this function is done, it hasn’t actually iterated over the string. It does not allocate any kind of collection in memory. It returns a structure that is ready to traverse a slice of the input string and generate a new structure. When this value is then passed by map() to other handlers such as filter() that implement the Iterator conversion, the whole process is crossed, and the looping approach effectively collapses into a single loop. By doing so, we can get a clean abstraction of token passing (“pass”) without the run-time overhead of a second loop.
Other languages also have iterators. But Rust’s iterators will be more powerful and ergonomic, and that’s not the only feature. The next section covers the very unique features of Rust.
The lifecycle lets you share references with no burden
The tokenize() function does not allocate new memory for token collection operations. Great, but what’s less obvious is that it also doesn’t allocate any memory to the Token itself! Because each token string slice is a pointer to the slice of the original string.
Of course, this can also be done in C/C++, but there is a risk: if these tokens are accessed after the original string has been freed, this will result in a memory error.
For example, suppose you open a file and load the source code from it, then store the result in a local variable. Then tokenize() it, send the token somewhere outside of the function that the variable is in, and voila, you get a release error.
One way to prevent this is to copy each string into a new string, which is stored on the heap so that it can be safely passed on after the original string disappears. But doing so comes at a cost: creating, copying, and manipulating new strings takes time (and memory). The code that implements this logic must also be written with the awareness that it is responsible for allocating memory to these strings, or it will leak memory.
This is where the magic of the life cycle comes into play.
Rust can prevent this from happening. Normally, to accomplish this task, the function’s input type is &str, which is assumed to be static (long enough) or alive for the entire execution of the program. This allocation state is like writing a literal string in Rust code by hand. Rust does not know how long the reference is valid in the context of a function, so it needs to conservatively infer the life of the value (memory) from the life cycle parameter.
However, the little character ‘a ‘says: “These things [variables] are alive for a certain period of time.” We can assert that the lifetime of a source string is at least as long as the lifetime of the token that references it. By doing so, Rust can infer that references to these token results are legitimate, so they don’t have to be assumed to be static! We can do whatever we want with these tokens, and the compiler will guarantee that the reference will always point to something valid, even if the source code is loaded dynamically at runtime (from a file or elsewhere). If we later discover through compiler prompts that they do need to outlive the source string, then we can copy them again (” take ownership “). If the compiler doesn’t force us to do this, it proves that these references are safe, so we can continue using the most efficient method, without fear.
What we’ve effectively done is written the most optimistic possible function (in terms of memory safety), with no downsides, because the Rust compiler will tell us if we’re misusing it and force us to then “step down” to whatever level of extra accommodation is needed.
Our most efficient implementation is to write a function under certain optimistic conditions (as far as memory safety is concerned) with no drawbacks, because if we use it incorrectly, the Rust compiler will prompt us and then force us to “debug trace” where we need to change until we fix it.
conclusion
I’ve been using (and loving) Rust for about a year and a half. I liked a lot of its features, but when I started using it, I could immediately see how it was different from other languages. In any other language, you can’t do both a) security and b) efficiency. This is the power of Rust.