• Learning Parser Combinators With Rust
  • Original author: Bodil
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: suhanyujie

This article provides some parser basics for those who can program with Rust. In the absence of additional knowledge, we will cover everything that is not directly related to Rust, as well as some aspects of using Rust to achieve this even more than expected. The Rust article won’t tell you how to use it if you don’t already know it, and if you do, there’s no guarantee that it will teach you about parser combinators. If you want to learn Rust, I recommend reading the Rust programming language.

Beginner’s Monologue

There may come a time in many programmers’ careers when they realize they need a parser.

A little white programmer might ask, “What is a parser?”

Intermediate programmers say, “That’s easy. I can write regular expressions.”

Senior programmers say: Get out of the way, I know lex and Yacc.

Xiao Bai’s mentality is right.

It’s not that regex is bad. (But don’t try to write a complex parser as a regular expression.) That’s not to say it’s not fun to use powerful tools like parsers and Lexer generators, which have been iterated and refined to a very good degree. But it’s fun to learn parsers from zero. If you go straight in the direction of regular expressions or parser generators, you’ll miss out on a lot of great things, because they’re just abstractions of the real problem at hand. As someone said, the beginner’s mind is full of possibilities. In the mind of an expert, it is possible to get used to that idea.

In this article, you’ll learn how to build a parser from scratch using a technique common in functional programming languages, called a parser combinator. They have the nice advantage that once you’ve mastered the basic ideas, the basic principles, you’ll build your own abstraction on top of the basic combinator, and here it will be the only abstraction — all of which must be conceived before you use them.

How to learn this article well

It is highly recommended that you create a new Rust project and type the code snippet into the file SRC /lib.rs as you read it (you can copy the code snippet directly from the page, but it is best to type it by hand, as this will ensure that you read the code in its entirety). This article introduces each piece of code you need in order. Note that it may introduce a modified version of the function you wrote earlier, in which case you should replace the old version with the new version of the code.

The code is based on ruSTC version 1.34.0 of 2018. You should be able to use the latest version of the compiler, just make sure you’re using a 2018 edition (check Cargo. Toml to see if edition = “2018” is included). Code requires no external dependencies.

As you might expect, to run the tests described in this article, you can use cargo Test.

XML text

We will write a parser for the simplified version of XML. It looks something like this:

<parent-element>
  <single-element attribute="value" />
</parent-element>
Copy the code

XML elements start with the symbol < and an identifier, which consists of letters, numbers, or -. This is followed by some Spaces, or a list of optional attributes: another identifier defined earlier, followed by a = and double quotes containing some strings. Finally, there may be a /> to end, indicating a single element with no child elements, or a > to end with a number of child elements, followed by an end tag beginning with
.

That’s what we’re going to do. There are no namespaces, no text nodes, no other nodes, and certainly no schema validation. We don’t even need to support escaping quotes for these strings — they start with the first double quote and end with the next, and that’s it. If you want to use double quotes in actual strings, you can save this unwieldiness for later.

We will parse these elements into a structure like this:

#[derive(Clone, Debug, PartialEq, Eq)]
struct Element {
    name: String,
    attributes: Vec<(String, String)>,
    children: Vec<Element>,
}
Copy the code

There are no generics, just a string named Name (the identifier at the beginning of each tag), some string attributes (identifiers and values), and a list of child elements that look exactly like the parent element.

(If you are typing code, be sure to include general computer users. You’ll need it later.

Define the parser

So, it’s time to start writing parsers.

Parsing is the process of generating structure from data streams. The parser is the thing that tease them out of structure.

In the discipline we’ll explore, the simplest form of a parser is a function that takes some input and returns what was parsed and the rest of the input, or an error message saying “cannot be parsed.”

In short, this is what the parser looks like in more complex scenarios. You can complicate input, output, and errors, which is exactly what you need if you have good error messages, but the parser stays the same: either it processes the input and parses the result and the rest of the input, or it says it can’t parse the input and displays a message to let you know.

We mark it as a function type

Fn(Input) -> Result<(Input, Output), Error>
Copy the code

More detailed said, in our case, we want to fill type, will be similar to the following result, because we need to do is to convert a string to an Element structure, this point, we don’t want to display error complex, so we will only we cannot parse error as a string:

Fn(&str) -> Result<(&str, Element), &str>
Copy the code

We use the string slice because it is a valid pointer to a string fragment, which we can reference by slice, whatever we do, processing the input string slice and returning the rest and processing results.

Using &[u8] (a one-byte slice, assuming we limit ourselves to using only ASCII counterparts) as the input type may be more concise, especially since a string slice behaves differently than most other slices, especially if strings cannot be indexed by numbers, Numeric index strings such as input[0], you must use a string slice input[0..1] like this. On the other hand, they provide many useful methods for parsing strings that byte slice does not.

In fact, for the most part, we’ll rely on these methods rather than indexing them because of Unicode. In UTF-8, all Rust strings are UTF-8, and these indexes do not always correspond to individual characters. It is best to let the standard library handle this for us.

Our first parser

Let’s try writing a parser that looks only at the first character in the string and decides if it is the letter A.

fn the_letter_a(input: &str) -> Result<(&str, ()), &str> {
  match input.chars().next() {
      Some('a') => Ok((&input['a'.len_utf8().. ] , ())), _ => Err(input), } }Copy the code

First, let’s look at the input and output types: we take a string slice as input, and as we discussed, we return a Result or &str error containing (&str, ()). The interesting part is the (& STR, ()) part: as we discussed, we expect to return a tuple with the next input part for parsing, along with the parsing result. &str is the next input, and the result is a single () type, because if the parser runs successfully, it will only get one result (the letter A was found), and in this case we don’t need to specifically return the letter A, we just need to indicate that it was found successfully.

So let’s look at the code for the parser itself. First get the first character of the input: input.chars().next(). We didn’t try to rely on the standard library to avoid introducing Unicode problems — we called a chars() iterator it provided for characters in strings, and then fetched the first unit from it. This is an item of type char wrapped by Option, Option

. An Option of type None means we are getting an empty string.

Worse, a char might not even be the character we think of in Unicode. This is probably the Unicode “letter set”, which can consist of several char characters that actually represent “scalar values” and is almost two levels lower than the “letter set”. However, it’s a bit radical to think that, for our purposes, we’re unlikely to even see characters outside the ASCII character set, so ignore that for now.

We pattern match Some(‘a’), which is the specific result we are looking for, and return success Ok((&input[‘a’.len_utf8]().. ] , ())). That is, we remove the parsed item (‘a’) from the string slice and return the remaining characters, as well as the parsed value, which is the () type. Considering Unicode character set issues, we use the standard library method to query the length of character ‘a’ in UTF-8 before processing the string range — the length is 1, so we don’t encounter the Unicode character problems we thought before.

If we get Some other type of result, Some(char), or None, we return an exception. As mentioned earlier, the exception type we just mentioned was the string slice that failed parsing, which is the input we passed. It doesn’t start with an A, so it returns an exception to us. It’s not a terrible mistake, but it’s better than “something went horribly wrong.”

In fact, although we’re not going to parse this XML with this parser, the first thing we need to do is look for the starting <, so we need something similar. In particular, we also need to parse >, /, and =, so maybe we can create a function to build a parser that parses the character we want to parse.

Parser builder

Imagine writing a function that generates a parser for a static string of arbitrary length, not just a single character. This is even easier, since the string slice is a valid UTF-8 string slice, and Unicode character sets aside for the moment.

fn match_literal(expected: &'static str) -> impl Fn(&str) -> Result<(&str, ()), &str> { move |input| match input.get(0.. expected.len()) { Some(next) if next == expected => { Ok((&input[expected.len()..] , ())) } _ => Err(input), } }Copy the code

It looks a little different now.

First, let’s look at types. Our function doesn’t look like a parser, it now takes the Expected string as an argument and returns something that looks like a parser. It is a function whose return value is a function — in other words, a higher-order function. Basically, we’re writing to generate a function similar to the_letter_A that we wrote earlier.

So, instead of performing some logic in the function body, we return a closure that is where the logic is executed and matches the function signature of the previous parser.

The matching pattern is the same, except we can’t match string text directly because we don’t know what it is, so we use the condition if next == Expected to determine the match. Therefore, it is exactly the same as before, except that the execution of the logic is inside the closure.

Test parser

We’ll write a test to make sure we’re doing it right.

#[test]
fn literal_parser() {
    let parse_joe = match_literal("Hello Joe!"); assert_eq! ( Ok(("", ())),
        parse_joe("Hello Joe!")
    );
    assert_eq!(
        Ok((" Hello Robert!", ())),
        parse_joe("Hello Joe! Hello Robert!")
    );
    assert_eq!(
        Err("Hello Mike!"),
        parse_joe("Hello Mike!")); }Copy the code

First, we build the parser: match_literal(“Hello Joe!” ). This should be done using the string Hello Joe! As input, and returns the rest of the string, otherwise it should prompt failure and return the entire string.

In the first case, we just give it the concrete string it expects as an argument, and then we see that it returns an empty string and a value of type (), which means: “We parsed the string as normal and you don’t really need it to return you this value”.

In the second case, we give it the string Hello Joe! Hello Robert! And we do see it parse the string Hello Joe! And returns the rest: Hello Robert! (All remaining strings beginning with a space).

In the third example, we entered some incorrect values: Hello Mike! , notice that it does give errors based on input and interrupts execution. In general, Mike is not the correct input part; it is not what the parser is looking for.

A parser for unfixed arguments

Thus, we parse <,>,= and even
. We’re actually pretty much there.

The next element after the beginning < is the name of the element. While we can’t do this with a simple string comparison, we can do it with regular expressions…

. But let’s restrain ourselves; it will be a regular expression that is easy to copy in simple code, and we don’t need to rely on the Regex crate library for this. Let’s see if we can write our own parser using only the Rust standard library.

Reviewing the definition of an element name identifier, it looks something like this: a character of one letter, followed by several alphanumeric dashes – and so on.

fn identifier(input: &str) -> Result<(&str, String), &str> {
    let mut matched = String::new();
    let mut chars = input.chars();

    match chars.next() {
        Some(next) if next.is_alphabetic() => matched.push(next),
        _ => return Err(input),
    }

    while let Some(next) = chars.next() {
        if next.is_alphanumeric() || next == The '-' {
            matched.push(next);
        } else {
            break; }}letnext_index = matched.len(); Ok((&input[next_index..] , matched)) }Copy the code

As always, let’s look at the types first. This time, instead of writing functions to build the parser, we’ll write the parser itself, as we did in the beginning. The difference to note here is that instead of returning a result type of (), we return a tuple containing the String and the unparsed remainder of the input. This String will contain the identifier we just parsed.

With that in mind, first we create an empty String and call it matched. That’s going to be our result value. We also get an iterator from the input string that separates the characters by iterating through them one by one.

The first step is to see if the prefix starts with a letter. We take the first character from the iterator and check if it is a letter: next-.is_alphabetic (). The Rust library will certainly help us with Unicode here — it will match any letter, not just ASCII. If it’s a letter, we’ll put it in the matching string, if it’s not, obviously, we didn’t find the element identifier, and we’ll just return an error.

In the second step, we continue extracting the character from the iterator and putting it into the constructed String until we find a character that does not match is_alphanumeric() (similar to is_alphabetic()), does not match any character in the alphabet, and is not -.

The first time we see something that doesn’t match one of these conditions, it means we’re done parsing, so we jump out of the loop and return the String we processed, remembering that we want to strip out what we processed from the input. Similarly, if the iterator completes, we have reached the end of the input.

It is worth noting that we do not return an exception when we see something other than alphanumeric or -. Once the first letter is matched, we have enough content to create a valid identifier, and after parsing the identifier, it is perfectly normal to parse more things in the input string, so we simply stop parsing and return the result. We only return an exception if we can’t even find the first letter, in which case it means there must be no identifier in the input.

Remember when we were parsing an XML document into an Element structure?

struct Element {
    name: String,
    attributes: Vec<(String, String)>,
    children: Vec<Element>,
}
Copy the code

In fact, we just finished the first part of the parser, parsing the name field. This is the case with the String returned by our parser, which is also the appropriate parser for the first part of each attribute.

Let’s start testing it.

#[test]
fn identifier_parser() { assert_eq! ( Ok((""."i-am-an-identifier".to_string())),
        identifier("i-am-an-identifier")
    );
    assert_eq!(
        Ok((" entirely an identifier"."not".to_string())),
        identifier("not entirely an identifier")
    );
    assert_eq!(
        Err(! "" not at all an identifier"),
        identifier(! "" not at all an identifier")); }Copy the code

We see the first case where the string i-am-an-identifier is fully parsed, leaving only the empty string. In the second case, the parser returns “not” as the identifier and the rest of the string as the rest of the input. In the third case, the parser completely fails because the first character it finds is not a letter.

combiner

Now we can parse the leading < and then the following identifiers, but we need to parse both at the same time to be able to run down. So the next step is to write another parser builder function that takes both parsers as input and returns a new parser that parses them in order. In other words, another parser combinator, because it combines two parsers into a new parser. Let’s see if we can do that.

fn pair<P1, P2, R1, R2>(parser1: P1, parser2: P2) -> impl Fn(&str) -> Result<(&str, (R1, R2)), &str>
where
    P1: Fn(&str) -> Result<(&str, R1), &str>,
    P2: Fn(&str) -> Result<(&str, R2), &str>,
{
    move |input| match parser1(input) {
        Ok((next_input, result1)) => match parser2(next_input) {
            Ok((final_input, result2)) => Ok((final_input, (result1, result2))),
            Err(err) => Err(err),
        },
        Err(err) => Err(err),
    }
}
Copy the code

It’s a little complicated here, but you should know what to do next: Start by looking at the types.

First, we have four types: P1, P2, R1 and R2. This is parser 1, parser 2, result 1, result 2. P1 and P2 are functions, and you’ll notice that they follow the established pattern of parser functions: just like a return value, they take &str as input and either return the rest of the input and the result of the parse, or return an exception.

But look at the result type of each function: P1 is a parser that, if successful, will produce R1, and P2 will produce R2. The final parser result — that is, the return value of the function — is (R1, R2). Thus, the logic of the parser is to first run parser P1 on the input, preserving its results, and then run P2 on the one returned by P1 as input. If both methods work, we combine the two results into a single tuple (R1, R2).

And if you look at the code, that’s exactly what it does. We run the first parser on the input, then the second, then combine the two results into a tuple and return. If one of the parsers encounters an exception, we immediately return the error it gives us.

In this case, we can combine the previous two parsers, match_literal and identifier, to actually parse the first byte of the XML tag. Let’s write a test to see if it works.

#[test]
fn pair_combinator() {
    let tag_opener = pair(match_literal("<"), identifier); assert_eq! ( Ok(("/ >", ((), "my-first-element".to_string()))),
        tag_opener("<my-first-element/>")
    );
    assert_eq!(Err("oops"), tag_opener("oops")); assert_eq! (Err(! "" oops"), tag_opener("
      ));
}
Copy the code

It seems to work! But look at the result type :((), String). Obviously, we only care about the value on the right-hand side, which is String. Most of the time – some of our parsers only match patterns in the input and don’t produce values, so you can safely ignore this output. To accommodate this scenario, we’ll use our pair combinator to write two more combinators: Left, it dismisses the result of the first parser and returns the second parser with the corresponding number, right, which we wanted to use in our tests above instead of pair — it dismisses the left (), leaving only our String.

  • Learn through Rust parser combinator – Part 1
  • Learn through Rust parser combinator – Part 2
  • Learn through Rust parser combinator – Part 3
  • Learn through Rust parser combinator – Part 4

license

This work is copyrighted by Bodil Stokke and licensed under the terms of the Creative Commons Attribution – Non-commercial – Share alike 4.0 agreement. To view the license, please visit the creativecommons.org/licenses/by…

footnotes

1: He is not your real uncle. 2: Please don’t be that person at the party.

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.