Build an SQL parser with Go

The declaration of Japanese

In this article, I’ll show you how to construct an LL(1) parser in Go and apply it to parse SQL queries. I hope you can use Go to have a simple understanding of the parser algorithm and simple application.

PS: Rich first-line technology, diversified forms of expression, all in “360 cloud computing”, point to pay attention!

Abstract

The purpose of this article is to provide a brief introduction to the construction of an LL(1) parser in Go, which in this case is used to parse SQL queries.

For simplicity, we’ll deal with subselections, functions, complex nested expressions, and other features supported by all SQL styles. These features are closely related to the strategy we will use.

1 Minute theory

A parser consists of two parts:

Lexical analysis: “Tokeniser”
Syntax analysis: Creation of AST

Lexical analysis

Let’s define it by example. “Tokenising” the following query:

SELECT id, name FROM 'users.csv'Copy the code

Tokens represent the tokens that make up this query. Tokeniser results like this:

[]string{"SELECT", "id", ",", "name", "FROM", "'users.csv'"}Copy the code

Syntax analysis

This section is actually where we look at tokens, make sure they make sense and parse them to construct some structure that represents the query in a way that is more convenient for the application that will use it (for example, to perform the query and color highlight it). After this step, we get something like this:

query{	Type: "Select",	TableName: "users.csv",	Fields: ["id", "name"],}Copy the code

There are many reasons why parsing can fail, so it might be convenient to perform both steps at the same time and stop immediately if an error occurs.

strategy

We’ll define a parser that looks like this:

type parser struct { sql string // The query to parse i int // Where we are in the query query query.Query // The "query  struct" we'll build step step // What's this? Read on... }// Main function that returns the "query struct" or an errorfunc (p *parser) Parse() (query.Query, error) {}// A "look-ahead" function that returns the next token to parsefunc (p *parser) peek() (string) {}// same as peek(), but advancing our "i" indexfunc (p *parser) pop() (string) {}Copy the code

Intuitively, the first thing we do is peek() the first token. In basic SQL syntax, there are only a few valid initial tokens: SELECT, UPDATE, DELETE, etc. Everything else is wrong. The code looks like this:

switch strings.ToUpper(parser.peek()) {case "SELECT": parser.query.type = "SELECT" // start building the "query struct" parser.pop() // TODO continue with SELECT query parsing... case "UPDATE": // TODO handle UPDATE// TODO other cases... default: return parser.query, fmt.Errorf("invalid query type")}Copy the code

We can basically fill out TODO and make it run! However, the smart reader will notice that the code to parse the entire SELECT query can quickly get messy, and we have many types of queries to parse. So we need some structure.

Finite state machine

FSMs is a very interesting topic, but we’re not here to talk about it, so we won’t go into it. Let’s just focus on what we need.

During our parsing, only a few tokens are valid at any given point (rather than “points”, we would call them “nodes”), and after finding those tokens, we would move to new nodes where different tokens are valid, and so on, until parsing of the query is complete. We can visualize these node relationships as directed graphs:

Point conversions can be defined in a simpler table, but:

We can simply convert this table into a very large switch statement. We’ll use the parser. Step property we defined earlier:

func (p *parser) Parse() (query.Query, error) {  parser.step = stepType // initial step  for parser.i < len(parser.sql) {    nextToken := parser.peek()    switch parser.step {    case stepType:      switch nextToken {      case UPDATE:        parser.query.type = "UPDATE"        parser.step = stepUpdateTable      // TODO cases of other query types      }    case stepUpdateSet:      // ...    case stepUpdateField:      // ...    case stepUpdateComma:      // ...    }    parser.pop()  }  return parser.query, nil}Copy the code

All right! Note that some steps may loop back conditionally to previous steps, such as the comma on the SELECT field definition. This strategy works well for basic parsers. However, as the syntax becomes more complex, the number of states increases dramatically, so it can become tedious to write. I recommend testing as you write code; See below for more information.

Peek () implementation

Remember, we need to implement both peek() and POP (). Since they are almost the same, we use an auxiliary function to keep our code clean. In addition, pop() should be further indexed to avoid fetching Spaces.

func (p *parser) peek() string {  peeked, _ := p.peekWithLength()  return peeked}func (p *parser) pop() string {  peeked, len := p.peekWithLength()  p.i += len  p.popWhitespace()  return peeked}func (p *parser) popWhitespace() {  for ; p.i < len(p.sql) && p.sql[p.i] == ' '; p.i++ {  }}Copy the code

Here is a list of tokens we might want to get:

var reservedWords = []string{ "(", ")", ">=", "<=", "! =", ",", "=", ">", "<", "SELECT", "INSERT INTO", "VALUES", "UPDATE", "DELETE FROM", "WHERE", "FROM", "SET",}Copy the code

In addition, we might encounter quoted strings or pure identifiers (such as field names). Here is a complete implementation of peekWithLength() :

func (p *parser) peekWithLength() (string, int) {  if p.i >= len(p.sql) {    return "", 0  }  for _, rWord := range reservedWords {    token := p.sql[p.i:min(len(p.sql), p.i+len(rWord))]    upToken := strings.ToUpper(token)    if upToken == rWord {      return upToken, len(upToken)    }  }  if p.sql[p.i] == '\'' { // Quoted string    return p.peekQuotedStringWithLength()  }  return p.peekIdentifierWithLength()}Copy the code

The rest of the functions are simple and left as an exercise for the reader. If you’re interested, check out the Github link for the complete source code implementation.

The final validation

The parser might find the end of the string before getting the full query definition. It might be a good idea to implement a parser.validate() function that looks at the generated “query” structure and returns an error if it is incomplete or incorrect.

The table-driven test mode for testing Go is perfect for our situation:

type testCase struct {  Name     string         // description of the test  SQL      string         // input sql e.g. "SELECT a FROM 'b'"  Expected query.Query    // expected resulting "query" struct  Err      error          // expected error result}Copy the code

Test example:

ts := []testCase{    {      Name:     "empty query fails",      SQL:      "",      Expected: query.Query{},      Err:      fmt.Errorf("query type cannot be empty"),    },    {      Name:     "SELECT without FROM fails",      SQL:      "SELECT",      Expected: query.Query{Type: query.Select},      Err:      fmt.Errorf("table name cannot be empty"),    },    ...Copy the code

Test the test case like this:

for _, tc := range ts { t.Run(tc.Name, func(t *testing.T) { actual, err := Parse(tc.SQL) if tc.Err ! = nil && err == nil { t.Errorf("Error should have been %v", tc.Err) } if tc.Err == nil && err ! = nil { t.Errorf("Error should have been nil but was %v", err) } if tc.Err ! = nil && err ! = nil { require.Equal(t, tc.Err, err, "Unexpected error") } if len(actual) > 0 { require.Equal(t, tc.Expected, actual[0], "Query didn't match expectation") } }) }Copy the code

I use Verify because it provides a diff output when the query structure does not match.

Deep understanding of

This experiment is perfect for:

Learn the LL(1) parser algorithm
Custom simple syntax for resolving dependencies

However, this approach can become tedious and has certain limitations. Consider how to parse arbitrarily complex compound expressions (such as SQRT (a) =(1 *(2 + 3)).

For a more powerful parsing model, look at the parser combinator. Goyacc is a popular Go implementation.

Here is the full parser address:

http://github.com/marianogappa/sqlparser

360 cloud computing

The technology sharing public account created by 360 Cloud platform team covers database, big data, micro services, containers, AIOps, IoT and many other technical fields. Through solid technology accumulation and rich front-line practical experience, it will bring you the most promising technology sharing

Related Posts

Mongo advanced – Mongoose understanding

20 day brush test plan -3. The longest string without repeating characters

Laravel’s eighteen Best Practices