preface

The Go compiler is generally abbreviated to lowercase GC (Go Compiler) and needs to be distinguished from uppercase GC (garbage collection). The Go compiler execution process can be refined into multiple stages, including lexical parsing, syntax parsing, abstract syntax tree construction, type checking, variable capture, function inlining, escape analysis, closure rewriting, traversing and compiling functions, SSA generation, and machine code generation, as shown in the figure:

This article mainly explains the steps and details of lexical analysis.

1. Lexical analysis and Token

During the lexical parsing phase, the Go compiler scans the input Go source files and converts them to tokens.

Token is the smallest lexical unit in a programming language that has an independent meaning. Tokens contain not only keywords, but also user-defined identifiers, operators, delimiters, comments, and so on. It is important that the lexical unit corresponding to each Token has three attributes: first, the value of the Token itself indicates the type of the lexical unit, second, the source text form of the Token in the source code, and finally, the location where the Token appears. Of all the tokens, comments and semicolons are two special tokens. Ordinary comments generally do not affect the semantics of the program, so they can be ignored most of the time. In Go, semicolon tokens are often automatically added at the end of lines. Semicolons are the lexical units that separate statements. Therefore, the automatic addition of semicolons results in minor grammatical differences such as the left bracket cannot be a single line in Go.

1.1 concept Token

Go language is mainly composed of tokens such as identifiers, keywords, operators and delimiters.

Identifiers refer to character sequences used by Go language to name variables, methods, functions, etc. Identifiers consist of several letters, underscores, and digits, and the first character cannot be a number (can be underscores). In plain English, any name that can be defined by itself can be called an identifier. Note that the dollar sign $is not a letter, so the identifier cannot contain a dollar sign. (Thanks to a careful person for reminding us that identifiers are supported to start with underscores. I didn’t notice this little detail before. Thanks again.)

Keywords are identifiers endowed with special meanings by Go language, which can also be called reserved words. Keywords are used to guide special syntax structures. They cannot be used as independent identifiers. Here are 25 keywords defined by the Go language:

break        default      func         interface    select
case         defer        go           map          struct
chan         else         goto         package      switch
const        fallthrough  if           range        type
continue     for          import       return       var
Copy the code

The reason why keywords in Go are deliberately kept so few is to simplify code parsing during compilation.

In addition to identifiers and keywords, tokens also contain operators and separators. Here are 47 symbols defined by the Go language:

+ & += &= && ==! = = () - | - | = | | < < = [] * * = ^ ^ = < - > > = {} / < < / = < < = + + =, =,; % >> %= >>=! . . : & ^ & ^ =Copy the code

Of course, in addition to user-defined identifiers, 25 keywords, 47 operations, and separators, the program consists of some literals, comments, and whitespace. The first step in parsing a Go program is to parse these tokens.

1.2 Token definition

In the GO/Token package, tokens are defined as an enumerated value, with tokens of different values representing different types of lexical tokens:

// Token is the set of lexical tokens of the Go programming language.
type Token int
Copy the code

All tokens are divided into four categories: special types of tokens, tokens corresponding to base literals, operator tokens, and keywords.

There are three special types of tokens: error, end of file, and comment:

// The list of tokens.
const (
	// Special tokens
	ILLEGAL Token = iota
	EOF
	COMMENT
Copy the code

If an unrecognized Token exists, that is, a lexical error exists in the source file, return “ILLEGAL” to simplify the lexical analysis. If EOF is encountered, the. Go file has been traversed. COMMENT represents a COMMENT in the source code.

Then there are the Token types that correspond to the base values: The base values defined by the Go language specification are mainly integer, floating-point, and complex values, in addition to character and string values. It is important to note that the Boolean types true and false are not part of the base value in the Go specification. But to facilitate lexical parsing, the Go/Token package includes identifiers such as true and false as literal tokens.

Here is a list of value class tokens:

    literal_beg
    // Identifiers and basic type literals
    // (these tokens stand for classes of literals)
    IDENT  // main
    INT    / / 12345
    FLOAT  / / 123.45
    IMAG   / / 123.45 I
    CHAR   // 'a'
    STRING // "abc"
    literal_end
Copy the code

Literal_beg and literal_end are private types, which are mainly used to indicate the value range of Token. Therefore, judging the value of a Token between literal_beg and literal_end can confirm that it is a Token type. IDENT identifies an identifier and some primitive type literals, such as method names, type names, variable names, constants, and keywords, that do not belong to this type. The literals true and false of bool are bound to IDENT.

The operator and delimiter types have the largest number of tokens. Here is a list of tokens:

    operator_beg
    // Operators and delimiters
    ADD // +
    SUB // -
    MUL / / *
    QUO // /
    REM / / %

    AND     / / &
    OR      // |
    XOR     / / ^
    SHL     // <<
    SHR     // >>
    AND_NOT / / & ^

    ADD_ASSIGN / / + =
    SUB_ASSIGN / / - =
    MUL_ASSIGN / / * =
    QUO_ASSIGN / / / =
    REM_ASSIGN / / % =

    AND_ASSIGN     / / & =
    OR_ASSIGN      / / | =
    XOR_ASSIGN     / / ^ =
    SHL_ASSIGN     / / < < =
    SHR_ASSIGN     / / > > =
    AND_NOT_ASSIGN / / & ^ =

    LAND  / / &&
    LOR   // ||
    ARROW // <-
    INC   // ++
    DEC   // --

    EQL    / / = =
    LSS    // <
    GTR    // >
    ASSIGN / / =
    NOT    // !

    NEQ      / /! =
    LEQ      / / < =
    GEQ      / / > =
    DEFINE   / / : =
    ELLIPSIS // ...

    LPAREN / / (
    LBRACK / / /
    LBRACE / / {
    COMMA  // ,
    PERIOD // .

    RPAREN    // )
    RBRACK    // ]
    RBRACE    // }
    SEMICOLON // ;
    COLON     / / :
    operator_end
Copy the code

Operators mainly include ordinary arithmetic operation symbols such as addition, subtraction, multiplication and division, as well as binary operations such as logical operations, bitwise operators and comparison operations (which are combined again with assignment operations). In addition to binary operation, there are a few unary operation symbols: such as positive and negative signs, address fetching symbols, pipeline reading, etc. The delimiters are mainly parentheses, middle parentheses, and large parentheses, as well as commas, periods, semicolons and colons.

The keywords of Go correspond to 25 tokens of keyword type:

    keyword_beg
    // Keywords
    BREAK
    CASE
    CHAN
    CONST
    CONTINUE

    DEFAULT
    DEFER
    ELSE
    FALLTHROUGH
    FOR

    FUNC
    GO
    GOTO
    IF
    IMPORT

    INTERFACE
    MAP
    PACKAGE
    RANGE
    RETURN

    SELECT
    STRUCT
    SWITCH
    TYPE
    VAR
    keyword_end
)
Copy the code

From a lexical perspective, keywords are no different from ordinary identifiers. However, the 25 keywords are generally the beginning tokens of different syntax structures. By defining these special tokens as keywords, parsing can be simplified.

All tokens are also put into a string slice, which makes it easy to quickly determine the Token type using the xxx_beg and xxx_end subscripts:

var tokens = [...]string{
   ILLEGAL: "ILLEGAL",

   EOF:     "EOF",
   COMMENT: "COMMENT",

   IDENT:  "IDENT",
   INT:    "INT",
   FLOAT:  "FLOAT",
   IMAG:   "IMAG",
   CHAR:   "CHAR",
   STRING: "STRING",

   ADD: "+",
   SUB: "-",
   MUL: "*",
   QUO: "/",
   REM: "%".// Too much...
   // ...
   
   SELECT: "select",
   STRUCT: "struct",
   SWITCH: "switch",
   TYPE:   "type",
   VAR:    "var",

   TILDE: "~",}Copy the code

Tokens are as important to programming languages as 26 letters are to The English language, building blocks for more complex logical code, so we need to familiarize ourselves with the features and taxonomy of tokens.

1.3 FileSet and File

Go language itself, it is composed of multiple files package, and then multiple packages linked into an executable file, so the multiple files corresponding to a single package can be regarded as the basic compilation unit of Go language. So the Go/Token package also defines FileSet and File objects that describe filesets and files.

Here are the structural definitions of FileSet and File (both located in SRC \go\token\position.go) :

type FileSet struct {
   mutex sync.RWMutex // protects the file set
   base  int          // base offset for the next file
   files []*File      // list of files in the order added to the set
   last  *File        // cache of last file looked up
}

type File struct {
   set  *FileSet
   name string // file name as provided to AddFile
   base int    // Pos value range for this file is [base...base+size]
   size int    // file size as provided to AddFile

   // lines and infos are protected by mutex
   mutex sync.Mutex
   lines []int // lines contains the offset of the first character for each line (the first entry is always 0)
   infos []lineInfo
}
Copy the code

The lineInfo type structures involved:

type lineInfo struct {
   Offset       int
   Filename     string
   Line, Column int
}
Copy the code

FileSet; File; FileSet;

FileSet:

  • Files is a slice that holds all files in a set of files
  • Base is the offset of the next File, such as the current FileSet files slice only one File, then FileSet base is 1+ file. size+1, the first 1 is because the base calculation starts from 1 instead of 0. The second 1 is because EOF is used to indicate the end of the lexical analysis of the file. Detailed calculations will be explained later.

File:

  • Name is the file name
  • Base represents the base offset for each File
  • Size indicates the length of the file source in bytes
  • Lines is an int slice that holds the offset of the first character of each line in the file. Lines can be used to calculate the line and column numbers of tokens
  • Infos is a slice of type lineInfo. The lineInfo structure is used to store the row and column numbers of each Token

This should give you a general idea of FileSet and File. FileSet is a large array that stores the bytes of a File in sequence. A File belongs to an array [base,base+size]. Base is the location of the first byte of the File in the large array, and size is the size of the File.

The mapping between FileSet and File objects is shown in the following figure:

You can see that there is an extra Pos in the diagram, and the range is the entire large array. Yes, Pos is short for Position, which is the subscript Position of the entire array.

Each File consists of File name, base and size. Base corresponds to the Pos index position of File in FileSet, so base and base+size define the start and end positions of File in FileSet array. The subscript index can be located by offset inside each File, and offset+ file.base can be converted to Pos position inside the File. Because Pos is the global offset of FileSet, you can also use Pos to query the corresponding File and offset within the corresponding File. More on that later.

The location information of each Token in lexical analysis is defined by Pos, and the corresponding File can be easily queried through Pos and the corresponding FileSet. The corresponding line and column numbers are then calculated from the source File corresponding to File and offset. (In the implementation, File only stores the beginning of each line and does not contain the original source code data.) The underlying type of Pos is int, which is similar to the semantics of Pointers. Therefore, 0 is also defined as NoPos, indicating that Pos is invalid.

1.4 Parsing Attempts

Lexical analysis is usually a code that reads input one character at a time, using the currently read character, along with a state machine that parses the lexical to determine the type of Token currently read. Sometimes, a character does not provide enough information to make this determination, and the next or more characters need to be read in advance to assist the lexical analyzer in making this determination.

Go language standard library Go /scanner package provides scanner Token scanning, which is based on FileSet and File abstract File collection for lexical analysis. There are three important methods: Init(), Next(), and Scan(). The Init method is used to initialize the scanner, the Next method is used to find the Next character, and the Scan method is the core, which is used to Scan the code to return the Token. The Scan method has three return values, representing the location of the Token, the Token value, and the source text representation of the Token.

We can find the test case in SRC \go\scanner\example_test.go and modify it slightly:

import (
   "fmt"
   "go/scanner"
   "go/token"
)

func main(a) {
   var src = []byte("println(\"hello world\")\n")
   var src1= []byte("println(\"thank you\")")
   src=append(src,src1...)
   var fset = token.NewFileSet()
   var file = fset.AddFile("hello.go", fset.Base(), len(src))
   var s scanner.Scanner
   s.Init(file, src, nil, scanner.ScanComments)

   for {
      pos, tok, lit := s.Scan()
      if tok == token.EOF {
         break
      }
      fmt.Printf("%s\t%s\t%q\t\n", fset.Position(pos), tok, lit)
   }
}
Copy the code

Where SRC is the code to be analyzed. A File set is then created with token.newfileset (), through which the token location information must be located, and through which the File parameters needed to create the scanner’s Init method are required.

The fset.addfile method call then adds a new file to the fset file set named “hello.go” whose length is the same as the code SRC is analyzing.

The scanner.Scanner object is then created and the scanner is initialized by calling the Init method. The first argument to Init is the file object just added to the fset, the second argument is the code to analyze, the third nil argument means there is no custom error handler, and the last scanner.scanComments argument means the comment Token is not ignored.

Because there are multiple tokens in the code to be parsed, we parse the new tokens in turn in a for loop calling s.can (). If token.EOF is returned, the end of the file is scanned. Otherwise, the scan result is printed. Before printing, we need to convert the pos parameter returned by the scanner into more detailed Position information with file name and line number, which can be done by using the fset.position (pos) method.

The output of the above program is as follows:

hello.go:1:1	IDENT	"println"	
hello.go:1:8	(	""	
hello.go:1:9	STRING	"\"hello world\""	
hello.go:1:22	)	""	
hello.go:1:23	;	"\n"	
hello.go:2:1	IDENT	"println"	
hello.go:2:8	(	""	
hello.go:2:9	STRING	"\"thank you\""	
hello.go:2:20	)	""	
hello.go:2:21	;	"\n"
Copy the code

The first column of the output indicates the file and column number of the Token, the middle column indicates the value of the Token, and the last column indicates the value of the Token.

1.5 Deepen understanding

If you have not been exposed to compilation before, you are probably still confused by it. For example, with lexical analysis, what is the relationship between so many offsets (offset, rdOffset, lineOffset)? How did Pos change? How exactly is Pos converted to offset? The fset.position (Pos) method is used to convert Pos to a more detailed Position with a file name and column number.

Debug the above program directly to see the implementation details of the scanner, if it is troublesome to see the debug result I comb on the line.

First look at the scanner structure:

// A Scanner holds the scanner's internal state while processing
// a given text. It can be allocated as part of another data
// structure but must be initialized via Init before use.
//
type Scanner struct {
   // immutable state
   file *token.File  // source file handle
   dir  string       // directory portion of file.Name()
   src  []byte       // source
   err  ErrorHandler // error reporting; or nil
   mode Mode         // scanning mode

   // The core variable used by the lexical analyzer
   ch         rune // Current character
   offset     int  // The character offset
   rdOffset   int  // can be understood as the offset of ch
   lineOffset int  // The current row offset
   insertSemi bool // Insert a semicolon before a line break

   // public state - ok to modify
   ErrorCount int // number of errors encountered
}
Copy the code

Pos, tok, and lit are returned by the ch, offset, rdOffset, lineOffset, and fset.Position(pos) methods.

println:        ch=40
                offset=7
                rdOffset=8
                lineOffset=0
                pos=1
                tok=IDENT(4)
                lit="println"

(               ch=34
                offset=8
                rdOffset=9
                lineOffset=0
                pos=8
                tok=LPAREN(49)
                lit=""

"hello world":  ch=41
                offset=21
                rdOffset=22
                lineOffset=0
                pos=9
                tok=STRING(9)
                lit=""hello world""

)               ch=10
                offset=22
                rdOffset=23
                lineOffset=0
                pos=22
                tok=RPAREN(54)
                lit=""
                
\n              ch=112
                offset=22
                rdOffset=24
                lineOffset=23
                pos=23
                tok=SEMICOLON(57)
                lit="\n"
                
println:        ch=40
                offset=30
                rdOffset=31
                lineOffset=23
                pos=24
                tok=IDENT(4)
                lit="println". \n ch=- 1
                offset=43
                rdOffset=43
                lineOffset=23
                pos=44
                tok=SEMICOLON(57)
                lit="\n"End of file ch=- 1
                offset=43
                rdOffset=43
                lineOffset=23
                pos=44
                tok=EOF(1)
                lit=""
Copy the code

Several details can be found:

  • After iterating through the current Token, ch points to the next character
  • RdOffset is always one more than offset until the file is parsed
  • There are several tokens that print lit as “”, the reason is that certain single-character tokens already know what it is by looking at toK, for example(,),{,}You don’t need to assign to LIT anymore. If it’s an identifier or a primitive type, you should definitely use LIT to specify the value. For example, if the Token read is an INT, the return value lit is used to determine whether the Token read is 0 or 1000 or some other valid integer value.

FileSet, File, offset, pos

Now let’s see how Pos is converted to row and column numbers.

The line and column information for the file’s tokens is stored in the Position structure:

type Position struct {
   Filename string // filename, if any
   Offset   int    // offset, starting at 0
   Line     int    // line number, starting at 1
   Column   int    // column number, starting at 1 (byte count)
}
Copy the code

The fset.position (pos) method converts pos passed into Position:

func (s *FileSet) Position(p Pos) (pos Position) {
   return s.PositionFor(p, true)}func (s *FileSet) PositionFor(p Pos, adjusted bool) (pos Position) {
   ifp ! = NoPos {iff := s.file(p); f ! =nil {
         return f.position(p, adjusted)
      }
   }
   return
}

func (f *File) position(p Pos, adjusted bool) (pos Position) {
   // If you haven't understood the relation between Pos and offset, you must have understood it
   offset := int(p) - f.base
   pos.Offset = offset
   // The core function
   pos.Filename, pos.Line, pos.Column = f.unpack(offset, adjusted)
   return
}
Copy the code

In the position method, it can be found that the row and column number is not really calculated by Pos, but the offset calculated by Pos.

Read on:

func (f *File) unpack(offset int, adjusted bool) (filename string, line, column int) {
   f.mutex.Lock()
   defer f.mutex.Unlock()
   filename = f.name
   if i := searchInts(f.lines, offset); i >= 0 {
      line, column = i+1, offset-f.lines[i]+1
   }
   if adjusted && len(f.infos) > 0 {
      / / to omit
   }
   return
}
Copy the code

Because the adjusted parameter is dependent on the //line annotation, instead of looking at the normal case, you’ll find that searchInts returns the column number. The input arguments to the searchInts method are f.lines and offset. F.lines was introduced earlier. Let’s look at the definition of this field:

lines []int // lines contains the offset of the first character for each line (the first entry is always 0)
Copy the code

Lines stores the offset of the first character of each line. We can guess why this field is needed to calculate the Token line and column number. If the value of lines is [0,10,20,30,40], and offset is 22, it is not difficult to think that if the value of lines[I]<offset<lines[j], Offset -lines[I]+1, offset-lines[I]+1, offset-lines[I]+1, offset-lines[I]+1, offset-lines[I]+1, offset-lines[I]+1, offset-lines[I]+1, offset-lines[I]+1, offset-lines[I]+1, offset-lines[I]+1, offset-lines[I]+1, offset-lines (Does finding the line number feel like a binary search?)

See if searchInts implements this:

func searchInts(a []int, x int) int {
   i, j := 0.len(a)
   for i < j {
      h := i + (j-i)/2 // avoid overflow when computing h
      // i ≤ h < j
      if a[h] <= x {
         i = h + 1
      } else {
         j = h
      }
   }
   return i - 1
}
Copy the code

That’s true. Ha ha ha ha, I guessed right!

1.6 Further

Finally, comb the whole execution process of the above program and some logic of SCAN, and paste the program code here, so as not to turn over.

func main(a) {
   var src = ...
   var fset = token.NewFileSet()
   var file = fset.AddFile("hello.go", fset.Base(), len(src))
   var s scanner.Scanner
   s.Init(file, src, nil, scanner.ScanComments)

   for {
      pos, tok, lit := s.Scan()
      if tok == token.EOF {
         break
      }
      // Printf}}Copy the code

1.6.1 NewFileSet ()

NewFileSet used to create a FileSet

// NewFileSet creates a new file set.
func NewFileSet(a) *FileSet {
   return &FileSet{
      base: 1.// 0 == NoPos}}Copy the code

The initial base value of FileSet is 1, not 0, and the 0 value of base is represented by an alias NoPos. Base equals NoPos means there is no file or line information at all.

const NoPos Pos = 0
Copy the code

1.6.2 AddFile

AddFile is used to AddFile to FileSet.

func (s *FileSet) AddFile(filename string, base, size int) *File {
   s.mutex.Lock()
   defer s.mutex.Unlock()
   if base < 0 {
      base = s.base
   }
   // omit some base size judgments
   // Create a File and initialize the internal field, where the base value is 1, because it is the base of FileSet
   Var file = fset.addfile ("hello.go", fset.base (), len(SRC))
   f := &File{set: s, name: filename, base: base, size: size, lines: []int{0}}
   The first 1 is the base value and the second 1 is the EOF placeholder
   base += size + 1 // +1 because EOF also has a position
   // omit some base size judgments
   // add the file to the file set
   s.base = base
   s.files = append(s.files, f)
   s.last = f
   return f
}
Copy the code

At this point, the preparation for the scan is complete.

1.6.3 Init

Init Initializes scanner

func (s *Scanner) Init(file *token.File, src []byte, err ErrorHandler, mode Mode) {
   iffile.Size() ! =len(src) {
      panic(fmt.Sprintf("file size (%d) does not match src len (%d)", file.Size(), len(src)))
   }
   s.file = file
   s.dir, _ = filepath.Split(file.Name())
   s.src = src
   s.err = err
   s.mode = mode

   s.ch = ' '
   s.offset = 0
   s.rdOffset = 0
   s.lineOffset = 0
   s.insertSemi = false
   s.ErrorCount = 0

   // Retrieves the next character to parse
   s.next()
   if s.ch == bom {
      s.next() // ignore BOM at file beginning}}Copy the code

1.6.4 Next

Next reads the Next character into scanner.ch

// Since Golang supports utF-8 encoded source programs, the Next method reads the Next character instead of the byte
// This method does not return a value, but instead updates the relevant variables of type scanner (ch, offset, etc.)
func (s *Scanner) next(a) {
   if s.rdOffset < len(s.src) {
      s.offset = s.rdOffset
      if s.ch == '\n' {
         s.lineOffset = s.offset
         s.file.AddLine(s.offset)
      }
      // use rune to get the utf8 character
      r, w := rune(s.src[s.rdOffset]), 1
      switch {
      case r == 0:
         s.error(s.offset, "illegal character NUL")
      case r >= utf8.RuneSelf:
         // Not ASCII
         The utf8.decoderune method converts bytes to rune and returns r as a rune and w as the length of r
         r, w = utf8.DecodeRune(s.src[s.rdOffset:])
         if r == utf8.RuneError && w == 1 {
            s.error(s.offset, "illegal UTF-8 encoding")}else if r == bom && s.offset > 0 {
            s.error(s.offset, "illegal byte order mark")
         }
      }
      s.rdOffset += w
      s.ch = r
   } else {
      s.offset = len(s.src)
      if s.ch == '\n' {
         s.lineOffset = s.offset
         s.file.AddLine(s.offset)
      }
      // If we read to the end of the file, we will set ch to -1
      s.ch = - 1 // EOF}}Copy the code

1.6.5 scan

The Scan method is the core implementation of the lexical analyzer, which, as mentioned above, is a state machine. For each call to Scan, a Token is returned.

func (s *Scanner) Scan(a) (pos token.Pos, tok token.Token, lit string) {
scanAgain:
   // Filter out blank lines
   s.skipWhitespace()

   // Start point of the current Token
   pos = s.file.Pos(s.offset)

   // insertSemi determines whether to insert before a newline
   insertSemi := false
   // Determine the current token type based on the current character ch.
   switch ch := s.ch; {
   // If the current read is a letter (a-z, A-z, or UTF8), parse it to an Identifier token
   // Of course this identifier can be a keyword, so use token.lookup to check whether the current identifier is a keyword
   case isLetter(ch):
      // Scan out the entire identifier
      lit = s.scanIdentifier()
      if len(lit) > 1 {
         // Determine whether the current identifier is a keyword
         tok = token.Lookup(lit)
         switch tok {
         case token.IDENT, token.BREAK, token.CONTINUE, token.FALLTHROUGH, token.RETURN:
            insertSemi = true}}else {
         insertSemi = true
         tok = token.IDENT
      }
   // If the current read is a number, parse it as an INT. FLOAT is determined by the scanNumber method
   case isDecimal(ch) || ch == '. ' && isDecimal(rune(s.peek())):
      insertSemi = true
      tok, lit = s.scanNumber()
   default:
      // Update s.ch to the next character, which depends on the next ch to determine the current token type
      s.next()
      // where ch is the first character of s.char
      switch ch {
      case - 1:
         if s.insertSemi {
            s.insertSemi = false // EOF consumes insertSemi
            return pos, token.SEMICOLON, "\n"
         }
         tok = token.EOF
      case '\n':
         s.insertSemi = false // Line breaks consume insertSemi
         return pos, token.SEMICOLON, "\n"
      // The current token is string. The form is "ABC..."
      case '"':
         insertSemi = true
         tok = token.STRING
         lit = s.scanString()
      // The current token is char.
      case '\' ':
         insertSemi = true
         tok = token.CHAR
         lit = s.scanRune()
      
      // The current token is a raw string. The form is' ABC... `
      case '`:
         insertSemi = true
         tok = token.STRING
         lit = s.scanRawString()
      // When a ':' is encountered, the token type will depend on the next character. If the next character is '=', then the current token will be ":=",
      // Otherwise it is simply ':', which explains why next is executed first
      case ':':
         tok = s.switch2(token.COLON, token.DEFINE)
      // If ch is a '.' and s.h is a number, then we are currently in the token format of "*.1", can only be decimal.
      case '. ':
         tok = token.PERIOD
         if s.ch == '. ' && s.peek() == '. ' {
            s.next()
            s.next() // consume last '.'
            tok = token.ELLIPSIS
         }
      case ', ':
         tok = token.COMMA
      case '; ':
         tok = token.SEMICOLON
         lit = ";"
      case '(':
         tok = token.LPAREN
      case ') ':
         insertSemi = true
         tok = token.RPAREN
      case '[':
         tok = token.LBRACK
      case '] ':
         insertSemi = true
         tok = token.RBRACK
      case '{':
         tok = token.LBRACE
      case '} ':
         insertSemi = true
         tok = token.RBRACE
      // If ch is currently '+', then all possible results are '+', '+=', '++'. The specific token type still depends on the value of s.char
      case '+':
         tok = s.switch3(token.ADD, token.ADD_ASSIGN, '+', token.INC)
         if tok == token.INC {
            insertSemi = true
         }
      case The '-':
         tok = s.switch3(token.SUB, token.SUB_ASSIGN, The '-', token.DEC)
         if tok == token.DEC {
            insertSemi = true
         }
      case The '*':
         tok = s.switch2(token.MUL, token.MUL_ASSIGN)
      case '/':
         // If ch is '/' and s.ch is '/' or '*', the current token is a comment.
         if s.ch == '/' || s.ch == The '*' {
            if s.insertSemi && s.findLineEnd() {
               s.ch = '/'
               s.offset = s.file.Offset(pos)
               s.rdOffset = s.offset + 1
               s.insertSemi = false
               return pos, token.SEMICOLON, "\n"
            }
            comment := s.scanComment()
            if s.mode&ScanComments == 0 {
               s.insertSemi = false
               goto scanAgain
            }
            tok = token.COMMENT
            lit = comment
         } else {
            // If it is not a comment, the result may be the division operator or '/='
            tok = s.switch2(token.QUO, token.QUO_ASSIGN)
         }
      // The possible result is the modulo operator or the '%=' operator
      case The '%':
         tok = s.switch2(token.REM, token.REM_ASSIGN)
      case A '^':
         tok = s.switch2(token.XOR, token.XOR_ASSIGN)
      case '<':
         if s.ch == The '-' {
            s.next()
            tok = token.ARROW
         } else {
            tok = s.switch4(token.LSS, token.LEQ, '<', token.SHL, token.SHL_ASSIGN)
         }
      case '>':
         tok = s.switch4(token.GTR, token.GEQ, '>', token.SHR, token.SHR_ASSIGN)
      case '=':
         tok = s.switch2(token.ASSIGN, token.EQL)
      case '! ':
         tok = s.switch2(token.NOT, token.NEQ)
      case '&':
         if s.ch == A '^' {
            s.next()
            tok = s.switch2(token.AND_NOT, token.AND_NOT_ASSIGN)
         } else {
            tok = s.switch3(token.AND, token.AND_ASSIGN, '&', token.LAND)
         }
      case '|':
         tok = s.switch3(token.OR, token.OR_ASSIGN, '|', token.LOR)
      default:
         ifch ! = bom { s.errorf(s.file.Offset(pos),"illegal character %#U", ch)
         }
         insertSemi = s.insertSemi
         tok = token.ILLEGAL
         lit = string(ch)
      }
   }
   if s.mode&dontInsertSemis == 0 {
      s.insertSemi = insertSemi
   }

   return
}
Copy the code

1.7 summary

Lexical analysis can be simply understood as the process of translating source code into character sequences according to certain conversion rules. For example, use rules to translate the following source code:

package main

import ( 
    "fmt" 
) 

func main(a) { 
    fmt.Println("Hello")}Copy the code

Then the output Token sequence is:

PACKAGE  IDENT

IMPORT  LPAREN
    QUOTE IDENT QUOTE
RPAREN

FUNC  IDENT LPAREN RPAREN  LBRACE
    IDENT DOT IDENT LPAREN QUOTE IDENT QUOTE RPAREN
RBRACE
Copy the code

Of course, such a pure character Token does not help much in the subsequent analysis, which will continue based on the Token.

1.8 develop

The lexical analysis code shown above is in the SRC \go directory. There is a similar set of lexical analysis code in the SRC \ CMD \compile\internal directory. The implementation logic is not quite the same, but the principle is the same.

So why write two sets of compile front ends? In fact, SRC \ CMD \compile\internal is used for the Golang compiler; Another set in the SRC \ Go directory is for applications that require front-end analysis, such as the one cited by the official Gofmt:

We know that Gofmt can be used to format code by parsing it into an AST and then turning the AST back into code.

At this point, the exploration of lexical analysis is over, and we can write about syntax analysis, abstract trees, type checking and so on. It’s tiring to type so many words at once. Escape)

Reference:

Gitee.com/amell/go-as…

Blog.csdn.net/zhaoruixian…