preface
In the last article, the core technology of lexical analysis, finite automata (DFA), and the use and working principle of two common lexical analysers were introduced. On this basis to see Go lexical analysis source code will be easier
This article mainly contains the following contents:
- The entry file compiled by Go, and what is done in compiling the entry file
- Where is lexical analysis in Go compilation, and what is the detailed process like
- Write a test go source file, perform a lexical analysis on the source file, and obtain the results of the lexical analysis
Source code analysis
Go compiler entry
In order to have a clearer understanding of how the compilation process of Go goes to the step of lexical analysis, here we first introduce the Go compilation entry file where, and roughly do what
The Go compilation entry file is in: SRC/CMD /compile/main.go -> gc.Main(archInit)
Copy the code
Enter the gc.main (archInit) function, which is a long one. All it does is get the parameters passed in from the command line and update the compilation options and configuration. Then you’ll see this line of code down here
lines := parseFiles(flag.Args())
Copy the code
This is the entry for lexical analysis and syntax analysis. It will perform lexical and syntax analysis on the incoming file, resulting in a syntax tree. It will build it into an abstract syntax tree, and then perform type checking on it, which will be shared in a later article
Open the parseFiles(flag.args ()) file and you can see the following (I omitted the latter part of the code to focus on the lexical analysis) :
func parseFiles(filenames []string) uint {
noders := make([]*noder, 0.len(filenames))
// Limit the number of simultaneously open files.
sem := make(chan struct{}, runtime.GOMAXPROCS(0) +10)
for _, filename := range filenames {
p := &noder{
basemap: make(map[*syntax.PosBase]*src.PosBase),
err: make(chan syntax.Error),
}
noders = append(noders, p)
go func(filename string) {
sem <- struct{} {}defer func(a) { <-sem }()
defer close(p.err)
base := syntax.NewFileBase(filename)
f, err := os.Open(filename)
iferr ! =nil {
p.error(syntax.Error{Msg: err.Error()})
return
}
defer f.Close()
p.file, _ = syntax.Parse(base, f, p.error, p.pragma, syntax.CheckBranches) // errors are tracked via p.error
}(filename)
}
......
}
Copy the code
As we know, in the compilation process of Go, each source file is eventually parsed into a syntax tree. As you can see from the first few lines of the code above, it first creates multiple coroutines to compile the source file, but there is a limit on the number of open source files at a time
sem := make(chan struct{}, runtime.GOMAXPROCS(0) +10)
Copy the code
Then the source file is traversed, and a number of coroutines are used for lexical and grammatical analysis of the file, mainly reflected in the for loop and go func. As can be seen in Go Func, it initializes the information of the source file, mainly recording the name, row, and column of the source file. The purpose is to report the location of the error if encountered during lexical and grammatical analysis. It mainly contains the following structures
type PosBase struct {
pos Pos
filename string
line, col uint32
}
type Pos struct {
base *PosBase
line, col uint32
}
Copy the code
I’m going to open the source file and initialize the parser, and the reason why I initialize the parser is because you can see that during the compilation of Go, lexing and parsing are put together, and when I initialize the parser, I actually initialize the parser, We can go to the syntax.Parse function in Go Fun
func Parse(base *PosBase, src io.Reader, errh ErrorHandler, pragh PragmaHandler, mode Mode) (_ *File, first error) {
defer func(a) {
if p := recover(a); p ! =nil {
if err, ok := p.(Error); ok {
first = err
return
}
panic(p)
}
}()
var p parser
p.init(base, src, errh, pragh, mode) // Initialize the operation
p.next() // The lexical parser parses the source file and converts it into a source file consisting entirely of tokens
return p.fileOrNil(), p.first // The parser parses the token file above
}
Copy the code
You can see that parsing initialization is invoked:
p.init(base, src, errh, pragh, mode)
Copy the code
When we go into P.i nit, we see a line of code that initializes the lexical analyzer
p.scanner.init(... Here are the parameters to initialize the lexical analyzer.Copy the code
You can see that the parser calls the init method of the lexical parser with p.scanner. The parser structure is embedded in the structure of the parser. This article mainly introduces the syntax of the parser, so it will not introduce the meaning of the parser structure field. It will introduce the parser structure in detail in the article.
// Parse the structure
type parser struct {
file *PosBase
errh ErrorHandler
mode Mode
pragh PragmaHandler
scanner // Embedded lexical analyzer
base *PosBase
first error
errcnt int
pragma Pragma
fnest int
xnest int
indent []byte
}
Copy the code
Having made clear the relationship between grammatical analysis and lexical analysis, let’s start to look at the specific process of lexical analysis
Lexical analysis process
The code location for lexical analysis is:
src/cmd/compile/internal/syntax/scanner.go
Copy the code
Lexers are implemented with a structure that looks like this:
type scanner struct {
source // Source is also a structure that records information about the source file for lexical analysis, such as the byte array of the contents, the currently scanned characters and their positions.
mode uint // Controls whether comments are parsed
nlsemi bool // if set '\n' and EOF translate to '; '
// current token, valid after calling next()
line, col uint // The position of the current scanned character, starting with 0
blank bool // line is blank up to col
tok token // The TOKEN of the string currently matched (record all TOKEN types supported in Go)
lit string If the Token is lit from the source file, then its Token is _If and its lit is if
bad bool // If there is a syntax error, the obtained lit may be incorrect
kind LitKind // If the string is of a value type, this variable identifies which value type it is, such as INT, FLOAT, RUNE, etc
op Operator // It is similar to kind, which identifies the TOKEN recognized by the operator (if any).
prec int // valid if tok is _Operator, _AssignOp, or _IncOp
}
type source struct {
in io.Reader
errh func(line, col uint, msg string)
buf []byte // A byte array of source file contents
ioerr error // File read error message
b, r, e int // buffer indices (see comment above)
line, col uint // The position of the currently scanned character
ch rune // The characters currently scanned
chw int // width of ch
}
Copy the code
Now that you know what each field in the lexical parser structure means, let’s look at what types of tokens are available in Go
Token
Token is the smallest lexical unit with independent meaning in a programming language. Token mainly contains keywords, custom identifiers, operators, delimiters, comments, etc., which can be found in: SRC/CMD/compile/internal/syntax/tokens. Go see, I bottom part of the (these Token in the form of constant)
const (
_ token = iota
_EOF // EOF
// names and literals
_Name // name
_Literal // literal
// operators and operations
// _Operator is excluding '*' (_Star)
_Operator // op
_AssignOp // op=
_IncOp // opop
_Assign / / =.// delimiters
_Lparen / / (
_Lbrack / / /
_Lbrace / / {
_Rparen // )
_Rbrack // ]
_Rbrace // }.// keywords
_Break // break
_Case // case
_Chan // chan
_Const // const
_Continue // continue
_Default // default
_Defer // defer.// empty line comment to exclude it from .String
tokenCount //
)
Copy the code
The three most important attributes of the lexical unit corresponding to each lexical Token are the lexical unit type, the text form of the Token in the source code, and the location where the Token appears. Annotations and semicolons are two special tokens. Common annotations generally do not affect the semantics of the program, so you can ignore annotations in many cases.
All tokens are divided into four categories:
- Token of a special type. Such as:
_EOF
- The Token corresponding to the base value. Such as:
IntLit
,FloatLit
,ImagLit
Etc. - Operator. Such as:
Add* // +
,Sub* // -
, *,Mul* // *
- The keyword. Such as:
_Break* // break
,_Case* // case
Lexical analysis implementation
In the lexical analysis section, there are two core methods, nextch() and next()
As we know, lexical analysis is a character-by-character reading of the source file, and nextch() continuously reads the contents of the source file character-by-character from left to right
The following is part of the nextch() function, which gets the next unprocessed character and updates the scanned position
func (s *source) nextch(a) {
redo:
s.col += uint(s.chw)
if s.ch == '\n' {
s.line++
s.col = 0
}
// fast common case: at least one ASCII character
if s.ch = rune(s.buf[s.r]); s.ch < sentinel {
s.r++
s.chw = 1
if s.ch == 0 {
s.error("invalid NUL character")
goto redo
}
return
}
// slower general case: add more bytes to buffer if we don't have a full rune
fors.e-s.r < utf8.UTFMax && ! utf8.FullRune(s.buf[s.r:s.e]) && s.ioerr ==nil {
s.fill()
}
// EOF
if s.r == s.e {
ifs.ioerr ! = io.EOF {// ensure we never start with a '/' (e.g., rooted path) in the error message
s.error("I/O error: " + s.ioerr.Error())
s.ioerr = nil
}
s.ch = - 1
s.chw = 0
return}... }Copy the code
The next() function, based on the scanned characters, uses the idea of determining finite automata introduced in the previous article to split the string and match the corresponding token. Part of the core code of the next() function is as follows:
func (s *scanner) next(a) {
nlsemi := s.nlsemi
s.nlsemi = false
redo:
// skip white space
s.stop()
startLine, startCol := s.pos()
for s.ch == ' ' || s.ch == '\t' || s.ch == '\n' && !nlsemi || s.ch == '\r' {
s.nextch()
}
// token start
s.line, s.col = s.pos()
s.blank = s.line > startLine || startCol == colbase
s.start()
if isLetter(s.ch) || s.ch >= utf8.RuneSelf && s.atIdentChar(true) {
s.nextch()
s.ident()
return
}
switch s.ch {
case - 1:
if nlsemi {
s.lit = "EOF"
s.tok = _Semi
break
}
s.tok = _EOF
case '\n':
s.nextch()
s.lit = "newline"
s.tok = _Semi
case '0'.'1'.'2'.'3'.'4'.'5'.'6'.'7'.'8'.'9':
s.number(false)
case '"':
s.stdString()
......
}
Copy the code
A complete description of what these two methods do is:
- The lexical parser gets the latest unparsed character each time through the nextch() function
- Based on the characters scanned, the next() function determines the type of the currently scanned character. For example, if an A character is currently scanned, it tries to match an identifier type, the s.dent () method called in the next() function (and determines if the identifier is a keyword).
- If the scanned character is a numeric character, an attempt is made to match an underlying literal type (such as *)
IntLit
,FloatLit
,ImagLit
*) - Next () recognizes a token and passes it to the parser, which then proceeds to fetch the next token by calling the next() function of the lexical parser (so you can see that the lexical parser does not translate the entire source file into tokens at once and then provide them to the parser. Instead, the parser needs one itself and gets one from the lexical parser’s next() function.)
We can see this line in the next() function
for s.ch == ' ' || s.ch == '\t' || s.ch == '\n' && !nlsemi || s.ch == '\r' {
s.nextch()
}
Copy the code
It filters out Spaces, tabs, newlines, and so on in the source file
To see how it identifies whether a string is a literal or a string, you can look at the internal implementations of ident(), number(), and stdString()
Let me start with the Go compiler entry and draw a flow chart of the lexical analysis section to tie things together
Maybe after reading the source code, you still don’t have a clear understanding of the lexical parser. The following is a test file or standard library provided by Go to actually use the lexical analyzer to see how it works
Test the lexical analysis process
There are two ways to test lexical analysis: you can directly compile and execute the lexical analyzer test files provided by Go, or the standard library provided by Go
Lexical analyzer test files address: SRC/CMD/compile/internal/syntax/scanner_test.goGo provides the lexer standard library address: SRC /go/scanner/scanner.go
Copy the code
Below I will write a source file and pass it to the lexical analyzer to see how it parses and what the results are
Test the lexical analyzer with the test file
We can compile directly execute the SRC/CMD/compile/internal/syntax/scanner_test TestScanner methods of go, the method of source code (code comments) as follows:
func TestScanner(t *testing.T) {
if testing.Short() {
t.Skip("skipping test in short mode")
}
filename := *src_ // can be changed via -src flag
// Here you can choose an absolute path to the source file you want to parse yourself
src, err := os.Open("/Users/shulv/studySpace/GolangProject/src/data_structure_algorithm/SourceCode/Token/aa.go")
iferr ! =nil {
t.Fatal(err)
}
defer src.Close()
var s scanner
s.init(src, errh, 0) // Initialize the lexical parser
for {
s.next() // Retrieve token (nextch() is called in the next function to retrieve the next character until a token is matched)
if s.tok == _EOF {
break
}
if! testing.Verbose() {continue
}
switch s.tok { // The obtained token value
case _Name, _Literal: // An identifier or base value
// Prints the file name, row, column, token, and the text in the token corresponding source file
fmt.Printf("%s:%d:%d: %s => %s\n", filename, s.line, s.col, s.tok, s.lit)
case _Operator:
fmt.Printf("%s:%d:%d: %s => %s (prec = %d)\n", filename, s.line, s.col, s.tok, s.op, s.prec)
default:
fmt.Printf("%s:%d:%d: %s\n", filename, s.line, s.col, s.tok)
}
}
}
Copy the code
The test function first opens your source file and passes the contents of the source file to the lexical analyzer’s initialization function. The token is then obtained by calling next() repeatedly through an infinite loop until the _EOF terminator is reached
The file I want to parse for the lexical parser looks like this:
package Token
import "fmt"
func testScanner(a) {
a := Awesome!
if a == Awesome! {
fmt.Println("Learning Scanner")}}Copy the code
Then run the test method with the following command (you can print more information about the sacnner structure fields, you can print them out to see) :
# cd /usr/local/go/src/cmd/compile/internal/syntax
# go test -v -run="TestScanner"Command output: === RUN TestScanner parser.go:1:1: package
parser.go:1:9: name => Token
parser.go:1:14:; parser.go:3:1: import
parser.go:3:8: literal => "fmt"
parser.go:3:13:; parser.go:5:1: func
parser.go:5:6: name => testScanner
parser.go:5:17: (
parser.go:5:18: )
parser.go:5:21: {
parser.go:6:2: name => a
parser.go:6:4: :=
parser.go:6:7: literal => Awesome!
parser.go:6:10:; parser.go:7:2: if
parser.go:7:5: name => a
parser.go:7:7: op => == (prec = 3)
parser.go:7:10: literal => Awesome!
parser.go:7:14: {
parser.go:8:3: name => fmt
parser.go:8:6: .
parser.go:8:7: name => Println
parser.go:8:14: (
parser.go:8:15: literal => "Learning Scanner"
parser.go:8:33: )
parser.go:8:34:; parser.go:9:2: }
parser.go:9:3:; parser.go:10:1: }
parser.go:10:2:; --- PASS: TestScanner (0.00s)
PASS
ok cmd/compile/internal/syntax 0.007s
Copy the code
Test the lexical analyzer against the standard library
Another way to test is through the standard library provided by Go. Let me show you how to test the lexical analyzer using the methods in the standard library. Okay
You need to write code that calls a method in the standard library to implement a lexical analysis procedure, as shown in the following example:
package Token
import (
"fmt"
"go/scanner"
"go/token"
)
func TestScanner1(a) {
src := []byte("cos(x)+2i*sin(x) //Comment") // What I want to parse (of course you can also use a byte array of file contents)
// Initialize scanner
var s scanner.Scanner
fset := token.NewFileSet() // Initialize a file set (I'll explain this below)
file := fset.AddFile("", fset.Base(), len(src)) // Adds a file to the character set
s.Init(file, src, nil, scanner.ScanComments) // The third parameter is mode. I'm passing ScanComments, which means I need to parse comments. Normally, I don't need to parse comments
/ / scan
for {
pos, tok, lit := s.Scan() // is equivalent to the next() function
if tok == token.EOF {
break
}
fmt.Printf("%s\t%s\t%q\n", fset.Position(pos), tok, lit) // fset.position (pos) : obtain the Position information}}Copy the code
Execute the above code and get the following result:
1:1 IDENT "cos"
1:4 ( ""
1:5 IDENT "x"
1:6 ) ""
1:7 + ""
1:8 IMAG "2i"
1:10 * ""
1:11 IDENT "sin"
1:14 ( ""
1:15 IDENT "x"
1:16 ) ""
1:18 ; "\n"
1:18 COMMENT "//Comment"
Copy the code
You will find that the methods used in the standard library are completely different from those used in the test files. This is because the library implements a separate set of lexers and does not reuse the go compiler’s lexer code. I understand that this is because the code in the GO compiler cannot be used as a public method for external use. For security reasons, the methods in it must remain private
If you look at the lexical analyzer implementation in the standard library, it is not quite the same as the implementation in the GO compilation, but the core ideas are the same (e.g. character scanning, token recognition). The difference lies in the processing of the files to be parsed. As we know, in Go, multiple files are composed of packages, which are then linked into an executable file, so the multiple files corresponding to a single package can be regarded as the basic compilation unit of Go. Therefore, the lexical parser provided by Go also defines FileSet and File objects to describe filesets and files
type FileSet struct {
mutex sync.RWMutex // protects the file set
base int // base offset for the next file
files []*File // list of files in the order added to the set
last *File // cache of last file looked up
}
type File struct {
set *FileSet
name string // file name as provided to AddFile
base int // Pos value range for this file is [base...base+size]
size int // file size as provided to AddFile
// lines and infos are protected by mutex
mutex sync.Mutex
lines []int // lines contains the offset of the first character for each line (the first entry is always 0) (Line contains the offset of the first character of each line (first entry is always 0))
infos []lineInfo
}
Copy the code
The use of the go compiler is similar to the use of the source structure in the lexical parser scanner structure. The difference is that we know that the GO compiler creates multiple coroutines to compile multiple files concurrently, while the standard library stores multiple files to be parsed through file sets. You’ll notice that the FileSet structure has a one-dimensional array of files to parse
FileSet and File, and how it calculates Token location information
The FileSet and File
The mapping between FileSet and File is shown in the figure below:
Image credit: Go-ast-book
The Pos type in the figure represents the subscript position of the array. Each File element in FileSet corresponds to an interval of the underlying array. There is no intersection between different files, and there may be fill space between adjacent files
Each File consists of File name, base and size. Base corresponds to the Pos index position of File in FileSet, so base and base+size define the start and end positions of File in FileSet array. The subscript index can be located by offset inside each File, and offset+ file.base can be converted to Pos position inside the File. Because Pos is the global offset of FileSet, you can also use Pos to query the corresponding File and offset within the corresponding File
The location information of each Token in lexical analysis is defined by Pos, and the corresponding File can be easily queried through Pos and the corresponding FileSet. The corresponding line and column numbers are then calculated from the source File corresponding to File and offset. (In the implementation, File only stores the beginning of each line and does not contain the original source code data.) The underlying type of Pos is int, which is similar to the semantics of Pointers. Therefore, 0 is also defined as NoPos, indicating that Pos is invalid
Source: go – ast – book
From the relationship between FileSet and File, it can be seen that the lexical analyzer in Go standard library parses multiple source files in this way of File sets
conclusion
This paper mainly starts from the Go compilation of the entry file, gradually introduced the Go compilation of the lexical analysis of the source code part of the implementation, and through the test file and Go provided by the lexical analyzer standard library, the actual test and use of the lexical analyzer. I believe I can have a clear understanding of the lexical analysis process of Go after watching it
The lexical analysis part is relatively simple, involving less core content, the real difficult part is the grammar analysis and abstract syntax tree part, interested friends please continue to pay attention to