preface
The Go compiler is generally abbreviated to lowercase GC (Go Compiler) and needs to be distinguished from uppercase GC (garbage collection). The Go compiler execution process can be refined into multiple stages, including lexical parsing, syntax parsing, abstract syntax tree construction, type checking, variable capture, function inlining, escape analysis, closure rewriting, traversing and compiling functions, SSA generation, and machine code generation, as shown in the figure:
This article mainly explains the steps and details of lexical analysis.
1. Lexical analysis and Token
During the lexical parsing phase, the Go compiler scans the input Go source files and converts them to tokens.
Token is the smallest lexical unit in a programming language that has an independent meaning. Tokens contain not only keywords, but also user-defined identifiers, operators, delimiters, comments, and so on. It is important that the lexical unit corresponding to each Token has three attributes: first, the value of the Token itself indicates the type of the lexical unit, second, the source text form of the Token in the source code, and finally, the location where the Token appears. Of all the tokens, comments and semicolons are two special tokens. Ordinary comments generally do not affect the semantics of the program, so they can be ignored most of the time. In Go, semicolon tokens are often automatically added at the end of lines. Semicolons are the lexical units that separate statements. Therefore, the automatic addition of semicolons results in minor grammatical differences such as the left bracket cannot be a single line in Go.
1.1 concept Token
Go language is mainly composed of tokens such as identifiers, keywords, operators and delimiters.
Identifiers refer to character sequences used by Go language to name variables, methods, functions, etc. Identifiers consist of several letters, underscores, and digits, and the first character cannot be a number (can be underscores). In plain English, any name that can be defined by itself can be called an identifier. Note that the dollar sign $is not a letter, so the identifier cannot contain a dollar sign. (Thanks to a careful person for reminding us that identifiers are supported to start with underscores. I didn’t notice this little detail before. Thanks again.)
Keywords are identifiers endowed with special meanings by Go language, which can also be called reserved words. Keywords are used to guide special syntax structures. They cannot be used as independent identifiers. Here are 25 keywords defined by the Go language:
break default func interface select
case defer go map struct
chan else goto package switch
const fallthrough if range type
continue for import return var
Copy the code
The reason why keywords in Go are deliberately kept so few is to simplify code parsing during compilation.
In addition to identifiers and keywords, tokens also contain operators and separators. Here are 47 symbols defined by the Go language:
+ & += &= && ==! = = () - | - | = | | < < = [] * * = ^ ^ = < - > > = {} / < < / = < < = + + =, =,; % >> %= >>=! . . : & ^ & ^ =Copy the code
Of course, in addition to user-defined identifiers, 25 keywords, 47 operations, and separators, the program consists of some literals, comments, and whitespace. The first step in parsing a Go program is to parse these tokens.
1.2 Token definition
In the GO/Token package, tokens are defined as an enumerated value, with tokens of different values representing different types of lexical tokens:
// Token is the set of lexical tokens of the Go programming language.
type Token int
Copy the code
All tokens are divided into four categories: special types of tokens, tokens corresponding to base literals, operator tokens, and keywords.
There are three special types of tokens: error, end of file, and comment:
// The list of tokens.
const (
// Special tokens
ILLEGAL Token = iota
EOF
COMMENT
Copy the code
If an unrecognized Token exists, that is, a lexical error exists in the source file, return “ILLEGAL” to simplify the lexical analysis. If EOF is encountered, the. Go file has been traversed. COMMENT represents a COMMENT in the source code.
Then there are the Token types that correspond to the base values: The base values defined by the Go language specification are mainly integer, floating-point, and complex values, in addition to character and string values. It is important to note that the Boolean types true and false are not part of the base value in the Go specification. But to facilitate lexical parsing, the Go/Token package includes identifiers such as true and false as literal tokens.
Here is a list of value class tokens:
literal_beg
// Identifiers and basic type literals
// (these tokens stand for classes of literals)
IDENT // main
INT / / 12345
FLOAT / / 123.45
IMAG / / 123.45 I
CHAR // 'a'
STRING // "abc"
literal_end
Copy the code
Literal_beg and literal_end are private types, which are mainly used to indicate the value range of Token. Therefore, judging the value of a Token between literal_beg and literal_end can confirm that it is a Token type. IDENT identifies an identifier and some primitive type literals, such as method names, type names, variable names, constants, and keywords, that do not belong to this type. The literals true and false of bool are bound to IDENT.
The operator and delimiter types have the largest number of tokens. Here is a list of tokens:
operator_beg
// Operators and delimiters
ADD // +
SUB // -
MUL / / *
QUO // /
REM / / %
AND / / &
OR // |
XOR / / ^
SHL // <<
SHR // >>
AND_NOT / / & ^
ADD_ASSIGN / / + =
SUB_ASSIGN / / - =
MUL_ASSIGN / / * =
QUO_ASSIGN / / / =
REM_ASSIGN / / % =
AND_ASSIGN / / & =
OR_ASSIGN / / | =
XOR_ASSIGN / / ^ =
SHL_ASSIGN / / < < =
SHR_ASSIGN / / > > =
AND_NOT_ASSIGN / / & ^ =
LAND / / &&
LOR // ||
ARROW // <-
INC // ++
DEC // --
EQL / / = =
LSS // <
GTR // >
ASSIGN / / =
NOT // !
NEQ / /! =
LEQ / / < =
GEQ / / > =
DEFINE / / : =
ELLIPSIS // ...
LPAREN / / (
LBRACK / / /
LBRACE / / {
COMMA // ,
PERIOD // .
RPAREN // )
RBRACK // ]
RBRACE // }
SEMICOLON // ;
COLON / / :
operator_end
Copy the code
Operators mainly include ordinary arithmetic operation symbols such as addition, subtraction, multiplication and division, as well as binary operations such as logical operations, bitwise operators and comparison operations (which are combined again with assignment operations). In addition to binary operation, there are a few unary operation symbols: such as positive and negative signs, address fetching symbols, pipeline reading, etc. The delimiters are mainly parentheses, middle parentheses, and large parentheses, as well as commas, periods, semicolons and colons.
The keywords of Go correspond to 25 tokens of keyword type:
keyword_beg
// Keywords
BREAK
CASE
CHAN
CONST
CONTINUE
DEFAULT
DEFER
ELSE
FALLTHROUGH
FOR
FUNC
GO
GOTO
IF
IMPORT
INTERFACE
MAP
PACKAGE
RANGE
RETURN
SELECT
STRUCT
SWITCH
TYPE
VAR
keyword_end
)
Copy the code
From a lexical perspective, keywords are no different from ordinary identifiers. However, the 25 keywords are generally the beginning tokens of different syntax structures. By defining these special tokens as keywords, parsing can be simplified.
All tokens are also put into a string slice, which makes it easy to quickly determine the Token type using the xxx_beg and xxx_end subscripts:
var tokens = [...]string{
ILLEGAL: "ILLEGAL",
EOF: "EOF",
COMMENT: "COMMENT",
IDENT: "IDENT",
INT: "INT",
FLOAT: "FLOAT",
IMAG: "IMAG",
CHAR: "CHAR",
STRING: "STRING",
ADD: "+",
SUB: "-",
MUL: "*",
QUO: "/",
REM: "%".// Too much...
// ...
SELECT: "select",
STRUCT: "struct",
SWITCH: "switch",
TYPE: "type",
VAR: "var",
TILDE: "~",}Copy the code
Tokens are as important to programming languages as 26 letters are to The English language, building blocks for more complex logical code, so we need to familiarize ourselves with the features and taxonomy of tokens.
1.3 FileSet and File
Go language itself, it is composed of multiple files package, and then multiple packages linked into an executable file, so the multiple files corresponding to a single package can be regarded as the basic compilation unit of Go language. So the Go/Token package also defines FileSet and File objects that describe filesets and files.
Here are the structural definitions of FileSet and File (both located in SRC \go\token\position.go) :
type FileSet struct {
mutex sync.RWMutex // protects the file set
base int // base offset for the next file
files []*File // list of files in the order added to the set
last *File // cache of last file looked up
}
type File struct {
set *FileSet
name string // file name as provided to AddFile
base int // Pos value range for this file is [base...base+size]
size int // file size as provided to AddFile
// lines and infos are protected by mutex
mutex sync.Mutex
lines []int // lines contains the offset of the first character for each line (the first entry is always 0)
infos []lineInfo
}
Copy the code
The lineInfo type structures involved:
type lineInfo struct {
Offset int
Filename string
Line, Column int
}
Copy the code
FileSet; File; FileSet;
FileSet:
- Files is a slice that holds all files in a set of files
- Base is the offset of the next File, such as the current FileSet files slice only one File, then FileSet base is 1+ file. size+1, the first 1 is because the base calculation starts from 1 instead of 0. The second 1 is because EOF is used to indicate the end of the lexical analysis of the file. Detailed calculations will be explained later.
File:
- Name is the file name
- Base represents the base offset for each File
- Size indicates the length of the file source in bytes
- Lines is an int slice that holds the offset of the first character of each line in the file. Lines can be used to calculate the line and column numbers of tokens
- Infos is a slice of type lineInfo. The lineInfo structure is used to store the row and column numbers of each Token
This should give you a general idea of FileSet and File. FileSet is a large array that stores the bytes of a File in sequence. A File belongs to an array [base,base+size]. Base is the location of the first byte of the File in the large array, and size is the size of the File.
The mapping between FileSet and File objects is shown in the following figure:
You can see that there is an extra Pos in the diagram, and the range is the entire large array. Yes, Pos is short for Position, which is the subscript Position of the entire array.
Each File consists of File name, base and size. Base corresponds to the Pos index position of File in FileSet, so base and base+size define the start and end positions of File in FileSet array. The subscript index can be located by offset inside each File, and offset+ file.base can be converted to Pos position inside the File. Because Pos is the global offset of FileSet, you can also use Pos to query the corresponding File and offset within the corresponding File. More on that later.
The location information of each Token in lexical analysis is defined by Pos, and the corresponding File can be easily queried through Pos and the corresponding FileSet. The corresponding line and column numbers are then calculated from the source File corresponding to File and offset. (In the implementation, File only stores the beginning of each line and does not contain the original source code data.) The underlying type of Pos is int, which is similar to the semantics of Pointers. Therefore, 0 is also defined as NoPos, indicating that Pos is invalid.
1.4 Parsing Attempts
Lexical analysis is usually a code that reads input one character at a time, using the currently read character, along with a state machine that parses the lexical to determine the type of Token currently read. Sometimes, a character does not provide enough information to make this determination, and the next or more characters need to be read in advance to assist the lexical analyzer in making this determination.
Go language standard library Go /scanner package provides scanner Token scanning, which is based on FileSet and File abstract File collection for lexical analysis. There are three important methods: Init(), Next(), and Scan(). The Init method is used to initialize the scanner, the Next method is used to find the Next character, and the Scan method is the core, which is used to Scan the code to return the Token. The Scan method has three return values, representing the location of the Token, the Token value, and the source text representation of the Token.
We can find the test case in SRC \go\scanner\example_test.go and modify it slightly:
import (
"fmt"
"go/scanner"
"go/token"
)
func main(a) {
var src = []byte("println(\"hello world\")\n")
var src1= []byte("println(\"thank you\")")
src=append(src,src1...)
var fset = token.NewFileSet()
var file = fset.AddFile("hello.go", fset.Base(), len(src))
var s scanner.Scanner
s.Init(file, src, nil, scanner.ScanComments)
for {
pos, tok, lit := s.Scan()
if tok == token.EOF {
break
}
fmt.Printf("%s\t%s\t%q\t\n", fset.Position(pos), tok, lit)
}
}
Copy the code
Where SRC is the code to be analyzed. A File set is then created with token.newfileset (), through which the token location information must be located, and through which the File parameters needed to create the scanner’s Init method are required.
The fset.addfile method call then adds a new file to the fset file set named “hello.go” whose length is the same as the code SRC is analyzing.
The scanner.Scanner object is then created and the scanner is initialized by calling the Init method. The first argument to Init is the file object just added to the fset, the second argument is the code to analyze, the third nil argument means there is no custom error handler, and the last scanner.scanComments argument means the comment Token is not ignored.
Because there are multiple tokens in the code to be parsed, we parse the new tokens in turn in a for loop calling s.can (). If token.EOF is returned, the end of the file is scanned. Otherwise, the scan result is printed. Before printing, we need to convert the pos parameter returned by the scanner into more detailed Position information with file name and line number, which can be done by using the fset.position (pos) method.
The output of the above program is as follows:
hello.go:1:1 IDENT "println"
hello.go:1:8 ( ""
hello.go:1:9 STRING "\"hello world\""
hello.go:1:22 ) ""
hello.go:1:23 ; "\n"
hello.go:2:1 IDENT "println"
hello.go:2:8 ( ""
hello.go:2:9 STRING "\"thank you\""
hello.go:2:20 ) ""
hello.go:2:21 ; "\n"
Copy the code
The first column of the output indicates the file and column number of the Token, the middle column indicates the value of the Token, and the last column indicates the value of the Token.
1.5 Deepen understanding
If you have not been exposed to compilation before, you are probably still confused by it. For example, with lexical analysis, what is the relationship between so many offsets (offset, rdOffset, lineOffset)? How did Pos change? How exactly is Pos converted to offset? The fset.position (Pos) method is used to convert Pos to a more detailed Position with a file name and column number.
Debug the above program directly to see the implementation details of the scanner, if it is troublesome to see the debug result I comb on the line.
First look at the scanner structure:
// A Scanner holds the scanner's internal state while processing
// a given text. It can be allocated as part of another data
// structure but must be initialized via Init before use.
//
type Scanner struct {
// immutable state
file *token.File // source file handle
dir string // directory portion of file.Name()
src []byte // source
err ErrorHandler // error reporting; or nil
mode Mode // scanning mode
// The core variable used by the lexical analyzer
ch rune // Current character
offset int // The character offset
rdOffset int // can be understood as the offset of ch
lineOffset int // The current row offset
insertSemi bool // Insert a semicolon before a line break
// public state - ok to modify
ErrorCount int // number of errors encountered
}
Copy the code
Pos, tok, and lit are returned by the ch, offset, rdOffset, lineOffset, and fset.Position(pos) methods.
println: ch=40
offset=7
rdOffset=8
lineOffset=0
pos=1
tok=IDENT(4)
lit="println"
( ch=34
offset=8
rdOffset=9
lineOffset=0
pos=8
tok=LPAREN(49)
lit=""
"hello world": ch=41
offset=21
rdOffset=22
lineOffset=0
pos=9
tok=STRING(9)
lit=""hello world""
) ch=10
offset=22
rdOffset=23
lineOffset=0
pos=22
tok=RPAREN(54)
lit=""
\n ch=112
offset=22
rdOffset=24
lineOffset=23
pos=23
tok=SEMICOLON(57)
lit="\n"
println: ch=40
offset=30
rdOffset=31
lineOffset=23
pos=24
tok=IDENT(4)
lit="println". \n ch=- 1
offset=43
rdOffset=43
lineOffset=23
pos=44
tok=SEMICOLON(57)
lit="\n"End of file ch=- 1
offset=43
rdOffset=43
lineOffset=23
pos=44
tok=EOF(1)
lit=""
Copy the code
Several details can be found:
- After iterating through the current Token, ch points to the next character
- RdOffset is always one more than offset until the file is parsed
- There are several tokens that print lit as “”, the reason is that certain single-character tokens already know what it is by looking at toK, for example
(
,)
,{
,}
You don’t need to assign to LIT anymore. If it’s an identifier or a primitive type, you should definitely use LIT to specify the value. For example, if the Token read is an INT, the return value lit is used to determine whether the Token read is 0 or 1000 or some other valid integer value.
FileSet, File, offset, pos
Now let’s see how Pos is converted to row and column numbers.
The line and column information for the file’s tokens is stored in the Position structure:
type Position struct {
Filename string // filename, if any
Offset int // offset, starting at 0
Line int // line number, starting at 1
Column int // column number, starting at 1 (byte count)
}
Copy the code
The fset.position (pos) method converts pos passed into Position:
func (s *FileSet) Position(p Pos) (pos Position) {
return s.PositionFor(p, true)}func (s *FileSet) PositionFor(p Pos, adjusted bool) (pos Position) {
ifp ! = NoPos {iff := s.file(p); f ! =nil {
return f.position(p, adjusted)
}
}
return
}
func (f *File) position(p Pos, adjusted bool) (pos Position) {
// If you haven't understood the relation between Pos and offset, you must have understood it
offset := int(p) - f.base
pos.Offset = offset
// The core function
pos.Filename, pos.Line, pos.Column = f.unpack(offset, adjusted)
return
}
Copy the code
In the position method, it can be found that the row and column number is not really calculated by Pos, but the offset calculated by Pos.
Read on:
func (f *File) unpack(offset int, adjusted bool) (filename string, line, column int) {
f.mutex.Lock()
defer f.mutex.Unlock()
filename = f.name
if i := searchInts(f.lines, offset); i >= 0 {
line, column = i+1, offset-f.lines[i]+1
}
if adjusted && len(f.infos) > 0 {
/ / to omit
}
return
}
Copy the code
Because the adjusted parameter is dependent on the //line annotation, instead of looking at the normal case, you’ll find that searchInts returns the column number. The input arguments to the searchInts method are f.lines and offset. F.lines was introduced earlier. Let’s look at the definition of this field:
lines []int // lines contains the offset of the first character for each line (the first entry is always 0)
Copy the code
Lines stores the offset of the first character of each line. We can guess why this field is needed to calculate the Token line and column number. If the value of lines is [0,10,20,30,40], and offset is 22, it is not difficult to think that if the value of lines[I]<offset<lines[j], Offset -lines[I]+1, offset-lines[I]+1, offset-lines[I]+1, offset-lines[I]+1, offset-lines[I]+1, offset-lines[I]+1, offset-lines[I]+1, offset-lines[I]+1, offset-lines[I]+1, offset-lines[I]+1, offset-lines[I]+1, offset-lines (Does finding the line number feel like a binary search?)
See if searchInts implements this:
func searchInts(a []int, x int) int {
i, j := 0.len(a)
for i < j {
h := i + (j-i)/2 // avoid overflow when computing h
// i ≤ h < j
if a[h] <= x {
i = h + 1
} else {
j = h
}
}
return i - 1
}
Copy the code
That’s true. Ha ha ha ha, I guessed right!
1.6 Further
Finally, comb the whole execution process of the above program and some logic of SCAN, and paste the program code here, so as not to turn over.
func main(a) {
var src = ...
var fset = token.NewFileSet()
var file = fset.AddFile("hello.go", fset.Base(), len(src))
var s scanner.Scanner
s.Init(file, src, nil, scanner.ScanComments)
for {
pos, tok, lit := s.Scan()
if tok == token.EOF {
break
}
// Printf}}Copy the code
1.6.1 NewFileSet ()
NewFileSet used to create a FileSet
// NewFileSet creates a new file set.
func NewFileSet(a) *FileSet {
return &FileSet{
base: 1.// 0 == NoPos}}Copy the code
The initial base value of FileSet is 1, not 0, and the 0 value of base is represented by an alias NoPos. Base equals NoPos means there is no file or line information at all.
const NoPos Pos = 0
Copy the code
1.6.2 AddFile
AddFile is used to AddFile to FileSet.
func (s *FileSet) AddFile(filename string, base, size int) *File {
s.mutex.Lock()
defer s.mutex.Unlock()
if base < 0 {
base = s.base
}
// omit some base size judgments
// Create a File and initialize the internal field, where the base value is 1, because it is the base of FileSet
Var file = fset.addfile ("hello.go", fset.base (), len(SRC))
f := &File{set: s, name: filename, base: base, size: size, lines: []int{0}}
The first 1 is the base value and the second 1 is the EOF placeholder
base += size + 1 // +1 because EOF also has a position
// omit some base size judgments
// add the file to the file set
s.base = base
s.files = append(s.files, f)
s.last = f
return f
}
Copy the code
At this point, the preparation for the scan is complete.
1.6.3 Init
Init Initializes scanner
func (s *Scanner) Init(file *token.File, src []byte, err ErrorHandler, mode Mode) {
iffile.Size() ! =len(src) {
panic(fmt.Sprintf("file size (%d) does not match src len (%d)", file.Size(), len(src)))
}
s.file = file
s.dir, _ = filepath.Split(file.Name())
s.src = src
s.err = err
s.mode = mode
s.ch = ' '
s.offset = 0
s.rdOffset = 0
s.lineOffset = 0
s.insertSemi = false
s.ErrorCount = 0
// Retrieves the next character to parse
s.next()
if s.ch == bom {
s.next() // ignore BOM at file beginning}}Copy the code
1.6.4 Next
Next reads the Next character into scanner.ch
// Since Golang supports utF-8 encoded source programs, the Next method reads the Next character instead of the byte
// This method does not return a value, but instead updates the relevant variables of type scanner (ch, offset, etc.)
func (s *Scanner) next(a) {
if s.rdOffset < len(s.src) {
s.offset = s.rdOffset
if s.ch == '\n' {
s.lineOffset = s.offset
s.file.AddLine(s.offset)
}
// use rune to get the utf8 character
r, w := rune(s.src[s.rdOffset]), 1
switch {
case r == 0:
s.error(s.offset, "illegal character NUL")
case r >= utf8.RuneSelf:
// Not ASCII
The utf8.decoderune method converts bytes to rune and returns r as a rune and w as the length of r
r, w = utf8.DecodeRune(s.src[s.rdOffset:])
if r == utf8.RuneError && w == 1 {
s.error(s.offset, "illegal UTF-8 encoding")}else if r == bom && s.offset > 0 {
s.error(s.offset, "illegal byte order mark")
}
}
s.rdOffset += w
s.ch = r
} else {
s.offset = len(s.src)
if s.ch == '\n' {
s.lineOffset = s.offset
s.file.AddLine(s.offset)
}
// If we read to the end of the file, we will set ch to -1
s.ch = - 1 // EOF}}Copy the code
1.6.5 scan
The Scan method is the core implementation of the lexical analyzer, which, as mentioned above, is a state machine. For each call to Scan, a Token is returned.
func (s *Scanner) Scan(a) (pos token.Pos, tok token.Token, lit string) {
scanAgain:
// Filter out blank lines
s.skipWhitespace()
// Start point of the current Token
pos = s.file.Pos(s.offset)
// insertSemi determines whether to insert before a newline
insertSemi := false
// Determine the current token type based on the current character ch.
switch ch := s.ch; {
// If the current read is a letter (a-z, A-z, or UTF8), parse it to an Identifier token
// Of course this identifier can be a keyword, so use token.lookup to check whether the current identifier is a keyword
case isLetter(ch):
// Scan out the entire identifier
lit = s.scanIdentifier()
if len(lit) > 1 {
// Determine whether the current identifier is a keyword
tok = token.Lookup(lit)
switch tok {
case token.IDENT, token.BREAK, token.CONTINUE, token.FALLTHROUGH, token.RETURN:
insertSemi = true}}else {
insertSemi = true
tok = token.IDENT
}
// If the current read is a number, parse it as an INT. FLOAT is determined by the scanNumber method
case isDecimal(ch) || ch == '. ' && isDecimal(rune(s.peek())):
insertSemi = true
tok, lit = s.scanNumber()
default:
// Update s.ch to the next character, which depends on the next ch to determine the current token type
s.next()
// where ch is the first character of s.char
switch ch {
case - 1:
if s.insertSemi {
s.insertSemi = false // EOF consumes insertSemi
return pos, token.SEMICOLON, "\n"
}
tok = token.EOF
case '\n':
s.insertSemi = false // Line breaks consume insertSemi
return pos, token.SEMICOLON, "\n"
// The current token is string. The form is "ABC..."
case '"':
insertSemi = true
tok = token.STRING
lit = s.scanString()
// The current token is char.
case '\' ':
insertSemi = true
tok = token.CHAR
lit = s.scanRune()
// The current token is a raw string. The form is' ABC... `
case '`:
insertSemi = true
tok = token.STRING
lit = s.scanRawString()
// When a ':' is encountered, the token type will depend on the next character. If the next character is '=', then the current token will be ":=",
// Otherwise it is simply ':', which explains why next is executed first
case ':':
tok = s.switch2(token.COLON, token.DEFINE)
// If ch is a '.' and s.h is a number, then we are currently in the token format of "*.1", can only be decimal.
case '. ':
tok = token.PERIOD
if s.ch == '. ' && s.peek() == '. ' {
s.next()
s.next() // consume last '.'
tok = token.ELLIPSIS
}
case ', ':
tok = token.COMMA
case '; ':
tok = token.SEMICOLON
lit = ";"
case '(':
tok = token.LPAREN
case ') ':
insertSemi = true
tok = token.RPAREN
case '[':
tok = token.LBRACK
case '] ':
insertSemi = true
tok = token.RBRACK
case '{':
tok = token.LBRACE
case '} ':
insertSemi = true
tok = token.RBRACE
// If ch is currently '+', then all possible results are '+', '+=', '++'. The specific token type still depends on the value of s.char
case '+':
tok = s.switch3(token.ADD, token.ADD_ASSIGN, '+', token.INC)
if tok == token.INC {
insertSemi = true
}
case The '-':
tok = s.switch3(token.SUB, token.SUB_ASSIGN, The '-', token.DEC)
if tok == token.DEC {
insertSemi = true
}
case The '*':
tok = s.switch2(token.MUL, token.MUL_ASSIGN)
case '/':
// If ch is '/' and s.ch is '/' or '*', the current token is a comment.
if s.ch == '/' || s.ch == The '*' {
if s.insertSemi && s.findLineEnd() {
s.ch = '/'
s.offset = s.file.Offset(pos)
s.rdOffset = s.offset + 1
s.insertSemi = false
return pos, token.SEMICOLON, "\n"
}
comment := s.scanComment()
if s.mode&ScanComments == 0 {
s.insertSemi = false
goto scanAgain
}
tok = token.COMMENT
lit = comment
} else {
// If it is not a comment, the result may be the division operator or '/='
tok = s.switch2(token.QUO, token.QUO_ASSIGN)
}
// The possible result is the modulo operator or the '%=' operator
case The '%':
tok = s.switch2(token.REM, token.REM_ASSIGN)
case A '^':
tok = s.switch2(token.XOR, token.XOR_ASSIGN)
case '<':
if s.ch == The '-' {
s.next()
tok = token.ARROW
} else {
tok = s.switch4(token.LSS, token.LEQ, '<', token.SHL, token.SHL_ASSIGN)
}
case '>':
tok = s.switch4(token.GTR, token.GEQ, '>', token.SHR, token.SHR_ASSIGN)
case '=':
tok = s.switch2(token.ASSIGN, token.EQL)
case '! ':
tok = s.switch2(token.NOT, token.NEQ)
case '&':
if s.ch == A '^' {
s.next()
tok = s.switch2(token.AND_NOT, token.AND_NOT_ASSIGN)
} else {
tok = s.switch3(token.AND, token.AND_ASSIGN, '&', token.LAND)
}
case '|':
tok = s.switch3(token.OR, token.OR_ASSIGN, '|', token.LOR)
default:
ifch ! = bom { s.errorf(s.file.Offset(pos),"illegal character %#U", ch)
}
insertSemi = s.insertSemi
tok = token.ILLEGAL
lit = string(ch)
}
}
if s.mode&dontInsertSemis == 0 {
s.insertSemi = insertSemi
}
return
}
Copy the code
1.7 summary
Lexical analysis can be simply understood as the process of translating source code into character sequences according to certain conversion rules. For example, use rules to translate the following source code:
package main
import (
"fmt"
)
func main(a) {
fmt.Println("Hello")}Copy the code
Then the output Token sequence is:
PACKAGE IDENT
IMPORT LPAREN
QUOTE IDENT QUOTE
RPAREN
FUNC IDENT LPAREN RPAREN LBRACE
IDENT DOT IDENT LPAREN QUOTE IDENT QUOTE RPAREN
RBRACE
Copy the code
Of course, such a pure character Token does not help much in the subsequent analysis, which will continue based on the Token.
1.8 develop
The lexical analysis code shown above is in the SRC \go directory. There is a similar set of lexical analysis code in the SRC \ CMD \compile\internal directory. The implementation logic is not quite the same, but the principle is the same.
So why write two sets of compile front ends? In fact, SRC \ CMD \compile\internal is used for the Golang compiler; Another set in the SRC \ Go directory is for applications that require front-end analysis, such as the one cited by the official Gofmt:
We know that Gofmt can be used to format code by parsing it into an AST and then turning the AST back into code.
At this point, the exploration of lexical analysis is over, and we can write about syntax analysis, abstract trees, type checking and so on. It’s tiring to type so many words at once. Escape)
Reference:
Gitee.com/amell/go-as…
Blog.csdn.net/zhaoruixian…