— This article is taken from my official account “Sun Wukong, Don’t talk Nonsense”
Start with the execution of JavaScript
For common compiled languages (such as Java), the compilation steps are: lexical analysis -> parsing -> semantic checking -> code optimization and bytecode generation.
JavaScript is a little different. It goes as follows: lexical analysis -> syntax analysis -> syntax tree, followed by interpretation execution.
This article focuses on the lexical analysis part of JavaScript mentioned in the flow above.
Overview of lexical analysis
What is lexical analysis? Lexical analysis is the process of converting an input character stream into a token stream. This token is the smallest semantic unit of the language specified by the JavaScript lexicon. Token can be translated as “token” or “word”. In this paper, the author uniformly translated token into words.
The whole process from character to word is unstructured, as long as the rules of words are followed, words are formed (in general, lexical design does not contain conflicts). In other words, the process of lexical analysis is such a process: suppose we have an input of a string, the string will be transformed into a token set by lexical rules.
Since there are rules for lexical analysis, let’s look at what those rules look like.
First, do a corresponding lexical classification of our JavaScript source code input. The diagram below:
As you can see, JavaScript differs somewhat from the lexical analysis process of a normal language in that it counts both line breaks and comments into this rule. In the case of JavaScript, line breaks and comments can also affect parsing.
Let’s take a closer look at these rules and try to interpret them.
whitespace
When it comes to whitespace, you probably think of whitespace, but JavaScript supports more than that.
<HT>
(or the<TAB>
) is U+0009, which is the indent TAB character, which is written in the string\t
。<VT>
It’s U plus 000B, which is TAB in the vertical direction\v
This character is hard to type on the keyboard, so it is rarely used.<FF>
Is U+000C, Form Feed, page break, string direct volume writing\f
In modern times, printing source programs is rare, so this character is rarely used in JavaScript source code.<SP>
It’s U plus 0020, which is the most common space.<NBSP>
Is U+00A0, non-line breaking space, which is a variant of SP. In typeset, it can avoid line breaking because of space here, but otherwise is exactly the same as normal space. Most JavaScript editing environments treat this as a normal space (because normal source code editing environments don’t automatically fold lines at all…). . In HTML, it’s the last thing that a lot of people like to use.<ZWNBSP>
(the old nameBOM
) is U+FEFF, which is a new whitespace in ES5. It is the zero-width non-line breaking space in Unicode. In UTF-encoded files, an extra U+FEFF is often inserted at the beginning of the file. A program that parses a UTF file can guess which UTF encoding is used based on the representation of U+FEFF. This character is also called “bit Order mark”.
A newline
Now let’s look at the newline character. Only four characters are provided as newlines in JavaScript.
Where
is U+000A, which is the most normal newline character in the string \n.
is U+000D, the true “carriage return” of this character is \r in the string, and in some Windows-style text editors, the newline is two characters \r\n.
is U+2028, a line separator in Unicode.
Most line breaks are discarded by the parser after they are scanned by the lexical parser, but line breaks affect two important syntactic features of JavaScript: automatic insertion of semicolons and the “no line terminator” rule. Pay attention to it here, and there will be a detailed article about it later.
annotation
JavaScript comments are divided into single-line comments and multi-line comments:
/* MultiLineCommentChars */
// SingleLineCommentChars
Copy the code
Multi-line comments allow all characters except * to appear freely. Each * cannot be followed by a slash. All characters except the four lineterminators can be used as single-line comments.
word
Identifier name (that is, variable name)
Identifier names can start with the dollar character “$”, the underscore” _ “, or a Unicode letter. In addition to the beginning character, identifier names can use Unicode concatenation marks, numbers, and concatenation symbols.
Any character of an identifier name can be escaped using JavaScript’s Unicode escape method, which has no character limit.
Limit situation
Of course, there are limitations: Identifier names cannot be reserved words. There are many reserved words in JavaScript, among them, keywords:
await break case catch class const continue debugger default delete do else export extends finally for function if import ininstance of new return super switch this throw try typeof var void while with yield
Copy the code
Also, NullLiteral (null) and BooleanLiteral (true false) are reserved words and cannot be used as identifier names.
In addition to the above, there are some additional keywords reserved for future use:
enum implements package protected interface private public
Copy the code
symbol
Here, list all the symbols:
{() [].... ; , < > <= >= =! = = = =! = = + - * * * % + + - < < > > > > > & | ^! ~ && | |? : + = = = = * * * = % = < < = > > = > > > = & = | = ^ = = > / =}Copy the code
Digital direct quantity
Let’s look at the question posed by today’s title.
Direct numeric quantities specified in the JavaScript specification can be written in four ways: decimal, binary, octal, and hexadecimal integers.
The decimal system
The decimal Number can have a decimal Number, but it can not be both before and after the decimal point.
. 01 24. 24.01Copy the code
There is a question here, which is also the question of our title. Let’s look at some code:
24.toString()
Copy the code
In this case, 24 will be treated as a whole if the following part of the decimal point is omitted. Therefore, to make a token by itself, we need to add a space or add a decimal point, like this:
24 .toString() 24.. toString()Copy the code
Direct numeric quantities also support scientific notation (here only integers are allowed after e), for example:
10.24 10.24 e-2 e+2 10.24 e2Copy the code
More into the system
When starting with 0x 0b or 0O, an integer in a particular base:
0xFA
0o73
0b10000
Copy the code
None of the above bases supports decimals, nor does scientific notation.
String direct quantity
StringLiteral in JavaScript supports both single and double quotation marks.
" DoubleStringCharacters "
' SingleStringCharacters '
Copy the code
The only difference between single and double quotation marks is the way they are written. In the case of a double quoted string literal, the double quotation marks must be escaped, while in the case of a single quoted string literal, the single quotation marks must be escaped. The other characters in the string that must be escaped are \ and all newlines.
In terms of single-character escape (that is, a backslash followed by a character), all meaningful escape characters are integrated here. As follows:
String template
Syntactically, the string template is a whole, where ${} is a juxtaposition.
But in fact, in JavaScript lexicography, the string template containing ${} is parsed separately, as in:
`a${b}c${d}e`
Copy the code
It is known in JavaScript as:
`a${
b
}c${
d
}e`
Copy the code
It was broken down into five parts:
- ‘A ${this is called the template header
- }c${is called the template middle section
- }e ‘is called the tail of the template
- B and D are common identifiers
In fact, the lexical analysis process here is deeply coupled to the grammatical analysis.
conclusion
We’ll learn the lexical part of JavaScript, which includes whitespace, line breaks, comments, identifier names, symbols, numeric literals, string literals, and string templates. The lexical rules explain the question of the title. Knowing these lexical rules is essential for debugging code.
JavaScript in-depth series:
“var a=1;” What’s going on in JS?
Why does 24. ToString report an error?
Here’s everything you need to know about “JavaScript scope.
There are a lot of things you don’t know about “this” in JS
JavaScript is an object-oriented language. Who is for it and who is against it?
Deep and shallow copy in JavaScript
JavaScript and the Event Loop
From Iterator to Async/Await
Explore the detailed implementation of JavaScript Promises