This is the 9th day of my participation in the August More Text Challenge
motivation
Prepare to write a javascript parser, parse javascript into an abstract syntax tree (AST), prepare to write this is mostly out of interest, to write code first need to choose a language to implement, originally is also relatively ideal to use c++, but now the time is not mature, I am not familiar with the c++ language, I am more familiar with javascript than c++ to write a parser, here does not rely on other javascript libraries, basically start from scratch.
Even in interest or entertainment, also want to do it well
Set up the project
Install the current stable version of Node, create a folder, and run the following command in the folder
npm init -y
Copy the code
When you’re done, a package.json file will be automatically created in the project directory
{
"name": "implementation_parser_with_js"."type":"module"."version": "1.0.0"."description": ""."main": "index.js"."scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"keywords": []."author": ""."license": "ISC"
}
Copy the code
Implement lexer first
The first step of compilation is called Lexer. The function of Lexer is to convert text input into multiple tokens. The next step is to implement this part of code function. We will mark each identified token with a type
const input = "";
function lexer(str){}console.log(lexer(input));
Copy the code
So we’re going to implement a lexer and we’re going to start with simple, input is just reading code, and we’re going to start with simple, with an empty string “” as input.
const input = "";
function lexer(str){
let c = str[0];
if(c === undefined) {return {
type: 'EOF'}}}console.log(lexer(input));
Copy the code
The lexer function reads the code and parses it one by one. When the string is parsed to an empty string, it means that the file has been read. The token type returned here is EOF.
Analytic numerical type
Again, for simplicity’s sake, we assume input=”7″ and then check if the current character is 7. If it is numeric 7, this returns a token.
const input = "Seven";
function* lexer(str){
for (let cursor = 0; cursor <= str.length; cursor++) {
const char = str[cursor];
if(char === undefined) {yield {
type: 'EOF'}}else if(char === '7') {yield{
type:'number'.value:7}}}}for(const token of lexer(input)){
console.log(token)
}
// console.log(lexer(input));
Copy the code
The token whose type is number and value is 7 is displayed, indicating that a token of numeric type is identified.
{ type: 'number'.value: 7 }
{ type: 'EOF' }
Copy the code
contrastif else if
和 if if
We usually have a couple of options for branching statements. The first one is the first one we encountered in javascript when we first encountered the branching statement if… else if … It can also be used for if listing, and of course it can also be used for switch, which is not discussed here. The structure is complicated, and it is not recommended. else if … Again, use the if list to branch.
function* lexer(str){
for (let cursor = 0; cursor <= str.length; cursor++) {
const char = str[cursor];
if( trace("checking undefine",(char === undefined))) {yield {
type: 'EOF'}}if (trace("checking 7",(char === '7'))) {yield{
type:'number'.value:7}}}}for(const token of lexer(input)){
console.log(token)
}
Copy the code
Simply to implement a method language output program to go to some branch output content
function trace(name,v){
console.log(name);
return v;
}
Copy the code
checking undefine
checking 7
{ type: 'number'.value: 7 }
checking undefine
checking 7
{ type: 'number'.value: 7 }
checking undefine
checking 7
{ type: 'number'.value: 7 }
checking undefine
checking 7
{ type: 'number'.value: 7 }
checking undefine
checking 7
{ type: 'number'.value: 7 }
checking undefine
{ type: 'EOF' }
checking 7
Copy the code
For the if · and if forms, output is acknowledged twice each time. This is because yield checks each if statement, whereas if… Else if the value check is placed first as follows
const input = "77777";
function trace(name,v){
console.log(name);
return v;
}
function* lexer(str){
for (let cursor = 0; cursor <= str.length; cursor++) {
const char = str[cursor];
if (trace("checking 7",(char === '7'))) {yield{
type:'number'.value:7}}else if( trace("checking undefine",(char === undefined))) {yield {
type: 'EOF'}}}}for(const token of lexer(input)){
console.log(token)
}
Copy the code
The advantage of doing this is obvious, this time for undefine and 7 check, only one check
checking 7
{ type: 'number'.value: 7 }
checking 7
{ type: 'number'.value: 7 }
checking 7
{ type: 'number'.value: 7 }
checking 7
{ type: 'number'.value: 7 }
checking 7
{ type: 'number'.value: 7 }
checking 7
checking undefines
{ type: 'EOF' }
Copy the code
Exception handling
This should throw an exception, in this case of type SyntaxError, during parsing when there is no criticism
const input = "78";
function* lexer(str){
for (let cursor = 0; cursor <= str.length; cursor++) {
const char = str[cursor];
if( char === '7') {yield{
type:'number'.value:7}}else if (char === undefined) {yield {
type: 'EOF'}}else{
throw new SyntaxError(`unexpected character "${char}"`)}}}for(const token of lexer(input)){
console.log(token)
}
Copy the code
SyntaxError: unexpected character "8"
Copy the code
And usually we have to give
const input = "777";
function trace(name,v){
console.log(name);
return v;
}
function* lexer(str){
for (let cursor = 0; cursor <= str.length; cursor++) {
let char = str[cursor];
function number(){
let value = ""
for(; cursor <= str.length; cursor++){
char = str[cursor];
if(char === '7') {//TODO
value +=7
}else{
break; }}return {
type:'number',
value,
}
}
if( char === '7') {yield number()
}else if (char === undefined) {yield {
type: 'EOF'.// begin:cursor,
// end:cursor+1,}}else{
throw new SyntaxError(`unexpected character "${char}" at ${cursor+1}`)}}}for(const token of lexer(input)){
console.log(token)
}
Copy the code
Sort code
We cleaned up the code, extracted the number method, and then we not only restricted to numeric tokens but also string, object, and so on, but also treated numeric tokens as a whole 777. So we added an inner loop to the number function, which returns values of the same type as a token. By now, it’s getting a little more complicated,
- First remove the cursor and CHAR variables from the for loop to initialize them externally
- Loop through number, append it to value if the type is still 7(numeric)
cursor
increase - Adds location information for exception statements that throw syntax
const input = "777";
function trace(name,v){
console.log(name);
return v;
}
function* lexer(str){
let cursor = 0;
let char = undefined;
function number(){
let value = ""
for(; cursor <= str.length; cursor++){
char = str[cursor];
if(char === '7') {//TODO
value += char
}else{
break; }}return {
type:'number',
value,
}
}
for(; cursor < str.length; ) {const token = number()
if( token ){
yield token
// }else if (char === undefined){
// yield {
// type: 'EOF',
/ /}
}else{
throw new SyntaxError(`unexpected character "${char}" at ${cursor+1}`)}}}for(const token of lexer(input)){
console.log(token)
}
Copy the code