This is the 9th day of my participation in the August More Text Challenge

motivation

Prepare to write a javascript parser, parse javascript into an abstract syntax tree (AST), prepare to write this is mostly out of interest, to write code first need to choose a language to implement, originally is also relatively ideal to use c++, but now the time is not mature, I am not familiar with the c++ language, I am more familiar with javascript than c++ to write a parser, here does not rely on other javascript libraries, basically start from scratch.

Even in interest or entertainment, also want to do it well

Set up the project

Install the current stable version of Node, create a folder, and run the following command in the folder

npm init -y
Copy the code

When you’re done, a package.json file will be automatically created in the project directory

{
  "name": "implementation_parser_with_js"."type":"module"."version": "1.0.0"."description": ""."main": "index.js"."scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "keywords": []."author": ""."license": "ISC"
}
Copy the code

Implement lexer first

The first step of compilation is called Lexer. The function of Lexer is to convert text input into multiple tokens. The next step is to implement this part of code function. We will mark each identified token with a type

const input = "";

function lexer(str){}console.log(lexer(input));
Copy the code

So we’re going to implement a lexer and we’re going to start with simple, input is just reading code, and we’re going to start with simple, with an empty string “” as input.

const input = "";

function lexer(str){
    let c = str[0];
    if(c === undefined) {return {
            type: 'EOF'}}}console.log(lexer(input));
Copy the code

The lexer function reads the code and parses it one by one. When the string is parsed to an empty string, it means that the file has been read. The token type returned here is EOF.

Analytic numerical type

Again, for simplicity’s sake, we assume input=”7″ and then check if the current character is 7. If it is numeric 7, this returns a token.

const input = "Seven";

function* lexer(str){

    for (let cursor = 0; cursor <= str.length; cursor++) {
        const char = str[cursor];
        if(char === undefined) {yield {
                type: 'EOF'}}else if(char === '7') {yield{
                type:'number'.value:7}}}}for(const token of lexer(input)){
    console.log(token)
}

// console.log(lexer(input));
Copy the code

The token whose type is number and value is 7 is displayed, indicating that a token of numeric type is identified.

{ type: 'number'.value: 7 }
{ type: 'EOF' }
Copy the code

contrastif else ifif if

We usually have a couple of options for branching statements. The first one is the first one we encountered in javascript when we first encountered the branching statement if… else if … It can also be used for if listing, and of course it can also be used for switch, which is not discussed here. The structure is complicated, and it is not recommended. else if … Again, use the if list to branch.


function* lexer(str){

    for (let cursor = 0; cursor <= str.length; cursor++) {
        const char = str[cursor];
        if( trace("checking undefine",(char === undefined))) {yield {
                type: 'EOF'}}if (trace("checking 7",(char === '7'))) {yield{
                type:'number'.value:7}}}}for(const token of lexer(input)){
    console.log(token)
}

Copy the code

Simply to implement a method language output program to go to some branch output content

function trace(name,v){
    console.log(name);
    return v;
}
Copy the code
checking undefine
checking 7
{ type: 'number'.value: 7 }
checking undefine
checking 7
{ type: 'number'.value: 7 }
checking undefine
checking 7
{ type: 'number'.value: 7 }
checking undefine
checking 7
{ type: 'number'.value: 7 }
checking undefine
checking 7
{ type: 'number'.value: 7 }
checking undefine
{ type: 'EOF' }
checking 7
Copy the code

For the if · and if forms, output is acknowledged twice each time. This is because yield checks each if statement, whereas if… Else if the value check is placed first as follows


const input = "77777";

function trace(name,v){
    console.log(name);
    return v;
}

function* lexer(str){

    for (let cursor = 0; cursor <= str.length; cursor++) {
        const char = str[cursor];
        if (trace("checking 7",(char === '7'))) {yield{
                type:'number'.value:7}}else if( trace("checking undefine",(char === undefined))) {yield {
                type: 'EOF'}}}}for(const token of lexer(input)){
    console.log(token)
}

Copy the code

The advantage of doing this is obvious, this time for undefine and 7 check, only one check

checking 7
{ type: 'number'.value: 7 }
checking 7
{ type: 'number'.value: 7 }
checking 7
{ type: 'number'.value: 7 }
checking 7
{ type: 'number'.value: 7 }
checking 7
{ type: 'number'.value: 7 }
checking 7
checking undefines
{ type: 'EOF' }
Copy the code

Exception handling

This should throw an exception, in this case of type SyntaxError, during parsing when there is no criticism

const input = "78";


function* lexer(str){

    for (let cursor = 0; cursor <= str.length; cursor++) {
        const char = str[cursor];
        if( char === '7') {yield{
                type:'number'.value:7}}else if (char === undefined) {yield {
                type: 'EOF'}}else{
            throw new SyntaxError(`unexpected character "${char}"`)}}}for(const token of lexer(input)){
    console.log(token)
}

Copy the code
SyntaxError: unexpected character "8"
Copy the code

And usually we have to give

const input = "777";

function trace(name,v){
    console.log(name);
    return v;
}



function* lexer(str){

    for (let cursor = 0; cursor <= str.length; cursor++) {
        let char = str[cursor];

        function number(){
            let value = ""

            for(; cursor <= str.length; cursor++){
                char = str[cursor];
                if(char === '7') {//TODO
                    value +=7
                }else{
                    break; }}return {
                type:'number',
                value,
            }
        }

        if( char === '7') {yield number()
        }else if (char === undefined) {yield {
                type: 'EOF'.// begin:cursor,
                // end:cursor+1,}}else{
            throw new SyntaxError(`unexpected character "${char}" at ${cursor+1}`)}}}for(const token of lexer(input)){
    console.log(token)
}

Copy the code

Sort code

We cleaned up the code, extracted the number method, and then we not only restricted to numeric tokens but also string, object, and so on, but also treated numeric tokens as a whole 777. So we added an inner loop to the number function, which returns values of the same type as a token. By now, it’s getting a little more complicated,

  • First remove the cursor and CHAR variables from the for loop to initialize them externally
  • Loop through number, append it to value if the type is still 7(numeric)cursorincrease
  • Adds location information for exception statements that throw syntax

const input = "777";

function trace(name,v){
    console.log(name);
    return v;
}

function* lexer(str){

    let cursor = 0;
    let char = undefined;

    function number(){
        let value = ""

        for(; cursor <= str.length; cursor++){
            
            char = str[cursor];

            if(char === '7') {//TODO
                value += char
            }else{
                break; }}return {
            type:'number',
            value,
        }
    }

    for(; cursor < str.length; ) {const token = number()
        if( token ){
            yield token
        // }else if (char === undefined){
            
        // yield {
        // type: 'EOF',
        / /}
        }else{
            throw new SyntaxError(`unexpected character "${char}" at ${cursor+1}`)}}}for(const token of lexer(input)){
    console.log(token)
}
Copy the code