• I Used Python to Create My Own Programming Language
  • Author: Umangshrestha
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: jaredliw
  • Proofreader: KimYangOfCat, Greycodee

I created my own programming language in Python

Computers can only understand machine code. Ultimately, a programming language is just a string of words designed to make it easier for humans to write what they want computers to do. The real magic is done by compilers and interpreters, which bridge the gap. The interpreter reads the code line by line and converts it to machine code. In this article, we will design an interpreter that can perform arithmetic operations. We’re not going to reinvent the wheel. The article will use PLY (Python lex-yacc), a lexical parser developed by David M. Beazley. PLY can be downloaded in the following ways:

$ pip install ply
Copy the code

We’ll take a quick look at the basics of creating an interpreter. For more information, see this GitHub repository.

Token

A tag is the smallest unit of character that provides meaningful information to the interpreter. The tag contains a pair of names and attribute values.

Let’s start by creating a list of tag names. This is a necessary step.

tokens = (
    # data type
    "NUM"."FLOAT".# Arithmetic operation
    "PLUS"."MINUS"."MUL"."DIV".# brackets
    "LPAREN"."RPAREN".)Copy the code

Lexer

The process of converting statements into tags is called tokenization or lexical analysis. The program that performs lexical analysis is a lexical analyzer.

# regular representation of the tag
t_PLUS   = r"+"
t_MINUS  = r"-"
t_MUL    = r"*"
t_DIV    = r"/"
t_LPAREN = r"("
t_RPAREN = r")"
t_POW    = r"^"
# ignore Spaces and tabs
t_ignore = " \t"


Add actions for each rule
def t_FLOAT(t) :
    r"""\d+.\d+"""
    t.value = float(t.value)
    return t


def t_NUM(t) :
    r"""\d+"""
    t.value = int(t.value)
    return t


# Error handling of undefined rule characters
def t_error(t) :
    The value of t.value here contains the remaining unmarked input
    print(f"keyword not found: {t.value[0]}\nline {t.lineno}")
    t.lexer.skip(1)


If \n is encountered, set it to a new line
def t_newline(t) :
    r"""\n+"""
    t.lexer.lineno += t.value.count("\n")
Copy the code

To import the lexical analyzer, we will use:

import ply.lex as lex

T_ is a special prefix that represents the rule that defines the tag. Each lexical rule is made with regular expressions, compatible with the RE module in Python. Regular expressions can scan the input according to rules and search for matching string of symbols. The grammar defined by regular expressions is called a regular grammar. Languages defined by regular grammars are called regular languages.

With the rules defined, we’ll build the lexical analyzer.

data = 'A = 2 +(10-8)/1.0'

lexer = lex.lex()
lexer.input(data)

while tok := lexer.token():
    print(tok)
Copy the code

To pass the input string, we use lexer.input(data). Lexer.token () will return the next LexToken instance, finally returning None. According to the above rules, the tag for code 2 + (10-8)/1.0 will be:

The purple character represents the name of the tag, followed by the tag’s content.

Backus-naur Form (BNF)

Most programming languages can be written using context-free grammars. It’s more complex than regular language. For context-free grammars, we use context-free grammars, which are a set of rules that describe all possible grammars in a language. BNF is a way of defining a syntax that describes the syntax of a programming language. Let’s look at an example:

Symbol: alternative1 | alternative2...

According to the production, the left side of: is replaced by one of the values on the right. On the right side of the value defined by | space (which can be understood as the symbol of alternative1 alternative2 or… Etc.). For our arithmetic interpreter, the syntax specification is as follows:

expression : expression '+' expression
           | expression '-' expression
           | expression '/' expression
           | expression '*' expression
           | expression '^' expression
           | +expression
           | -expression
           | ( expression )
           | NUM
           | FLOAT
Copy the code

The input tokens are symbols such as NUM, FLOAT, +, -, *, /, and so on, called terminals (characters that cannot continue to decompose or produce other symbols). An expression consists of a terminal and a rule set. For example, expression is called a non-terminal. For more information about BNF, see here.

Parser

We will use YACC (Yet Another Compiler Compiler) as the parser generator. Import modules: import ply.yacc as yacc.

from operator import (add, sub, mul, truediv, pow)

# list of operators supported by our interpreter
ops = {
    "+": add,
    "-": sub,
    "*": mul,
    "/": truediv,
    "^": pow,}def p_expression(p) :
    """expression : expression PLUS expression | expression MINUS expression | expression DIV expression | expression MUL expression | expression POW expression"""
    if (p[2], p[3]) = = ("/".0) :If you divide by 0, INF (infinite) is used as the value
        p[0] = float("INF")
    else:
        p[0] = ops[p[2]](p[1], p[3])


def p_expression_uplus_or_expr(p) :
    """expression : PLUS expression %prec UPLUS | LPAREN expression RPAREN"""
    p[0] = p[2]


def p_expression_uminus(p) :
    """expression : MINUS expression %prec UMINUS"""
    p[0] = -p[2]


def p_expression_num(p) :
    """expression : NUM | FLOAT"""
    p[0] = p[1]


# Rules for syntax errors
def p_error(p) :
    print(f"Syntax error in {p.value}")
Copy the code

In the docstring, we will add the appropriate syntax specification. The elements in the p list correspond to syntax symbols as follows:

expression : expression PLUS expression
p[0]         p[1]       p[2] p[3]
Copy the code

In the previous section, %prec UPLUS and %prec UMINUS were used to represent custom operations. %prec is the abbreviation of precedence. UPLUS and UMINUS are not used in symbols. (In this article, the two custom operations represent unary plus and UMINUS. In fact, UPLUS and UMINUS are just names, you can take whatever you want.) After that, we can add rules based on expressions. YACC allows you to assign priority to each token. We can set it up using the following methods:

precedence = (
    ("left"."PLUS"."MINUS"),
    ("left"."MUL"."DIV"),
    ("left"."POW"),
    ("right"."UPLUS"."UMINUS"))Copy the code

In a priority declaration, the tags are arranged in order of priority from lowest to highest. PLUS and MINUS have the same priority and are left associative (operations are performed from left to right). MUL and DIV have higher priority than PLUS and MINUS, and also have left associativity. The same is true for POW, but with a higher priority. UPLUS and UMINUS are right associative (operations are performed from right to left).

To parse the input we will use:

parser = yacc.yacc()
result = parser.parse(data)
print(result)
Copy the code

The complete code is as follows:

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# Import module #
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
from logging import (basicConfig, INFO, getLogger)
from operator import (add, sub, mul, truediv, pow)

import ply.lex as lex
import ply.yacc as yacc

# list of operators supported by our interpreter
ops = {
    "+": add,
    "-": sub,
    "*": mul,
    "/": truediv,
    "^": pow,}# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# tag set #
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
tokens = (
    # data type
    "NUM"."FLOAT".# Arithmetic operation
    "PLUS"."MINUS"."MUL"."DIV"."POW".# brackets
    "LPAREN"."RPAREN".)# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# regular expression # for the tag
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
t_PLUS   = r"+"
t_MINUS  = r"-"
t_MUL    = r"*"
t_DIV    = r"/"
t_LPAREN = r"("
t_RPAREN = r")"
t_POW    = r"^"
# ignore Spaces and tabs
t_ignore = " \t"


Add actions for each rule
def t_FLOAT(t) :
    r"""\d+.\d+"""
    t.value = float(t.value)
    return t


def t_NUM(t) :
    r"""\d+"""
    t.value = int(t.value)
    return t


# Error handling of undefined rule characters
def t_error(t) :
    The value of t.value here contains the remaining unmarked input
    print(f"keyword not found: {t.value[0]}\nline {t.lineno}")
    t.lexer.skip(1)


If you see \n, set it to a new line
def t_newline(t) :
    r"""\n+"""
    t.lexer.lineno += t.value.count("\n")


# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# Set symbol priority #
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
precedence = (
    ("left"."PLUS"."MINUS"),
    ("left"."MUL"."DIV"),
    ("left"."POW"),
    ("right"."UPLUS"."UMINUS"))# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# Write BNF rules #
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
def p_expression(p) :
    """expression : expression PLUS expression | expression MINUS expression | expression DIV expression | expression MUL expression | expression POW expression"""
    if (p[2], p[3]) = = ("/".0) :If you divide by 0, INF (infinite) is used as the value
        p[0] = float("INF")
    else:
        p[0] = ops[p[2]](p[1], p[3])


def p_expression_uplus_or_expr(p) :
    """expression : PLUS expression %prec UPLUS | LPAREN expression RPAREN"""
    p[0] = p[2]


def p_expression_uminus(p) :
    """expression : MINUS expression %prec UMINUS"""
    p[0] = -p[2]


def p_expression_num(p) :
    """expression : NUM | FLOAT"""
    p[0] = p[1]


# Rules for syntax errors
def p_error(p) :
    print(f"Syntax error in {p.value}")


# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# main program #
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
if __name__ == "__main__":
    basicConfig(level=INFO, filename="logs.txt")

    lexer = lex.lex()
    parser = yacc.yacc()

    while True:
        try:
            result = parser.parse(
                input("> > >"),
                debug=getLogger())
            print(result)
        except AttributeError:
            print("invalid syntax")
Copy the code

conclusion

Due to the sheer volume of the topic, this article doesn’t explain everything completely, but I hope you can get a good understanding of the surface knowledge it covers. I will post other articles on this topic soon. Thank you and have a nice day.

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.