Write a compiler from scratch in PHP

Writing a compiler in PHP, are you joking? No, I’m serious.

Although there have been several articles about writing a compiler in PHP on foreign websites, they are all illustrative documents. Today what we are going to do is compile or parse the RQL language.

So, what is RQL? RQL stands for Resource Query Language, that is, it is a Resource Query Language. What is the relationship between RQL and GraphQL? It doesn’t really matter. GraphQL itself is a resource query, but it has a lot of additional features, such as introspection and data types. RQL is not that complicated; it’s just queries.

Why write this compiler or parser? Because if you use RQL, you can enhance RESTful API queries, and a large number of interfaces in a project can be done through common code.

What is the syntax of RQL? It’s pretty simple. It’s a function structure.

For example, if a = b, RQL is written out as eq(a,b). For more details, please refer to:

Byteferry. Making. IO/RQL parser /…

So what are the specific requirements?

The RQL is parsed into the parameters available to the SERVICE layer in the MVC framework, so that the corresponding function can be called through the ApiBridge.

Very simple. However, we agreed on a few rules,

Design patterns must be used. All functions must have no more than 50 lines and no more than three levels of if nesting. Variables are named with underscores (considering that all databases use underscores, and variables can be distinguished from other types), and other (functions, classes, etc.) are named with a camel’s hump. The final code must be passed with Travis CI continuous integration with 95% or more code coverage. The goal is to truly be a popular open source. Really convenient for future maintenance. Definitely not allowed like Tp, a function nearly 200 lines, test code coverage of more than 20 percent.

Now it’s time for the real thing. We’re going to have to use what we learned in the principles of Compilation course in college.

A compiler has the following components: the first is the symbol table, which is what you compile.

Second, morphology, third, list morphology, fourth, Token, and of course, AST, or Abstract Syntex Tree. I guess you’re already reeling.

Let’s go back to the code. Let’s go back to the symbol table




/*
 * This file is part of the ByteFerry/Rql-Parser package.
 *
 * (c) BardoQi <67158925@qq.com>
 *
 * For the full copyright and license information, please view the LICENSE
 * file that was distributed with this source code.
 */


declare(strict_types=1);

namespace ByteFerry\RqlParser\Lexer;

use ReflectionClass;
use ByteFerry\RqlParser\AstBuilder as Ast;
/**
 * Class Symbols
 *
 * @package ByteFerry\RqlParser
 */
class Symbols
{

    /** * The keys of Symbol List of Lexer **@var array
     */
    public const  T_WORD                =  'T_WORD';
    public const  T_STRING              =  'T_STRING';
    public const  T_OPEN_PARENTHESIS    =  'T_OPEN_PARENTHESIS';   / / (
    public const  T_CLOSE_PARENTHESIS   =  'T_CLOSE_PARENTHESIS';  // )
    public const  T_PLUS                =  'T_PLUS';               // +
    public const  T_COMMA               =  'T_COMMA';              // ,
    public const  T_MINUS               =  'T_MINUS';              // -
    public const  T_COLON               =  'T_COLON';              / / :

    /** * The Symbol List Of Lexer * Of course we use regular expressions to get tokens, because that's the simplest code. * *@var array
     */
    public static $symbol_expressions = [
        'T_WORD'= >'(? 
      
       \w+_*? \w*?) '
      .// word
        'T_OPEN_PARENTHESIS'= >'(? 
      
       \({1})'
      ./ / (
        'T_CLOSE_PARENTHESIS'= >'(? 
      
       \){1})'
      .// )
        'T_STRING'= >'(? 
      
       ".*?" ) | (? 
       
        \.{1})'
       
      ./ / ". *"
        'T_COLON'= >'(? 
      
       :{1})'
      ./ / :
        'T_COMMA'= >'(? 
      
       ,{1})'
      .// ,
        'T_PLUS'= >'(? 
      
       \+{1})'
      .// +
        'T_MINUS'= >'(? 
      
       \-{1})'
      .// -
    ];

    /** * we use the rules to ensure the rql language is correct * we put the rules here only for doing the maintenance The difference * where the grammar check rules are defined. What kind of tokens can be followed by different tokens@var array
     */
    public static $rules = [
        'T_WORD'= > ['T_OPEN_PARENTHESIS'.'T_CLOSE_PARENTHESIS'.'T_COMMA'.'T_COLON'].'T_STRING'= > ['T_OPEN_PARENTHESIS'.'T_CLOSE_PARENTHESIS'.'T_COMMA'.'T_COLON'].'T_OPEN_PARENTHESIS'= > ['T_WORD'.'T_STRING'.'T_PLUS'.'T_MINUS'.'T_CLOSE_PARENTHESIS'].'T_CLOSE_PARENTHESIS'= > ['T_CLOSE_PARENTHESIS'.'T_COMMA'].'T_COLON'= > ['T_WORD'.'T_STRING'].'T_COMMA'= > ['T_WORD'.'T_STRING'.'T_OPEN_PARENTHESIS'.'T_PLUS'.'T_MINUS'].'T_PLUS'= > ['T_WORD'].'T_MINUS'= > ['T_WORD']]./** * list of operator aliases * where Rql keyword aliases are mapped, *@var array
     */
    public static $type_alias = [
        'plus'= >'increment'.'minus'= >'decrement'.'cols'= >'columns'.'only'= >'columns'.'field'= >'columns'.'select'= >'columns'.'aggr'= >'aggregate'.'mean'= >'avg'.'nin'= >'out',];/** * mapping the type to node type * /** * mapping the type to node type *@var array
     */
    public static $type_mappings = [
        'aggr'= >'N_COLUMN'.'aggregate'= >'N_COLUMN'.'all'= >'N_QUERY'.'and'= >'N_LOGIC'.'any'= >'N_QUERY'.'arr'= >'N_ARRAY'.'avg'= >'N_AGGREGATE'.'between'= >'N_PREDICATE'.'cols'= >'N_COLUMN'.'columns'= >'N_COLUMN'.'count'= >'N_QUERY'.'create'= >'N_QUERY'.'data'= >'N_DATA'.'decrement'= >'N_QUERY'.'delete'= >'N_QUERY'.'distinct'= >'N_COLUMN'.'empty'= >'N_CONSTANT'.'eq'= >'N_PREDICATE'.'except'= >'N_COLUMN'.'exists'= >'N_QUERY'.'false'= >'N_CONSTANT'.'filter'= >'N_FILTER'.'first'= >'N_QUERY'.'ge'= >'N_PREDICATE'.'gt'= >'N_PREDICATE'.'having'= >'N_FILTER'.'in'= >'N_PREDICATE'.'increment'= >'N_QUERY'.'is'= >'N_PREDICATE'.'le'= >'N_PREDICATE'.'like'= >'N_PREDICATE'.'limit'= >'N_LIMIT'.'lt'= >'N_PREDICATE'.'max'= >'N_AGGREGATE'.'mean'= >'N_AGGREGATE'.'min'= >'N_AGGREGATE'.'minus'= >'N_QUERY'.'ne'= >'N_PREDICATE'.'nin'= >'N_PREDICATE'.'not'= >'N_LOGIC'.'null'= >'N_CONSTANT'.'one'= >'N_QUERY'.'only'= >'N_COLUMN'.'or'= >'N_LOGIC'.'out'= >'N_PREDICATE'.'plus'= >'N_QUERY'.'search'= >'N_SEARCH'.'select'= >'N_COLUMN'.'sort'= >'N_SORT'.'sum'= >'N_AGGREGATE'.'true'= >'N_CONSTANT'.'update'= >'N_QUERY'.'values'= >'N_COLUMN',];/** * Mapping node type to class *@var array
     */
    public static $class_mapping = [
        'N_AGGREGATE' =>    Ast\AggregateNode::class,
        'N_ARRAY'=>         Ast\ArrayNode::class,
        'N_COLUMN' =>       Ast\ColumnsNode::class,
        'N_CONSTANT' =>     Ast\ConstantNode::class,
        'N_DATA' =>         Ast\DataNode::class,
        'N_FILTER' =>       Ast\FilterNode::class,
        'N_LIMIT' =>        Ast\LimitNode::class,
        'N_LOGIC' =>        Ast\LogicNode::class,
        'N_PREDICATE' =>    Ast\PredicateNode::class,
        'N_QUERY' =>        Ast\QueryNode::class,
        'N_SEARCH' =>       Ast\SearchNode::class,
        'N_SORT' =>         Ast\SortNode::class,
    ];

    /** * the mapping between the RQL operator and the actual operator is defined *@var array
     */
    public static $operators = [
        'eq'= >'='.'ne'= >'< >'.'gt'= >'>'.'ge'= >'> ='.'lt'= >'<'.'le'= >'< ='.'is'= >'is'.'in'= >'in'.'out'= >'not in'.'like'= >'like'.'between'= >'between'.'contains'= >'contains'
    ];

    /** * Query type mapping * This defines the RQL Query type, whether to read or write *@var array
     */
    public static $query_type_mapping = [
        'all'= >'Q_READ'.'any'= >'Q_READ'.'count'= >'Q_READ'.'create'= >'Q_WRITE'.'decrement'= >'Q_WRITE'.'delete'= >'Q_WRITE'.'exists'= >'Q_READ'.'first'= >'Q_READ'.'increment'= >'Q_WRITE'.'one'= >'Q_READ'.'update'= >'Q_WRITE',];/ * * * the following two static function, use it to tell *@return array
     * @throws \ReflectionException
     */
    public static function getSymbolsKey()
    {
        $reflect = new ReflectionClass(__CLASS__);
        return $reflect->getConstants();
    }

    / * * *@return string
     */
    public static function makeExpression(){
        $expression = '/';
        $expression .= implode('|'.self: :$symbol_expressions);
        return $expression . '/'; }}Copy the code

As you can see, this class is not very idiomatic. It is not just a symbol table, but a list of mappings. Why do you do that? It’s pretty simple. This is where all configurations or mappings are concentrated for subsequent upgrades or modifications. Next, we will make a parser to convert the input RQL into a Token array through the lexical class.

Our parser class code looks like this:


declare(strict_types=1);
/*
 * This file is part of the ByteFerry/Rql-Parser package.
 *
 * (c) BardoQi <67158925@qq.com>
 *
 * For the full copyright and license information, please view the LICENSE
 * file that was distributed with this source code.
 */

namespace ByteFerry\RqlParser;

use ByteFerry\RqlParser\AstBuilder\NodeVisitor;  		// Reference the node visitor class
use ByteFerry\RqlParser\Lexer\Lexer;         			// Use the Lexer class
use ByteFerry\RqlParser\AstBuilder\NodeInterface;		// Reference node Interface
use ByteFerry\RqlParser\Lexer\Token;			// Reference the Token class
use ByteFerry\RqlParser\Lexer\ListLexer;			// References the ListLexer class
use ByteFerry\RqlParser\AstBuilder\ParamaterRegister;          // Parameter registry class

/**
 * Class Parser
 *
 * @package ByteFerry\RqlParser
 */
class Parser
{
    / * * *@var NodeInterface[]
     */
    protected $node_list = [];

    / * * *@param\ByteFerry\RqlParser\Lexer\ListLexer $tokens * If you don't know the visitor pattern, you might want to read a design pattern book. Imagine that. *@return \ByteFerry\RqlParser\AstBuilder\NodeInterface[]
     */
    protected function load(ListLexer $ListLexer){
        $ListLexer->rewind();   // Resets the amount of shift to the first ListLexer
        / * *@var Token $token */
        $token = $ListLexer->current();  // Read the current, i.e., read the first.
       // Start spending each token
        for(; (false! = =$token); $token = $ListLexer->consume()){

            $symbol = $token->getSymbol();   // Get the symbol of the token
            / * *@var NodeInterface $node */
            $node = NodeVisitor::visit($symbol);  // Use the visitor pattern to get the real node object

            $node->load($ListLexer);  // The node is loaded into ListLexer
            $this->node_list[] = $node;  // Store the parsed node into an array
        }
        return $this->node_list; Return node list}/ * * *@paramBool $IS_Segmaent * This function is fairly simple and returns instances of different classes depending on the type. *@return QueryInterface
     */
    protected static function getOutputObject( $is_fragmaent = false){
        if(false= = =$is_fragmaent) {return Query::of();
        }
        return Fragment::of();
    }

    / * * *@param$string here is the RQL string * passed in@paramBool $IS_fragmaent what is passed in here is whether RQL is a fragment or a complete query * this is the entry to everything * *@return array
     * @throws \ByteFerry\RqlParser\Exceptions\RegexException
     */
    public static function parse($string.$is_fragmaent = false)
    {

        / * *@var ListLexer $tokens */   // First, Lexer converts RQL strings into tokens array
        $tokens = Lexer::of()->tokenise($string);

        $instance = new static(a);// Create an instance of the current class
 
        ParamaterRegister::newInstance();   // Initialize the parameter registry

        / * *@var NodeInterface[] $node_list */
        $node_list = $instance->load($tokens);  // Convert $tokens to a list of nodes using the load method.
        $ir_list= [];/ * *@var NodeInterface $node */
        foreach($node_list as $node) {$ir_list[] = $node->build();   // Pair each node to a precompiled list
        }

        $queries = [];
        foreach ($ir_list as $ir) {
             $query = self::getOutputObject($is_fragmaent);   // Convert it to an array of Query objects.
             $queries[] = $query->from($ir);
        }
 
        return $queries;   // So far, the most complex RQL takes less than 3MS, which is quite efficient.}}Copy the code

As you can see, this parser is just the top layer of the abstract syntax tree. It only does what it knows, which is to call the corresponding class to do it.

Next, the lexical class comes into play.



/*
 * This file is part of the ByteFerry/Rql-Parser package.
 *
 * (c) BardoQi <67158925@qq.com>
 *
 * For the full copyright and license information, please view the LICENSE
 * file that was distributed with this source code.
 */

declare(strict_types=1);

namespace ByteFerry\RqlParser\Lexer;

use ByteFerry\RqlParser\Abstracts\BaseObject;
use ByteFerry\RqlParser\Exceptions\ParseException;

/**
 * Class Lexer
 *
 * @package ByteFerry\RqlParser
 */
class Lexer extends BaseObject
{

    / * * *@var array
     */
    protected $symbol_keys;

    / * * *@var int
     */
    protected $previous_type = -1;


    / * * *@var ListLexer | null
     */
    protected $listLexer = null;  //ListLexer.

    /** * Lexer constructor * * we need get the array of keys of the symbols first! * * /
    public function __construct(){
        $this->symbol_keys= Symbols::getSymbolsKey();  // We call this function to load the constants into an array.
    }

    /** * The match data is in the target key and the offset! = 1 * *@param $match
     *
     * @return array
     */
    protected function getMatch($match){
        foreach($this->symbol_keys as $key) {// Check the actual match,
            if(isset($match[$key]) && (-1! = =$match[$key] [1])){
                return [$key= >$match[$key]].// Convert to a usable format}}return [];
    }

    / * * *@param $match
     *
     * @return mixed
     */
    protected function addToken($match){

        $key = key($match);  [$key=>$match[$key]

        [$symbol.$offset] = $match[$key];  // insert symbol, offset
       // The token is added to the listLexer via the addItem method, where the token has a previous_type node type, written early enough to avoid further processing
        $this->listLexer->addItem(Token::from($key.$symbol.$this->previous_type));

        /** * set the next_token_type for last token */
        $this->listLexer->setNextType($key);   // For the last node, tell it what type the next node is.

        $this->previous_type = $key;  // Resets previous_type to the current key

        return $offset + strlen($symbol);  // go back to the marshalling and tell the FOR loop if it is finished
    }


    / * * *@param$rQL_str * We called it from the Parser class, so what does it do *@return \ByteFerry\RqlParser\Lexer\ListLexer
     * @throws \ByteFerry\RqlParser\Exceptions\RegexException
     */
    public function tokenise($rql_str){ 
        // First, create a ListLexer instance
        $this->listLexer = ListLexer::of();
        /** * using all the regular expressions */
        $math_expression = Symbols::makeExpression();  // This function is called to load all the regular expressions into the array.

        $rql_str = trim($rql_str);   / / to space

        $end_pos = strlen($rql_str);   // Get the length

        for($offset=0;$offset<$end_pos;) {// loop match expression,
            preg_match($math_expression.$rql_str.$result,PREG_OFFSET_CAPTURE,$offset);
            if(preg_last_error() ! == PREG_NO_ERROR) {// If there is an error, throw an exception, because it is important to be programmer friendly and not do a second Tp.
                throw new ParseException(array_flip(get_defined_constants(true) ['pcre'])[preg_last_error()]);
            }

            /** * get the result from matches */
            $match = $this->getMatch($result);  // Format the matching result.

            /** * update the offset */
            $offset = $this->addToken($match); Add to token array of this class}if(0! = =$this->listLexer->getLevel()){   // Also, if the parentheses do not match, that is a syntax error and throws an exception.
            throw new ParseException('The bracket are not paired.');
        }

        return $this->listLexer;  / / return listLexer}}Copy the code

As we can see, the Lexer does only one thing: it converts the tokens in the string that are matched by the re into tokens and writes them to listLexer for further action. (to be continued)

Read on:

Write a compiler from scratch in PHP (2)

Write a compiler from scratch in PHP (3)

Write a compiler from scratch in PHP (4)

Write a compiler from scratch in PHP (5)

Write a compiler from scratch in PHP

Related Posts

How to play the ByteBuffer

GoLang study Notes (2) Go Language Combat

Vivado notes