This article is excerpted from “Yuhiro Matsumoto: Design and Implementation of Programming Languages”

1-1 Make your own sense of the programming language

By actually creating a new programming language, you can learn the design ideas and implementation methods of programming language. With the spread of open source, the barriers to creating new programming languages have suddenly dropped. Not only can creating a programming language increase your value as a technologist, but you can also have a lot of fun doing it.

Everyone knows me as the author of the programming language Ruby, but I’m actually a programming language junkie, as passionate about programming languages as anyone can be. Ruby is the biggest product of my interest in programming languages, and it might be better described as a byproduct of my interest. The popularity of the spin-off seems remarkable, but I owe it more to luck than to my own strength. Ruby has been around for more than 20 years, and without all the events and encounters that have taken place over the years, it wouldn’t be possible to be where it is today.

Enter the world of creating programming languages

Have you ever created a programming language? Programming languages are very familiar to people who have programming experience, but they tend to think that programming languages are already there, and perhaps no one has ever thought of creating a new one themselves. It makes sense.

Unlike the languages people speak (natural languages), all the programming languages in the world were created by someone, somewhere. They do not occur naturally, but are designed and implemented with clear intent and purpose. So, if it hadn’t been for the people who created the programming language (the authors of the programming language), we would still be programming in assembly language today.

Programming languages have been around since people started programming, and the history of programming is the history of programming languages.

The goal of this book is to create your own programming language. Some readers may be thinking, “What’s the point of inventing a programming language now?” I’ll answer that question later, but let’s take a look at the history of programming languages.

sThe history of programming languages created by individuals

Early programming languages were created by people who worked with programming languages on the ground, mostly in corporate research institutes (FORTRAN, PL/1), universities (LISP), and standards committees (ALGOL, COBOL). That said, it was the job of professionals to design and develop programming languages, but that tradition began to change with the popularity of computers in the 1970s. Some computer hobbyists start programming out of interest after owning their own computers, and even start to develop new programming languages.

One of the most representative is BASIC. BASIC was originally a programming language for teaching at Dartmouth College in the United States. Its simple syntax and minimal code to achieve the most BASIC functions made it popular and widely used by programming enthusiasts in the 1970s.

These hobbyists also began developing their own versions of BASIC. At the time, personal computers had a few terabytes of memory at most, and their BASIC was a small-scale version that could work on machines with that little memory. These small BASIC programs are less than one kilobyte in size, and they work on around four kilobytes of memory, which is amazing compared to the large memory requirements of today’s language processors.

1 is often called a microcomputer. Microcomputer is short for microcomputer or microcomputer.

The era of microcomputer magazines

Small language (Tiny language) processors, represented by the personally developed BASIC, were soon released in various forms. Some programs were published in computer magazines in the form of Dump lists, or in sonosheets that included the program’s data in audio recordings. I don’t think people know about thin-film records anymore. A record is a thin record made of plastic, but almost no one uses the word record anymore. It is said that computer enthusiasts at the time used record players to connect computers to read data, rather than tape recorders, the most common external storage device.

During the heyday of computer magazines (then called Microcomputer magazines) in the 1970s and 1980s, there was fierce competition among the following four magazines in Japan.

  • RAM(Published by Kwong Chi Tong)
  • My Computer(Radio News Service)
  • I/O(Engineering Society)
  • ASCII(ASCII Corporation)

Of those four magazines, only I/O is still published, but that’s not what it used to be. As someone who knew the situation at that time, MY heart was filled with infinite emotion.

After that, My Computer Magazine was derived from My Computer BASIC Magazine, and a lot of things happened, and I’m afraid that if we continue to talk about old people, so let’s stop there. If you ask today’s programmers in their 30s and 40s, many of them will talk happily about that era.

Microcomputer magazines of the time included thin-film records of BASIC, as well as several other small-scale languages, such as GAME and TL/1. These are very interesting reflections of their time, and I’ll cover them at the end of this section, so be sure to read them.

The state of personal creation of programming languages

Why the rise of individual creative programming languages in the late 1970s and early 1980s? I think the biggest reason was the difficulty of getting a development environment.

The microcomputer widely used in the late 1970s is tK-80 (FIG. 1-1), a single board machine with exposed motherboard, many of which are semi-finished products and need to be brazed by themselves. Such a machine can’t come with a development environment or anything like that, the software has to input its own machine language before it works.

FIG. 1-1 TK – 80

It was not until the late 1970s that “finished computers” like the PC-8001 and mZ-80 appeared. At most, however, this computer came with a BASIC development environment, so it was difficult to choose a development language freely. Although there are commercially available language processors, the C compiler is priced at 198,000 yen, which is not easily affordable for ordinary people. There was enthusiasm to create a programming language of their own.

Now getting the development environment for the language is no longer a hassle. Various programming languages and development environments are exposed as open source software, and even non-open source versions can easily be obtained for free over the web.

Wouldn’t it make sense to create your own programming language now? If the answer to this question is “yes,” then the book ends at the beginning of chapter 1.

I think (and for the sake of the book) that the answer to that question is no. Even now, creating your own new programming language makes sense, and it makes a lot of sense.

And many of today’s most widely used programming languages were designed and developed by individuals with easily accessible development environments. If personal development programming languages really didn’t make sense, languages like Ruby, Perl, Python, and Clojure wouldn’t exist.

Even so, I think Java, JavaScript, Erlang and Haskell are likely to emerge in other forms as they are developed as part of business and research.

sWhy create a new programming language

So what exactly is driving individual design and development of programming languages today? Looking back at my own experience and the opinions of authors in other languages, I think there are several reasons.

  • Improve your programming skills
  • Improve design ability
  • Build your personal brand
  • To be free

First, the implementation of programming languages can be said to be a comprehensive art of computer science. As the basis of language processor, lexical analysis and syntax analysis can also be applied to the realization of data protocol in network communication.

Implementing libraries of language functionality and implementing data structures within them is exactly what computer science is about. In particular, programming languages are so widely used that it’s hard to predict what they’re going to be used for, so libraries and data structures are harder to implement, but also more interesting.

In addition, a programming language is the interface between a person and a computer. Designing such interfaces requires a deep look at how people think and what expectations are latent in their subconscious. Repeated observations like this can be useful for application program interface (API) design, user interface (UI) design, and even user experience (UX) design outside of programming languages.

Improve your personal brand

IT may come as a surprise to some that there are actually quite a few people in the IT industry who are interested in programming languages. There is no doubt about this, because programming is closely related to programming languages. Programming language events and conferences tend to attract a lot of people, which is why programming languages are so attractive. Because of this, many people find a new language online and start trying it out. Ruby, for example, attracted an astonishing 200 people to its mailing list in just two weeks after it was posted online in 1995.

But while many people are willing to experiment with new programming languages, few are willing to design and implement a programming language that is practical beyond what the magazine calls “a child’s language.” But I guarantee you will be respected just for designing a useful programming language.

In this era of open source, a presence in the tech community is very important for techies to survive. Although techies can gain a foothold simply by opening their software, the “special feel” of the programming language will further enhance its brand.

Fun first

In addition, the design and implementation of a programming language is more interesting than anything else. That’s true. The same goes for challenging engineering related to computer science. It is also interesting to note that designing a programming language can help programmers who use it think, and even influence their minds.

Generally speaking, programming languages have an inviolable feel to them that they were taken from somewhere else. If you create your own programming language, you don’t have this problem at all. You can design it as you like, and feel free to change it if you’re not happy or have a better idea. In a sense, this is the ultimate freedom.

Programming is, in a sense, the pursuit of freedom. By programming ourselves, we gain a freedom that we don’t enjoy when we simply use someone else’s software. That, at least for me, is an important motivation for programming. For me, creating programming languages is a means to a higher degree of freedom and a source of fun and joy.

Why don’t more people create new programming languages

While there are many benefits to creating your own programming language, not everyone does it. As mentioned above, there are a few people who are interested in programming languages, but few who set out to create them. “A few people are interested,” but as a percentage of the population, it’s actually small enough to be within the margin of error, let alone motivated to create new programming languages, if not surprising.

I myself became fascinated with programming languages for a few years, but when I went to college to major in computer science, I noticed that not everyone was interested in programming languages. That’s because I grew up in the middle of nowhere, not surrounded by people who liked programming. I don’t know if it’s lucky or unlucky for me.

“Am I different from the others? When I realized this, I was shocked. Microcomputer magazines at the time were full of articles about TL/1 and other programming languages. I thought that people who were interested in programming (like me) would probably be fascinated by programming languages, but they weren’t.

Needless to say, people who are not interested in programming languages have a hard time getting to the point of designing and implementing them themselves.

I have thought long and hard about the cause of this problem. As a programming language designer, I have encouraged others to give it a try at programming language-related events, but the results have always been disappointing. Of course, all things are difficult at the beginning, and it takes a lot of courage to start something new. But even that was poorly received.

There’s no need to think hard

After asking a lot of people, I realized why they didn’t try it. That’s because even if you’re interested in creating a new programming language, you probably have some kind of psychological barrier before you start thinking, “Programming languages are out there, you don’t need to design and develop them.” There are very few people who do not have this psychological barrier, but feel that the realization of language seems difficult. That is, they find programming languages interesting and want to do them themselves, but don’t know how to do them.

When you think about it, there are a surprising number of books published on the implementation of programming languages, but most of them are college textbooks and not easy to understand. In addition, arcane terms like “grammar types” and “Follow collections” related to compilation principles are frequently used.

But when you think about it, the goal is to create your own programming language for fun, not to have all the knowledge you need to implement a programming language. If you think you can’t start creating a programming language without having the right knowledge, you’re wrong, and your enthusiasm will wear off.

The first thing you need to do great things is enthusiasm. Once you have the passion to create a programming language, you should start as soon as possible and slowly acquire the knowledge you need later.

This book focuses on the basics of creating a simple language processor and how to use the tools. It does not cover the more difficult parts of programming language implementation. Rather than a theoretical background, I’d like to focus on how to design programming languages.

Tiny language introduced in microcomputer magazine

GAME

GAME (General Algorithmic Micro Expressions) is a Tiny language derived from BASIC. Its biggest feature is that all keywords are symbols, and all statements are assignment statements.

For example, assign a value to? Will output the value, which in turn will “?” Assignment to a variable requires a numeric value. String input and output use $. In addition, assigning a line number to “#” is a goto statement, and assigning a line number to “!” Is a GOSUB statement (calling a subroutine).

In addition, a string statement like “ABC” will print out a string followed by a newline with a slash.

This is an interesting programming language, and the sample code is shown in Figure 1-a. It looks like BASIC, but it doesn’t look like BASIC, so feel it.

GAME is a very compact language, with an interpreter written in assembly language 8080 that is less than 1 KB in size. Also, the GAME compiler written using GAME, developed by Satoshi Nakajima (who was in high school at the time), had only about 200 lines of code. I don’t know if we should marvel at GAME’s verbal ability or Nakajima’s technical ability.

TL/1

Among other Tiny languages published in ASCII magazines at the same time, TL/1 (Tiny Language/1) was named after IBM’s PL/1 programming Language. Unlike the GAME language, which used symbols under the influence of BASIC, TL/1 had a Pascal-like syntax that felt more “normal”. In addition, TL/1’s language processor is compiled, which is much faster than the interpreted GAME. But GAME actually has a compiler, which we’ve covered in previous articles.

TL/1 is characterized by pascal-like syntax and variable types that are integers of only 1 byte. Readers may wonder how to code like this, but the dominant cpus at the time were 8-bit, so TL/1 wasn’t so outlandish. While running on an 8-bit CPU, other languages, including GAME, provide 16-bit integer types.

What about numbers greater than 255 that cannot be represented by 1 byte? The answer is split by byte, represented by a combination of variables. For example, use two variables to save 16-bit integers, and calculate while looking at the carry flag when calculating overflow (Figure 1-BA). On 8-bit cpus of the time, 16-bit integers were sufficient for most processing (addresses with 16-bit access to the entire address space). As a Tiny language, this is enough. If you are interested, you can explicitly view the carry flag, using multiple variables for 24-bit or 32-bit calculations.

Can handle Pointers and strings

Also, Pointers can’t be represented in just 1 byte. The meM array is used here, that is, the contents of the specified address are accessed using the 16-bit address represented by hi,lo in the following expression.

mem(hi, lo)Copy the code

Here is replacing the value of the address with v.

mem(hi, lo) = vCopy the code

At the time, personal computers (microcomputers) only had 32 KB of memory at most, so 16-bit address access was sufficient.

And then there are strings. Of course, we could have treated the string as an array of bytes and operated on each byte in turn, but this was too cumbersome, so TL/1 designed WRITE statements for data output.

For example, the Hello World program developed with TL/1 is shown in Figure 1-bb. In TL/1, the variable should be a 1-byte integer, but instead a string appears. In fact, WRITE is a syntax added separately to be able to handle strings.

Statements other than WRITE can’t handle strings, so you can’t do normal string processing, only 1-byte integers. It may seem strange now, but in Pascal and FORTRAN, which were not Tiny’s languages, input and output were also handled specially, which was probably a common practice at the time.

100 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - the Comment -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 110 | if come after row number is not a blank, Will the bank as the annotation, 120 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - 130-200 / "FOR loop statement is a variable name = initial value, final value... @=(variable name + step size) "/ 210 A=1,10, 220? (6)=A 230 @=(A+1) 240 300 / "IF statement example" / 310 B=1,2 320; =B=1 " B=1 " / 330 ; = B = 2 "B = 2" / 340 @ = (B + 1), 350, 400 / "numerical input and calculation" / 410 "A =?" A=? 410 "B = ?" B=? 420 "A + B = " ? =A " + " ? =B " = " ? =A+B / 430 "A * B = " ? =A " * " ? =B " = " ? =A*B / 440 500 / "array and character output" / 505-------- let the array address be $1000 510 D=$1000 520 C=0,69 525-------- write 530 as A 2-byte array (C) = D (C + $20) * 256 + C + $20 540 @ = (560 C = C + 1), 0139, 570 -- -- -- -- -- -- -- -- as read 1 byte array, 580 $=D:C) 590 @=(C+1) 600 700 / "GOTO and GOSUB" / 710 I=1 720 I=I+1 730! = 731 * 1000? (8)=I*I 740 ; If (I=10 #=760 750 #=720 760 900 / "program end" / 910 #=-1 920 1000 / "subroutine" / 1010? (8)=I*I 1020 ]Copy the code

Figure 1-a Sample code for the GAME language

BEGIN A := 255 B := a + 2 % overflow C := 0 ADC 0 % add with carry ENDCopy the code

% (b) BEGIN WRITE(0: "hello, world", CRLF) END

Figure 1-b Sample code for TL/1 language

Time Machine

I was going to reinvent MRuby

This section is the first installment of the series starting from April 2014. I spoke enthusiastically about programming language design.

At the time of the first issue, I had no idea what kind of programming language I wanted to design and develop. At that time, I was planning to transform the language processor MRuby written by myself, so I introduced the method of obtaining the source code of MRuby and the structure of the code directory when serializing. However, the mRuby source code is completely useless in practice, so this section is omitted from the book.

Although this book will not cover this part, MRuby’s simplicity makes it a good textbook for implementing a programming language. For those who want to read the source code for MRuby, you can start experimenting at www.mruby.org/. In addition, the source code for MRuby can be downloaded from GitHub (github.com/mruby/mruby).

If you have any questions, comments, or errors along the way, please report them to us through GitHub’s Problem tracker. As MRuby development is moving towards internationalization, it is recommended to describe problems in English. Also, you can communicate with me in English or Japanese via my Twitter account (@yukihiro_matz).

1-2 Language processor structure

This section will briefly explain the relationship between a programming language and a language processor, as well as the structure of a language processor, in preparation for starting to design a programming language. We’ll start with a calculator program and introduce a practical language processor using an implementation of MRuby as an example.

We’re going to create a programming language, but not many of us know exactly what we’re going to do. This is because most people only study existing languages and never think about designing one.

Languages and language processors

Programming languages have multiple layers of constructs. First, at a large level, programming languages can be divided into “languages” that represent communication rules and “language processors” that process this language to make it run on a computer. When many people use the term “programming language,” they tend to confuse language with language processor.

Language is made up of grammar and vocabulary. A grammar is a rule that specifies how to express a program in the language in order to make it valid; A vocabulary is a collection of functions that can be called from a program written in the language, gradually adding to it as a library. To speak of vocabulary in the context of a design language is to refer to the built-in functionality of the language from the beginning.

I don’t know if you’ve noticed, but there’s no software involved in defining grammar and vocabulary. Designing “the strongest language in my mind” doesn’t require a computer. In fact, when I was in high school in the countryside, I didn’t have much programming skills, but I wrote a lot of programs on my notepad in a programming language I thought I might want to design and develop one day. When I went back to my hometown, I also looked for the notepad at that time, but I couldn’t find it. It is estimated that I threw it away. It is a pity to think about it. I can’t remember what language it was, but it seems to have been strongly influenced by Pascal and Lisp.

A language processor is software that makes grammar and vocabulary actually run on a computer. What makes a programming language a real language, not just an idea, is a language processor. A programming language that does not work is not strictly a programming language.

The structure of a language processor

When you’re trying to build a language processor, you can’t do it without knowing what the language and the structure of the language processor is. For convenience, here we introduce the structure of a language processor using an off-the-shelf language processor. Without getting into the technical details, let’s get an overview of a language processor.

Language processor is a collection of computer science, is a very interesting piece of software. College students majoring in computer should have learned how to make a language processor, it can be said that the language processor is the foundation of computer science (one). This explains why books on language processors abound.

However, many books that introduce the methods of homemade programming languages devote too much time to introducing the methods of making language processors, and almost none of them involve the relevant knowledge of language design. Perhaps what these books refer to as “the way to make a programming language” is the same thing as “the way to make a language processor,” so there’s nothing wrong with that. The purpose of these books is to teach you how to make your own programming language (that is, how to make a language processor); whether you actually make it is not their concern.

This book focuses on language design. However, there is no practical point in writing down my “ideal language” in a notepad as I once did, so I’ll start by explaining the basics of language processors.

First, let’s look at what a language processor is made of.

The composition of a language processor

Language processors can be roughly divided into “compilers” that interpret syntax, “libraries” that are equivalent to vocabularies, and “runtimes (systems)” that actually run software. The proportions of these three components vary by language and processor nature (Figure 1-2).

Figure 1-2 Components of a language processor

Early languages, such as simple ones like TinyBASIC, had few syntax and the compiler did little to do much of the processing at run time. Such a processor is called an interpreter (Figure 1-3).

Figure 1-3 BASIC language processor

An example of an “interpreted” compiler and runtime integration. In many cases, libraries are not separate

But there are fewer and fewer such purely interpretive languages. Processors in many languages today compile programs into internal code and then execute the internal code at run time. Ruby is one of them, of course. This “compiler + runtime” combination is sometimes called “interpreted” because it looks like the source code is being executed without transformation.

In addition, languages such as C, which seek efficiency at a very machine-like level, have no runtime and only the compiler part that interprets the syntax is prominent. Such language processors are called “compiled” (Figure 1-4). It’s easy to get confused that the components of a language processor have the same name as the language processor’s own classification. In a language like C, the program (executable) that is the result of the transformation is software that can be run directly, so there is no runtime responsible for running it. Some of the run-time work, such as memory management, is taken care of by the library and the operating system’s system calls.

Figure 1-4 C processor

Output a “compiled” example of an executable file. Because executables can be run directly, there is little equivalent to the runtime, and libraries take care of some of the runtime’s work (memory management, etc.)

There are language processors like Ruby that “look interpretive but have a compiler working inside” and language processors like Java that “look compiled but have an interpreter (virtual machine) working inside.” Java is a “hybrid” language that translates programs into machine code (JVM bytecode) and executes them by the virtual machine (JVM) (Figure 1-5).

Figure 1-5 Java language processor

Example of a virtual machine compiler. The compiler outputs the machine code (bytecode) of the virtual machine, which is executed by the runtime (virtual machine)

In addition, Java has become more and more complex by using techniques such as Just In Time Compiler, which translates bytecode into machine code at runtime to improve runtime efficiency.

Compiler composition

Next, let’s take a look at the internals of each component of a language processor.

The first is the compiler. The compiler’s job is to convert the source code of a programming language into an executable form.

Many compilers divide the conversion process into several stages, including “lexical analysis”, “syntactic analysis”, “code generation”, and “optimization”, in order of source code from near to far. However, this is just a general classification, and not all compilers run all stages.

(1) Lexical analysis

Lexical analysis is simply the process of “converting source code from character sequences to meaningful word (token) sequences”. By organizing the source code, which is just a string, into a sequence of words that makes some sense, the rest of the process becomes easier. For example, the Ruby program

puts "Hello\n"Copy the code

Perform lexical analysis and convert to

Identifier (puts) string ("Hello\n")Copy the code

I will explain the meaning of word sequences later in the introduction to parsing.

When a parser calls a function to request the next word, the parser takes characters from the source code inside the function one by one, organizes them into one word, and returns the next word.

You can use the lex tool to automatically generate lexical analysis functions based on the rules for writing words. For example, the LEX program that generates lexical analysis functions for numbers and simple four-order operations is shown in Figure 1-6.

%%
"+"             return ADD;
"-"             return SUB;
"*"             return MUL;
"/"             return DIV;
"\n"            return NL;

([1-9][0-9])|0|([0-9]+.[0-9]) { double temp; sscanf(yytext, "%lf", &temp); yylval.double_value = temp; return NUM; };

[ \t] ;

Copy the code

. { fprintf(stderr, "lexical error.\n"); exit(1); } % %

Figure 1-6 Lex program calc.l for a calculator

If you look at the rules for numbers, you can use regular expressions when writing patterns that form words. There are only operators, numbers, Spaces, and so on in this example, but all sorts of words can be added to the extension line. The lex program (presumably saved as a calc.l file) executed with lex produces a C file named lex.yy.c. Compile this file to use the yylex() function for lexical analysis.

As such, lexical analysis is easy to implement using Lex, but MRuby does not use Lex. This is because states determined by parsing in Ruby sometimes produce different words, even if literal words are the same. Lex can actually write stateful lexical analysis functions, but it’s not too hard to write them myself, and I wanted to try, so I didn’t use Lex. I was very young when I wrote Ruby.

(2) Grammatical analysis

Grammar analysis is the process of checking whether the words prepared in the lexical analysis stage are grammatically correct and processing them grammatically.

There are several approaches to parsing, the best known and simplest of which is to use parsing function generation tools, such as YACC (Yet Another Compiler Compiler), alias “compiler of compiler generation.” Mruby also uses YACC, the GNU version of YACC, Bison, to be exact. In addition to YACC, there are other compilers that generate compilers, such as ANTLR and BNFC, which are not covered here.

In YACC, the syntax interpreted by the compiler is written according to yACC writing rules. For example, Figure 1-7 shows the syntax for calculator input.

%{
#include <stdio.h>

static void yyerror(const char *s) { fputs(s, stderr); fputs("\n", stderr); }

static int yywrap(void) { return 1; }

%} %union { double double_value; }

%type <double_value> expr %token <double_value> NUM %token ADD SUB MUL DIV NL

% %

program : statement | program statement ;

statement : expr NL ;

expr : NUM | expr ADD NUM | expr SUB NUM | expr MUL NUM | expr DIV NUM ; % %

#include "lex.yy.c"

Copy the code

int main() { yyparse(); }

Figure 1-7 Calculator syntax analysis calc.y

The section from the beginning to %% is the definition section, which defines the type and type of the word. In addition, the part between %{and}% is inserted directly into the generated C program, so header references and so on are made.

The part between %% and %% is the syntax definition of the calculator. This definition is based on the writing of the Backus-Naur Form (BNF), a syntactic definition specification. The section after %% is also inserted directly into the C program, so the definition of the function called by the action section is placed there, etc.

Look at the syntax of the calculator

Now let’s look at the syntax of the calculator. In the first example, I’m going to use a very simple syntax, like a normal calculator, where there’s no precedence order for the operators.

Let’s start with the first rule. BNF is a description of rules, with the first rule coming first by default. The first rule is shown below.

program : statement
| program statement
;Copy the code

From this rule we can see that the correct calculator syntax is program, meaning “program is defined after or immediately following statement.” Is to define the “:”, “|” is “or”, and the arrangement of words means that the definition of each part will be connected. In this rule, the word program also appears on the right hand side, forming a recursion, and it doesn’t matter. Yacc uses recursion to write loops like this.

Let’s move on to the next rule.

statement : expr NL
;Copy the code

This rule means “Statement is defined immediately after EXPr NL”. NL is not defined here, it is the word passed by the parser when it hits a carriage return.

Now let’s look at the definition of expr.

expr : NUM
| expr ADD NUM
| expr SUB NUM
| expr MUL NUM
| expr DIV NUM
;Copy the code

This rule means “expr is defined as NUM (a numeric word), or expr followed by an operator and then a numeric value”. This is also a recursive loop. Therefore, “1” is a numerical value, so it is expr. “1+1” is the permutation of expr “1” with the operator “+” and the value, so it is expr. 1+2+3 is expR. Can you imagine what the BNF would look like?

Let’s try running the calculator program we wrote earlier (Figure 1-8). After executing the program shown in Figure 1-6 with lex and the program shown in Figure 1-7 with yacc, compile the generated C source file named y.tab.c to complete the syntax check for the calculator. The calculation part is not implemented at all, so only the syntax check is done. If the input code is syntactically correct, nothing is done; if it is incorrect, a Syntax error is displayed and the execution is terminated.

Figure 1-8 Compilation and running of a calculator program

Implement calculator program

A calculator that can’t do the math doesn’t make sense, so let’s let it actually do the math. In YACC, we can write actions that run when the rule is established. In other words, add the actual actions to the YACC code in Figure 1-7 to compute and display, and the calculator is complete. Specifically, replace the rule parts of Statement and EXPR in the program shown in Figure 1-7 with the code in Figure 1-9.

statement : expr NL { fprintf(stdout, "%g\n", $1); };Copy the code

expr : NUM | expr ADD NUM { ? = $1 + $1 - $=* $= $1$= $1 / $3; };

Figure 1-9 Action of a calculator program

In the calculator example, the calculation and display takes place directly in the action section, which is called the “pure interpreter.” However, this is rarely done directly in real compilers, because loops and user-defined functions are not supported. For example, in MRuby, a tree structure representing syntactical structures is created and passed on to subsequent code generation processing. Let’s look at an example of creating a tree structure: convert mRuby’s if statement to an S expression (essentially a link to a structure) as shown in Figure 1-10.

# puts the following Ruby program if cond puts "true" else puts "false" end

Convert to the following S expression

Copy the code

(if (lvar cond) (fcall "puts" "trued") (fcall "puts" "false"))

Figure 1-10 MRuby syntax tree structure

(3) Code generation

In the code generation process, in addition to traversing the tree structure generated in the parsing process, the machine code of the virtual machine is generated.

It seems that since the rise of the Java virtual machine, this “machine code” has been referred to as “bytecode”. Indeed, the machine code of the Java virtual machine is in bytes, so it makes sense to call it bytecode (its predecessor, Smalltalk, was also bytecode in bytes). But MRuby’s machine code is 32-bit, so it might be more appropriate to call it word code. Due to the inaccuracy of the bytecode term and the infrequent use of the term literal, it is called ISEQ (Instruction sequence) within MRuby. In addition, program information (the final result of code generation) with information such as symbols attached to isEQ is called internel representation (iREP).

Mruby’s code generation process does not do difficult analysis. If you want to go deeper, it is also possible to generate code directly in the actions part of the parsing process. But MRuby, for various reasons, splits the code generation process into several phases, using a tree structure like an S-expression as intermediate code.

The code generation process is split in MRuby

The first reason for doing this is maintainability. It is true that code can be generated in the action part of parsing, and the size of the program can be reduced (though not by much), but the integration of parsing and code generation can make the program more complex and difficult to detect.

The action parts are called in an order that matches the pattern of the rule, making the running order more unpredictable and difficult to debug than programmatic actions. For maintainability, it is wise to use a simple structure that generates only a syntax tree in the action section.

As a result, the increased memory usage for embedded MRuby is a concern, but fortunately the compilation part (which includes parsing and code generation) can be separated out at run time. That is, by pre-converting Ruby programs to IREP, you don’t need to compile them at runtime. This helps save memory, even in low-memory environments, so you don’t have to worry too much about memory usage.

When the tree structure of parsing results is processed through code generation, an IREP like figure 1-11 is generated. Ireq was originally a binary file (structure) that was not easy to understand, so it was converted to this form that we can understand.

Figure 1-11 Code generation result (IREP)

(4) optimization

Depending on the implementation, the compiler sometimes optimizes before and after code generation. In the case of MRuby, minimal optimizations are made during code generation processing because of the nature of the Ruby language that makes it difficult to optimize.

This optimization, known as peephole optimization, refers only to the instructions directly ahead for possible optimizations during instruction generation. Table 1-1 shows some of the optimizations implemented by the MRuby compiler.

Table 1-1 Optimizations for MRuby

category

Original instruction

The optimized

Remove meaningless assignments

R1=R1

delete

Delete switch instruction

R2=R1; R1=R2

R2=R1

Cut the assignment

R2=R1; R3=R2

R3=R1

Cut the assignment

R2=1; R3=R2

R3=1

Delete duplicate return commands

return; return

return

After compilation processing

There are two types of processing that MRuby performs after compilation. One is to run the compiled results directly, using the virtual CPU for MRuby, which also uses runtime and libraries such as object management. The other option is to write the compiled results to an external file so that you can generate programs that link directly to the compiled results and execute Ruby programs with the compiler removed, which is a very effective approach for memory-constrained embedded systems.

summary

This section introduces the composition of a language processor, but it does not cover language design, and while I am not satisfied with this, I will leave it at that in order to go into more detail.

Time Machine

Explaining language processors is difficult

This section is published in the May 2014 issue of the magazine, and introduces the use of YACC, which will be mentioned in the language processor related books. I was embarrassed to use an old calculator program as an example.

On the plus side, though, this section covers mRuby, a useful language processor, in addition to a calculator. I cover this section because you can’t use a calculator program to explain code generation and optimization.

That said, I regret that this section is limited to introducing you to the fact that such a thing exists. The dilemma for me was that going into too much detail about the mRuby implementation would make it too difficult, but I wouldn’t be satisfied if I didn’t. It’s hard to decide.

The yACC authoring rules described in this section will come into play later when Streem implementations are introduced.

1-3 virtual machine

This section introduces the implementation of Virtual Machine (VM), the core part of programming language processor. After covering the four technologies used to implement virtual machines, we’ll take a look at the instructions that mRuby virtual machines actually have.

As we saw in Section 1-2, it is the runtime that runs the compiled results of source code. There are several ways to implement this at runtime, and the virtual machine described in this section is one of them.

Run with a software-implemented CPU

The word virtual machine has many different meanings. In this section, it refers to “a computer implemented with software (without actual hardware)”.

This is different from what virtual machines mean in contexts such as virtual machine software and cloud computing. In the context of virtual machine software, a virtual machine (VM) is a virtual machine that encapsulates the actual hardware with certain software to realize the simultaneous running of multiple systems and the migration of systems among hardware. Wikipedia classifies this virtual machine as a “system virtual machine” and the virtual machine described in this section as a “process virtual machine.”

Ruby up until version 1.8 did not implement (process) virtual machines, but instead ran programs by traversing a compiler-generated syntax tree that supports the Ruby program syntax implemented with pointer linked constructs (Figure 1-12). While this approach is simple, the cost of accessing the pointer for every instruction is significant. That’s one of the reasons Ruby was said to be slow before Ruby 1.8 came out.

Int vm(node* node) {while(node) {switch (node->type) {case NODE_ASSIGN: /* Assign processing */... break; Case NODE_CALL: /* method call processing */... break; . } /* jump to the next node */ node = node->next; /* ← this is slow */}}Copy the code

Figure 1-12 Syntax tree interpreter (Summary)

Why was Ruby slow before

I think I need to explain why such a simple structure is running so slowly. We all know that hard disk access speed is much slower than memory access speed, but what about memory access speed? When you’re writing code, you don’t pay much attention to memory speed.

But in reality, the DISTANCE between CPU and memory is surprisingly far. The speed of reading data at a given address through the memory bus is much slower than the execution speed of the CPU. When accessing memory, the CPU can only wait for data to arrive, and this waiting time has an impact on execution speed.

To reduce such wait times, the CPU has a built-in “memory cache,” or cache for short. A cache is a small amount of high-speed memory embedded in a CPU circuit. By reading data from the main memory to the cache in advance, the read and write to the memory is converted to the cache. In this way, the waiting time for accessing the memory is reduced and the processing speed is improved.

Because the cache must be embedded within the CPU, its capacity is severely limited and very little data can be read in advance 2. To make effective use of the cache, the memory space to be accessed next needs to be read into the cache beforehand, but this is very difficult. In general, this is only possible when there is a memory access locality. That is, because the memory space accessed by the program at one time is very small and close together, the memory space read to the cache at one time will be read and written multiple times.

Modern cpus divide the cache into multiple levels to increase the cache capacity. Even so, the capacity is much smaller than main memory, and it does not solve the problem of reading into the cache the memory space to be accessed next in advance.

Use caching flexibly on virtual machines

Unfortunately, a syntax tree interpreter like Figure 1-12 is the worst from a cache access standpoint. The nodes that make up a syntax tree are individual structures, and their addresses are not necessarily adjacent or contiguous. This makes it difficult to read the memory space to be accessed next into the cache beforehand.

If the syntax tree is converted into instruction sequences and stored in contiguous memory space, memory access locality will be enhanced and performance will be greatly improved due to the role of cache.

A virtual machine called YARV introduced in Ruby 1.9 used this approach to achieve performance improvements. YARV is short for Yet Another Ruby VM. The name comes from the fact that at the time of development there were already several virtual machines being developed for the purpose of running Ruby. At first YARV was a pilot project, but it was the only one of these virtual machines capable of running all the features of the Ruby language, so YARV eventually replaced Ruby’s own virtual machine.

Advantages and disadvantages of virtual machines

Java is probably the best known language for using virtual machines, but the virtual machine technology did not first appear in Java, but has been around since the late 1960s. For example, Smalltalk, the language that emerged in the early 1970s, was known (in part) for its early use of bytecode. Further back, Niklaus Wirth’s Eular language, based on Algol68 by Pascal creator Niklaus Wirth, is said to have completed the virtual machine. Alan Kay, the father of Smalltalk, said that Smalltalk’s virtual machine implementation was inspired by the Eular virtual Machine.

Pascal brings to mind UCSD Pascal. UCSD Pascal, developed at the University of California, San Diego, runs after changing the Pascal program to bytecode P-code. Changing Pascal programs to P-code made it easy to port UCSD Pascal to computers with a wide range of operating systems and cpus, making UCSD Pascal widely used as a highly portable compiler.

From this we can see that the greatest advantage of virtual machines is portability. The processing of code generation with the various CPU-generated machine languages is the most complex part of the compiler. Redeveloping the code generation process for subsequent cpus can be a burden for language processor developers.

Today architectures such as x86 and ARM dominate, and there are far fewer cpus than ever, but in the 1960s and 1970s new architectures emerged so frequently that even the same family of computers from the same company used different cpus depending on the model. Virtual machines play a big role in reducing this burden.

In addition, the virtual machine can be designed to work with the target language, so we can limit the scope of the instruction set to those necessary to implement the language. Compared to general-purpose cpus, specifications can be reduced and development is easier.

But virtual machines are not all good. A virtual machine running on an emulated VIRTUAL CPU has significant performance problems compared to executing directly on hardware. A language processor that uses a virtual machine can result in a performance loss of several times, even hundreds of times. However, we can reduce this performance penalty to some extent by using techniques such as JIT compilation.

Virtual machine implementation technology

A real CPU implemented in hardware differs in performance from a virtual machine implemented in software. Let’s take a look at the performance related implementation techniques for virtual machines. The following are representative ones.

(1) RISC and CISC

(2) Stack and register

(3) Instruction format

(4) Direct jump

Short for Reduced Instruction Set Computer, RISC is an architecture that improves CPU performance by reducing the number of instructions and simplifying the circuit. Among the architectures popular in the 1980s, representative cpus were MIPS and SPARC. ARM processors, which are widely used in mobile devices, are RISC.

CISC, as opposed to RISC, stands for Complex Instruction Set Computer, which simply means “not RISC CPU”. CISC is complicated to implement because each instruction performs a very large processing and there are many kinds of instructions.

However, RISC versus CISC was before the 21st century, and in today’s hardware cpus, RISC versus CISC doesn’t make any sense. This is because pure RISC cpus have lost popularity and are now rarely seen. Even so, SPARC survived and was adopted by devices such as the Japanese supercomputer Kyo.

The more promising ARM in RISC is also increasing instructions, towards CISC direction. Intel x86, the representative CISC architecture, achieved high speed by providing a complex instruction set 3 on the surface to maintain compatibility with past versions and internally converting the instructions to RISC-like internal instructions (μ OP).

A few days ago it was reported that the x86 move instruction was too complex and could be Turing complete with this instruction alone. In other words, it is theoretically possible to write any algorithm using just the move instruction.

CISC has advantages in virtual machines

But for virtual machines, the RISC versus CISC debate has a different meaning. IF the virtual machine is implemented in software, we cannot ignore the cost of Instruction Fetch (IF) processing. In other words, the fewer instructions you need to do the same thing, the better. A good virtual machine instruction set is a CISC-like instruction set, all of which are highly granular.

The instructions of the virtual machine should be as abstract as possible, and the program design should be small. Some virtual machines aim for compactness by providing compound instructions that combine frequently called instructions into one, a technique known as “instruction fusion” or “super operator.”

Stack and register

The two main schools of virtual machine architecture are stack virtual machine and register virtual machine. In principle, a stack VM operates data on the stack (Figure 1-13), whereas a register VM commands contain register numbers and operate registers in principle (Figure 1-14).

Push 1 ← ① Push 1 to the stack Push 2 ← ② Push 2 add ← ③ Add the two numbers in the stack and then push the result to the stackCopy the code

The state of the stack when each instruction is executed

Figure 1-13 Stack VM commands and their structures

Load R1 1 ← ① Assign the value of the first register to 1 load R2 2 ← ② Assign the value of the second register to 2 Add R1 R1 R2 ← ③ add the values of the first register and the second register and save the result to the first registerCopy the code

Figure 1-14 Commands for a register VM

Compared with register virtual machines, stack virtual machines are simpler and have relatively small programs. However, since all instructions exchange data through the stack, there is a great dependence on the order between instructions, and it is difficult to implement such optimization as switching instruction order.

Register virtual machine because the instruction contains register information, so the program is relatively large. It is important to note that program size does not necessarily correlate with the cost of fetch processing, as we will also mention later. In addition, because registers are explicitly specified in register VMS, the order of instructions is less dependent and the optimization space is larger. However, there are few examples of highly optimized languages on a small scale, so this is not that important.

So which is better, a stack virtual machine or a register virtual machine? The jury is still out on this, as there are many virtual machines that use both architectures. Table 1-2 shows the use of these two architectures in VMS of various languages. We found that even the same language has different architectures depending on the implementation, sometimes using stack virtual machines, sometimes using register virtual machines. It’s an interesting phenomenon.

Table 1-2 VM architectures in different languages

language

The virtual machine

architecture

Java

JVM

Stack virtual machine

Java

Dalvik (Android)

Register virtual machine

Ruby

YARV (Ruby 1.9 and later)

Stack virtual machine

Ruby

mruby

Register virtual machine

Lua

lua

Register virtual machine

Python

CPython

Stack virtual machine

Instruction format

With the advent of Smalltalk, the machine language (sequence of instructions) interpreted by virtual machines began to be called bytecode. This is because Smalltalk instructions are written in bytes. The word “bytecode” was later popularized by Java, which inherited the byte unit feature.

However, not all virtual machines have instruction sets in bytes; YARV and MRuby, for example, have instruction sets expressed as 32-bit integers. On many cpus, 32-bit integers are the easiest length to process and are often referred to as “words,” so the technical name for these sequences of instructions might be more appropriate. But the word “font” was not only hard to read, it was hard to understand, so it didn’t catch on at all, so much so that people slowly gave it up and sometimes just called it bytecode.

The advantages and disadvantages of fonts

Both bytecode and font have their advantages and disadvantages. Bytecode programs are much more compact than codes that must consume 32 bits per instruction. On the other hand, since one byte in the bytecode is equivalent to eight bits and can represent only 256 states, operands (arguments to instructions) can only be stored in the byte after the instruction, increasing the number of fetches to extract data from the instruction sequence. As mentioned earlier, in a virtual machine implemented in software, the cost of fetch processing is higher, so the character has a performance advantage.

In addition, fonts have the advantage of “address alignment”. On some cpus, direct access to an address that is not a multiple of a certain number will result in an error. In this case, the data needs to be taken from the aligned (address uniform to a certain number of multiples) address, the offset portion of the cut out. Even if access does not go wrong, multiple addresses can make a big difference in access speed compared to non-multiple addresses (because of internal pruning, etc., described above).

Multiples of address 2 are called 16-bit alignment, and multiples of address 4 are called 32-bit alignment. All instructions in a font must conform to the address alignment standard; bytecode does not. Depending on CPU type and address state, the average fetch cost per instruction of bytecode can sometimes be high.

In general, bytecode has a relatively short sequence of instructions, which gives it an advantage in terms of memory usage, but in terms of performance, such as the number of fetch instructions and the time required, words have an advantage.

Take a look at mRuby’s instructions

Let’s look at a real-world example of a virtual machine directive, such as the MRuby directive in Figure 1-15.

Figure 1-15 MRuby command structure

Mruby directives specify the type of instruction by the trailing seven bits. The instruction type is determined by seven bits, which means that up to 128 instructions can be implemented. In fact, MRuby has prepared a total of 81 instructions, including five that were prepared.

The instruction length is 32 bits, of which 7 bits are used to determine the instruction type, meaning the remaining 25 bits can be used for operands. Mruby instructions can be divided into four types based on how the operand part is used (partition methods).

Type 1:3 operands

Instruction type 1 contains operands A, B, and C. A is 9 digits, B is 9 digits, and C is 7 digits. That is, operands A and B have A maximum of 511, and operands C have A maximum of 127. Operands A and B are mostly used to specify registers. For example, the move instruction OP_MOVE between registers in this type is

OP_MOVE A BCopy the code

This means copying the contents of the register specified by operand B to the register specified by operand A. The OP_MOVE instruction does not use operand C.

Instructions that use operand C, such as OP_SEND that calls methods.

OP_SEND A B CCopy the code

This represents the method represented by symbol 4 (or, to be exact, the BTH symbol in the symbol table) in the object stored in the register specified by operand A (called register A here). The values of registers in the range A + 1 through A + 1 + C are parameters to the method, and the return value of the method call is stored in register A.

Symbol is the value that the language processor uses internally to identify method names. Different strings are assigned different values.

As with the OP_MOVE directive, some operands are not used in several instructions of type 1. Although this space is wasted, it is acceptable for ease of access and efficiency.

Type 2:2 operands

Instead of the operands B and C, instruction type 2 has a large (16-bit) operand. The operands are classified as unsigned (Bx) and signed (sBx), and are used differently depending on the instruction. Bx uses instructions such as OP_GETIV, as shown below.

OP_GETIV A BxCopy the code

This means saving the self instance variable specified by the Bx symbol in the symbol table in register A.

Instructions using sBx have jump commands of the following form.

OP_JMP sBxCopy the code

This instruction jumps the address of the next instruction from the current address to an offset of sBx positions. SBx is a signed number, so it can jump both forwards and backwards. The OP_JMP instruction does not use operand A. An example of an instruction that uses A conditional jump with operand A is shown below.

OP_JMPIF A sBxCopy the code

This means jump sBx positions if register A is true.

Type 3:1 operand

Instruction type 3 consolidates the operand part into a 25-bit operand (Ax) for processing. Instructions of type 3 have only OP_ENTER.

OP_ENTER AxCopy the code

OP_ENTER checks method arguments against the bit mode specified by Ax. OP_ENTER divides 23 out of 25 bits into 5/5/1/5/5/5/1/1 to explain the parameters. Table 1-3 describes the meanings of each bit.

Table 1-3 OP_ENTER parameters

position

content

5

The number of required parameters

5

Number of optional parameters

1

Whether there are REST arguments (* arguments)

5

The number of required arguments at the end

5

Number of keyword arguments (not used yet)

1

Is there any keyword rest argument (** argument) (not used yet)

1

Whether there is a block argument

Split the 25-bit Ax operand from scratch

Type 4: Deformation of type 1

Instruction type 4 divides the parts of the B and C operands (16 bits) into 14-bit Bz operands and 2-bit Cz operands. Instructions of instruction type 4 have only OP_LAMBDA.

Mruby comes with macros that get operands from instructions. You can use these macros to get operands from instructions. These macros do not check for instruction type, so developers should be careful to use macros correctly. Macros that get operands of MRuby instructions are shown in Table 1-4.

Table 1-4 Macros for getting operands of MRuby instructions

The name of the macro

meaning

GET_OPCODE(i)

Gets the category of the instruction

GETARG_A(i)

A operand (9 bits)

GETARG_B(i)

B operand (9 bits)

GETARG_C(i)

C operand (7 bits)

GETARG_Bx(i)

Bx operand (16 bits)

GETARG_sBx(i)

SBx operand (signed 16-bit)

GETARG_Ax(i)

Ax operand (25 bits)

GETARG_b(i)

Bz operand (14 bits)

GETARG_c(i)

Cz operand (2 bits)

Analytical cycle

If you can use such a structure to translate the source code into a sequence of instructions for the virtual machine, you can easily implement the basic structure of the virtual machine.

Figure 1-16 shows the interpreter loop, which is the core part of the VM, in pseudo-code.

typedef uint32_t code;

int vm_loop(code *pc) { code i;

Copy the code

for (;;) { switch (GET_OPCODE((i = *pc))) { case OP_MOVE: stack[GETARG_A(i)] = stack[GETARG_B(i)]; break case OP_SEND: ... break; . }}}

Figure 1-16 Vm structure (using the Switch statement)

Is it surprisingly simple? Even if the instruction is increased, it is only the case of the switch statement that is increased.

However, even if the basic structure is easy to implement, there are still many things to consider to implement a practical language, such as how to implement the runtime stack, how to build the mechanism for method calls and exception handling, which are not mentioned here. Thus we can see that there is still a huge gap between theory and practice.

Jump straight

In many cases, a practical virtual machine is speed first, so we also want to improve the efficiency of parsing loops. One of the best known techniques for improving the efficiency of virtual machine parsing loops is direct threading, which uses the extended features of GNU Compiler Collection (GCC).

GCC gets the address of the label and jumps to that address. The address of a label can be obtained by clicking && label name. The way to goto a label is goto * label. Using this feature, we can build virtual machines using jumps instead of switch statements.

Figure 1-17 shows the code that implements the parsing loop using a direct jump.

typedef uint32_t code;

#define NEXT i=*++pc; goto *optable[GET_OPCODE(i)] #define JUMP i=*pc; goto *optable[GET_OPCODE(i)]

int vm_loop(code pc) { code i; / static void * opTABLE [] = {&&L_OP_MOVE, &&L_OP_SEND,... };

JUMP;

Copy the code

L_OP_MOVE: stack[GETARG_A(i)] = stack[GETARG_B(i)]; NEXT; L_OP_SEND: ... NEXT; . }

Figure 1-17 Direct forward

In fact, almost all virtual machine implementations that use direct jumps, including MRuby, provide compilation options for users to choose whether to use switch statements or direct jumps. This is because tag address fetching is just an extended feature of GCC and is not guaranteed to be available all the time. Figure 1-18 shows the code for the loop using a switch macro.

typedef uint32_t code;

/ * only supports extensions with GCC compiler * / # if defined _ GNUC _ | | defined _ clang _ | | defined _ _INTEL_COMPILER # define DIRECT_THREADED

#endif

#ifdef DIRECT_THREADED

#define INIT_DISPATCH JUMP; #define CASE(op) L_ ## op: #define NEXT i=*++pc; goto *optable[GET_OPCODE(i)] #define JUMP i=*pc; goto *optable[GET_OPCODE(i)] #define END_DISPATCH

#else

#define INIT_DISPATCH for (;;) { i = *pc; switch (GET_OPCODE(i)) { #define CASE(op) case op: #define NEXT pc++; break #define JUMP break #define END_DISPATCH }}

#endif

int vm_loop(code *pc) { code i; #ifdef DIRECT_THREADED static void *optable[] = { &&L_OP_MOVE, &&L_OP_SEND, ... }; #endif

Copy the code

INIT_DISPATCH { CASE(OP_MOVE) { stack[GETARG_A(i)] = stack[GETARG_B(i)]; } NEXT; CASE(OP_SEND) { ... } NEXT; . } END_DISPATCH; }

Figure 1-18 Switching macros in use

Using this technique, even on compilers without the GCC extension feature, you can get speed with the switch statement. On compilers with GCC extensions, you can use the direct jump technique to achieve faster virtual machines.

summary

This section covers the implementation of the virtual machine, the core part of the runtime, leaving the basics of the language processor sketchy. I will focus on language design from May next month.

The five books are called “next month” because they are compiled from magazine serials. –

Time Machine

Although I would also like to use virtual machines in Streem…

This section appears in the June 2014 issue. Following the introduction to YACC in the previous section, the implementation of virtual machines is explained here. I used MRuby as an example because it is too simple to grasp the overall picture of the virtual machine implementation. The most important reason is that I plan to implement virtual machines in other languages (future languages) on top of mRuby’s virtual machines.

In fact, Streem’s implementation uses a simple interpreter that traverses the syntax tree directly, so nothing in this section will be useful for Streem, but it is still valuable for virtual machine implementations, so I chose to keep it in this book. I would have liked to replace Streem’s simple interpreter with the virtual machine described in this section, but I didn’t have the time. It’s not the first time or twice that time management has been my biggest obstacle…

1-4 Introduction to Programming Language Design (Part 1)

Now that we have an overview of language implementation, let’s think about language design. As an example, in this section we’ll look back at Ruby’s early design. Ruby was developed as an object-oriented language that supports scripting.

Let’s say you want to create a new programming language, not just for fun, but in the hope that it will one day become a “popular language” used around the world. So, how do you do it?

Ways to create a popular language

More than performance and functionality, language specifications determine the popularity of a new programming language. However, almost no book or web page will tell you how to design a language.

But if you think about it, very few people have ever designed a serious language. Although there are self-made programming language related textbooks on the market, these textbooks introduce the implementation methods of programming languages, and the languages introduced are just some examples, mostly existing languages or subsets of existing languages. Language design is outside the scope of these textbooks, perhaps even by people who have no experience designing popular languages.

That’s true. Not many programming languages have reached the point of widespread use in the world. If you count all the popular languages throughout history, there are probably no more than a few hundred. That depends, of course, on the definition of popularity. This means that there are only a few hundred designers of these languages around the world, and some of them are no longer alive.

As one of the few language designers, I feel it is my mission to introduce you to the secrets of language design. That is the real purpose of this book.

Question in mind

The following questions often go through the mind of aspiring language designers when they start designing new languages.

  • Is a new language really needed
  • What does this language do
  • Who is the target user
  • What kind of functions

There’s no need to worry about these questions, because if you do, it won’t help you design a good language.

But let’s think about these questions here. The first problem, for example, was that all algorithms could be written in Alan’s complete language. Existing programming languages have all proved to be Turing complete, so there is no need for new languages from a software development (= writing algorithms) point of view.

The reality, however, is that new languages have been created over the past 50 years, not because existing languages can’t write algorithms, but because new languages are easier to write or more fun to write. The reason you question whether you really need a new language is precisely because you already have the idea in your mind to create one. With such an idea in mind, there is no need to worry about whether it is necessary.

It’s enough to use it for yourself

To the questions “What does the language do?” and “Who is the target audience?” I feel I need to add something.

As a longtime programming language aficionado, I’ve learned a lot of programming languages. After Ruby became famous, I talked to a lot of language designers, Examples include Bjarne Stroustrup of C++, Larry Wall of Perl, Guido van Rossum of Python, and PHP Designer Rasmus Lerdorf and so on. One thing I’ve learned from talking to them is that most languages have failed to catch on, with the exception of those designed by the designers themselves for their own use.

If you don’t plan to use it yourself, you won’t be able to design for details and maintain the passion to make your language popular. Many languages take more than a decade to become popular, so to create a popular language, attention to detail and passion are essential. In other words, the target audience for a popular language is first the designer himself, and then users with similar characteristics. “What the language does” depends on what the designer wants to do.

Once you’ve decided on your target audience and language purpose, there’s no need to worry about the last question, “What features?” But there are some tricks behind this, which we’ll explain later.

sThe opportunity to develop Ruby

No amount of fine talk is convincing, so let’s take a look at Ruby. I’ve been working with Ruby for over 20 years, and I can say a lot. Here we focus on the early days of development that determined the direction of language design.

Let’s start with Ruby’s development background.

Ruby development began in 1993. I became interested in programming languages in the early 1980s when I was a high school student in Tottori prefecture. From then on, I became interested in Pascal, Lisp and Smalltalk.

I didn’t have my own computer at the time and couldn’t freely program, but somehow I became interested in programming languages. I’m more interested in the language as a means of writing programs than in what programs I write.

However, I suffered a lot because I lived in the countryside and could not find much material or literature to study. At that time, the Internet was not widespread and there were hardly any computer books in the school library, which gave me a headache.

To get information about programming languages, I had to look in computer magazines or go to a nearby bookstore to read something that looked like a college textbook (books were expensive and I couldn’t afford them at the time), so I’ve always been grateful to that bookstore I frequented.

When I went to college and had a library full of books, magazines and papers, I felt like I was living in heaven. This is how I mastered the knowledge of programming languages, which later played a very important role in my language design. Just as there is no writer who does not read books, and no professional chess player who does not know the old score, it is important to have a broad knowledge of the existing language when designing a new one.

There was a lot of free time in 1993

It’s 1993. By that time, I had graduated from college and was a professional programmer. My job was to develop software according to the business requirements of the company. Before that, I developed internal systems for the company, desktops on UNIX workstations, mail systems for attachments, etc. This is not unusual on Windows and Mac systems today, but it was not available on UNIX workstations at the time. Even if there are similar ones, they don’t support Japanese, so we have to develop them ourselves.

However, after the bubble economy burst, the company became depressed as a whole, and the internal system could not bring economic benefits, so the company decided to stop developing new functions and continue to use the functions that had been developed.

The development team was disbanded, with only a few remaining as maintenance staff. Fortunately or unfortunately, I was one of those few. But since development had stopped, I didn’t have much to do. Once in a while, someone calls to say the computer isn’t working, and I just say, “Please reboot.” That’s how it went in those days, sitting on the bench.

Book planning becomes an opportunity

It’s not all bad, though. Despite the recession, I didn’t have to work as much overtime and lost my bonus, making less money than DURING the bubble economy (when I was newly married and money was tight), BUT LUCKILY I didn’t get fired, so I didn’t have to look for a job. With computers in front of them, things were few and unimportant, so nobody cared. I was full of time and energy, and I wanted to do something. I developed a few useful little programs during that time, and then, by accident, I decided to realize a long-held dream of creating a programming language.

Japanese companies usually offer overtime pay. –

This “serendipity” goes something like this. At that time, a senior in my department was planning a book. When he started writing, he approached me and said, “I decided to write a book about learning object orientation by creating programming languages. Can you help me write the part about programming languages?”

As a programming language junkie, I was interested in this project, so I said yes. But the plan failed to make it through an editorial meeting and was soon aborted. Creating a programming language had been a dream of mine for years, and I had finally built up the momentum and didn’t want to stop there. In the past, I just had a dream and could not imagine what the language would be like when it was finished, so I had no motivation to do it. Now when I finally got excited, it would be a pity to stop.

It was this “drive” that started Ruby’s 20-year history. At the time, I never dreamed that Ruby would grow into a language widely spoken around the world.

sRuby’s early design

We’ve already discussed some of the questions that pop up in my mind when trying to create a new language, and although I can definitely say that I’m over them now, twenty years later, I was young enough to hesitate a little. After a little thought, I decided to create my own language. Now that I think about it, it was that choice that determined everything that followed.

I was a C programmer, working mostly with C and shell scripting languages. C is used to develop medium-scale systems at work, while shell scripts are used to develop small-scale programs in daily use. I was (and still am) neither unhappy with C, nor did I find it appealing to create a new language that added object-oriented functionality to C. This was probably due to the fact that C++ already existed, and the fact that I had designed a C-based object-oriented language for my college graduation project (though not to my satisfaction).

Not satisfied with shell scripts

I’m not so happy with shell scripts. I was using Bash at the time, and a simple language like that would have sufficed if I had just arranged the command line with some simple control structures. However, as the program gets more and more complex, it’s easy to have situations that you don’t understand, and that doesn’t make me happy. Shell scripts also have no formal data structures, which I find frustrating. In short, shell scripts are simply command line input with some logical control, but ultimately are nothing more than a “simple language,” and that’s the problem with it.

There was Perl in the realm of scripting languages that was closer to shell scripting, but Perl also had a “simple language” feel to it, in my opinion. I also resent the fact that Perl has only scalar (strings and numbers), array, and hash data structures. Some complex data structures cannot be expressed directly.

This was the age of Perl 4, and the object-oriented features of Perl 5 were nothing more than anecdotal, but the rumored object-oriented features of Perl 5 didn’t sound very good. I feel that languages with richer data structures are better than Perl. Also, I’ve been fascinated with object-oriented programming since high school, so I wanted programming languages that could not only handle structures, but actually support object-oriented programming.

Python is too generic

In addition, there is a language called Python. There was very little information about Python at the time, and I did a lot of research, only to find that object-oriented features had been added as an afterthought, and the language felt too generic, so I didn’t like it much. I know how arrogant I am, but when it comes to “the ideal language,” I’m a programming language junkie who can’t shut up.

Some may ask what “too ordinary” means. This means that Python doesn’t support regular expressions at the language level, and string manipulation isn’t powerful enough to feel like scripting at the language level (in this case, Python 20 years ago).

Indenting blocks of code is one of Python’s features, an interesting experiment, but also one of its drawbacks. For example, when you want to automatically generate code from a template, if you don’t keep the correct indentation, the program won’t work. Because code blocks are represented by indentation, there is a need at the language level to make a clear distinction between expressions and statements, and so on.

In this way, Python doesn’t seem to be any different from normal Lisp dialects, except that the syntax is a little easier to understand. Now that I think about it, I completely ignored the community and class libraries, but I didn’t realize how important they were at the time.

Make scripting languages object-oriented

However, by looking at other languages, IT became clear to me what I wanted to do: a shell-like language that was more generic than Perl, could define its own data structures, and had object-oriented capabilities. Of course, the language is more seamless for object-oriented programming than Python, and it must also support the features needed for scripting, including string manipulation, as well as libraries.

Scripting has become more and more important in recent years, with Perl and Python appearing more and more frequently, but the need for object-oriented programming, which is also in scripting, has not been sufficiently recognized.

At the time, object-oriented languages were generally thought of as technologies that were only used in university research or large-scale complex system development, rather than small-scale, simple programming such as scripting. But there are signs of a change.

Perl is finally planning to support object-oriented functionality in the future. Python was already an object-oriented language, but this functionality was added later, so not all data was an object. When I wanted to write an object-oriented language, Python was already a procedural programming language that supported object-oriented programming. If there was a script-oriented object-oriented language that supported procedural programming, it would be very useful, at least for me.

Then the energy came. Arrogance, one of the seven virtues of being a programmer, was on full display in me, and I decided that if I was going to do it, I needed to do something as good as Perl and Python. Blind confidence is terrible, but often such confidence can be a source of motivation.

Larry Wall, the designer of 7Perl, says programmers have three great virtues: laziness, impatience, and arrogance. Of course, you wouldn’t normally call these qualities virtues.

sStart developing Ruby

So I started Ruby development. It starts with names. Names are important. Perl’s name comes from the word “pearl,” so I decided to copy Perl and pick a gem name for the language. Most of the stones had long names, like Diamond and Emerald, and I couldn’t find any, so I ended up with Coral and Ruby. Ruby is short and beautiful, so I chose it. I didn’t think much about it at the time, but because programming languages are often called by names, they had better be both easy to read and memorable.

If you decide to develop your own language, make sure you put a lot of effort into coming up with a good name. Names that clearly express language characteristics are best, but names that are completely unrelated to language characteristics, such as Ruby, are also acceptable. Recently, common names have a low “Googleability” (searchability) problem. This problem did not exist when Ruby began development in 1993.

Pondering the way the structure of a code block is represented

The next decision is to use the end keyword to represent the code block. C, C++, and Java all use curly braces ({}) to enclose multiple statements in code blocks. One problem with this is that it is easy to forget the curly braces when converting a single statement into multiple statements (Figure 1-19). Although Pascal used begin and end instead of braces, the same problem existed because of the difference between single statements and multiple statements.

If (cond) {statement1(); if (cond) {statement1(); statement2(); }

If (cond) statement1(); if (cond) statement1();

Copy the code

// Forget the parentheses if (cond) statement1(); statement2(); // No syntax errors

Figure 1-19 Single statement and multiple statements

I don’t like single-statement and multi-statement problems, so I want to eliminate them in my own language, and THERE are three ways to do that.

(1) A Perl method that does not allow the omission of braces in a single statement

(2) The Python way of indenting code blocks

(3) Eiffel mode of ending code blocks with end without distinguishing between single and multiple statements (Figure 1-20)

■ If cond statemen1(); end

■ Cond statemen1() if cond statemen1(); statemen2(); end

Copy the code

■ If cond statemen1(); elsif cond2 statemen2() else statemen3() end

Figure 1-20 Eiffel mode (comb code block structure)

Autoindent topic

I’ve been using the Emacs text editor for years and am very familiar with its language mode, and I like the auto-indentation feature it provides the most. After you type in some code, the editor will automatically indent it for you. It’s like writing code with the editor.

In the Python approach of (2), the indentation itself is used to represent the structure of the code block, so there is no room for automatic indentation (however, entering a colon at the end of the line makes indentation deeper). In addition, Python, which uses indentation to represent blocks of code, makes a clear distinction between statements and expressions, which I’m not a big fan of because I’m influenced by Lisp, which does not distinguish between statements and expressions. As a result, I ended up not using this Python approach of indenting code blocks.

Eiffel has a great influence on me when I was in school. At that time, I read a book called Object-oriented Software Construction. Influenced by it, I designed a language that is semantically similar to Eiffel (but syntactically similar to C) as my graduation project. While I can’t say this has been a success, I’m going to take a leaf out of Eiffel syntactically (but not semantically) and see how it works.

Make your own Emacs language patterns

Again, what worries me here is the auto indent feature. In Emacs language patterns at the time, the dominant practice was to mark code blocks with symbols, as in C, for automatic indentation, while in Pascal and other languages that used keywords to represent code blocks, shortcuts were used to increase or decrease indentation, eliminating the pleasant feeling of automatic indentation.

So I spent a few days wrestling with Emacs Lisp, using regular expressions to do a simple analysis of Ruby syntax, and creating a model of Ruby language patterns that can automatically indent even in the syntax that uses end. This also proves that automatic indentation can be implemented in languages that use end and have a syntax similar to Eiffel’s. This way, I can safely use end in Ruby’s syntax. Conversely, Ruby syntax would not be what it is today if a Ruby language pattern with automatic indentation had not been successfully developed.

The design choice to use the end block structure has another unintended benefit. Because a large part of Ruby is implemented in C, it is necessary to use C differently from Ruby. But C and Ruby have completely different code styles, so you can see at a glance what language you’re working in, which reduces the cost of schema switching for your brain. That’s a small cost, but it’s a great way to keep your programming motivated. Also, when Perl, Python, and Ruby are considered scripting language competitors, I think having different block constructs for each language (Perl is curly braces, Python is indent, Ruby is end) might help them survive.

Else if or elsif or elif

As an aside, you can’t write else if statements like C does if you use the above method to solve multiple statement problems. The else if of C is interpreted as a single if statement without braces followed by the else (Figure 1-21). Write an else if statement using Ruby syntax, as shown in Figure 1-22.

// (a) use the following statement if (cond) {... } else if (cond2) { ... }Copy the code

If (cond) {// if (cond) {... } else { if (cond2) { ... }}

Figure 1-21 Else If of C

If Ruby doesn't use elsif, you'll need to write if cond... else if cond2 ... end end

It's better to use ELsif

Copy the code

if cond ... elsif cond2 ... end

Figure 1-22 Else If of Ruby

From the code in Figure 1-22, elsif is better. Incidentally, Perl and Ruby use Elsif, while shell scripts and Python (and C preprocessors) use Elif. That’s an interesting difference.

Python is said to inherit the elif script from shell scripts and C preprocessors, which in turn inherit from the old Algol family. It is also believed to have originated from the reverse spelling of the beginning keyword like FI and ESac in shell scripts to indicate the end.

I’m sorry I don’t know why Perl uses Elsif, but Ruby does because of two things.

  • elsifelse ifThe pronunciation is the same and shorter (elseifLonger, and the elif pronunciation has changed)
  • The language that Ruby borroves the most from in terms of basic syntax is Eiffelelsif

A keyword also has a historical reason.

Taking a step further, Perl’s syntax is basically the same as C’s, but else if is not supported because you can’t omit braces, as shown in Figure 1-21. However, if you explicitly include a combination of else and if syntactically, you might support else if as well. A long time ago, I changed the Perl source code on a whim, only to spend a few minutes tweaking the YACC description and making Perl support for else If. It’s hard to understand why the Perl community hasn’t done it yet.

Begin to implement

With the basic guidelines and syntax set, it’s time to implement them. Fortunately, I still had the source code for the “pediatrics” language I had been working on casually, so I decided to use it as a base for development.

Ruby development began in February 1993, after which the basic parts of the parser and runtime were roughly completed, and the first Ruby program (a Hello World program) started running half a year later in August.

To be honest, that was one of the most difficult times in Ruby development. Programmers can only feel the joy of programming when they see their code working properly. At that time, Ruby had nothing to run, and writing and writing couldn’t reach the state of running, so I almost had no motivation to keep going.

Although a parser is written, all it can do is check the syntax. In order to run the program, you also need a string class, because “Hello World” is a string object. To write string classes, you need an object-oriented system with objects as vertices, and to output strings you need objects that manage IO, and so on, you need one thing after another. It was a miracle that I was able to endure those six months as an impatient programmer.

Popularity lies in the details

The Hello World output program has been implemented, but Ruby is not yet usable by itself, and what has been implemented so far is textbook examples. In order to achieve the goal of popular language, the next step is important.

How do you add your own characteristics to a language? How to gain popularity? How was Ruby’s early design considered? I have so much more to say.

But “catching up” was a bit long, and this time I had run out of space. Sections 1-5 will continue this section and introduce you to the second half of the Ruby design case study.

Time Machine

Learn about the history of common languages

This section appears in the July 2014 issue. This is the beginning of the introduction of language design. It tells the background and history of Ruby language development, and answers questions like “why DO I want to do it”, “where did I encounter setbacks”, and “why did I adopt this syntax”.

Although it’s all a long time ago, few people can actually tell the background of common languages and the reasons behind the various designs, so I think this and the next section are a highlight of the book.

But then again, the content itself is just useless knowledge. For the sake of future generations, I sincerely hope that you can learn some lessons from these past events, such as:

  • Design is decision
  • Even something as basic as grammar has all sorts of considerations
  • Design can go wrong without careful consideration
  • Even careful consideration can lead to mistakes

1-5 Introduction to Programming Language Design (Part II)

Sections 1-4 cover the birth of Ruby. This section picks up where the last section left off and covers Ruby language design. It introduces how to name variable names, how to think about inheritance, how to handle errors, and how to determine iterators.

In the previous sections, Ruby took its first steps as a programming language by establishing basic syntactic structures, but if nothing else, it would be a mediocre language to be found everywhere. Right now Ruby only syntactically defines the blocks as “do~end” and else if with elsif, but the details need to be refined.

Design principles

At this stage, in addition to meeting the functional requirements of being an object-oriented language, Ruby had several goals in mind.

  • Out of the realm of simple language
  • Easy to write, easy to read
  • concise

“Out of the realm of simple language” means not slapdash language norms. At the time, especially in the scripting language world, many languages prioritized getting the job done and (seemingly) slacked on language specifications. For example, adding a sign to a variable name because it is easy to implement when it is not necessary, or calling a user-defined function differently than a built-in function.

“Easy to write and easy to read” is a bit abstract. The program is not written once on the end, but to be in the process of debugging and so on repeatedly pondering, repeated modification. The smaller the code is, the easier it is to understand the same operation, so clean code is ideal, but not too clean.

There are also languages that are exceptionally concise, but when you look back at the code you often don’t understand it. Such languages are often referred to as “Write Once languages”, which means you Write it and leave it behind. In the case of this language, it is often more difficult to reinterpret code than to write it all over again. Easy to write and easy to read can only be achieved by balancing trade-offs, as language design has always been.

On the other hand, when you’re writing code, it’s annoying to be forced to write something, even a little bit, that has nothing to do with what you’re essentially trying to do. This is because I want to focus on the essence of what the software should be used for. We want to make the implementation as simple as possible by cutting out the irrelevant without compromising understanding.

The variable name

Perl was one of Ruby’s early reference languages. Perl variable names begin with symbols, which are described in Table 1-5.

Table 1-5 Perl rules for variable names

The variable name

meaning

$foo

Scalars (strings or numbers)

@foo

Arrays (scalar arrays)

%foo

Hash (associative array)

$foo[0]

Accessing an array element

$foo{n}

Access hash elements

One of the more interesting things is how the array is accessed. It takes the 0th element of the array (@foo), but the symbol is $, so it is. That is, the leading symbol represents the type of the variable expression. This is because Perl was (surprisingly) a statically typed language that mutated explicit data types.

Later, Perl introduced the concept of references, which made everything, including arrays and hashes, representable as scalars. Therefore, the principle of static typing becomes less important.

But when we see a variable name, what we most want to know is not the type of the variable but the scope. Some languages (such as C++) require that global or member variables be preceded by a specific prefix. Coding rules that include type information in variable names, such as the Hungarian nomenclature used by Microsoft, have recently been completely lost. This means that explicit type information is no longer necessary.

Ruby then adds scoped symbols to variable names (Table 1-6), such as $for global variables and @ for instance variables. However, if the most commonly used local variables and constants (class names, and so on) are also signed, you fall into Perl’s trap.

Table 1-6 Ruby’s rules for variable names

species

symbol

The sample

The global variable

$

$foo

The instance variables

@

@foo

A local variable

Lowercase letters

foo

constant

The capital letters

Foo

Make local variables more concise

After much consideration, I decided to make it a rule to use lowercase letters before local variables and uppercase letters before constants, so there wouldn’t be so many ugly symbols. Also, if you use global variables a lot, you’ll end up with ugly $signs all over the program, which will help naturally promote good coding style.

The advantage of including scope information in variable names is that you no longer have to go through the variable declarations, because the information about what a variable does is presented to you in a compact form. Variable declarations are used to provide the compiler with information such as the scope and type of a variable, regardless of the underlying processing. I didn’t want to write anything like that if I could, and I didn’t want to have to look for variable declarations just to read the program. That’s why I decided on this rule, and Ruby doesn’t have variable declarations. Variables are generated at the initial assignment without variable declaration.

To be on the safe side, I should add here that I am not denying the merits of declarations, especially type declarations. The ability of statically typed languages to check for type mismatches at compile time even when they are not running strikes me as remarkable. It’s just that I want to focus on the nitty-nines, and I don’t want to write type declarations, so I prefer dynamic typing for now.

Add object-oriented functionality to scripting languages

Another thing that was in mind from the beginning when designing Ruby was to make the language a true object-oriented language.

At that time, object-oriented languages such as Smalltalk and C++ were used, and Flavors language of the Lisp family was used in college research. It is said that there is also a language called Eiffel, which is mainly used in foreign industries such as finance, but the actual language processor is only available in commercial versions and is difficult to obtain in Japan.

For these reasons, object-oriented programming is a long way off, and everyday programming, especially small and less complex programming like scripting word processing, is generally considered unnecessary.

So none of the scripting languages of the time started out with object-oriented capabilities. Even if there were features that supported object-oriented programming, they were added later, so most lacked a sense of wholeness.

However, what I had read about Smalltalk in high school made me think that object-oriented programming was the ideal type of programming, and I believed that object-oriented programming would work in the scripting world, so I naturally wanted to design the language in that direction from the beginning.

Single inheritance versus multiple inheritance

What bothers me here is the design of inheritance functionality. As readers may know, inheritance is divided into single inheritance (also known as single inheritance) and multiple inheritance in languages that support object-oriented programming. Inheritance refers to inheriting functionality from an existing class and attaching new functionality to the new class. Where there is only one number of existing classes (called parent classes) as the base, the case is called single inheritance, and there are more than one case is called multiple inheritance.

Single inheritance is a subset of multiple inheritance. As long as there are multiple inheritance, single inheritance can be realized. Multiple inheritance was well developed in the Lisp family of object-oriented languages, and C++ later introduced this feature, although it is not known how well it was used in 1993.

According to the design and evolution of the C++ language, C++ introduced multiple inheritance in version 2.0 in 1989. Back in 1993, when this feature was new, it probably wasn’t used.

However, multiple inheritance has problems that single inheritance does not. In the case of single inheritance, the inheritance relationship between classes is just a simple column, and the class hierarchy as a whole is a tree structure (Figure 1-23). Multiple inheritance allows the existence of multiple parent classes, so the relationships between classes are netted, forming the Directed Acyclic Graph (DAG) structure. In multiple inheritance, the inherited parent may also have multiple parents. If left unattended, relationships between classes can quickly become complicated.

Figure 1-23 Single inheritance (left) and multiple inheritance (right)

In a single inheritance, the relationships between classes are just a single column, you don’t need to worry about inheritance precedence, and you can search methods in order from bottom (subclass) to top (superclass).

But in the case where the class relationship is multiple inheritance of the DAG structure, the search order is not necessarily unique (Figure 1-24). There are both depth-first and breadth-first searches, and many languages that support multiple inheritance (CLOS, Python, etc.) use C3 searches in addition to these two methods.

Figure 1-24 DAG search sequence

However, no matter which method is chosen, there are situations that are difficult to articulate intuitively. Such a complicated inheritance is inherently hard to understand.

So there’s no problem with making multiple inheritance into a simple single inheritance? Although the class relationship of single inheritance is simple and easy to understand, this does not mean that single inheritance is without problems.

The problem of unitary inheritance

The problem with single inheritance is that class attributes such as methods cannot be shared across inheritance scopes. In the absence of a common parent class, attributes cannot be shared, only code copied. DRY Principle 9 states that copying code is a vice and bad practice.

9DRY stands for “Don’t Repeat Yourself,” a software design principle that refers to avoiding repetition when developing software.

Let’s look at a practical example. Smalltalk has a Stream class that manages input and output. This class includes ReadStream for reading, WriteStream for writing, and ReadWriteStream for both reading and writing.

In languages that support multiple inheritance, ReadWriteStream is typically designed to inherit both ReadStream and WriteStream classes (Figure 1-25a), which is an ideal example of multiple inheritance. Smalltalk does not support multiple inheritance, so make ReadWriteStream a subclass of WriteStream and copy the code for ReadStream (Figure 1-25b). Worst of all, if ReadStream changes, the ReadWriteStream that copies the ReadStream code must change accordingly, or it will be buggy.

Figure 1-25 ReadWriteStream

Mix-in

Mix-in gave me the inspiration to solve this problem. Mix-in is a Flavors technology based on the Object-oriented language of the Lisp family. Flavors is an object-oriented language that supports multiple inheritance, but to alleviate the problem of multiple inheritance mentioned earlier, the language has the following restrictions on the second parent class.

  • Do not instantiate
  • You can’t have a parent class, you can use another mix-in

According to these rules, a network of multiple inheritance becomes a tree with the first parent and a twig with the second parent. Use this technique to implement the same structure as shown in Figure 1-25, as shown in Figure 1-26. It’s a very different structure than using multiple inheritance directly, but it keeps the simplicity of single inheritance without copying code.

Ruby module

Mix-ins are a good idea, but they’re just a trick in the use of multiple inheritance, not a force, so I’m considering enforcing mix-ins at the language level, which means having two classes: a generic class for main inheritance and a special class that can only be used as a mix-in. This particular class needs to follow mix-in rules that prohibit instantiation and inheritance from ordinary classes.

Ruby’s modules were born out of this idea. The Module statement defines exactly what meets the mix-in nature described earlier (Readable and Writable in Figure 1-26 are equivalent to Module). Using this technique, we can avoid the disadvantages of multiple inheritance and reduce complexity.

Figure 1-26 Mix – in

Around the same time as Ruby, other languages (once sun Microsystems’ Self language, for example) provided such structures under names such as traits or mixin.

Error handling

The hardest part of software development is error handling. The software does not work as expected, such as the file that you want to open does not exist, the network connection is broken, or the memory is insufficient. A language like C needs to check that the function terminates properly after an abnormal function call (Figure 1-27).

It is clear from the code in Figure 1-27 that the intent to “open a file” can be described in a single line of code, whereas error handling is cumbersome. This violates the Ruby design principle of “express intention succinctly” and should be resolved no matter what.

FILE *f = fopen(path, "r"); If (f == NULL) {// The file is not opened properly switch (errno) {// Details of the error are stored in the variable errno case ENOENT: // The file does not exist... break; Case EACCES: // File access error... break; . }}Copy the code

Figure 1-27 Error handling of C

The first consideration is the error-handling structure using the Icon language. All function calls in the Icon language, developed at the University of Arizona, return success (return value) or failure. If a function call fails, the caller’s function also fails to run, similar to the exception-handling structure of languages such as C++ and Java. Icon is special in that the failed value is treated as a Boolean value.

That is, when the following line of code fails for some reason, the caller’s function also fails, causing processing to break.

line := read()Copy the code

The following code indicates that write() is executed when read() succeeds, otherwise nothing is done.

if line := read() then
   write(line)Copy the code

The following code says that the loop will loop over the return value of the write() function whenever read() is executed successfully, and will abort the loop if either read() or write() fails.

while write(read())Copy the code

This structure does not need to use special syntax can write naturally exception handling, this is very attractive, but the exception handling is not easy to get attention, and with the “normal” language difference is too big, often let people stay at a respectful distance from sb, processing efficiency is also worrying, so I didn’t use this structure. If I had used this structure, Ruby would be very different.

If you want to design a popular language, you need to consider whether it will be the highlight of the language when you adopt a design that differs from other languages. If you’re really committed to the design, you don’t have to give in, but if you don’t care that much and adopt “weird grammar” without careful consideration, you could be setting yourself up for future trouble.

Pay attention to the keywords of exceptions

Having given up exception handling in the Icon language, I still wanted Ruby to have some form of exception handling, so I decided to go with “normal” exception handling constructs from languages like C++ (Java hadn’t even existed yet), but I was a little bit more particular about keywords.

C++ uses try~catch statements for exception handling, but I don’t like this keyword. Because try gives people the feeling of “try”. Since all method calls are subject to exceptions, saying “try it” is not appropriate. Also, the word catch is not associated with exception handling.

So I borrowed the word rescue from the Eiffel language THAT I had referenced when designing the block of code. Rescue, rescue, rescue, rescue, rescue, rescue, rescue, rescue, rescue, rescue, rescue

Additionally, ensure is a keyword borrowed from the Eiffel language for post-processing whether or not an exception occurs (some languages use the word finally). Instead of being used for exception handling, the Eiffel keyword ensure is used to indicate the post facto conditions that should be met after a method used in DBC (Design by Contract) is run.

The code block

The last thing to say is code blocks. As a Designer of Ruby, I was surprised to hear people say that code blocks are one of Ruby’s biggest features.

MIT developed a language called CLU, which in one sentence is a precursor to object-oriented languages or abstract data languages.

CLU has some striking features, one of which is iterators. An iterator is “a function that abstracts the loop.”

Iterators can be called as follows.

for i:int in times(100) do
...
endCopy the code

This code calls an iterator function called times. Iterator functions are special functions that can only be called in a for statement. Use THE CLU language to implement The Times function, as shown in Figure 1-28.

times=iter(last:int) yields(int)
   n:int := 0
   while n < last
     yield(n)
     n := n + 1
   end
 end timesCopy the code

Figure 1-28 Times function implemented in CLU language

In an iterator function, when yield is called, the value passed to yield is assigned to the variable specified in the for statement, and the code block starting with DO is run.

The same process in C would use a for statement or a function pointer. However, details such as loop variables and access to internal constructs cannot be hidden when using the for statement. When using function Pointers, you can hide details (because there are no closures in C), but passing things like variables can be tricky.

CLU iterators don’t have this problem, so it’s ideal for loop abstraction.

Iterate over grammatical structures

I considered introducing CLU iterators into Ruby, but on second thought, I decided it wasn’t a good idea to introduce them directly. The ITERators of the CLU do a good job of abstracting loops, but the structure can also be used in scenarios outside of loops, where the syntax of the CLU prevents it from being used.

There are many functions and methods in Smalltalk and Lisp that can pass functions (or blocks of code in Smalltalk) as arguments and then be used for loops and so on. In Smalltalk, for example, you can multiply each element of an array by 2 to get a new array.

Collect [1, 2, 3] : [: a | a * 2].Copy the code

If this process were written in CLU syntax, it might look like this, which is not intuitive (this is just mimicking CLU writing, which is not actually possible).

For a in [1,2]. Collect () do a * 2 endCopy the code

Isn’t there a better way to write it? I was trying to figure out how to write this sentence as I put my eldest daughter, who was a newborn, to bed at night.

The syntax that comes to mind is something like this.

Do [1,2]. Collect using a * 2 endCopy the code

This syntax is obviously influenced by the CLU. The keyword using is borrowed from a PC language called actors, which has nothing to do with the Actor model of concurrent programming. However, even when written this way, it is not as easy to understand as Smalltalk code blocks or Lisp anonymous functions.

After much deliberation, the resulting syntax is shown below.

[1, 2, 3]. Collect {a | a * 2}Copy the code

This syntax is very close to Smalltalk. Only a “|” said variables is also affected by the Smalltalk. After the grammar is also constantly improve: no variable with two “|” said omitted; When mixed with other statements ending in end, do~end is used to represent code blocks in a consistent manner.

Expanded the scope of use

Blocks of code designed to reference the CLU language for looping abstraction have been introduced into Ruby in a variety of ways, such as the following.

  • Loop Abstraction (essential)
  • Specified conditions (selectEtc.)
  • Specify callback code (GUI, etc.)
  • The thread andforkRunning part of
  • Specifying scope (DSL, etc.)

It’s amazing how much code blocks are used in almost every field.

However, code blocks are simply special statements of higher-order functions (functions that take functions as arguments) that are widely used in Lisp, Smalltalk, and other functional languages, so these uses are natural. But since it was designed specifically for the abstraction of loops, there were limitations, and to my surprise, they didn’t stop it at all.

The limitations just mentioned refer to:

  • No function object statements were written directly (until lambda expressions arrived in version 1.9)
  • Since calls with code blocks are integrated into the syntax, a method can only have one code block

In fact, it’s these limitations that make code easier to understand and design in most cases. What a blessing in disguise.

According to one survey, 98% of the high-order functions abundant in the standard library of OCaml, one of the functional languages, have only one function parameter. Perhaps for the same reason, Ruby’s restriction to a single block of code in a method argument isn’t a problem.

Secrets of language design

Seeing this, I believe that readers should also know some language design secrets.

First, fully investigate what problems exist with existing languages and what solutions are available. Accumulating experience in solving these trivial problems helps to complete a good design.

Second, the pursuit of individuality should be limited to “a certain point”. As I mentioned in my introduction to Icon’s exception handling, an idiosyncratic syntax can be frustrating for beginners, and taking a conservative attitude beyond “a certain point” can also add to the language’s popularity. But being too conservative can make the language technically dull and unappealing, so it’s hard to strike the right balance.

Third, stand in objective Angle. We all know that when people talk about design together, it doesn’t end well. The benefits of a unique design are lost as people compromise to reach an agreement. Therefore, if you consult with others, you should only seek advice. If the ultimate responsibility does not lie with one person, good design cannot be done.

When I design code blocks, I always consult with babies and teddy bears, which is another way. You may feel silly doing this, but even if the other person doesn’t respond, it will help you think further and sometimes even come up with a better idea.

summary

In sections 1-4 and 1-5, I introduced you to some of the early ideas of Ruby design. What do you think?

Even a subtle linguistic detail can only be decided by the designer after all kinds of research and thought. I don’t know if you feel that.

All design, not just language, is a trade-off. There is no perfect design, there is only a better choice under certain conditions. It is up to the designer to make this choice as widely applicable and as close to perfect as possible.

But designing a language is a long process. Even Ruby, a relatively new language, has been under development for more than 20 years. As computer performance improves and environments change, Ruby presents new trade-offs. For example, when Ruby was born, multicore computers were not common, so threads were not required to effectively utilize multicore cpus. However, today’s personal computers are all dual-core and quad-core.

In response to this change in environment, I am rethinking language design and implementation every day.

Time Machine

The secrets of language design apply to all design

This section appears in the August 2014 issue. I picked up where I left off and went behind the scenes on language design, covering Ruby’s variable naming rules, the design of object-oriented functionality, exception handling, and more.

In general, when introducing a language, people tend to explain what the language will look like, rather than why it is like that. This time, AS a Ruby developer, I went into some depth (and I talked a lot) about language design that I wouldn’t normally touch on. I remember being very happy when I was writing.

At the time I wrote this section, I did not realize that “デザ” and “design” both mean “design” 10. But in Japanese, “デザ” means something different.

デザ – concentrated in Japanese means “decide the pattern” and “design” means “consider the structure.” I’m good at deciding what features and syntax a programming language should have, and that’s my job. Focusing on the beauty of the language, I recently started using the word デザ – (language design). Do you think “language Designer” (デザ) is really cool?

In the second half of this section, I’ve covered some of the most important language design secrets (a bit of a boastful one). These principles are common not only to language design, but to all software design. Although I have little experience outside of programming, I believe these principles extend beyond programming and apply to all areas of decision making, including design and requirements determination.

10 There are two ways to express the word “design” in Japanese. One is the loaner “デザ concentrated”, which is based on the pronunciation of the word “design”. The other is “design”. –