Source | Communications of ACM

Written by John L. Hennessy and David A. Patterson

The Heart of the machine

John Hennessy, chairman of the board of Google’s parent company Alphabet, and David Patterson, a Google distinguished engineer specializing in machine learning and artificial intelligence, are the 2017 Turing Award recipients. They are better known as the bible of computer Systems architecture, Computer Architecture: Quantitative Research Methods.

This article, “The New Golden Age of Computer Architecture”, was published in 2019. It provides a complete introduction to the development of computer chips and the future trends of all architectures. It is worth reading for anyone who wants to understand hardware architecture.

Photo note: Published in February 2019

The Turing Lecture we gave on 4 June 2018 began with a review of the development of computer architecture since the 1960s. In addition to that review, we also presented current challenges and future opportunities. We also foresee the next golden age of computer architecture in the next decade, similar to the research we carried out in the 1980s that helped us win the Turing Prize — improving the cost, energy, security and performance of computing.

“Those who cannot remember the past are condemned to repeat it.” – George Santayana, 1905

Software and hardware talk through a vocabulary called Instruction Set Architecture (ISA). In the early 1960s, IBM had four incompatible computer lines, each with its own ISA, software stack, I/O systems, and niche markets (small business, large enterprise, research, and real-time applications, respectively). Including ACM Turing Award winner Fred Brooks, Jr. IBM engineers at IBM and IBM thought they could create a single ISA that would effectively unify all four isAs.

They needed a technical solution that would allow both cheap 8-bit data channel machines and fast 64-bit data channel machines to share one ISA. These data paths are the “body” of the processor in which they perform arithmetic operations but are relatively easy to “widen” or “narrow”. The biggest challenge for computer designers then and now is the “brain” of the processor — the hardware that controls it. Inspired by software programming, computing pioneer and Turing Award winner Maurice Wilkes proposed ways to simplify control. Control can be described as a two-dimensional array, which he calls a “control store.” Each column of the array corresponds to a control line, and each line is a microinstruction, and writing microinstructions is called microprogramming. A control store contains an ISA interpreter written in microinstructions, so multiple microinstructions are required to execute a regular instruction. This control memory is implemented through memory at a much lower cost than logic gates.

Table 1 lists the four models of the new System/360 ISA announced by IBM on April 7, 1964. Data paths vary by a factor of 8, memory capacity by a factor of 16, clock frequency by a factor of 4, performance by a factor of 50, and cost by a factor of nearly 6. The most expensive computers have the widest control memory because more complex data paths use more control lines. The lowest-cost computers have narrower control memory because of simpler hardware, but because they require more clock cycles to execute a System/360 instruction, they require more microinstructions.

Table 1: Characteristics of four models of the IBM System/360 series; IPS indicates the number of commands executed per second.

Driven by microprogramming, IBM is betting the company’s future on the new ISA, which it hopes will revolutionize computing and reap the rewards. IBM managed to dominate the market, with the mainframe descendant of IBM’s family of computers still generating $10bn a year in revenue 55 years after its launch.

As we have seen time and again, while the market is not a perfect judge of technical issues, given the close connection between architecture and business computers, it is the market that ultimately determines the success of architectural innovations that often require heavy engineering investment.

Integrated circuits, CISC, 432,8086, IBM PC

When computers started using integrated circuits, Moore’s Law meant that the control memory could become much larger. More memory, in turn, means that more complex isAs are allowed. Remember, the vAX-11/780, introduced in 1977 by Digital Equipment Corp., had 5120 words by 96 bits of control memory, compared with 256 words by 56 bits in its predecessor.

Some manufacturers choose to open up microprogramming so that selected customers can add custom features, which they call “writable controlled Memory” (WCS). The most famous WCS computer was the Alto, developed by Turing Award winners Chuck Thacker and Butler Lampson and colleagues in 1973 for Xerox’s Palo Alto Research Center. This was actually the first PERSONAL computer (PC), with the first bit-mapped display and the first Ethernet LAN. The device controller for this new display and network is a microprogram stored in a 4096 word x 32 bit WCS.

Microprocessors of the 1970s, such as Intel’s 8080, were still in the 8-bit era and relied on assembly language to write programs. Competing designers would try to catch up with each other by adding new instructions, and they would use assembly language examples to show their strengths.

Gordon Moore believed Intel’s next-generation ISA would last a lifetime, so he hired a bunch of smart COMPUTER science Ph.D.s and sent them to Portland to invent the next great ISA. Intel’s original computer architecture project, called the 8800, was ambitious — as it would be in any era, and clearly the most ambitious of the 1980s. It has 32-bit addressing capabilities, an object-oriented architecture, bit-length instructions, and its own tradition of operations written in the new programming language Ada.

Unfortunately, the ambitious project stalled a few years later, forcing Intel to begin an emergency replacement effort in Santa Clara, launching a 16-bit microprocessor in 1979. Intel gave the new team 52 weeks to develop the new “8086” ISA and design and build the chip. Given a tight schedule, the team actually expanded the 8080’s 8-bit registers and instruction set to 16 bits, and ended up designing the ISA in just 10 man-weeks over three regular working weeks. The team completed development of the 8086 on time, but it attracted little attention when it was released.

Luckily for Intel, IBM was developing a personal computer to compete with the Apple II and needed a 16-bit microprocessor. IBM was previously interested in Motorola 68000, whose ISA was similar to the IBM 360 but lagged behind IBM’s aggressive plans. IBM switched to an 8-bit bus version of the 8086. IBM introduced the PC on Aug. 12, 1981, hoping to sell 250,000 units by 1986. In fact, the company sold 100 million units worldwide, paving the way for a very bright future for the emergency replacement Intel ISA.

Intel’s original 8800 project was renamed iAPX-432 and finally launched in 1981, but it required multiple chips and had serious performance problems. The project was terminated again in 1986, when Intel extended the 16-bit 8086 ISA in the 80386 to expand its registers from 16 to 32 bits. So Moore was right — the next generation ISA would indeed live as long as Intel, but the market opted for the emergency 8086 rather than the specially crafted 432. As the designers of both Motorola 68000 and IAPX-432 have learned, markets are impatient.

From complex instruction set to reduced instruction set computer

In the early 1980s, some work was done on complex instruction set computers (CISC) used by large microprograms in larger control memory. Unix showed that even operating systems could be written in high-level languages, and the key question then became “What instructions will the compiler generate?” “Rather than” What assembly language would a programmer use?” . Significant improvements in hardware/software interfaces present opportunities for architectural innovation.

Turing prize winner John Cocke and his colleagues developed simpler ISA and compilers for minicomputers. As an experiment, they reconfigured the compiler they studied to use only the simple register-register operations and load-store data transfers in the IBM 360 ISA, avoiding more complex instructions. They found that the program ran three times faster using this simple subset. Emer and Clark found that 20% of VAX instructions required 60% microcode, but these instructions only occupied 0.2% of the actual execution time.

David Patterson spent a sabbatical at DEC working to reduce vulnerabilities in the VAX directive. He argued that microprocessor makers needed a way to fix microcode bugs if they followed CISC ISA designs for larger computers. He wrote a paper about it, but the journal Computer rejected it. Reviewers thought it was bad to develop a microprocessor with an ISA so complex that it needed to be tinkered with. The rejection made the author rethink the value of CISC ISA in microprocessors. Ironically, modern CISC microprocessors do need to include microcode repair mechanisms, but Patterson’s main result of the rejection was to inspire him to develop a simpler ISA for microprocessors, the Reduced Instruction Set Computer (RISC).

These observations and the popularity of higher-level programming languages created an opportunity for CISC to transition to RISC. First, RISC instructions were further simplified, eliminating the need for a microcode interpreter. RISC instructions are usually as simple as microinstructions, and the hardware can execute them directly. Second, the fast memory of the microcode interpreter previously used for CISC ISA was used as a cache for RISC instructions (a cache is small, fast memory that temporarily buffers recently executed instructions because such instructions are likely to be reused soon). . Third, register allocators based on the Gregory Chaitin graph coloring scheme make it easier for compilers to use registers efficiently, which is beneficial to these register-register ISA. Finally, Moore’s Law meant that in the 1980s there were enough transistors on a single chip to contain a full 32-bit data path, along with the corresponding instruction and data caches.

In today’s post-PC era, x86 shipments are down about 10% a year from their 2011 peak, while RISC processor chip shipments have ballooned to 20 billion.

Figure 1, for example, shows the RISC-I and MIPS microprocessors, developed by UC Berkeley in 1982 and Stanford in 1983, respectively, which demonstrate the advantages of RISC. These chips were eventually presented at the top circuit conference, the IEEE International Solid-state Circuits Conference in 1984. This was a remarkable achievement at the time, because some of the graduate students at Berkeley and Stanford were able to build microprocessors that were somewhat beyond what the industrial class could build.

Figure 1: RISC-1 at Uc Berkeley and MIPS microprocessors at Stanford university.

These academic-made chips inspired many companies to build RISC microprocessors, the fastest chips for the next 15 years. The following formula explains the processor’s performance:

Time/Program = Instructions / Program x (Clock cycles) / Instruction x Time / (Clock cycle)

DEC engineers later showed that the more complex CISC ISA performed 75% of the number of instructions per program as RISC did (item 1 above), and CISC took 5 to 6 more clock cycles per instruction when using a similar technique (item 2). Making RISC microprocessors about three times faster.

Such formulas were not available in computer architecture books in the 1980s, and we later published Computer Architecture: Quantitative Research Methods in 1989. The subtitle indicates the main theme of the book: quantitative estimation using metrics and benchmarks, rather than relying on architect intuition and experience as in the past. The quantitative methods we used were also inspired by Turing Prize winner Donald Knuth’s book on algorithms.

VLIW, EPIC, Itanium

The next ISA innovation was supposed to replace RISC and CISC. The Ultra-long Instruction word (VLIW) and its “explicit Parallel Instruction Computer” (EPIC) cousin use wide instructions in which multiple independent operations are bundled within each Instruction. VLIW and EPIC advocates argued at the time that hardware could be simpler if a single instruction could specify six separate operations (two data transfers, two integer operations, and two floating point operations), and compiler technology could efficiently allocate operations to six instruction slots. Like the RISC approach, VLIW and EPIC shift the workload from hardware to the compiler.

Working together, Intel and HP designed a 64-bit processor based on EPIC concepts to replace the 32-bit x86. Intel and HP had high hopes for the first EPIC processor (Itanium), but the reality didn’t match the developers’ early claims. While the EPIC approach is suitable for highly structured floating point programs, it is difficult to achieve high performance on integer programs with less predictable cache misses or harder to predict branch judgments. As Donald Knuth later pointed out: ** “The Itanium method…… was thought to be great until it turned out that the desired compiler was basically impossible to write.” ** experts noted Itanium’s delays and poor performance and renamed it “Itanic” after the Titanic incident. The market again lost patience, resulting in the 64-bit version of x86 becoming the successor to 32-bit x86 rather than Itanium.

The good news is that VLIW is still suitable for a narrower range of applications, small programs, simpler branching, and omit caches, including digital signal processing.

RISC vs. CISC in the PC and post-PC era

AMD and Intel used 500-person design teams and cutting-edge semiconductor technology to close the performance gap between x86 and RISC. Again inspired by the performance advantages of simple pipelinization versus complex instructions, the instruction decoder converts complex x86 instructions into risC-like internal microinstructions at run time. AMD and Intel then pipelined the execution of RISC microinstructions. Any ideas RISC designers had for improving performance, including split instructions, data caching, chip level 2 caching, deep pipelinings, and fetching and executing multiple instructions at the same time, were applied to the x86 design. AMD and Intel shipped around 350m microprocessors at the peak of the PC era in 2011. High volumes and low margins in the PC industry also meant lower prices than RISC computers.

With hundreds of millions of PCS sold worldwide each year, PC software is a huge market. While software providers in the Unix market offer different versions of their software for different commercial RISC ISA-Alpha, HP-PA, MIPS, Power, and SPARC, the PC market enjoys a single ISA, so software developers offer “packaged” software, Only compatible with x86 ISA binary. In 2000, a larger software base, similar performance, and lower price allowed x86 to dominate the desktop and small server markets.

Apple helped propel the post-PC era in 2007. Instead of buying microprocessors, smartphone companies are building their own systems on a chip (SoC) using designs from outside firms, including ARM’s RISC processors. Mobile device designers value chip size and energy efficiency as much as performance, which works against CISC ISA. In addition, the arrival of the Internet of Things has greatly facilitated the number of processors and the trade-offs required for chip size, power, cost, and performance. This trend increases the importance of design time and cost, further putting CISC processors at a disadvantage. In today’s post-PC era, x86 shipments have fallen nearly 10% a year since peaking in 2011, while chips powered by RISC processors have soared to 20 billion. Today, 99% of 32 – and 64-bit processors are RISC.

To conclude this historical review, we can say that the market has settled the RISC-CISC debate. CISC won the latter stages of the PC era, but RISC is dominating the post-PC era. There has been no new CISC ISA for decades. To our surprise, today, 35 years after its introduction, the best ISA for general-purpose processors is still RISC.

Current challenges in processor architecture

“If a problem has no solution, it may not be a problem, but a fact; not solved by us, but solved by the passage of time.” – Shimon Peres

Although the previous sections focused on instruction set Architecture (ISA) design, most computer architects do not design new isAs, but implement existing ISas within existing implementation technologies. Since the late 1970s, the technology of choice has been integrated circuits based on metal oxide semiconductor (MOS), first N-type METAL oxide semiconductor (nMOS) and then complementary metal oxide semiconductor (CMOS). The astonishing rate of advance in MOS technology captured in Moore’s predictions has been driving architects to design more aggressive ways to achieve better performance for a given ISP. In his original 1965 prediction, Moore said transistor density would double every year; In 1975, he predicted it would double every two years. This prediction eventually became known as Moore’s Law. Because transistor density increases quadratic while speed increases only linearly, architects actually use more transistors to achieve better performance.

The end of Moore’s law and Desnard’s scaling law

Although Moore’s Law has been going on for decades (see chart 2), it began to slow around 2000. By 2018, Moore’s Law predictions were 15 times worse than they actually are today. According to current projections, this gap will continue to widen as CMOS technology approaches its limits.

Figure 2. Number of transistors per Intel microprocessor vs. Moore’s Law

The accompanying Moore’s law is the Dennard Scaling law predicted by Robert Dennard. He points out that as the density of transistors increases, the energy consumption per transistor decreases, so the energy consumption per square millimetre on a silicon chip remains almost constant. As the computing power of silicon chips per square millimeter increases with each iteration of technology, computers will become more energy-efficient. Dennard’s scaling law slowed down dramatically from 2007 and was close to failing around 2012 (see Figure 3).

Figure 3. Transistors per chip and power consumption per square millimeter.

Instruction level parallelism (ILP) was the primary architectural approach to improving performance between 1986 and 2002. And as the speed of transistors increases, their performance increases by about 50 percent a year. The end of Dennard’s scaling law meant that engineers had to find more efficient ways to parallelize.

To understand why an increase in ILP would make chips much less energy efficient, take a look at current processor cores from ARM, Intel and AMD. Suppose the chip has a pipeline of 15 stages that can send four instructions per clock cycle. Then at any one time, there are no more than 60 instructions in the entire workflow, including about 15 branches that represent about 25% of the execution instructions. To fill the pipeline, you need to anticipate branches and put code into the workflow for execution based on the projections. Speculative use is the source of ILP’s high performance and low chip energy efficiency. If the branch prediction is perfect, the prediction can improve ILP performance but increase power consumption a bit — and may even save power — but if the branch prediction is wrong, the processor has to throw away the wrong prediction instructions, and all of its computing work and energy is wasted. The internal state of the processor must also be restored to the state it was in before the wrongly predicted branch, which takes extra time and energy.

To understand how challenging this design can be, consider the difficulty of correctly predicting the outcome of 15 branches. If the processor is to limit its idle time to 10 percent, it must predict each branch correctly 99.3 percent of the time. Few general-purpose programs can predict branching so accurately.

To understand what this adds up to, look at the data in Figure 4. Figure 4 shows the part of the instruction that executed effectively, which was rendered useless by the processor’s speculative error. On the Intel Core I7 benchmark, 19% of instructions were wasted, but the power wastage was even worse because the processor had to use the extra power to recover if it guessed wrong. Such measurements have led many to conclude that architects need a different approach to achieving performance improvements. And so the polynuclear age was born.

Figure 4. Intel Core I7 wasted instructions as a percentage of completed instructions on various SPEC integer benchmarks.

Multicore shifts the responsibility of identifying parallelism and deciding how to exploit it to programmers and language systems. Multicore does not solve the energy-efficiency computing challenge posed by the end of Dennard’s scaling law. Each active nucleus consumes energy, whether or not it contributes effectively to the calculation. One major obstacle can be expressed by Amdahl’s Law, which holds that acceleration in parallel computers is limited to the part of the serial computation. See Figure 5 for the importance of this law. The figure shows how fast an application with up to 64 cores can run compared to a single kernel, assuming that the proportion of parts executed sequentially is different on only one processor. For example, if only 1% of the time is serial, a 64-bit configuration can be sped up about 35 times, but the energy required is proportional to 64 processors, so about 45% of the energy is wasted.

Figure 5. The effect of Amdar’s law on acceleration when part of the clock cycle time is in serial mode.

The real program structure is of course more complex, with some components allowing different numbers of processors to be used at any given time. However, the need for periodic communication and synchronization means that most applications have components that can efficiently use only a few processors. Although Amdar’s law has been around for more than 50 years, it remains a big obstacle.

With the end of Dennard’s scaling law, increasing the number of cores on a chip means increasing power consumption. However, some of the electrical energy that enters the processor is bound to be converted into heat. Multicenter processors are therefore limited by thermal dissipation power (TDP), the average amount of power that can be removed by the packaging and cooling systems. While some high-end data centers may use more advanced packaging and cooling technologies, no computer user wants to have a small heat exchanger on his desk or a radiator on his back to cool his phone. The limitations of TDP led directly to the era of “dark Silicon”, when processors had to slow clock rates and shut down idle cores to prevent overheating. Another explanation for this approach is that some chips can reallocate their precious power consumption from idle to active cores.

Dennard’s scaling law ended, Moore’s law waned, and Amdar’s law came into its own, meaning inefficiencies limited performance improvements to a few percentage points per year (see Figure 6). Achieving higher performance improvements (as in the 1980s and 1990s) requires new architectural approaches that make more efficient use of integrated circuits. Next we’ll discuss computer security, another major shortcoming of modern computers, and then we’ll come back to new approaches that work.

Figure 6: computer performance improvement with integer programs (SPECintCPU).

Neglected computer security

In the 1970s, processor architects were particularly concerned with computer security, which involved concepts like guard rings, capacity, and so on. These architects understand that most bugs are in the software, but they think architectural support can help. However, most operating systems do not use these features, and operating systems assume that they are in a benign environment (such as a PERSONAL computer), so expensive features are not used. In the software community, many people believe that microkernel and formal verification technologies will provide an effective guarantee for building highly secure software. Unfortunately, the scale of our software systems and performance drivers mean that such technologies can’t keep up with processor performance. The result is that large software systems are still riddled with security vulnerabilities, whose impact is magnified by the vast amount of personal information online and the use of cloud computing.

The end of Dennard’s scaling law means that architects must find more efficient ways to leverage parallelization.

Although computer architects have been slow to recognize the importance of security, they have begun to provide hardware support for virtual machines and encryption. Unfortunately, speculative execution introduces an unknown but important security flaw to many processors. Specifically, The Meltdown and Spectre security vulnerabilities introduce new vulnerabilities to microarchitectures that expose protected information. Both exploits use bypass attacks. In 2018, researchers showed how Spectre variants can be used to cause network leaks without an attacker loading code into the target processor. Although the attack, known as NetSpectre, was slow to leak information, it exposed all machines on the same LAN, creating a host of new challenges. There are two other holes in the virtual machine architecture. One is Foreshadow, which affects Intel SGX security mechanisms designed to protect high-risk data, such as encryption keys. New bugs are discovered every month.

Bypass attacks are not new, but in the earliest cases, it was software flaws that made them successful. In attacks such as Meltdown and Spectre, defects in hardware implementations lead to the disclosure of protected information. This ISA fundamental challenge for processor architects to define what is the correct implementation of ISA, because the performance impact of execution instruction sequences is not mentioned in the standard definition, only isa-visible execution architecture state. Architects need to rethink the definition of the proper implementation of ISA to avoid such security vulnerabilities. At the same time, they should rethink their focus on computer security and how architects work with software designers to achieve more secure systems. The architect (and everyone) depends on the extent to which the information system tolerates security issues, rather than making security a top priority.

Future opportunities in computer architecture

“Our opportunity lies in the unsolved problems.” – John Gardner, 1965

The inherent inefficiencies of general-purpose processors, and the demise of Dennard’s scaling law and Moore’s Law, make it highly unlikely that processor architects and designers can continue to sustain significant performance improvements in general-purpose processors. Given the importance of improved performance to new software capabilities, we must ask: Are there other effective approaches?

There are two clear opportunities, and combining them is the third. First, existing software building techniques make extensive use of high-level languages with dynamic typing and storage management. However, such languages are often poorly interpreted and executed. Leiserson et al. use a small example (performing matrix multiplication) to illustrate this inefficiency. As Figure 7 shows, a simple rewrite of Python language code into C code can improve performance by a factor of 46 (Python is typically a high-level, dynamically typed language).

Running Parallel loops on multiple cores increased performance by nearly seven times. Optimized memory configuration improved performance by nearly 19 times, while hardware extensions with single-instruction multi-data (SIMD) parallelization (16 32-bit operations per instruction) improved performance by more than 8 times. That is, the final highly optimized version runs 62,000 times faster than the original Python version on a multi-core Intel processor. This is of course a small example, but we would expect programmers to use optimized libraries. While this exaggerates a common performance gap, many programs can have a performance gap of 100 to 1,000 times.

Figure 7. Potential acceleration of matrix multiplication in Python quartic optimization.

An interesting line of research concerns whether new compiler technologies can be used to close the performance gap (which could be supplemented by architectural enhancements). While efficiently compiling and implementing a high-level scripting language like Python is difficult, the potential benefits are huge. Even achieving that 25% improvement potential can make Python programs run a hundredfold faster. This simple example illustrates the huge gap between modern languages’ emphasis on programmer productivity and traditional methods’ emphasis on performance.

Domain-specific architecture. A more centered on the hardware design is a design for a specific problem and in the field of architecture, and give them strong (and efficient) performance, so they are in the field of “specific architecture (DSA)”, this is a specific field programmable processor, is usually a turing-complete, but the customized for a particular category of the application. In that sense, they differ from special-purpose integrated circuits (ASics), which apply only to a single function with few code changes. Dsas are often referred to as accelerators because they speed up certain applications compared to running the entire application on a general-purpose CPU. In addition, DSA can achieve better performance because they are closer to the actual needs of the application; Examples of DSA include graphics acceleration units (gpus), neural network processors for deep learning, and software-defined processors (SDN). DSA is more efficient and uses less energy for four reasons:

First and foremost, DSA uses a more efficient form of parallelism for domain-specific computing. For example, single-instruction multi-data parallelism (SIMD) is more efficient than multi-instruction multi-data (MIMD) because it processes only one instruction stream and processing unit in one clock step. Although SIMD is less flexible than MIMD, the former is suitable for many DSA. DSA can also implement ILP using a VLIW approach rather than a speculative out-of-order mechanism. As mentioned earlier, the VLIW processor does not match the general-purpose code, but is more effective for limited domains because its control mechanism is simpler. In particular, most high-end general-purpose processors are superscalar and require complex control logic to start and complete instructions. In contrast, VLIW performs the necessary analysis and scheduling at compile time, which works fine for explicitly parallel programs.

Second, DSA can make more efficient use of memory hierarchies. As Horowitz points out, memory access is much more expensive than arithmetic. For example, accessing a 32 kilobyte cache requires about 200 times as much energy as doing 32 bit addition. This huge difference makes optimizing memory access critical to achieving high energy efficiency. The running code of a general-purpose processor in which memory access generally exhibits spatial and temporal locality but is not very predictable at compile time. As a result, cpus use multiple levels of caching to increase bandwidth and mask the relatively slow off-chip DRAM latency. These multilevel caches almost avoid all off-chip DRAM accesses at a cost that typically consumes about half the processor power, requiring about 10 times as much power as last-level cache accesses.

Caching has two major drawbacks:

  • When the data set is very large, the efficiency of cache is very low when the time space locality is low.
  • When caching is efficient, locality is very high, which means that, by definition, most caches are idle most of the time.

In applications where memory access patterns can be well defined and discovered at compile time — common for typical DSLS — programmers and compilers can optimize memory usage better than dynamically allocated caches. As a result, DSA typically uses layers of memory that are explicitly controlled by the software to move, similar to the operation of a vector processor. For the right application, user-controlled storage can use less power than caching.

Third, DSA can use lower precision when acceptable. Cpus for general-purpose tasks typically support 32 – and 64-bit integer and floating-point data. For many machine learning and imaging applications, this accuracy is a bit of a waste. For example, in deep neural networks (DNN), reasoning typically uses 4, 8, or 16-bit integers to improve data and computational throughput. Similarly, for DNN trainers, floating-point numbers make sense, but 32 bits are enough, and 16 bits often work.

Finally, DSA benefits from target programs written in domain-specific languages (DSLS) that can achieve higher parallelism, better structure and representation of memory access, and enable applications to map more efficiently to domain-specific processors.

Domain-specific languages

DSA requires locating high-level operations into the architecture, but trying to extract structure and information from general-purpose languages such as Python, Java, C, or Fortran is generally too difficult. So domain-specific languages (DSLS) specifically support this process and program Dsas as efficiently as possible. DSLS, for example, can make vector, dense, and sparsity matrix operations explicit and allow the DSL compiler to efficiently map operations to the processor. There are many common examples of DSL, such as Matlab, which focuses on matrix operation, TensorFlow, which focuses on DNN programming, P4, which focuses on SDN programming, and Halide, which focuses on advanced transformation in image processing.

The challenge with DSLS is to ensure enough architectural independence that the software written in the DSL can be ported to different architectures, while still mapping the software to the underlying DSA efficiently. XLA systems, for example, convert TensorFlow code into graphs that can use heterogeneous processors such as Gpus or TPUS. Balancing portability and efficiency between Dsas is an area of interest for programming language designers, compiler designers, and DSA architects.

TPU, for example

As an example of DSA, consider the design of Google’S TPU 1, which is designed to speed up the reasoning process of neural networks. TPU, which has been in production since 2015, supports a wide variety of Google businesses, from search engines to language translation and image recognition, as well as DeepMind cutting-edge research such as AlphaGo and AlphaZero. TPU aims to improve the performance and energy efficiency of deep neural network reasoning processes by a factor of 10.

As shown in Figure 8 below, the TPU’s organizational structure is completely different from that of a general-purpose processor. Its primary cell is the matrix cell, the Systolic Array that provides 256×256 multiplicative operations per clock cycle. TPU also combines 8-bit accuracy, systolic structure, and SIMD control features, meaning that the multiplicate-Accumulates per clock cycle are 100 times more than the average general-purpose single-core CPU.

Instead of caching, the TPU uses 24MB of local memory, roughly double the capacity of a 2015 CPU with the same power consumption. Finally, the active value memory and weight memory (including the FIFO structure for storing weights) can be connected through a high bandwidth memory channel controlled by the user. Using the weighted arithmetic mean of six reasoning problems common to Google data centers as a measure, THE TPU was 29 times faster than the average CPU. Because TPU requires half as much energy, it is 80 times more energy efficient than a typical CPU on such a workload.

Figure 8: Functional organization diagram of the Google Tensor Processing Unit (TPU V1).

summary

We considered two different ways to improve program performance by improving the technical efficiency of hardware: first, by improving the performance of modern high-level languages; Second, by building domain-specific architectures that significantly improve performance and efficiency over general-purpose cpus. DSL is another example of how hardware/software interfaces that support architectural innovations such as DSA can be improved. Achieving significant benefits from these approaches will require a vertically integrated design team that understands applications, domain-specific languages and associated compiler technologies, computer architecture, organization, and underlying implementation technologies. The need for vertical integration and design decisions across levels of abstraction was a major feature of the early days of computer technology development prior to the horizontal structure of the industry. In this new era, vertical integration becomes more important, and teams that can examine and perform complex trade-offs and optimizations will benefit.

This opportunity has led to a number of architectural innovations and attracted a number of competitive architectural design ideas:

  • GPU: Nvidia gpus use a multi-core architecture, with each core having large register files, lots of hardware threads, and caches;
  • TPU: Google TPU relies on hardware control of large two-dimensional pulsating array and onboard memory;
  • FPGA: Microsoft deployed field programmable gate Array devices (FPGas) in its data centers, dedicated to neural network applications;
  • Cpus: Intel offers cpus with many cores enhanced by large advanced caches and one-dimensional SIMD instructions, Microsoft uses FPGas, and a new class of neural network processors that are closer to TPU than cpus.

In addition to these giants, dozens of start-ups are pushing their own initiatives. To meet growing demand, architects are interconnecting hundreds to thousands of these chips to form neural network supercomputers.

The avalanche of DNN architectures has led to an interesting time in computer architecture. It is difficult to predict which (or even if any) of these directions will win in 2019, but the market will certainly resolve competition issues as it has resolved past architecture debates.

Open architecture

Inspired by the success of open source software, a second opportunity for computer architecture is open ISA. To create “Linux” in processors, the field needs an industry-standard open ISA where the community can create open source cores except for individual companies that own proprietary technology. If many organizations design processors using the same ISA, greater competition may drive faster innovation. The goal is to provide processors for chips that cost anywhere from a few cents to $100.

The first example is RISC-V (called the “RISC Five”), the fifth RISC architecture developed at uc Berkeley. Risc-v has a community that maintains the architecture under the management of the RISC-V Foundation. Openness allows ISA to evolve in an open environment, where hardware and software experts collaborate before decisions are finalized. Another benefit of open-ended funds is that isAs are less likely to be extended primarily for marketing reasons, which is sometimes the only reason for proprietary instruction sets to be extended.

Risc-v is a modular instruction set. A small number of instructions run the full open source software stack, followed by optional standard extensions that designers can include or omit as needed. The base includes both 32-bit address and 64-bit address versions. Risc-v can only grow with optional extensions; Even if the architect does not accept the new extensions, the software stack still works fine. Proprietary architectures typically require upward binary compatibility, which means that when processor companies add new functionality, all future processors must also include it. This is not the case with RISC-V, where all enhancements are optional and can be removed if the application does not need them. Here are the standard extensions so far, using abbreviations representing their full names:

  • M. Integer multiplication/division;
  • A. Atomic memory operation;
  • F/D. Single/double floating point numbers;
  • C. Compress commands.

Fewer instructions. Risc-v has far fewer instructions. There are 50 instructions in BASE, similar to the original RISC-I. The remaining standard extensions (M, A, F, and D) add 53 instructions, plus C adds another 34, for A total of 137. ARMv8 has over 500 instructions.

Fewer instruction formats. Risc-v has very few instruction formats, only six, while ARMv8 has at least 14.

Simplicity reduces the amount of work needed to design the processor and validate the hardware. As RISC-V targets range from data center chips to iot devices, design validation can be a significant component of development costs.

The RISC-V is a neat design, and 25 years after its birth, its designers have learned from the mistakes of their predecessors. Unlike the first-generation RISC architecture, it avoided microarchitectural or technology-dependent features (such as delayed branching and lazy loading) or innovations (such as register Windows) that were replaced by advances in compiler technology.

Finally, RISC-V supports DSA by reserving a large opcode space for custom accelerators.

Security experts don’t believe in invisible security, so open implementations are attractive, and open implementations require open architectures.

In addition to RISC-V, Nvidia in 2017 announced a free and open architecture called the Nvidia Deep Learning Accelerator (NVDLA), which is an extensible configurable DSA for machine learning reasoning. Configuration options include the data type (INT8, INT16, or FP16) and the size of the two-dimensional multiplication matrix. Die sizes range from 0.5 mm^2 to 3 mm^2 with power from 20 milliwatts to 300 milliwatts. ISA, the software stack, and the implementation are all open.

An open, simple architecture works in tandem with security. First, security experts don’t believe in invisible security, so open implementations are attractive, and open implementations require open architectures. It is also important to increase the number of people and organizations that can innovate around security architectures. Proprietary architectures limit participation to your own staff, but open architectures allow all the best minds in academia and industry to help improve security. Finally, the simplicity of RISC-V makes its implementation easier to check. In addition, the open architecture, implementation, software stack, and plasticity of FPGas mean that architects can deploy and evaluate novel solutions online and iterate on them weekly rather than annually. While FPGas are 10 times slower than custom chips, this performance is still sufficient to support online users while enabling security innovations that address real attacks. We want open architecture to be a model for hardware/software co-design by architects and security professionals.

Agile Hardware Development

The Manifesto for Agile Software Development, by Beck et al. (2001), revolutionized The way Software was developed, overcoming The frequent failures of traditional fine-planning and documentation in waterfall Development. A small programming team quickly developed useful but incomplete prototypes and got customer feedback before moving on to the next iteration. The Scrum version of Agile development brings together teams of 5 to 10 programmers and sprints of 2 to 4 weeks per iteration.

Again inspired by the success of software development, the third opportunity is agile hardware development. The good news for architects is that contemporary computer-aided Design (ECAD) tools raise the level of abstraction that makes agile development possible, and this higher level of abstraction increases design reuse.

A four-week sprint for hardware seems implausible, given the months it takes from design delivery to chip return. Figure 9 Outlines how the agile development approach works by changing the prototype at the appropriate level. The innermost layer is the software emulator, which is the easiest and quickest place to make changes if the emulator can meet iteration requirements. The second layer is fpgas, which run hundreds of times faster than specific software simulators. Fpgas can run operating systems and complete benchmarks (like those from standard performance evaluation companies), allowing for more accurate evaluation of prototypes. Amazon offers FPGas in the cloud, so architects can use FPGas without having to buy hardware and build LABS. To record the chip area and power numbers, the third layer uses the ECAD tool to generate the chip layout. Even after the tool runs, some manual steps need to be taken to refine the results before you are ready to manufacture a new processor. Processor designers refer to the fourth layer as “tape in.” The first four layers all support the four-week dash.

Figure 9: Agile hardware development methodology.

For research purposes, we can stop on tape in because the area, energy and performance estimates are very accurate. But it’s like stopping about 100 meters from the finish line in a long-distance race, because runners can accurately predict the final time. Despite all the effort that goes into preparing for a race, runners still miss out on the excitement and satisfaction of crossing the finish line. One advantage hardware engineers have over software engineers is that they build things. Measuring the chip, running the real program, and then showing it to friends and family is one of the joys of hardware design.

Many researchers felt compelled to stop because the cost of making chips was too high. When designs are small, they are surprisingly cheap. Architects can order 100 1mm-square chips for as little as $14,000. Millions of transistors can fit on a 1-square-millimeter chip at 28 nanometers, which is large enough for RISC-V and NVLDA processors. If the designer’s goal is to design a large chip, the outermost layer will be very expensive, and the architect can use many small chips to illustrate many new ideas.

conclusion

“The darkest hour is that before the dawn. — — Thomas Fuller, 1650

To benefit from the lessons of history, architects must realize that software innovation can also spur architecture, and that increasing the level of abstraction on the hardware/software interface creates opportunities for innovation, and that the market will eventually resolve the computer architecture debate. Iapx-432 and Itanium illustrate how architecture investment can exceed returns, while S/360, 8086 and ARM have delivered surprisingly high returns for decades and will continue to do so.

The end of Dennard’s scaling law and Moore’s Law, and the slowing down of the performance of standard microprocessors, are not problems that must be solved, but accepted facts. This fact also presents amazing opportunities. High-level, domain-specific languages and architectures that free architects from the chain of proprietary instruction sets, as well as the public’s increased demand for security, will usher in a new golden age for computer architects. With the help of an open source ecosystem, nimbly developed chips will demonstrate their advances, accelerating commercial adoption. The ISA concept for the general-purpose processor in these chips is probably RISC, which has stood the test of time. We expect this to improve as quickly as the last golden era, but this time in terms of cost, energy, safety and performance.

The next decade will see a Cambrian explosion of new computer architecture, and it will be an exciting time for industry and academic architects.

The original:

Cacm.acm.org/magazines/2…

(This article has been reproduced with authorization, translation:

Mp.weixin.qq.com/s/epFvsCcYV…

OneFlow’s new generation of open source deep learning framework: github.com/Oneflow-Inc…