The author of this article is Li Xiangbin, a graduate student of the State Key Laboratory of Complex Systems, Institute of Automation, Chinese Academy of Sciences, majoring in robotics and artificial intelligence.

Google I/O is an annual conference for Web developers held by Google that focuses on developing web applications using Google and open Web technologies. The annual conference has been held since 2008 and is now in its ninth year.

At this year’s conference, Google announced the following eight products: Google Assistant, a wireless speaker and voice command rival to Amazon Echo, Google Home, messaging app Allo, video calling app Duo, VR platform Daydream, Android Wear 2.0 support for standalone apps, Android Instant Apps, which allow users to use Apps without installing them, and Google Play on Chrome OS, which allows users to use the Android app on Chromebooks.

These 8 products are mainly concentrated in the software field.

(Google I/O 2016 live image via:webpronews.com)

At the end of the Google I/O 2016 keynote, Pichai, Google’s CEO, talked about one of the things they’ve done with AI and machine learning, a Tensor Processing Unit, or TPU. Pichai only introduced some performance metrics of the TPU at the conference, and later published some usage scenarios in his blog. He did not elaborate on the architecture and inner workings of the PROCESSOR, so we may need to start with some common processor architectures. Try to guess and explore the face of this dedicated chip for machine learning.

 

(Tensor Processing Unit)

First let’s take a look at what we’re most familiar with: the Central Processing Unit, or CPU. It’s a very large scale integrated chip, and it’s a general-purpose chip, which means it can do a lot of different things with it. The processor of the computer we use on a daily basis is basically CPU. There is no problem watching a movie, listening to music, or running code.

| Let’s look at the structure of the CPU

CPU mainly includes Arithmetic and Logic Unit (ALU) and controller (CU, Control Unit). In addition, it includes a number of registers and cache memory and a bus for the data, control, and state that are connected between them. As can be seen from the above description, CPU mainly includes operation logic devices, register components and control components.

 

(simplified image of CPU structure via:blog.csdn.net)

Operational logic devices perform arithmetic operations, shifts, and address operations and transformations. Register parts are mainly used to store the data and instructions generated in the operation. The control device is responsible for decoding instructions, and issued to complete each instruction to perform the various operations of the control signal.

We can use the following diagram to illustrate the general process of executing an instruction in the CPU:

(CPU execution instruction diagram via:blog.csdn.net)

The CPU gets the instruction from the program counter, sends the instruction to the decoder through the instruction bus, and delivers the translated instruction to the timing generator and operation controller. Then the arithmetic unit calculates the data and stores the data to the data cache register through the data bus.

We can see from the STRUCTURE and execution process of CPU, CPU follows the von Neumann architecture, the core of Von Neumann is: stored programs, sequential execution.

From the above description, we can see that the CPU is like an orderly housekeeper, and we always do what we are told to do step by step. But as Moore’s Law advances and the need for larger and faster processing speeds increases, the CPU seems to perform tasks less satisfactorily. So people thought, can we put many processors on the same chip, let them work together, so that the efficiency will be much higher, this is the BIRTH of GPU.

| GPU was born

As its name suggests, a GPU is a microprocessor used to run Graphics computing on personal computers, workstations, game consoles and some mobile devices (such as tablets and smart phones). For processing image data, every pixel on the image needs to be processed, which is a considerable amount of data. Therefore, the demand for acceleration of computation is most intense in the field of image processing, and GPU emerges as The Times require.

(schematic diagram of CPU and GPU structure comparison via:baike.baidu.com)

By comparing the STRUCTURE of CPU and GPU, we can see that CPU has many functional modules and can adapt to complex computing environment. GPU structure is relatively simple, most of the transistors are mainly used to build control circuits (such as branch prediction) and Cache, and only a few transistors are used to complete the actual computing work. However, the control of GPU is relatively simple and the demand for Cache is small, so most of the transistors can be composed of various special circuits and multiple pipelines, which makes the computing speed of GPU a breakthrough leap and has a stronger ability to process floating point calculation. Currently, the most advanced CPUS only have 4 or 6 cores, simulating 8 or 12 processing threads to perform computation, but the ordinary gpus contain hundreds of processing units, or even more high-end ones, which have natural advantages for the large amount of repeated processing in multimedia computing.

This is just like the CPU using a pen to draw a picture, while the GPU uses multiple pens to draw different positions at the same time. The natural efficiency will improve by leaps and bounds.

(comparison of floating point computing performance between Intel CPU and nvidia GPU via:blog.sina.com.cn)

Although the GPU is born for image processing, but we through introducing the front can be found that it did not specifically on structure components which serves for the image, only on the CPU structure has been optimized and adjusted, so now the GPU can not only play a prominent role in the field of image processing, it is also used to scientific computing, password cracking, numerical analysis, Massive data processing (sorting, map-reduce, etc.), financial analysis and other fields requiring large-scale parallel computing. Therefore, GPU can also be considered as a more general chip.

| FPGA arises at the historic moment

As people’s computing needs become more and more specialized, people hope that chips can better meet our professional needs. However, considering that hardware products cannot be changed once formed, people begin to think that we can produce a chip and make it hardware programmable. In other words —

At this moment we need a more suitable for image processing hardware system, the next moment we need a more suitable for scientific computing hardware system, but we do not want to weld two boards, this time FPGA came into being.

FPGA is short for Field Programmable Gate Array, which is called Field effect Programmable logic Gate Array in Chinese. As a semi-customized circuit in the Field of ASIC, FPGA not only solves the deficiency of fully customized circuit, but also overcomes the shortcoming of limited number of Gate circuits of original Programmable logic devices.

FPGA uses hardware description language (Verilog or VHDL) to describe the logic circuit, which can be quickly burned to FPGA for testing by using logic synthesis, layout and wiring tool software. People can connect the logic blocks inside the FPGA through editable connections as needed. It is as if a circuit test board were placed inside a chip. The logic blocks and connections of a finished FPGA can be changed according to the designer’s needs, so the FPGA can perform the required logic functions.

 

(FPGA structure diagram via: DPS-AZ.CZ/VyVOj)

The programmable characteristics of FPGA made it very popular as soon as it was launched, and many ASics (special integrated circuits) were replaced by FPGA. Here we need to explain what ASIC is. ASIC is a special specification integrated circuit customized according to different product requirements, designed and manufactured by specific user requirements and specific electronic system needs. The reason for special description here is that the TPU we introduce below is also a kind of ASIC.

FPGA and ASIC chip have their own disadvantages. FPGA is generally slower than ASIC, and can not complete more complex design, and will consume more electricity. ASIC production costs are high, if the shipment is small, then the use of ASIC is not very cost-effective. But if there’s a demand for something and asICS start shipping more, then there’s a historical trend for asics, and I think that’s a good starting point for Google to create a Tensor Processing unit. Since then, TPU has been on the historical stage.

As machine learning algorithms are more and more applied in various fields and show superior performance, such as Street View, intelligent email reply, voice search, etc., hardware support for machine learning algorithms is becoming more and more necessary. At present, many machine learning and image processing algorithms mostly run on GPU and FPGA. However, we can know from the above that these two chips are still general-purpose chips, so they cannot be more closely adapted to machine learning algorithms in terms of efficiency and power consumption. And Google always believed that great software would be great with the help of great hardware, so Google wondered if we could make a dedicated chip for machine-learning algorithms for computers, and TPU was born.

Figure (TPU interface card via:cloudplatform.googleblog.com).

| Google wants to do a special chip machine machine learning algorithms, TPU

As we can see from the name, TPU is inspired by Google’s open source deep learning framework TensorFlow, so currently TPU is only a chip used within Google.

Google has actually been running TPU in its internal data centers for over a year now, and the performance metrics are staggering, increasing hardware performance by about 7 years, about 3 generations of Moore’s Law. In terms of performance, the two biggest limiting factors for processor speed are heating and logic gate latency, of which heating is the most important limiting factor. Most of today’s processors use CMOS technology, which dissipates energy with each clock cycle, so the faster the speed, the hotter the heat. Here’s a graph of CPU clock frequency versus energy consumption, and we can see that the increase is exponential.

 

(CPU clock frequency and power diagram) via:electronics.stackexchange.com

From the appearance of TPU, we can see that a large metal sheet protrudes from the middle of the TPU, which is to generate a large amount of heat for dissipation in order to facilitate TPU high-speed operation.

TPU’s high performance also comes from its tolerance for low computational accuracy, which means that fewer transistors are required per step. With the same total transistor capacity, we can run more operations per unit of time on those transistors, which allows us to get smarter results faster by using more complex and powerful machine learning algorithms. We saw inserts on the TPU board, so Google currently uses the TPU board by plugging it into the hard drive slot in the data center cabinet.

And I think TPU’s high performance also comes from the localization of its data. For Gpus, it takes a lot of time to fetch instructions and data from memory, but machine learning does not need to fetch data from global cache for most of the time. Therefore, more localized structural design also accelerates the running speed of TPU.

 

(The server rack with TPU used in AlphaGo vs Li Shi 乭 match, I don’t know why the go picture attached on the side is kind of cute. Via: googleblog.com).

In the past year at Google’s data center, TPU has actually done a lot of things, such as RankBrain, a machine learning ARTIFICIAL intelligence system designed to help Google process search results and provide users with more relevant results. Street View, used to improve the accuracy of maps and navigation; Of course, there is also the computer program AlphaGo that plays go. In fact, there is also a very interesting place in this point. We can see in the Nature article that describes AlphaGo, AlphaGo only runs on CPU+GPUs. Running on 48 cpus and 8 Gpus, the distributed version of AlphaGo takes advantage of more machines, with 40 search threads running on 1202 cpus and 176 Gpus. This configuration was used in the match with Fan Hui, so Lee 乭 was very confident in the man-machine battle after seeing the match between AlphaGo and Fan Hui. But in just a few months, Google switched the hardware platform running AlphaGo to TPU, and things got tough.

So besides TPU being able to run machine learning algorithms better and faster, what other purpose does Google have for releasing it? I think it’s fanciful to say that Google may be playing a big game.

Google says its goal is to lead the industry in Machine Learning and make the power of innovation available to everyone, as well as make TensorFlow and Cloud Machine Learning more accessible to users. Specialized hardware like TPU is just a small step along the way, like Microsoft’s Holographic Processing Unit (HPU) for its HoloLens AUGMENTED reality headset. It’s not just about getting ahead of market leader Amazon Web Services (AWS) in the public cloud space. As time goes by, Google will release more MACHINE learning apis. Now Google has launched cloud machine learning platform services and visual APIS. We can believe that being the leader of machine learning technology and market is Google’s bigger goal.

Note: The caption and the first image are from wingatewire.com. This article for The lei Feng net (public number: Lei Feng net) original manuscript, please contact authorization and retain the complete information, shall not delete, modify the article.

Lei Feng net original article, prohibit reprint without authorization. See instructions for details.


Want to receive ai-related high-quality technical content first time every day?

Scan the qr code to follow the public account: AI training field