FPGA acceleration: Exploration and Practice for data center and cloud services

Welcome to Tencent Cloud + community, get more Tencent mass technology practice dry goods oh ~

This post was published by Columneditor on cloud + Community

Zhang Heng — Tencent Cloud FPGA expert, currently responsible for the DEVELOPMENT of FPGA cloud in Tencent Architecture Platform Department, exploring the application of FPGA acceleration data center, including: image processing, deep learning, SDN, etc.

In order to further accelerate the innovative development of cloud computing, to set up the trust cloud computing system, standardize the cloud computing industry, promote the development of the market, enhance the level of industrial technology and services, by the China academy of information and communication, Chinese telecommunication standardization association of “trusted cloud conference 2018” on August 14, 2018 – August 15 at the Beijing international convention center.

Cloud computing has developed for more than ten years, and gradually formed a huge industrial scale, enterprises “on” the cloud is not difficult. However, the continuous emergence of information data leakage incidents to the hot cloud computing alarm, enterprises began to gradually realize the risks of cloud computing, understand that any kind of cloud deployment is likely to be attacked by hackers. While there are significant advantages to cloud computing, there are also potential risks. “Trusted Cloud Conference 2018” will invite many big names in the industry to explore the innovative development path of trusted cloud and cloud computing.

The following is the full text of the speech “FPGA Acceleration: Exploration and Practice for Data Center and Cloud Services” by Zhang Heng, an EXPERT of Tencent Cloud FPGA, at the “trusted Cloud Conference 2018” :

Zhang Heng: Good afternoon, distinguished guests. I’m Zhang Heng from Tencent. Today’s topic is FPGA Acceleration: Exploration and Practice for Data Center and Cloud Services. Before, all the experts have described the application and development of HIGH performance computing from the perspective of academia and standardization. Today, I mainly look at the actual scenarios and how we apply FPGA to accelerate from the perspective of industry.

Today’s speech is mainly divided into three parts, the cause of FPGA acceleration, the application of FPGA acceleration in data center, and the application of FPGA cloud service.

Previous experts have described the rapid growth of data in data centers. We have also seen the growth of data in the cloud at around 30% per year, while the rapid development of AI has also brought about the demand for data high-performance computing. On the one hand, this leads to the exponential growth of data, and on the other hand, the computing resources required to process these data also increase accordingly. Traditional computing is processed by CPU, but facing the post-Moore era, the development of CPU is stagnant, like the doubling of performance in the previous two years, now is almost impossible. Then how to solve the problem of computing performance requires us to find higher performance chips, GPU, FPGA, ASIC so as to enter the eyes of everyone.

High-performance computing chips can be summarized into two requirements: the first is high throughput capacity, which can process the growth of data. The second is low latency, which can respond to real-time connected devices and improve user experience. Especially, the development of 5G and the Internet of Things has brought the growth of data on the one hand and the requirement of low latency on the other hand.

The above mentioned several kinds of computing chips, CPU to FPGA to ASIC, what characteristics do they have, what is the essential reason that makes them achieve high performance? In these types of computing chips, the more flexible the programming is, the more efficient the transistor is. CPU is software programmable, very high versatility, can face a variety of software algorithms. ASIC circuit is a proprietary hardware circuit, is not programmable, that is to say, ASIC can only accelerate for the algorithm, which is why it can achieve high performance, because every transistor on it is for the algorithm service.

GPU is also software-programmable. Compared with CPU, GPU has many computing units. For algorithms that can be parallel and require a large amount of computation, GPU has higher computing efficiency than CPU. FPGA is a programmable hardware circuit, which is the same as ASIC in that it is also a hardware circuit built for algorithm, but it is programmable hardware circuit, that is to say, it can build a hardware circuit for each algorithm to accelerate hardware. On the whole, the computing capability is brought by Moore’s Law on the one hand, and the hardware architecture of computing chips on the other hand. Although Moore’s Law has come to an end, we can improve the overall computing performance through the innovation of hardware architecture of computing chips.

From the whole industry, in ASIC chips, Google made TPU chips, used to do AI algorithm acceleration, is now the third generation of chips. Intel acquired Nervana, Mobileye chips are to do AI acceleration, domestic Cambrian, Horizon have launched their own AI algorithm chips.

In terms of FPGA, Microsoft does each server with a FPGA, used for data center application acceleration, Domestic Baidu, Ali, Tencent are useful FOR FPGA application acceleration. So we see a trend of heterogeneous computing across the industry, where software and hardware are combined, from general-purpose to dedicated.

In front of all kinds of computing chip characteristics, now let’s talk about FPGA acceleration in data what advantages. The most right side is the characteristics of the underlying chip of FPGA, which has on-chip cached RAM, computing resources and some logical resources. With these resources, a hardware circuit can be built for each algorithm to accelerate.

Its advantages are as follows:

1, high performance and low latency, you can customize the hardware architecture for each application algorithm.

2. Flexible and extensible, FPGA itself has programmability and rich IO pins, making FPGA in the data center not only in computing, but also in storage, network algorithm evolution.

3. FPGA’s low power consumption, low cost and high reliability can facilitate deployment, operation and maintenance in the data center.

4, the combination of soft and hard. For an algorithm, not all functions need to be accelerated in FPGA, you can put some functions suitable for CPU to CPU to do, suitable for FPGA to do, put in FPGA to do acceleration. The combination of CPU and FPGA can be achieved to give full play to each other’s advantages, so as to achieve the optimal system.

Although we see the advantages of each computing chip, we also see the integration of computing chips, that is, FPGA can absorb the advantages of ASIC, CPU can absorb the advantages of FPGA. At present, FPGA will integrate the ASIC computing core to improve the overall computing performance; The GPU also incorporates matrix computing with the ASIC tensor core, which is already done in the V100, allowing for higher AI computing performance. So will there be CPU, GPU, FPGA, ASIC all-chip fusion in the future? Let’s wait and see.

Next we introduce Tencent’s internal FPGA acceleration, in the actual scene inside the application. We started to use FPGA for acceleration in 2014. At the beginning, we mainly applied the acceleration of picture transcoding in QQ album and wechat moments, and later we also used FPGA for acceleration of AI algorithm. Image transcoding and AI acceleration won Tencent Outstanding RESEARCH and Development Award. In 2017, we were the first manufacturer to release FPGA cloud server in China. We also participated in FPGA2018, the FPGA top conference exhibition.

Next, we introduce how to use FPGA to do acceleration in the scene of QQ picture transcoding. As we all know, Tencent is a social media platform, like QQ and wechat, there are a large number of pictures sent, received and spread every day. If the Jpeg pictures uploaded by users are uploaded intact when users download and browse, it will put great pressure on Tencent’s CDN transmission bandwidth, and it is unnecessary. First users browse pictures of terminal may be different, some by PC, via mobile phone, mobile phone terminal screen is relatively small, but the PC screen is very big, so there’s no need in the size of very small terminal upload a larger image, and can pass a fit the screen size of images, this will reduce the transmission bandwidth of CDN. In addition to Jpeg, there are other image formats, such as Webp and Hevc, which have a smaller volume than Jpeg. Therefore, when users upload Jpeg images, we can convert them to Hevc or Webp format for users to download and experience. The processing of massive pictures includes the transformation of various picture formats, including multi-size, cropping, sharpening, rotation and other operations. Before using FPGA acceleration, users upload their pictures through PC terminals or mobile terminals, which are transcoded after being processed by the access layer of Tencent background. Transcoding results of multiple image formats and sizes are stored in the distributed storage system. When users browse in the terminal, Corresponding to the distributed storage system from the corresponding format or size of the picture for the user to see. So why is image processing transcoded at the user’s upload end, rather than when the user needs to transcode? The main reason is that when users browse pictures, they want to click to see them, which requires a very low delay. Low delay is the user’s experience requirements.

After using FPGA for transcoding picture, we can make full use of FPGA processing characteristics of low latency, in the whole picture of transcoding, no longer need to store images in different formats, and as long as a format, a kind of images stored in a distributed system, the user needs to browse the real-time transcoding will according to the user terminal, The transcoded images are sent to users to save the pressure on distributed storage in the system.

How do you do this in FPGA? We have implemented several image codec cores in FPGA to ensure that images can be processed in parallel flow and data, so as to improve the performance of image transcoding and achieve high throughput and low latency. Compared with CPU, we have achieved the effect of reducing latency by 3 times and increasing throughput by 6 times.

In recent years, the term most heard by the industry is artificial intelligence. In this scenario, Tencent’s strategy is AI in ALL, combining various AI technologies with various application scenarios, such as medical imaging, information security, voice translation, etc. How to use FPGA to accelerate in information security scene? First of all, there are a large number of UGC pictures uploaded on QQ and wechat all the time. Fewer of these UGC pictures are prohibited pictures, and a large number of them are normal pictures. As for how to crack down on the few prohibited pictures, we also adopt AI technology and combine the advantages of FPGA with high performance and low delay. Here is our whole processing logic. Firstly, hundreds of millions of pictures are uploaded to the processing system every day, and normal pictures are filtered out through the speed model of AI. For a few suspicious pictures, the second-level AI excellent model is used to judge whether there are malicious pictures or not. Through two AI models and FPGA acceleration to achieve high performance processing effect.

How to accelerate the AI algorithm in FPGA? The basic operator operation of AI algorithm is realized in FPGA, including convolution, pooling, normalization and activation function. Through the support of these basic operators, real-time processing of AI algorithm is achieved. For the input and processing of data flow, all computing units in FPGA will process the data of the same layer at the same time, which can achieve low delay. Taking the specific Algorithm model of Googlenet as an example, comparing CPU, GPU and FPGA, it can be seen that FPGA can reach the maximum throughput at the beginning. GPU needs to piece together larger data to achieve high throughput, but the larger its batchsize, the greater the delay. Therefore, with the same throughput performance as GPU, FPGA delay can be reduced by 10 times than GPU, and the overall TCO can be reduced by 50%.

** What are the specific advantages of FPGA in accelerating AI? ** To sum up, there are three aspects:

1, flexible and extensible, because of the programmable nature of FPGA, it can quickly support the rapid evolution of AI algorithm, support DNN, CNN, LSTM and decision tree, support arbitrary precision bits, you can use any bit to represent your data. At the same time, it can also support model compression, sparse network and other model components. 2, high performance and low latency, can build real-time AI processing capacity, especially to the future “end cloud combination” application scenarios, the requirements for low latency will be higher. As mentioned above, FPGA can achieve the throughput performance comparable to THAT of GPU, and the low-delay inference ability higher than that of GPU.

3. Continuous optimization of development environment. It is difficult for users to develop FPGA with Verilog, so how to reduce the threshold of USING FPGA in AI scenarios? On the one hand, the basic AI operator in FPGA is further optimized to provide a more perfect operator library, and on the other hand, the compiler is provided to users. When the compiler is used, the user only needs to convert the AI model into instructions recognized by FPGA through the compiler, and guide FPGA acceleration through these instructions. In this way, the usability of FPGA is improved.

The third part introduces the development of FPGA cloud service.

Firstly, there are multiple links in the whole FPGA industry chain, including chip manufacturers, hardware manufacturers, IP developers and scheme integrators. Hardware manufacturers mainly refer to the production of board cards, IP developers are to provide IP solutions, scheme integrators are to package the scheme, hardware board cards, schemes. FPGA cloud service needs to integrate these technical resources to form the final service for users. Compared with the entire GPU industry chain, GPU provides solutions from chip manufacturing to board cards to the final programming framework. FPGA cloud service also aims to platform the previously fragmented use mode through cloud service scenarios and lower the threshold of FPGA use.

Here is an introduction to the value of FPGA service itself: For traditional suppliers, such as Xilinx and Altera chip manufacturers, they used to supply directly to large customers, while distribute by agents to small and medium customers. They do not directly contact small and medium customers, and FPGA sales grow slowly. For IP developers to sell IP without providing other services, IP is just a component, not a solution for the final industry. At the same time, because IP itself is concerned about the leakage of property rights, it usually has to sign NDA and pay, so the whole delivery cycle is very long and the process is very complicated, which in turn restricts the SALE of IP. For solution integrators, the original way is that technical personnel visit customers with hardware equipment, make on-site demonstration and explanation, and finally leave the hardware equipment to users for verification and testing. The whole operation and promotion cycle is very long, the process is also very tedious, and the hardware maintenance is very troublesome. This is the problem and pain point of traditional suppliers. For the user’s pain point, want to use FPGA development, need to solve the PRODUCTION and manufacture of FPGA board, with the hardware board and do FPGA software development, the whole hardware manufacturing and software development cycle is particularly long, bringing the use of FPGA decision cost is particularly high, trial and error cost is also particularly high. Users are relatively more dependent on solution integrators when they purchase solutions directly instead of developing them themselves, and the solutions are expensive and slow to upgrade. To sum up, FPGA itself has no mature development ecology, and the whole development threshold is high, which in turn limits the development of FPGA ecology.

Therefore, all these problems mentioned above mean that we need to get through all links of FPGA through the way of FPGA cloud platform, including hardware manufacturers, scheme integrators, IP developers and chip manufacturers, and provide services to users by making industry solutions on cloud platform. This will lower the barrier for users to use fpGas, and make the whole thing much easier to use.

What is the value to traditional suppliers and users through such a FPGA cloud platform? For traditional suppliers, the original chip factory can solve the problem of supporting small and medium-sized customers, focus on the FPGA ecological development, and the new model brings new user growth. IP developers offer online validation and testing, short lead times, reach more users through the cloud and increase sales. For the solution integrator, it is no longer necessary to provide hardware for sale, but only need to provide it to the user for purchase through FPGA cloud server. The purchase on demand will shorten the operation and promotion cycle, and the hardware platform will be in charge of the cloud platform manufacturer.

For users who want to use the FPGA, it can shorten the development cycle, at the same time because of the cloud platform is relatively very public technology competitive place, if you plan to do less than optimal or someone better than you, this plan doesn’t use others, so in general cloud platforms will use the latest technology, it brings the user to enhance the efficiency of the whole production. In terms of solutions, directly purchasing solutions on the cloud platform, combined with the production environment of cloud users, can shorten the verification cycle, reduce the cost of trial and error, reduce the decision cost, and at the same time, the flexible expansion and contraction brought by the cloud itself, all of which bring value to users.

Tencent’s FPGA cloud service is the first FPGA cloud server released in China in January 2017. After the release, it mainly focuses on self-development and the introduction of more third-party solution providers to provide more industrial solutions, including image processing, image yelting and gene sequencing, so that users can directly use industrial solutions. On the hardware board, it was KU115, VU9P, and Intel Stratix 10 coming soon. For FPGA developers, if they want to use FPGA cloud server to do their own FPGA development, we provide a FPGA development platform, which integrates THE PCLE path and HDK of DDR controller. There is a driver SDK on the software CPU side. Users only need to pay attention to their own logic development and software application development. The entire development cycle can be saved.

What we talked about before is FPGA cloud service itself. In the process of contacting with all walks of life through FPGA cloud service, we also made specific application acceleration for scenarios that need to do HIGH-PERFORMANCE computing in various industries. In terms of gene sequencing, as the cost of sequencing gradually decreases, the proportion of data analysis cost in the overall cost increases. With the explosive growth of genetic data in recent years, the whole data analysis has encountered a computational bottleneck. Therefore, we use FPGA to accelerate some time-consuming algorithms in gene sequencing, which can improve the calculation speed and reduce the cost. The figure on the right shows the application scenario of second-generation gene sequencing in which BWA and GATK algorithms are applied to the standard WGS process. The sequencing of the whole human genome takes 30 hours with CPU, which can be achieved in 2.8 hours with CPU+FPGA, which can be improved by 10 times. Now such an industry solution has been in Tencent gene products, completed the product, provided for users to use.

Finally, I will talk about my thoughts on FPGA cloud service and THE FPGA industry. FPGA cloud service is a new thing. Although the advantages of USING FPGA acceleration in various application scenarios have been introduced before, there are also many difficulties.

Firstly, all cloud platform manufacturers provide FPGA cloud platform at present, but there is no unified standard for the platform itself, which means every cloud platform is developing FPGA platform according to its own ideas. This brings a problem. The fragmentation of the whole platform is particularly serious. For manufacturers who want to provide FPGA industry solutions, they need to adapt to each cloud platform, and the migration cost of industry solutions on FPGA cloud platform is very high. Of course, there may be FPGA cloud industry standards behind, I am also very looking forward to.

Second, it has a high threshold of development and few industry solutions. The programming language used in FPGA development is relatively a very low-level hardware circuit description language, and the abstraction of programming language is relatively low.

Third, the FPGA cloud ecosystem is not perfect, and there is no positive cycle from developers to industry solutions, to customers, to more developers. As a result, the industry solutions are still developed by each cloud platform, and the strength of the industry is not brought into play.

At present, Tencent’s FPGA cloud service planning is as follows: 1. In terms of FPGA cloud platform construction, hardware platform upgrade and IP market launch, more AI and other industry solutions are launched. 2. FPGA ecological development. We hope to connect developers and users to establish an evaluation system and promote the iteration of FPGA solutions.

Finally, FPGA has a lot to do in the end side, because FPGA itself has rich IO pins and low delay characteristics, the combination of the end side and cloud side will have a great space for development.

Question and answer

What are the language requirements for AI development?

reading

How to build Minecraft server on Ubuntu

How to build ark on Ubuntu: Survival and Evolution Server

Create highly usable PostgreSQL clusters using Patroni and HAProxy

Cloud, college courses, recommend | tencent senior engineer, bring you a quick introduction to machine learning

This article has been published by Tencent Cloud + community authorized by the author.Cloud.tencent.com/developer/a…

Search concern public number “cloud plus community”, the first time to obtain technical dry goods, after concern reply 1024 send you a technical course gift package!

Massive technical practice experience, all in the cloud plus community!

FPGA acceleration: Exploration and Practice for data center and cloud services

Question and answer

reading

Related Posts

Win10 system icon display is not normal solution

Deep learning TensorFlow framework is used for image recognition

TDSQL “Similar query tool MSQL+” was selected into VLDB paper