Introduction to the principle and implementation of Android hardware acceleration

In the development of mobile clients, especially Android applications, the term “hardware acceleration” is often used. Due to the underlying operating system software and hardware encapsulation is very perfect, the upper software developers have a little is known about the underlying principle of hardware acceleration, also not clear to understand the meaning of the underlying principle, so there are always some misunderstandings, such as hardware acceleration by special algorithm page rendering, or through improve the CPU/GPU computing rate of hardware rendering speed.

This paper tries to introduce the hardware acceleration technology from the underlying hardware principle to the upper code implementation, which is based on Android 6.0.

Understand what hardware acceleration means for App development

For App developers, a simple understanding of the principle of hardware acceleration and the implementation of the upper API can make full use of hardware acceleration to improve page performance. On Android, for example, there are usually two ways to implement a rounded rectangle button: using a PNG image; Implemented in code (XML/Java). A simple comparison of the two schemes is as follows.

plan	The principle of	The characteristics of
Using PNG images (BitmapDrawable)	PNG images are decoded to generate Bitmap, which is transmitted to the underlying layer and rendered by GPU	Image decoding consumes CPU computing resources, while Bitmap occupies large memory and draws slowly
Using XML or Java code (ShapeDrawable)	Shape information is directly transmitted to the bottom layer and rendered by GPU	It consumes less CPU resources, occupies less memory, and draws quickly

Page rendering background knowledge

When the page is rendered, the elements drawn are eventually converted into matrix pixels (i.e. multi-dimensional arrays, similar to android bitmaps) before they can be displayed on the display.
Pages are made up of a variety of basic elements, such as circles, rounded rectangles, line segments, text, vector diagrams (often made up of Bezier curves), bitmaps, etc.
Elements drawn especially the animation process, often with varying interpolation, scaling, rotating, transparency, animation transition, frosted glass, including 3 d transformation, physical exercise (such as common parabola) in the game, multimedia file decoding (mainly has applications in a desktop machine, mobile devices are not GPU do decoding), etc.
The drawing process often requires relatively simple logic but large numbers of floating-point operations.

Comparison between CPU and GPU structures

The Central Processing Unit (CPU) is the core component of a computer device. It is used to execute program code. Software developers are familiar with it. The Graphics Processing Unit (GPU) is mainly used to process Graphics operations. The core component of the “Graphics card” is GPU.

The following is a comparison of CPU and GPU structures. Among them:

The yellow Control controller is used to coordinate and Control the operation of the entire CPU, including taking out instructions and controlling the operation of other modules.
The green ALU (Arithmetic Logic Unit) is the Arithmetic Logic Unit used for mathematical and logical operations.
The orange colors Cache and DRAM are the Cache and RAM, respectively, used to store information.

As can be seen from the structure diagram, the CONTROLLER of CPU is relatively complex and the number of ALUs is small. So the CPU is good at all kinds of complex logical operations, but not good at math, especially floating point arithmetic.
- Take 8086 as an example, most of the more than 100 assembly instructions are logical instructions, and the mathematical calculation is mainly related to the 16-bit addition, subtraction, multiplication and division and shift operation. An integer and logical operation typically takes 1 to 3 machine cycles, while a floating-point operation can take hundreds of machine cycles to convert to an integer calculation.
- Simpler cpus even have only addition instructions, with subtraction done with complement addition, multiplication done with summation, and division done with subtraction loops.
- Modern cpus typically come with a hardware floating point arithmetic unit (FPU), but are mainly used for small data volumes.
The CPU is a serial structure. Taking 100 numbers as an example, for a CPU core, only two numbers can be added at a time, and the results accumulate.
Unlike cpus, gpus are designed to do a lot of math. As can be seen from the structure diagram, the controller of GPU is relatively simple, but it contains a large number of ALUs. The ALU in GPU adopts parallel design and has many floating point operation units.
The main principle of hardware acceleration is to convert graphics calculation that CPU is not good at into GPU-specific instructions through the underlying software code, which is completed by THE GPU.

Extension: The GPU in many computers has its own separate video memory; If there is no independent video memory, it uses the form of shared memory, which is divided into an area of memory as video memory. Video memory can save GPU instructions and other information.

Example of parallel structure: cascade adder

To facilitate understanding, here is an example from the perspective of the underlying circuit structure. The figure below shows an adder corresponding to the actual digital circuit structure.

A and B are input, and C is output, and A, B, and C are all buses. Taking A 32-bit CPU as an example, each bus actually consists of 32 wires, and each wire represents A binary 0 or 1 with different voltages.
Clock is A Clock signal line. Each fixed Clock cycle can input A specific voltage signal to it. When A Clock signal comes, the sum of A and B will be output to C.

Now we have to calculate the sum of eight integers.

For a serial structure like the CPU, the code is simple, adding up all the numbers one by one with a for loop. The serial structure has only one adder and requires seven summation operations. Each time the partial sum is calculated, it is transferred to the adder’s input for the next calculation. The whole process takes at least a dozen machine cycles.

For parallel architectures, a common design is a cascading adder, as shown below, where all the clocks are linked together. After three clock cycles, the sum is completed when the eight data points to be added are ready at the inputs A1 to B4. If the data volume is larger and the cascade level is larger, the advantages of the parallel structure are more obvious.

Due to the limitation of the circuit, it is not easy to increase the computing speed by increasing the clock frequency and decreasing the clock period. Parallel structure can realize faster operation by increasing circuit size and parallel processing. However, parallel architecture is not easy to implement complex logic, because the process of considering the output of multiple branches at the same time and coordinating synchronous processing is complicated (a bit like multithreaded programming).

Example of GPU parallel computing

Suppose we have the following image processing task, adding 1 to each pixel value. The method of GPU parallel computing is simple and crude. If resources permit, a GPU thread can be opened for each pixel to add 1. The larger the mathematical computation, the more obvious the performance advantage of this parallel mode.

Hardware acceleration in Android

On Android, the interfaces of most applications are built using regular Views (except for games, video, graphics, etc., which may use OpenGL ES directly). Below, according to the Java layer code of Android 6.0 native system, the software and hardware accelerated rendering of View do some analysis and comparison.

DisplayList

DisplayList is a basic drawing element that contains primitive attributes of the element (position, size, Angle, transparency, etc.) corresponding to the drawXxx() method of the Canvas (figure below).

Information transfer process: Canvas(Java API) – > OpenGL(C/C++ Lib) – > Driver – > GPU.

On Android 4.1 and above, DisplayList supports properties. If some properties of the View change (such as Scale, Alpha, Translate), you simply update the properties to the GPU, without generating a new DisplayList.

RenderNode

A RenderNode contains several DisplayLists. Usually, a RenderNode corresponds to a View, which contains all the DisplayLists of the View itself and its child views.

Android Drawing Process (Android 6.0)

Below is a complete drawing flow chart of Android View, mainly through reading the source code and debugging, dotted arrows represent recursive calls.

From ViewRootImpl. The performTraversals to PhoneWindow. DecroView. DrawChild is each traversal tree View fixed process, first of all, according to mark a judge whether need to layout and perform layout; Then start drawing by creating Canvas and other operations.
- If hardware acceleration is not supported or disabled, software is used to draw, and the generated Canvas isCanvas.classThe object;
- Is generated if hardware acceleration is supportedDisplayListCanvas.classThe object;
- Of the twoisHardwareAccelerated()Method returns false and true, and the View uses this value to determine whether hardware acceleration is used.
The View ofdraw(canvas,parent,drawingTime) – draw(canvas) – onDraw – dispachDraw – drawChildThis recursive path (hereafter referred to asThe Draw path), calledCanvas.drawXxx()Method, used for actual rendering in software rendering; Used to build the DisplayList when hardware is accelerated.
View updateDisplayListIfDirty -DispatchGetDisplayList -recreatechildDisplayList is a recursive path that can only be passed during hardware acceleration. Use to update the DisplayList property while traversing the View tree and quickly skip views that do not need to rebuild the DisplayList.

In Android 6.0, the DisplayList API is still marked as “@hide” inaccessible, indicating that it is not yet mature and may be available in a later version.
In the hardware-accelerated case, the DisplayList is built after the draw process is completed and then passedThreadedRenderer.nSyncAndDrawFrame()Use the GPU to draw the DisplayList onto the screen.

Pure Software Drawing VS Hardware Acceleration (Android 6.0)

The following describes the process and acceleration effect before and after hardware acceleration based on specific scenarios.

Render the scene	Pure software drawing	Hardware acceleration	Acceleration effect analysis
Page initialization	Draw all views	Create all displayLists	Gpus share complex computing tasks
Call setText() of a background transparent TextView on a complex page with its size position unchanged	Redraw all views in dirty areas	TextView and each parent View rebuild the DisplayList	Overlapping sibling nodes do not need to be redrawn by the CPU, but will be processed by the GPU
TextView plays Alpha/Translation/Scale animation frame by frame	Each frame redraws all the views in the dirty area	Except for frame 1 and Scene 2, only the RenderNode properties corresponding to TextView are updated in each subsequent frame	Refresh a frame performance greatly improved, animation fluency improved
Modify TextView transparency	Redraw all views in dirty areas	Call RenderNode.setalpha () directly to update	Before acceleration, the whole page should be traversed and many views should be redrawn. Speed up after only trigger DecorView updateDisplayListIfDirty, no longer to traverse, CPU execution time can be ignored

In Scenario 1, regardless of acceleration, the View tree is traversed and the Draw path is taken. After hardware acceleration, Draw path does not do actual drawing work, but only constructs DisplayList, and the complex drawing calculation task is shared by GPU, which has a great acceleration effect.
In scenario 2, the size and position of the TextView remain unchanged before and after setting, and a Layout relayout will not be triggered.
- When the software draws, the area where the TextView is located is the dirty area. Since textViews have transparent areas, most of the views that overlap with the dirty areas will be redrawn when traversing the View tree, including overlapping sibling nodes and their parent nodes (see introduction later). Views that do not need to be drawn will be redrawn in the View treedraw(canvas,parent,drawingTime)Method where the judgment returns directly.
- After hardware acceleration, the View tree also needs to be traversed, but only the TextView and its parent nodes of each layer need to rebuild the DisplayList, using the Draw path. Other views directly use the DisplayList path, and the rest of the work is left to the GPU. The more complex the page, the greater the performance gap.
In scenario 3, the software has to do a lot of drawing for each frame, which can easily lead to animation stalling. After hardware acceleration, the animation process directly updates the DisplayList properties along the DisplayList path, and the animation smoothness can be greatly improved.
In scenario 4, the performance difference is even more pronounced. Simple changes to transparency, software rendering still need to do a lot of work; Hardware acceleration typically updates RenderNode properties directly, without triggering invalidate, and without traversing the View tree (except for a few views that may respond specifically to Alpha and run in theonSetAlpha()Return true, code below).

public class View { // ... Public void setAlpha(@floatRange (from=0.0, to=1.0) float alpha) {ensureTransformationInfo(); if (mTransformationInfo.mAlpha ! = alpha) { mTransformationInfo.mAlpha = alpha; if (onSetAlpha((int) (alpha * 255))) { // ... invalidate(true); } else { // ... mRenderNode.setAlpha(getFinalAlpha()); / /... } } } protected boolean onSetAlpha(int alpha) { return false; } / /... }Copy the code

Software drawing refresh logic Introduction

Actually read the source code and experiment, get the software drawing refresh logic under normal circumstances:

By default, a View’s clipChildren property is true, meaning that each View cannot draw beyond the scope of its parent View. If the clipChildren property of a page root layout is set to false, the child View can exceed the parent View’s drawing area.
When a View triggers invalidate and no animation or layout is triggered:
- For a completely opaque View, it sets the flag bit itselfPFLAG_DIRTYThe parent View sets the flag bitPFLAG_DIRTY_OPAQUE. indraw(canvas)In this method, only the View itself is redrawn.
- For views that may have transparent regions, both the parent View and the View itself are set to the flag bit PFLAG_DIRTY.
  - When clipChildren is true, the dirty area will be converted into Rect in ViewRoot, and judgment will be made layer by layer when refreshing. When View overlaps with the dirty area, it will be redrawn. If a View is outside the parent View and overlaps with the dirty area, but the parent View does not overlap with the dirty area, the child View will not be redrawn.
  - When clipChildren is false,ViewGroup.invalidateChildInParent()In will expand the dirty area to its entire area, so that all views that overlap with this area will be redrawn.

conclusion

At this point, hardware acceleration related content is introduced, here is a simple summary:

Cpus are better at complex logic control, while Gpus are better at math thanks to a lot of ALU and parallel architecture.
The page is made up of a variety of base elements (DisplayList) that require a lot of floating point arithmetic to render.
Under hardware-accelerated conditions, CPU is used to control complex drawing logic, build or update DisplayList. GPU is used to perform graphics calculation and render DisplayList.
Under the condition of hardware acceleration, when the interface is refreshed, especially when animation is played, the CPU only rebuilds or updates the necessary DisplayList to further improve the rendering efficiency.
To achieve the same effect, try to use a simpler DisplayList for better performance (Shape instead of Bitmap, etc.).

Resources and extended reading

GPU – Powerful parallel computing tool
Introduction to the working principle of GPU, the “heart” of graphics card
Matlab GPU acceleration
Processor architecture: Understand the basic operating principles of CPU
The internal architecture and working principle of the CPU
What is a heterogeneous multiprocessing system and why is a heterogeneous multiprocessing system needed
Android application UI hardware accelerated rendering Display List construction process analysis
Android application UI hardware accelerated rendering Display List rendering process analysis
Android Choreographer source code analysis
Android Project Butter analysis

If you answer “thinking questions”, find mistakes in the article, or have questions about the content, you can leave a message to us at the background of wechat public account (Meituan-Dianping technical team). Each week, we will select one “excellent responder” and give a nice small gift. Scan the code to pay attention to us!