Details Apple M1 software engineers need to know

For more exciting technical articles, please pay attention to the author’s wechat official number: Code worker notes

Apple M1 is a new generation of SoC (System on Chip, which integrates GPU, neural network computing engine, I/O controller, etc.) based on THE ARM instruction set launched by Apple. It is the first CPU manufactured by 5 nm process on personal computers. Features and parameters are summarized as follows:

One, 8-core CPU

4 high-performance CPU cores (codename “FireStorm”)

L1 instruction cache: 192KB
- It is 3 times and 6 times of Arm and Intel rival products respectively
L1 data cache: 128 KB
- Load takes only 3 clock cycles
- AMD 32 KB, 4 clock cycles; Intel’s latest Sunny Cove is 48 KB with five clock cycles
L2 shared cache: 12 MB
“P cluster” main frequency 0.6-3.204GHz, power <=13.8W

The following microarchitecture data is obtained from software evaluation [2][3], which may be different from the actual data

Instruction decoder width is 8

The widest in the industry
AMD Zen1 to 3, Intel x86 width is 4
The Samsung M3 will always be a 6, the ARM Cortex-X1 is currently a 4, and the Cortex-X1 will be a 5

Command Execution Unit (Port)

Integer units:
- 1: alu + flags + branch + adr + msr/mrs nzcv + mrs
- 2: alu + flags + branch + adr + msr/mrs nzcv + ptrauth
- 3: alu + flags + mov-from-simd/fp?
- 4: alu + mov-from-simd/fp?
- 5: alu + mul + div
- 6: alu + mul + madd + crc + bfm/extr
Load and store units, up to 128-bit loads and stores, including address generation with shifts up to LSL #3:
- 7: store + amx
- 8: load/store + amx
- 9: load
- 10: load
Floating point numbers, SIMD units (FP/SIMD units) :
- 11: fp/simd
- 12: fp/simd
- 13: fp/simd + fcsel + to-gpr
- 14: fp/simd + fcsel + to-gpr + fcmp/e + fdiv + frecpe + frsqrte + fjcvtzs + ursqrte + urecpe + sha

Instruction fusion

adds/subs/ands/cmp/tst + b.cc (complete fusion when fused instructions read no more than 4 registers per 6 instructions)
aese + aesmc (always fused if operands match pattern “A, B ; A, A”)
aesd + aesimc (always fused if operands match pattern “A, B ; A, A”)
pmull + eor (usually fused if operands match pattern “A, B, C ; A, A, D” or “A, B, C ; A, D, A”)
amx + amx (excluding loads and stores – probably fuses to something like a STP)

Instruction elimination: Deletes an instruction that is not needed

mov x0, 0 (handled by renaming)
mov x0, x1 (usually handled by renaming)
Movi V0.16b, #0 (Handled by renaming)
Mov V0.16b, V1.16b (usually handled by renaming)
mov imm/movz/movn (handled by renamer at a max of 2 per 8 instructions, includes all tested “mov”)
nop (never issues)

The other parameters

Retires per cycle: 8
ROB (In-Flight Renames) :~623
- Currently, Intel Sunny Cove & Willow Cove Cores are number two: 352
- AMD’s latest Zen3 is 256
- Arm’s latest Cortex-X1 is 224
Integer physical register file size: ~380
FP/SIMD physical register file size: ~434
Fetch window tracking slots (in-flight I-cache lines or branches): ~144
Load buffers: ~129
Store buffers: ~108

Note: The Physical Register file is used to store the operands of the UOP in the out-of-order pipeline. The UOP in the pipeline obtains the operands by pointing to the Physical Register file

4 energy saving cores

L1 instruction cache: 128 KB
L1 data cache: 64 KB
L2 cache: 4MB
“E Cluster “main frequency 0.6-2.064ghz, power <=1.3W

Ii. Other components

8-core GPU (Developed by Apple)

Each GPU core contains 8 execution parts (EU)
Each actuator contains eight ALUs
A total of 128 EU and 1024 ALU
Up to 25,000 threads can be executed simultaneously
Floating point operation: 2.6 TFLOPS

16 – core neural network computing engine

11 trillion calculations per second

RAM

Unified Memory Architecutre (UMA)
- All components, such as cpus and Gpus, can access the same physical memory
- It eliminates memory copy between components and improves efficiency
8G or 16G 4266 MT/s LPDDR4X SDRAM

Compatible with programs compiled for Intel

Rosetta 2 dynamic binary translation

This technology allows M1 devices to run software compiled for Intel x86 cpus
In terms of performance, compared with Native, it performs worse in computation-intensive programs, basically at 70%+. See evaluation data for detailed data [4].

Iv. Reference materials

[1] en.m.wikipedia.org/wiki/Apple_…
[2] www.anandtech.com/show/16226/…
[3] dougallj. Making. IO/applecpu/fi…
[4] www.anandtech.com/show/16226/…