For more exciting technical articles, please pay attention to the author’s wechat official number: Code worker notes
Apple M1 is a new generation of SoC (System on Chip, which integrates GPU, neural network computing engine, I/O controller, etc.) based on THE ARM instruction set launched by Apple. It is the first CPU manufactured by 5 nm process on personal computers. Features and parameters are summarized as follows:
One, 8-core CPU
4 high-performance CPU cores (codename “FireStorm”)
- L1 instruction cache: 192KB
- It is 3 times and 6 times of Arm and Intel rival products respectively
- L1 data cache: 128 KB
- Load takes only 3 clock cycles
- AMD 32 KB, 4 clock cycles; Intel’s latest Sunny Cove is 48 KB with five clock cycles
- L2 shared cache: 12 MB
- “P cluster” main frequency 0.6-3.204GHz, power <=13.8W
The following microarchitecture data is obtained from software evaluation [2][3], which may be different from the actual data
Instruction decoder width is 8
- The widest in the industry
- AMD Zen1 to 3, Intel x86 width is 4
- The Samsung M3 will always be a 6, the ARM Cortex-X1 is currently a 4, and the Cortex-X1 will be a 5
Command Execution Unit (Port)
- Integer units:
- 1: alu + flags + branch + adr + msr/mrs nzcv + mrs
- 2: alu + flags + branch + adr + msr/mrs nzcv + ptrauth
- 3: alu + flags + mov-from-simd/fp?
- 4: alu + mov-from-simd/fp?
- 5: alu + mul + div
- 6: alu + mul + madd + crc + bfm/extr
- Load and store units, up to 128-bit loads and stores, including address generation with shifts up to LSL #3:
- 7: store + amx
- 8: load/store + amx
- 9: load
- 10: load
- Floating point numbers, SIMD units (FP/SIMD units) :
- 11: fp/simd
- 12: fp/simd
- 13: fp/simd + fcsel + to-gpr
- 14: fp/simd + fcsel + to-gpr + fcmp/e + fdiv + frecpe + frsqrte + fjcvtzs + ursqrte + urecpe + sha
Instruction fusion
- adds/subs/ands/cmp/tst + b.cc (complete fusion when fused instructions read no more than 4 registers per 6 instructions)
- aese + aesmc (always fused if operands match pattern “A, B ; A, A”)
- aesd + aesimc (always fused if operands match pattern “A, B ; A, A”)
- pmull + eor (usually fused if operands match pattern “A, B, C ; A, A, D” or “A, B, C ; A, D, A”)
- amx + amx (excluding loads and stores – probably fuses to something like a STP)
Instruction elimination: Deletes an instruction that is not needed
- mov x0, 0 (handled by renaming)
- mov x0, x1 (usually handled by renaming)
- Movi V0.16b, #0 (Handled by renaming)
- Mov V0.16b, V1.16b (usually handled by renaming)
- mov imm/movz/movn (handled by renamer at a max of 2 per 8 instructions, includes all tested “mov”)
- nop (never issues)
The other parameters
- Retires per cycle: 8
- ROB (In-Flight Renames) :~623
- Currently, Intel Sunny Cove & Willow Cove Cores are number two: 352
- AMD’s latest Zen3 is 256
- Arm’s latest Cortex-X1 is 224
- Integer physical register file size: ~380
- FP/SIMD physical register file size: ~434
- Fetch window tracking slots (in-flight I-cache lines or branches): ~144
- Load buffers: ~129
- Store buffers: ~108
Note: The Physical Register file is used to store the operands of the UOP in the out-of-order pipeline. The UOP in the pipeline obtains the operands by pointing to the Physical Register file
4 energy saving cores
- L1 instruction cache: 128 KB
- L1 data cache: 64 KB
- L2 cache: 4MB
- “E Cluster “main frequency 0.6-2.064ghz, power <=1.3W
Ii. Other components
8-core GPU (Developed by Apple)
- Each GPU core contains 8 execution parts (EU)
- Each actuator contains eight ALUs
- A total of 128 EU and 1024 ALU
- Up to 25,000 threads can be executed simultaneously
- Floating point operation: 2.6 TFLOPS
16 – core neural network computing engine
- 11 trillion calculations per second
RAM
- Unified Memory Architecutre (UMA)
- All components, such as cpus and Gpus, can access the same physical memory
- It eliminates memory copy between components and improves efficiency
- 8G or 16G 4266 MT/s LPDDR4X SDRAM
Compatible with programs compiled for Intel
Rosetta 2 dynamic binary translation
- This technology allows M1 devices to run software compiled for Intel x86 cpus
- In terms of performance, compared with Native, it performs worse in computation-intensive programs, basically at 70%+. See evaluation data for detailed data [4].
Iv. Reference materials
- [1] en.m.wikipedia.org/wiki/Apple_…
- [2] www.anandtech.com/show/16226/…
- [3] dougallj. Making. IO/applecpu/fi…
- [4] www.anandtech.com/show/16226/…