For more exciting technical articles, please pay attention to the author’s wechat official number: Code worker notes

Apple M1 is a new generation of SoC (System on Chip, which integrates GPU, neural network computing engine, I/O controller, etc.) based on THE ARM instruction set launched by Apple. It is the first CPU manufactured by 5 nm process on personal computers. Features and parameters are summarized as follows:

One, 8-core CPU

4 high-performance CPU cores (codename “FireStorm”)

  • L1 instruction cache: 192KB
    • It is 3 times and 6 times of Arm and Intel rival products respectively
  • L1 data cache: 128 KB
    • Load takes only 3 clock cycles
    • AMD 32 KB, 4 clock cycles; Intel’s latest Sunny Cove is 48 KB with five clock cycles
  • L2 shared cache: 12 MB
  • “P cluster” main frequency 0.6-3.204GHz, power <=13.8W

The following microarchitecture data is obtained from software evaluation [2][3], which may be different from the actual data

Instruction decoder width is 8

  • The widest in the industry
  • AMD Zen1 to 3, Intel x86 width is 4
  • The Samsung M3 will always be a 6, the ARM Cortex-X1 is currently a 4, and the Cortex-X1 will be a 5

Command Execution Unit (Port)

  • Integer units:
    • 1: alu + flags + branch + adr + msr/mrs nzcv + mrs
    • 2: alu + flags + branch + adr + msr/mrs nzcv + ptrauth
    • 3: alu + flags + mov-from-simd/fp?
    • 4: alu + mov-from-simd/fp?
    • 5: alu + mul + div
    • 6: alu + mul + madd + crc + bfm/extr
  • Load and store units, up to 128-bit loads and stores, including address generation with shifts up to LSL #3:
    • 7: store + amx
    • 8: load/store + amx
    • 9: load
    • 10: load
  • Floating point numbers, SIMD units (FP/SIMD units) :
    • 11: fp/simd
    • 12: fp/simd
    • 13: fp/simd + fcsel + to-gpr
    • 14: fp/simd + fcsel + to-gpr + fcmp/e + fdiv + frecpe + frsqrte + fjcvtzs + ursqrte + urecpe + sha

Instruction fusion

  • adds/subs/ands/cmp/tst + b.cc (complete fusion when fused instructions read no more than 4 registers per 6 instructions)
  • aese + aesmc (always fused if operands match pattern “A, B ; A, A”)
  • aesd + aesimc (always fused if operands match pattern “A, B ; A, A”)
  • pmull + eor (usually fused if operands match pattern “A, B, C ; A, A, D” or “A, B, C ; A, D, A”)
  • amx + amx (excluding loads and stores – probably fuses to something like a STP)

Instruction elimination: Deletes an instruction that is not needed

  • mov x0, 0 (handled by renaming)
  • mov x0, x1 (usually handled by renaming)
  • Movi V0.16b, #0 (Handled by renaming)
  • Mov V0.16b, V1.16b (usually handled by renaming)
  • mov imm/movz/movn (handled by renamer at a max of 2 per 8 instructions, includes all tested “mov”)
  • nop (never issues)

The other parameters

  • Retires per cycle: 8
  • ROB (In-Flight Renames) :~623
    • Currently, Intel Sunny Cove & Willow Cove Cores are number two: 352
    • AMD’s latest Zen3 is 256
    • Arm’s latest Cortex-X1 is 224
  • Integer physical register file size: ~380
  • FP/SIMD physical register file size: ~434
  • Fetch window tracking slots (in-flight I-cache lines or branches): ~144
  • Load buffers: ~129
  • Store buffers: ~108

Note: The Physical Register file is used to store the operands of the UOP in the out-of-order pipeline. The UOP in the pipeline obtains the operands by pointing to the Physical Register file

4 energy saving cores

  • L1 instruction cache: 128 KB
  • L1 data cache: 64 KB
  • L2 cache: 4MB
  • “E Cluster “main frequency 0.6-2.064ghz, power <=1.3W

Ii. Other components

8-core GPU (Developed by Apple)

  • Each GPU core contains 8 execution parts (EU)
  • Each actuator contains eight ALUs
  • A total of 128 EU and 1024 ALU
  • Up to 25,000 threads can be executed simultaneously
  • Floating point operation: 2.6 TFLOPS

16 – core neural network computing engine

  • 11 trillion calculations per second

RAM

  • Unified Memory Architecutre (UMA)
    • All components, such as cpus and Gpus, can access the same physical memory
    • It eliminates memory copy between components and improves efficiency
  • 8G or 16G 4266 MT/s LPDDR4X SDRAM

Compatible with programs compiled for Intel

Rosetta 2 dynamic binary translation

  • This technology allows M1 devices to run software compiled for Intel x86 cpus
  • In terms of performance, compared with Native, it performs worse in computation-intensive programs, basically at 70%+. See evaluation data for detailed data [4].

Iv. Reference materials

  • [1] en.m.wikipedia.org/wiki/Apple_…
  • [2] www.anandtech.com/show/16226/…
  • [3] dougallj. Making. IO/applecpu/fi…
  • [4] www.anandtech.com/show/16226/…