Assembly language execution process
First of all, the computer is a fool. All the complex calculations it does are based on high and low levels, which turn out to be 0 and 1 in our logic. The computer only knows 0 and 1.
In the earliest days, there was a paper-tape computer, which punched a hole in the tape to indicate a 1, and left the hole blank to indicate a 0, and the computer read it.
There is a legend about the paper tape computer: there was a man whose paper tape had been checked for many times, but the output was wrong. Later, it was found that there was a hole blocked by a small bug, which is called bug in English.
So the earliest programmers programmed in machine languages that computers would recognize directly, namely 0 and 1:1001101 00110100… (I made it up)
But this machine language was so unreadable to humans that it was hard to write, so people gave it nicknames, and assembly language came into being.
Such as:
01001000 means move (I made it up), people alias mov to match it;
10110011 is for addition (which I also made up) and people alias add to match it;
It’s easy to remember a lot.
So: Assembly language is essentially a mnemonic for machine language.
Assembly language execution process:
The computer is energized -> CPU reads the program in memory (electrical signal input) -> clock generator constantly oscillates on and off -> promote the CPU internal step by step execution (how many steps depends on the clock cycle required by the instruction) -> complete the calculation -> write back (electrical signal) -> write to the graphics card output (SOUT, or graphics)
The composition of the CPU
PC: Program Counter (Program Counter), used to record the address of the current instruction
Registers: Registers that temporarily store data required by CPU computing at a very high speed
ALU: Arithmetic & Logic Unit
CU: Control Unit
MMU: Memory Management Unit
hyper-threading
We can often hear four nuclear eight thread such terms, the so-called hyper-threading, actually very simple, that is a ALU corresponding multiple PC | Registers.
The benefits of this are:
When the CPU performs a context switch, it needs to remove data from the old thread, store it in the cache, and put data from the new thread into the cache. This switch consumes CPU resources.
Now don’t need so trouble, switch directly to ALU corresponding PC | address Registers.
The cache
Why cache: Because CPU speed is too fast and memory speed is too slow. The speed of CPU and memory is about 100:1.
The hierarchy of memory
The speed of CPU access to different levels of memory
memory | Access speed |
---|---|
Registers | < 1ns |
L1 cache | About 1 ns |
L2 cache | About 3 ns |
L3 cache | About 15 ns |
main memory | About 80 ns |
Cache architecture for multi-core cpus
Each core has its own L1 and L2. The core within each CPU shares L3, and multiple cpus share memory.
To read in a block
-
The locality principle of programs
The performance is: time locality and space locality. Time locality means that if an instruction in a program is executed once, it may be executed again soon after. If data is accessed, it may be accessed again shortly thereafter. Spatial locality means that once a program accesses a storage unit, it will not be long before nearby storage units are accessed as well.
When a storage unit is read by block, adjacent storage units are read at the same time. Based on the program locality principle, I/O times are reduced and efficiency is improved.
Cache line
The larger the cache line is, the higher the local space utilization is, but the read time is slow.
The smaller the cache row, the lower the local space utilization, but the faster the read time.
In practice, Intel cpus came up with a compromise of 64 bytes, not necessarily 64 bytes for other cpus.
Cache coherence protocol, here introduce a Intel CPU cache implementation agreement, MESI:www.cnblogs.com/z00377750/p…
Cache line alignment
- Pseudo share: As shown in the figure above, two adjacent regions X and Y are located in the same cache line. If two threads change the values of X and Y respectively, the data in the cache of the other thread will be invalidated and the data will be read from the memory again. This is a pseudo share.
For some particularly sensitive numbers, there will be high-contention access from threads. To ensure that pseudo-sharing does not occur, cache line alignment can be used programmatically.
- No cache line alignment is used
/** * Cache line alignment experiment * - No cache line alignment * ARr [0] and ARr [1] in the same cache line */
public class T01_CacheLinePadding {
public static volatile long[] arr = new long[2];
public static void main(String[] args) throws Exception {
Thread t1 = new Thread(()->{
for (long i = 0; i < 10000_0000L; i++) {
arr[0] = i; }}); Thread t2 =new Thread(()->{
for (long i = 0; i < 10000_0000L; i++) {
arr[1] = i; }});final long start = System.nanoTime();
t1.start();
t2.start();
t1.join();
t2.join();
System.out.println((System.nanoTime() - start)/100 _0000); }}Copy the code
- Use cache line alignment
/** * Cache line alignment experiments * - Use cache line alignment * ARr [0] and ARR [8] are in different cache lines */
public class T02_CacheLinePadding {
public static volatile long[] arr = new long[16];
public static void main(String[] args) throws Exception {
Thread t1 = new Thread(()->{
for (long i = 0; i < 10000_0000L; i++) {
arr[0] = i; }}); Thread t2 =new Thread(()->{
for (long i = 0; i < 10000_0000L; i++) {
arr[8] = i; }});final long start = System.nanoTime();
t1.start();
t2.start();
t1.join();
t2.join();
System.out.println((System.nanoTime() - start)/100 _0000); }}Copy the code
- In JDK8, this is provided
@Contended
Annotations, used to ensure that fields are not in the same cache line, are used with virtual machine parameters-XX:-RestrictContended
Will only take effect
import sun.misc.Contended;
/** * JDk8 starts providing cache line alignment annotations ** required to add the virtual machine parameter -xx: -restrictContEnded */
public class T03_Contended {
@Contended
public static volatile long x;
@Contended
public static volatile long y;
public static void main(String[] args) throws Exception {
Thread t1 = new Thread(()->{
for (long i = 0; i < 10000_0000L; i++) { x = i; }}); Thread t2 =new Thread(()->{
for (long i = 0; i < 10000_0000L; i++) { y = i; }});final long start = System.nanoTime();
t1.start();
t2.start();
t1.join();
t2.join();
System.out.println((System.nanoTime() - start)/100 _0000); }}Copy the code
Out-of-order execution
The essence of out-of-order CPU execution is to improve efficiency.
Why volatile DCL singletons
Create object source code:
class T {
int m = 8;
}
T t = new T();
Copy the code
There are three steps to creating an object:
1: apply for a block of memory and assign default values to attributes.
2. Execute the constructor.
3: Assigns the address of the object to the reference.
Volatile can prohibit instruction reordering, and without volatile, steps 2 and 3 May be reordered, exposing an uninitialized incomplete object.
How do I disable instruction reordering
CPU level: Memory barrier/bus lock
Memory barriers: Barriers added before and after memory operations. Operations before and after memory barriers cannot be performed in disorder.
Intel cpus provide primitives (lfence, sfence, mfence) that can also be addressed using bus locks.
- Sfence: Writes before the sfence directive must be completed before writes after the sfence directive
- Lfence: The read before the lfence directive must be completed before the read after the lfence directive
- Mfence: Read and write operations before the mfence directive must be completed before the read and write operations after the mfence directive
JVM level: 8 happens-before principles and 4 memory barriers (LL, SS, SL, LS)
Happens-before principle (JVM rules that reorder must follow)
- Program order rule: In a single thread, in the order of the program code’s flow of execution, the actions that happened before happen before (all previous write operations in the same thread are visible to subsequent operations)
- The lock rule: an unlock operation happens — a lock operation on the same lock after before. (If thread 1 unlocks Monitor A and thread 2 then locks A, all write operations before thread 1 unlocks A are visible to thread 2 (thread 1 and thread 2 can be the same thread).)
- Rule for volatile variables: Writes to a volatile variable happen — before Reads to that variable after. (If thread 1 writes volatile v and thread 2 reads V, then thread 1 writes V and previous writes are visible to thread 2 (thread 1 and thread 2 can be the same thread).)
- Thread start rule: the thread.start () method happens — before every operation before the start Thread is called. Given that thread A starts ThreadB by executing threadb.start (), thread A’s changes to the shared variable are visible to ThreadB before the next ThreadB starts executing. Note: thread A may not be visible to variable B after thread B starts.
- Thread termination rule: All operations on a Thread happen — before termination detection for this Thread can detect that the Thread has terminated by means of the end of thread.join () method, the return value of thread.isalive (), and so on. (All variables written by thread T1 are visible to T2 after any other thread T2 calls t1.join() or returns t1.isalive () successfully.)
- Thread interrupt rule: Calls to threadinterrupt () happen — before occurs when the interrupt is detected by the code of the interrupted thread. (All variables written by Thread T1, call Thread.interrupt(), and interrupt Thread T2, see all operations of T1.)
- Object finalization rule: An object’s initialization happens — before the start of its Finalize () method. (When objects call Finalize (), any operation completed by objects initialization will be synchronized to all main memory and synchronized to all cache.)
- Transitivity: If A happens — before B, B happens — before C, then A happens — before C. (A h-b b, b h-b C then you get A h-b C)
JSR memory barrier
-
LoadLoad barrier:
For such a statement Load1; LoadLoad; Load2,
Ensure that the data to be read by Load1 is completed before the data to be read by Load2 and subsequent read operations is accessed.
-
StoreStore barrier:
For statements like Store1; StoreStore; Store2.
Ensure that write operations on Store2 and subsequent writes are visible to other processors before they are executed.
-
LoadStore barrier:
For such a statement Load1; LoadStore; Store2.
Ensure that the data to be read by Load1 is completed before Store2 and subsequent write operations are flushed out.
-
StoreLoad barrier:
For statements like Store1; StoreLoad; Load2,
Ensure that writes to Store1 are visible to other processors before Load2 and subsequent reads are executed.
As-if-serial: The result of single-thread execution remains the same regardless of hardware order
Merge write technique
Write Combining Buffer, usually 4 bytes
Due to the high speed of ALU, a WC Buffer was written to L1 at the same time. When it was full, it was directly updated to L2.
NUMA
-
UMA: Uniform Memory Access
Disadvantages: not easy to expand, increase the number of CPUS caused by memory access conflicts aggravate, a lot of CPU resources spent in fighting for memory address, (4 more appropriate)
-
NUMA: Non-uniform Memory Access
In the NUMA architecture, a set of cpus and some memory are placed together, which can be understood as dedicated memory and can be accessed efficiently.
Computer startup process
- Power-on, BIOS, UEFI work, power-on self-check, to the hard disk fixed position (the first sector) load the bootloader;
- Read configurable information from the CMOS.