Click “like” to see, form a habit, the public account search [dime technology] pay attention to more original technical articles. This article has been included in GitHub org_Hejianhui /JavaStudy.

preface

In the multithreaded, multiprocessor, distributed environment programming era, concurrency is an unavoidable problem. Since concurrency is an unavoidable problem, it is better to embrace it, understand it, and spend some time exploring concurrency from the underlying principles of the operating system, to the basic programming of Java, to distributed environments. Let’s start with the principles.

Computer system hierarchy

The hierarchy of early computer systems

The first computers were programmed in machine language, called the first programming language

Assembly language programming

Assembly language programming

The hierarchy of modern (traditional) computer systems

Modern computers are programmed in high-level languages

  • Third generation programming languages (3GL) are programming languages that describe the implementation, or “how to”, when coded.
  • The fourth generation programming language (4GL) is a non-procedural language, which only needs to say “what to do” when coding, and does not need to describe specific algorithm implementation details.

Language processing system includes: various language processing programs (such as compilation, assembly, linking), runtime system (such as library functions, debugging, optimization and other functions)

The operating system includes human-computer interaction interface and kernel routines providing service functions

It can be seen that the development of language is a process of continuous “abstraction”, thus, the corresponding computer system is also constantly new levels.

Transformation of abstraction layer in computer system

Function transformation: the upper layer is the lower layer of abstraction, the lower layer is the implementation of the upper layer to provide support environment for the upper layer!

Different users of a computer system

  • The end user works at the topmost level of abstraction provided by the application
  • System administrators work at an abstraction layer provided by the operating system
  • Application programmers work at the abstraction layer made up of language processing systems (mainly compilers and assemblers)
  • The language processing system is built on top of the operating system
  • Systems programmers (who implement system software) work at the ISA level and must know ISA well

The object programs of compilers and assemblers are made up of machine-level code

The operating system programmatically controls the hardware directly through instructions. ISA is at the interface (interface) between software and hardware. ISA is an abstraction of hardware on which all software functions are built

Instruction Set Architecture (ISA)

ISA stands for Instruction Set Architecture, sometimes referred to simply as Instruction system

  • ISA ISA Specification, it says how to use hardware
    • A collection of executable instructions, including instructions format, types of operations, and corresponding specifications for operands of each operation;
    • The type of operands that the instruction can accept;
    • The structure of the register groups that operands can hold, including the name, number, length, and purpose of each register;
    • The size of storage space that operands can hold and the way they are addressed;
    • Operands are stored in the storage space in big-end or small-end mode.
    • The way in which instructions get operands, that is, addressing;
    • Control mode of instruction execution process, including program counter (PC), condition code definition, etc.
  • ISA is an essential abstraction layer in general-purpose computer systems,
    • Without it, software can’t use computer hardware!
    • Without it, a computer cannot be called a general-purpose computer

 

The relationship between ISA and computer composition (microstructure)

ISA is an abstraction of computer composition. Different ISAs specify different instruction sets

  • For example, ia-32, MIPS, ARM and other computer components must be able to implement the functions specified in ISA
  • The same ISA can be composed of different computers, such as GPR, logo, arithmetic circuit, etc
  • Such as multiplication instructions can be ALU or multiplier implementation

The prototype of the modern computer

Modern computer models are based on the von Neumann computer model

In 1946, the Institute for Advance Study at Princeton (IAS) began designing a “stored program” computer, known as the IAS computer.

  • The most important idea of von Neumann’s structure is “stored-program”.
  • Working Mode:
    • Any work to be done by a computer is first programmed, then the program and raw data are sent to main memory and executed. Once the program is started, the computer should be able to automatically complete the task of retrieving and executing instructions one by one without operator intervention.
    • A Von Neumann structured computer is also known as a Von Neumann Machine.
    • Almost all modern general-purpose computers use the von Neumann structure, so the IAS computer is the prototype of modern computers.

When the computer is running, first take out the first instruction from the memory, through the decoding of the controller, according to the requirements of the instruction, take out the data from the memory for the specified operation and logical operation processing, and then send the results to the memory according to the address. Next, take out the second instruction again, complete the specified operation under the command of the controller. And so on. Until a stop command is encountered.

Program and data storage, according to the sequence of program arrangement, step by step to take out the instruction, automatically complete the operation prescribed by the instruction is the computer’s most basic working model. This principle was first developed by Hungarian American mathematician Feng. Neumann proposed it in 1945, so it was called Feng. Neumann computer model.

What is the von Neumann structure?

  • There is main storage for programs and data
  • A unit that automatically extracts instructions one by one
  • A unit that specifically performs instructions (i.e., operations)
  • A program consists of instructions
  • Instruction description how to process data
  • A unit that injects programs and raw data into a computer
  • A part of a computer that outputs the results of an operation

The main idea of von Neumann’s structure

  • A computer should consist of five basic parts: calculator, controller, memory, input device and output device.
  • The functions of the basic components are:
    • Controller (Control) : is the central nerve of the whole computer, its function is to interpret the Control information stipulated by the program, Control according to its requirements, schedule the program, data, address, coordinate the work of each part of the computer and access to memory and peripherals, etc.
    • Datapath: The function of Datapath is to perform various arithmetic and logical operations on data, that is, to process data.
    • Memory: The function of Memory is to store programs, data, signals, commands and other information, and provide these information when needed.
    • Input system: The Input device is an important part of the computer. The Input device and output device are combined as external devices, referred to as peripherals. The function of the Input device is to Input the program, original data, text, characters, control commands or data collected on the scene into the computer. Common input equipment has keyboard, mouse, photoelectric input machine, tape machine, disk machine, cd-rom machine and so on.
    • Output system: Output equipment and input equipment is an important part of the computer, it is the intermediate result or the final result of the external computer, the machine in all kinds of data symbols and text or all kinds of control signals and other information Output. Microcomputer commonly used output equipment CRT display terminal, printer, laser printer, plotter and tape, cd-rom, etc.
  • Internally, instructions and data are represented in binary. Each instruction consists of two parts: operation code and address code. The opcode indicates the type of operation, and the address code indicates the address of the operand. A program consists of a sequence of instructions.
  • Use “stored program” working mode.

Modern computer structural models

Abstract simplified model based on von Neumann’s computer theory, its specific application is the hardware structure design of modern computers:In the hardware structure above, there are many accessories, but the core is only two parts: CPU and memory. So we’re going to focus on those two parts.

CPU: The central processing unit; PC: program counter; MAR: Memory address register ALU: arithmetic logic unit; IR: instruction register; MDR: memory data register GPRs: general-purpose register group (consisting of several general-purpose registers, early as accumulator)

CPU instruction structure

CPU internal structure

  • The control unit
  • The operation unit
  • Data unit

The control unit

The control unit is the command and control center of the whole CPU, which is composed of Instruction Register (IR), Instruction Decoder ID (Instruction Decoder) and Operation Controller (OC). It is very important to coordinate the orderly work of the whole computer. According to the user’s pre-programmed program, it takes out each instruction in turn from the memory and puts it in the instruction register IR. Through instruction decoding (analysis), it determines what operation should be carried out. Then, by operating the controller OC, it sends out microoperation control signals to the corresponding parts according to the determined timing sequence. Operation controller OC mainly includes: beat pulse generator, control matrix, clock pulse generator, reset circuit and power – off circuit control logic.

The operation unit

Arithmetic unit is the core of arithmetic unit. You can perform both arithmetic operations (including basic operations such as addition, subtraction, and multiplication, and their add-ons) and logical operations (including shifts, logical tests, or comparison of two values). Relative to the control unit, the arithmetic unit accepts the command of the control unit to carry out the action, that is, all the operations carried out by the arithmetic unit are commanded by the control signal issued by the control unit, so it is the execution component.

Storage unit

A storage unit consists of the in-chip CPU Cache, Cache and register group. It is the place where data is temporarily stored in the CPU. The data that is waiting to be processed or has been processed is stored in the storage unit. Registers are internal components of the CPU, registers have very high read and write speed, so data transfer between registers is very fast. The use of registers can reduce the number of CPU access to memory, thus improving the CPU speed. The register group can be divided into special register and general register. The function of special register is fixed, storing corresponding data respectively. General-purpose registers are versatile and can be specified by programmers.

The following table lists the development history and representative series of CPU key technologies. The birth of each key technology is interlinked. The development history of these technologies of the processor all revolves around how to not let the “CPU idle down” this core goal.

The key technology time describe
Instruction cache (L1) 1982 Preread multiple commands
Data Cache (L1) 1985 Prefetch a certain length of data
Assembly line 1989 An instruction is split and processed by multiple units. I486
More lines 1993 Multi-cell and multi-pipeline parallel processing, Teng 1
Out-of-order + branch prediction 1995 Make full use of different components for collaborative processing, Pentium Pro
hyper-threading 2002 Introduction of multiple front-end components sharing the execution engine, Pentium 4
Multicore processor 2006 Eliminate hyperthreading, reduce clock frequency, switch to multi-core, Core Core
Multi-core hyperthreading 2008 Reintroducing hyper-threading technology, iX series

CPU cache structure

In order to improve execution efficiency and reduce the interaction between CPU and memory (interaction affects CPU efficiency), modern cpus generally integrate multi-level cache architecture, commonly known as three-level cache architecture

  • L1 Cache, divided into data Cache and instruction Cache, logic core exclusive
  • L2 Cache, the physical core is exclusive and the logical core is shared
  • L3 Cache, shared by all physical cores

  • Memory Storage space size: Memory >L3>L2>L1> register;
  • Order of memory speed: register >L1>L2>L3> memory;

Note: the cache consists of the smallest storage block, the cacheline, which is typically 64 bytes in size.

What does it mean to cache rows? For example, if your L1 cache size is 512 KB and your cacheline is 64 bytes, then L1 has 512 * 1024/64 cache lines

The process by which the CPU reads data from memory

  1. The CPU fetches the value of register X in one step: read it directly.
  2. To fetch a value from L1 cache, the CPU takes 1 to 3 steps (or more) : lock the cache row, fetch the data, unlock it, or slow it down if it’s not locked.
  3. To obtain a value from L2 cache, the CPU first fetches a value from L1 cache. L1 cache does not exist. After locking, the CPU copies data from L2 cache to L1, and then reads DATA from L2 cache.
  4. The CPU obtains the L3 cache in the same way, but copies from L3 to L2, from L2 to L1, and from L1 to CPU.
  5. CPU memory fetching is the most complicated: notify the memory controller to occupy the bus bandwidth, notify the memory to lock, initiate a memory read request, wait for a response, save the response data to L3 (if not to L2), then from L1/2 to L1, then from L1 to CPU, and then release the bus lock.

Why does the CPU have a cache

Cpus are doubling every 18 months under Moore’s Law, while memory and hard drives are nowhere near as fast. This makes high-performance memory and hard drives extremely expensive. However, the high speed of CPU computation requires high speed data. To solve this problem, CPU manufacturers built a small amount of caching into the CPU to solve the mismatch between I\O speed and CPU speed.

When a CPU accesses a storage device, both data and instructions tend to be clustered in a contiguous area, which is known as locality.

  • Temporal Locality: If an information item is being accessed, it is likely that it will be accessed again in the near future.

Such as loops, recursion, repeated calls to methods, etc.

  • Spatial Locality: If the location of a memory is referred to, the location near it will also be referred to in the future.

Examples include code executed sequentially, two objects created consecutively, arrays, and so on.

Case of spatial locality:

public class TwoDimensionalArraySum {
    private static final int RUNS = 100;
    private static final int DIMENSION_1 = 1024 * 1024;
    private static final int DIMENSION_2 = 6;
    private static long[][] longs;

    public static void main(String[] args) throws Exception {
        /* * Initializes the array */
        longs = new long[DIMENSION_1][];
        for (int i = 0; i < DIMENSION_1; i++) {
            longs[i] = new long[DIMENSION_2];
            for (int j = 0; j < DIMENSION_2; j++) {
                longs[i][j] = 1L;
            }
        }
        System.out.println("Array initialization completed....");

        long sum = 0L;
        long start = System.currentTimeMillis();
        for (int r = 0; r < RUNS; r++) {
            for (int i = 0; i < DIMENSION_1; i++) {//DIMENSION_1=1024*1024
                for (int j=0; j<DIMENSION_2; j++){/ / 6
                    sum+=longs[i][j];
                }
            }
        }
        System.out.println("spend time1:"+(System.currentTimeMillis()-start));
        System.out.println("sum1:"+sum);

        sum = 0L;
        start = System.currentTimeMillis();
        for (int r = 0; r < RUNS; r++) {
            for (int j=0; j<DIMENSION_2; j++) {/ / 6
                for (int i = 0; i < DIMENSION_1; i++){/ / 1024 * 1024
                    sum+=longs[i][j];
                }
            }
        }
        System.out.println("spend time2:"+(System.currentTimeMillis()-start));
        System.out.println("sum2:"+sum); }}Copy the code

A CPU with a cache performs the flow of computation

  1. Programs and data are loaded into main memory
  2. Instructions and data are loaded into the CPU’s cache
  3. The CPU executes instructions and writes the results to the cache
  4. Data in the cache is written back to main memory

CPU running security level

CPU has four running levels, which are:

  • ring0
  • ring1
  • ring2
  • ring3

Linux and Windows use only two levels :ring0 and Ring3. The commands of internal programs run at the RING0 level, and third-party programs run at the Ring3 level. If a third-party program calls internal functions of the OPERATING system, The CPU must be switched from ring3 to ring0 to execute the system function, because the CPU must be switched from ring3 to ring0. Here’s a quick overview of how the JVM creates thread cpus

  • Step1: switch the CPU from ring3 to ring0 creation thread
  • Step2: after the creation, switch the CPU from ring0 to ring3
  • Step3: thread executes JVM program
  • Step4: the thread is executed, the destruction will also have to cut ring0

Operating system memory management

Performing space Protection

Operating system has two concepts of user space and kernel space, the purpose is also to achieve safe isolation and stability of program operation, take the 32-bit operating system memory space of 4G as an example

Linux reserves several page boxes for kernel code and data structures that are never rolled out to disk. Linear addresses from 0x00000000 to 0xC0000000 (PAGE_OFFSET) can be referenced by user code and kernel code (User space). Linear addresses from 0xC0000000 (PAGE_OFFSET) to 0xfffffff can only be accessed by kernel code (Kernel space). The kernel code and its data structures must reside in this 1 GB address space, but the larger consumer of this address space is a virtual mapping of physical addresses.

This means that out of 4 GB of memory space, only 3 GB can be used for user applications. Processes and threads can only run in usermode or kernelmode. User programs run in user mode, while system calls run in kernel mode. Under these two approaches used in the stack is not the same: the user mode stack of is generally used for stack (user space), and the kernel mode with a fixed size of the stack (kernel space combat, generally for a memory page size), namely each process and thread actually have two stacks, running and the user mode and kernel mode respectively.

The basic unit of CPU scheduling, thread, is also divided into:

  1. Kernel Threading Model (KLT)
  2. User threading Model (ULT)

Kernel thread model

Kernel thread (KLT) : System kernel management thread (KLT), the kernel stores the state and context information of the thread, thread blocking does not cause process blocking. On multiprocessor systems, multiple threads run in parallel on multiple processors. Threads are created, scheduled, and managed by the kernel, which is slower than ULT and faster than process operations.

User thread model

ULT: A user application implementation that is independent of the operating system core and provides functions to create, synchronize, schedule, and manage threads to control user threads. No user/kernel mode switching, fast speed. The kernel is unaware of ULT, and if a thread blocks, the process (including all its threads) blocks.

At this point, think about what thread model does the JVM use?

Processes and threads

What is a process?

Modern operating systems create a process for a program when it runs; For example, when you start a Java program, the operating system creates a Java process. Processes are the smallest unit of OS resource allocation.

What is a thread?

Threads are the smallest unit of the OS’s scheduling CPU. Also known as Light Weight processes, multiple threads can be created within a Process. These threads have their own counters, stacks, local variables and other attributes, and have access to shared memory variables. The CPU switches between these threads at high speed, giving the user the impression that these threads are executing at the same time, the concept of concurrency, similar to parallelism!

Thread context switching process:

Virtual machine instruction set architecture

There are two main types of virtual machine instruction set architectures:

  1. Stack instruction set architecture
  2. Register instruction set architecture

Wiki details on instruction set architecture: zh.wikipedia.org/wiki/ Instruction set architecture

Stack instruction set architecture

  1. Simpler design and implementation, suitable for resource-constrained systems;
  2. Avoid the register allocation problem: use zero address instruction mode allocation;
  3. Most of the instructions in the instruction stream are zero-address instructions, whose execution process depends on the operation stack, and the instruction set is smaller, which is easy for the compiler to implement.
  4. No need for hardware support, better portability, better cross-platform implementation.

Register instruction set architecture

  1. Typical applications are x86 binary instruction sets: traditional PCS and the Android Davlik virtual machine.
  2. Instruction set architecture is completely dependent on hardware and has poor portability.
  3. Excellent performance and more efficient execution.
  4. It takes less instruction to complete an operation.
  5. In most cases, the instruction set based on register architecture is usually dominated by one address instruction, two address instruction and three address instruction, while the instruction set based on stack architecture is dominated by zero address instruction.

Java conforms to the typical stack instruction set architecture, such as Python and Go.

GitHub Org_Hejianhui /JavaStudy GitHub Hejianhui /JavaStudy