memory

register

Registers: We can think of the CPU as the “brain” of the computer. The things we think about are like the registers in the CPU. Registers are more like a part of the CPU itself than a memory, and generally hold only a very limited amount of information. But it’s very fast, in sync with the CPU.

CPU Cache

CPU Cache: The memory in our brain is like the CPU Cache (or “CPU Cache” for short). The CPU Cache uses a chip called SRAM (Static Random access memory).

SRAM: SRAM is called “static” memory because data in it can remain alive as long as it is powered on. And if the power goes out, the data is lost. In SRAM, one bit of data requires six to eight transistors. So the storage density of SRAM is not high. The amount of data that can be stored in the same physical space is limited. However, because of the simple circuitry of SRAM, the access speed is very fast.

In a CPU, there are usually three levels of caching: L1, L2, and L3. Each CPU core has its own L1 cache. CPU and L1 communicate directly, and L1 is relatively much smaller than L2. L1 is usually divided into instruction cache and data cache, storing instructions and data used by CPU separately, while L2 does not distinguish. -> The idea that instructions and data are stored separately also comes from the Harvard structure. L1’s Cache is often embedded in the CPU core. = =

L2’s Cache is also present in every CPU core, although it is often not inside the CPU core. As a result, L2 Cache accesses are slightly slower than L1. The L3 Cache, which is typically shared by multiple CPU cores, is larger and slower.

The L1 Cache of the CPU is our short-term memory, and the L2/L3 Cache is our long-term memory, which is our own bookshelf or desk. When we have no information in our own memory, we can take a book from a desk or shelf and turn to it. In this process, data is loaded from memory into the CPU’s Cache and registers, and then processed and processed by the “brain,” which is the CPU.

memory

DRAM memory uses a type of chip called DRAM, which has simpler circuits, higher density and more capacity than SRAM, and is much cheaper than SRAM chips. It is called “dynamic” memory because DRAM needs to be constantly “flushed” to keep data stored. A bit of DRAM can be stored with only one transistor and one capacitor. Therefore, DRAM can store more data in the same physical space, that is, the “density” of storage is greater. However, because the data is stored in the capacitor, the capacitor will continue to leak electricity, so it needs to regularly refresh and charge, in order to keep the data not lost. DRAM’s data access circuit and refresh circuit are more complex than SRAM’s, so access latency is longer.

The hard disk

In the case of memory, external storage devices called hard disks, such as SSDS and HDDS, are public libraries. The library has more books (data).

The hierarchy of memory

The entire memory hierarchy is similar to the difference in performance and price between SRAM and DRAM. SRAM is more expensive and faster. DRAM is cheaper and has more capacity. SRAM is like memory in our brain, while DRAM is like our own desk.

L1 Cache is not only limited by cost level, but also by physical level. Not only is it expensive, but its access speed is related to its physical distance from the CPU. The bigger a chip gets, the farther away some of it is from the CPU. So if you want to be fast, you can’t just spend more money.

Smaller devices are faster, and the CPU does not deal directly with each memory device, but with each memory device, and only with its neighbors. For example, if a CPU Cache is loaded from the memory or needs to be written back to the memory, data is not directly written back to the hard disk, nor is data directly loaded from the hard disk to the CPU Cache. Instead, data is first loaded to the memory and then to the Cache.

In a real computer, the faster the device, the smaller the capacity.

Locality principle

L1 Cache is 256K, L2 Cache is 1MB, and L3 Cache is 12M.

Can we not only enjoy the speed of CPU Cache, but also enjoy the huge capacity and low price of memory and hard disk? Because this problem can be extended to the locality principle, the locality principle includes the time locality and the space locality these two strategies.

  • Time locality: If a piece of data is accessed, it is inferred that it will be accessed again within a short period of time. The data is then read from the database on the hard disk into the cache in memory. This takes advantage of temporal locality.
  • Spatial locality: If a piece of data is accessed, its neighbors will be accessed soon. This is like our program, after accessing the first item of the array, most likely will loop through its next item.

With temporal locality and spatial locality, instead of having all of our data in memory and all of our data on HDDS, we can put our frequently accessed data in expensive but faster storage, and our less frequently accessed data in slower but larger storage. This combination of memory, SSDS, and HDDS allows us to provide the actual data storage, management, and access requirements at the lowest cost.

Locality principle + different levels of memory combination

Assuming amazon has 600 million items, each of which requires 4MB of storage space, then a total of 2400TB (= 600 million x 4MB) of data storage space is required.

If we put all the data in memory, it would cost $36 million (= 2400TB/1MB x 0.015 USD = 36 million). But not every one of those 600 million items will be visited often.

If we only put the top 1% of hot items, or 6 million hot items, in memory, and put the rest on mechanical HDDS, The cost of storage we needed dropped to $456,000 (= $36m x 1% + 2400TB / 1MB x 0.00004), or about 1.3% of the original cost.

We’re using time locality here. We load the data that has been accessed by the user into memory, and when we run out of memory, we remove the data that has not been accessed in memory for the longest time, and this is actually the LRU cache algorithm that we often use. Popular items that are accessed more often are always kept in memory, while unpopular items that are accessed less often are only stored on HDDS, and data is accessed directly from hard drives. Even if it is loaded into memory, it is quickly removed. The more popular an item is, the easier it is to find in memory, and the better it takes advantage of memory’s random-access capabilities.

Whether placing only 6 million hot items can satisfy actual online service requests depends on the cache hit ratio of the LRU cache policy.

Random access requests to memory require 100ns. This means that, in the extreme case, memory can support 10 million random accesses per second. We used 24TB of memory, if 8GB is one, that means 3000 bits of memory, which can support 30 billion accesses per second (= 24TB/8GB x 1s/100ns). Based on amazon’s 300 million users in 2017, we estimate that there are 100 million active users per day. Each of these 100 million users will visit 100 items on average, so the average number of items accessed per second is 120,000.

However, if the data does not hit memory, the corresponding data request will be accessed to HDD. A hard disk drive (HDD) can support only 100 random accesses per second. If 2400TB of data is calculated on a 4TB disk, 600 hard disks can support 60,000 random accesses per second (2400TB/4TB x 1s/10ms).

By doing this, you can see that all commodity access requests go directly to HDDS, which cannot support such pressures. We need at least a 50% cache hit ratio for HDDS to support the remaining number of accesses. Otherwise, we could either add more HDDS and achieve 120,000 random access per second, or replace HDDS with SSDS so that a single hard drive could support more random access requests.

Calculation process:

First calculate how much memory your goods need in total, then estimate how many hot goods need access acceleration, then calculate how much it costs to put the goods that need acceleration into memory, plus how much it costs to put the rest of the goods into disk. -> Count the money

Then calculate the average number of visits per second at a given moment, calculate how many accesses can be supported in one second based on random memory access request time, and calculate how many requests can be supported in one second based on the amount of memory used by hotspot data. Then calculate the total number of requests per second that the disk can support for non-hotspot data to determine whether it can meet the user access peak at a certain moment. -> Figure out if it’s enough

In this example, you can think of cargo as people, memory as airplanes, and disks as green trains. You figure out how many people you need to move quickly, and then you figure out how much it costs to fly, how much it costs to train, and then you figure out what the peak number of people you need to move is, and then you figure out how long it takes for one plane to move, and then you figure out how many people you can move in a day, and then you figure out how many planes you can move in a day, and it makes a lot of sense.

Finally, combine the two and, if enough, save money.

We used a quick estimate to determine whether this caching strategy would meet our needs and how much hardware would need to be planned given the estimated server load. This ability to “estimate + plan” is a must for every engineer who wants to grow as an architect.

Cache – up

Cache = CPU Cache = L1 Cache + L2 Cache + L3 Cache

Today, a single memory access takes about 120 CPU cycles

The instructions the CPU needs to execute and the data it needs to access are all in its memory, which is less than 1% of its speed. As you can see in the figure below, the performance gap between CPU and memory has widened over time

Once the CPU Cache is added to the existing CPU, the instructions and data in memory are loaded into the L1-L3 Cache instead of being accessed by the CPU. 95% of the time, the CPU only needs to access the L1-L3 Cache to read instructions and data from it, without accessing memory.

A large amount of space in modern cpus has been taken up by SRAM. The L3 Cache chip of the CPU is highlighted in red.

int[] arr = new int[64 * 1024 * 1024]; // 256MB = 4*64*2^20 // loop 1for(int i = 0; i < arr.length; i++) arr[i] *= 3; 2 / / cyclefor(int i = 0; i < arr.length; I + = 16) arr [I] * = 3 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 50 milliseconds cycles ephah, msCopy the code

The above code, according to our normal thinking, the second code is 1/16 of the array element of the first paragraph, that is, only 1/16 of the first code to perform the multiplication, should take an order of magnitude, 10 times more than the first code. But the difference was just 4 milliseconds.

With the CPU Cache and memory access speed difference above, we can know that the time of running the program is mainly spent reading the corresponding data from memory and loading it into the CPU Cache. The CPU reads data from memory to the CPU Cache in small chunks rather than individual array elements. Such small pieces of data in the CPU Cache are called Cache lines.

On a typical Intel server or PC, the size of the Cache Line is 64 bytes.

When a Block is stored in a CPU Line, it is a Block of data. Each Block of data is stored in a CPU Line, and each Block of data is stored in a CPU Line every 16 values. All 16 values are loaded into the CPU Line for each read value.

In loop 2 above, we calculate every 16 integers, which is exactly 64 bytes. Therefore, loop 1 and loop 2 need to read the same amount of Cache Line from memory to THE CPU Cache, and the end time of the two programs is not much different.


The data structure and reading process of the Cache

When a modern CPU reads data, the CPU always accesses the Cache first, regardless of whether the data is already stored in the Cache. Only when the CPU cannot find data in the Cache, it accesses the memory and writes the read data to the Cache. When the time locality principle is in effect, the data that was recently accessed will soon be accessed again. Cache access is much faster than memory access, so the CPU spends much less time waiting for memory access. -> In various benchmarks and actual application scenarios, the CPU Cache hit ratio can reach 95% or more.

Cache data structure and access logic

Direct mapping: THE CPU accesses memory data in small chunks to read. To read data in memory, the first thing we get is the address of the block of memory where the data is located. The direct mapping strategy ensures that the address of any memory block is always mapped to a fixed CPU Cache Line. And this mapping, usually with mod operation (mod operation) to achieve.

For example, our main memory is divided into 32 blocks from 0 to 31. We have a total of eight cache blocks. The user wants to access memory block 21. If the contents of block 21 are in cache, it must be in cache 5 (21 mod 8 = 5).

In practice, the number of cache blocks is usually set to 2 to the power of N. In this way, when calculating the modulus, you can directly take the lowest N bits of the address, which is the last few bits of the binary. So if I have 8 cache blocks here, that’s 2 to the third power. So, when modulo 21, you can take the base 2 of 21 to represent 10101 and take the lower three digits of the address ==, which is 101, and the corresponding 5, which is the corresponding cache block address.

A Cache Line is only 64 bytes long, so there is no way to store that many blocks of data. The Cache Line stores a group marker that records the number of blocks in the Cache Line. The address of the Cache Block is the lowest N bits of the memory address accessed by the current Cache Block. The address of the Cache Block is the lowest N bits of the memory address accessed by the Cache Block. As in the example above, the address of the cache block itself has already covered the corresponding part of the information, and the corresponding group marker, so it only needs to record the remaining two bits of information of 21, that is, 10.

The Cache Line has two other pieces of data in addition to the group marker, an actual array loaded from memory and a significant bit. It is used to flag whether the data in the corresponding cache block is valid, ensuring that it is not empty data when the machine was just started. If the significant bit is 0, the CPU does not care about the group marker or the contents of the Cache Line. Instead, the CPU accesses memory and reloads the data.

When the CPU reads data, it doesn’t read a whole Block, it reads an integer that it needs. This data, we call a word in the CPU. The specific word is determined by the position of the word within the Block. This position is called the Offset.

If the data in memory is already in the CPU Cache, the access to a memory address will go through four steps:

  1. The index in the Cache is calculated according to the low level of the memory address.
  2. Determine the valid bit to ensure that the data in the Cache is valid.
  3. Compare the high value of the memory access address with the group marker in the Cache to confirm that the Data in the Cache is the memory Data to be accessed, and read the corresponding Data Block from the Cache Line.
  4. Read the desired word from the Data Block based on the Offset bit of the memory address.

If, in steps 2 and 3, the CPU finds that the Data in the Cache is not the memory address to access, the CPU accesses the memory and updates the corresponding Block Data to the Cache Line, along with the corresponding significant bit and group marker Data.

Cache – down

Java Memory Model (JMM)

The JMM is just a memory model in a Process-level Virtual machine, the Java Virtual Machine, but it is very similar to the hardware architecture that combines CPU, cache, and main memory in a computer. Understanding the JMM makes it easy to understand the relationship between the CPU, cache, and main memory in your computer’s composition.

Volatile: It ensures that all reads and writes to this variable are synchronized to main memory and not read from Cache.

Volatile is widely used in embedded software written in C. Code that does not use volatile is more efficient than code that does, but does not guarantee data consistency. The purpose of volatile is to tell the compiler that the value of a variable is volatile, that it must be read or written from its memory address every time a value is read or written, and that it must not be read or written to a “temporary” variable instead of a direct read or write to the variable for efficiency. If the compiler sees the volatile keyword, it must generate a memory access instruction that reads or writes the variable directly. Without volatile keyword, the compiler to efficiency, will only use memory read instructions before the start of the loop will read this variable to register, after in circulation is done with register access instructions to operate the “temporary” variables, at the end of the cycle to use memory write command will this “temporary” variable in the register to write back to memory. During this process, if this variable in memory is changed by other factors (other threads, interrupt functions, signal handlers, DMA controllers, other hardware devices), data inconsistencies can occur. In addition, the register access instruction is faster than the memory access instruction, which includes the cache, which means that the memory access instruction may actually access the cache data, but even then, it is not as fast as the register access instruction. The cache is also transparent to the compiler, which only considers it to be reading or writing to memory when using memory read/write instructions. Data synchronization between memory and cache is guaranteed by the CPU.

case

Version 1 code:

public class VolatileTest {
    private static volatile int COUNTER = 0;

    public static void main(String[] args) {
        new ChangeListener().start();
        new ChangeMaker().start();
    }

    static class ChangeListener extends Thread {
        @Override
        public void run() {
            int threadValue = COUNTER;
            while(threadValue < 5){// Wait completely busyif( threadValue! = COUNTER){ System.out.println("Got Change for COUNTER : " + COUNTER + "");
                    threadValue= COUNTER;
                }
            }
        }
    }

    static class ChangeMaker extends Thread{
        @Override
        public void run() {
            int threadValue = COUNTER;
            while (COUNTER <5){
                System.out.println("Incrementing COUNTER to : " + (threadValue+1) + "");
                COUNTER = ++threadValue;
                try {
                    Thread.sleep(500);
                } catch (InterruptedException e) { e.printStackTrace(); }
            }
        }
    }
}
===================================
Incrementing COUNTER to : 1
Got Change for COUNTER : 1
Incrementing COUNTER to : 2
Got Change for COUNTER : 2
Incrementing COUNTER to : 3
Got Change for COUNTER : 3
Incrementing COUNTER to : 4
Got Change for COUNTER : 4
Incrementing COUNTER to : 5
Got Change for COUNTER : 5
Copy the code

Version 1 implementation:

The ChangeListener thread runs a simple task. It takes the current value of COUNTER, and then it listens for the value of COUNTER. As soon as the value of COUNTER changes, it prints out the new value through println. Until the value of COUNTER reaches 5. This process of listening is accomplished by busy waiting in a never-ending while loop.

The task that the ChangeMaker thread runs is also simple. It also takes the value of COUNTER and increments it by one every 500 milliseconds when it’s less than 5. Before the increment, print out the increment value using println.

Finally, in the main function, we start the two threads separately to see how the program executes. The program’s output is not surprising. The ChangeMaker function increases COUNTER from 0 to 5 one at a time. Since this increment occurs every 500 milliseconds, and ChangeListener is busy waiting to listen to the COUNTER, each increment is monitored by ChangeListener and the corresponding result is printed out.

Explanation:

In the first example where volatile is used, all data is read and written to main memory. So naturally, between our ChangeMaker and our ChangeListener, we see the same COUNTER value.


Version 2 code:

Remove the volatile keyword for the variable COUNTER as defined.

private static int COUNTER = 0;
====================================
Incrementing COUNTER to : 1
Incrementing COUNTER to : 2
Incrementing COUNTER to : 3
Incrementing COUNTER to : 4
Incrementing COUNTER to : 5
Copy the code

Version 2 implementation:

ChangeMaker still works, incrementing COUNTER by 1 every 500ms. But something strange happened to ChangeListener. ChangeListener no longer worked. In ChangeListener’s mind, it seems to think that COUNTER is still 0 at the beginning. It seems that the change of COUNTER is completely “invisible” to ChangeListener.

Explanation:

The volatile keyword was removed. In this case, the ChangeListener is a busy waiting loop, trying to fetch COUNTER continuously from the current thread’s “Cache”. As a result, the thread does not have time to synchronize the updated COUNTER value from main memory. So it’s stuck on a loop where COUNTER=0.


Version 3 code:

Instead of having ChangeListener wait completely busy, it waits for a small 5 milliseconds in the while loop.

static class ChangeListener extends Thread {
    @Override
    public void run() {
        int threadValue = COUNTER;
        while ( threadValue < 5){
            if( threadValue! = COUNTER){ System.out.println("Sleep 5ms, Got Change for COUNTER : " + COUNTER + "");
                threadValue= COUNTER;
            }
            try {
                Thread.sleep(5);
            } catch (InterruptedException e) { e.printStackTrace(); }
        }
    }
}
========================================
Incrementing COUNTER to : 1
Sleep 5ms, Got Change for COUNTER : 1
Incrementing COUNTER to : 2
Sleep 5ms, Got Change for COUNTER : 2
Incrementing COUNTER to : 3
Sleep 5ms, Got Change for COUNTER : 3
Incrementing COUNTER to : 4
Sleep 5ms, Got Change for COUNTER : 4
Incrementing COUNTER to : 5
Sleep 5ms, Got Change for COUNTER : 5
Copy the code

Version 3 implementation:

While the COUNTER variable still doesn’t have the volatile keyword set, ChangeListener seems to have “woken up.” After “sleeping” for 5 milliseconds in each loop through Thread.sleep(5), ChangeListener is able to fetch the value of COUNTER normally again.

Explanation:

The volatile keyword is still not used, but thead. Sleep in just 5ms gives the thread some breathing space. Now that the thread is not so busy, it has a chance to synchronize the latest data from main memory to its cache (interesting). So the next time ChangeListener looks at the COUNTER value, he’ll see the change ChangeMaker made.

Write policy

The Intel CPUS we use today are usually multi-core. Each CPU core (four cores and eight threads) has its own L1 and L2 Cache, followed by L3 Cache and main memory shared by multiple CPU cores.

CPU caches are much faster than main memory, and L1/L2 caches are much faster than L3 caches. As a result, the CPU always tries to fetch data from CPUCache, rather than from main memory every time.

This hierarchy is similar to the Java memory model where each thread has its own thread stack. When the == thread reads COUNTER data, it actually reads from the Cache copy of the local thread stack, not from main memory == If we just read the data, that’s not a problem. But there are two problems with writing it.

  1. Writing data to Cache is also faster than writing data to main memory. Should we write data to Cache or main memory?
  2. If we write directly to main memory, will the data in the Cache be invalidated?

There are two corresponding write strategies:

  1. Write direct

The simplest type of write strategy is called write direct. In this strategy, data is written to main memory each time. In write direct, we check whether the data is already in the Cache before writing. If the data is already in the Cache, we write the data to the Cache first and then to the main memory. If the data is not in the Cache, we update only the main memory.

The strategy of writing direct is straightforward, but the problem is obvious: it’s slow. Whether the data is in Cache or not, we need to write the data to main memory. This is a bit like the volatile keyword we used above, where data is always synchronized to main memory.

  1. Write back to

Since reading data is also loaded from Cache by default, we do not need to synchronize all writes to main memory. In this strategy, instead of writing to main memory every time, only the CPU Cache is written. Data is written to main memory only when it is “replaced” in the CPU Cache.

The write back strategy works like this: if the data we want to write is found in the CPU Cache, we simply update the data in the CPU Cache. At the same time, we mark the Block in the CPU Cache as dirty. By dirty, we mean that the Block in our CPU Cache is not the same as the main memory.

If we find that the Cache Block we are writing to contains data from another memory location, we need to check whether the data in that Cache Block is marked as dirty. If it is dirty, write the data in the Cache Block to main memory first. Next, write the current data to the Cache and mark the Cache Block as dirty. If the data in the Block is not marked as dirty, we write the data directly to the Cache and then mark the Cache Block as dirty. – Dirty data, different from memory data

After using the write back strategy, we also need to take an extra step to synchronize the dirty Cache when loading data into the Cache. If there is a dirty mark in the Cache Block when loading data from the memory to the Cache, we need to write the data in the Cache Block back to the main memory before loading data to the Cache.

As you can see, in the write back strategy, if we do a lot of operations, we hit the cache. Then most of the time, we don’t need to read or write to main memory, and natural performance is much better than writing directly.

CPU cache-MESI protocol

Modern computers have found another way to improve CPU throughput after failing to increase the CPU frequency is multi-core CPU technology. However, each CPU core in a multi-core CPU has its own L1 Cache and L2 Cache. Multiple cpus share only the L3 Cache and main memory. Because each core of the CPU has its own cache and operates independently of each other, there are cache consistency issues.

Cache consistency: For example, if the price of iPhone fluctuates, we want to update the latest price of iPhone into memory. For performance reasons, the CPU uses a write back policy to write to the CPU Cache. Write data to L2 Cache first, then mark the Cache Block as dirty. At this point, data is not actually synchronized to L3 Cache or main memory. Core 1 expects data to be written to main memory only when the Cache Block is swapped out.

This problem does not occur when computers are single-core. But if it’s multicore, the other CPU core reads the price of the iPhone from memory, and it reads the wrong price because the price of the iPhone has just been updated by core 1. However, the updated information only appears in core 1’s L2 Cache, not in core 2’s L2 Cache or main memory. The problem, called cache consistency, is that the caches of core 1 and core 2 are inconsistent at this point.

Therefore, we need to synchronize the cached data in two different cores. The following conditions need to be met:

  1. Write propagation: In a CPU core, our Cache updates must be able to propagate to the corresponding Cache lines of other nodes.
  2. Serialization of transactions: We read and write in one CPU core in the same order as the other nodes.

This is an example of a transaction not being serialized. CPU cores 3 and 4 read CPU1 and 2’s transactions in inconsistent order, and even though write propagation was done, the data was still in error.

Serialization of transactions is not only necessary for cache consistency. For example, we usually use the system, the most need to ensure the serialization of transactions is the database. When multiple different connections access the database, we must ensure that transactions are serialized.

For serialization of transactions, two things need to be done:

  1. Data operations by one CPU core need to communicate synchronously to other CPU cores.
  2. If two CPU cores have a Cache of the same data, there needs to be a concept of “locking” for updates to this Cache. Data can be updated only after the lock of the corresponding Cache Block is obtained.

To solve the cache consistency problem, we must first solve the problem of data propagation between multiple CPU cores. The most common solution is called a “bus sniffing” mechanism. This strategy essentially broadcasts all read and write requests across the Bus to all CPU cores, which then “sniff” the requests and respond locally.

MESI

Write the failure

The MESI protocol is a protocol called write invalidation. In write invalidation protocol, only one CPU core is responsible for writing data, and the other cores simply read into the write synchronously. After this CPU core writes to the Cache, it broadcasts an “invalid” request to all other CPU cores. Other CPU cores simply determine if they too have a “dead” version of a Cache Block and mark this as dead as well.

Write a radio

Write broadcast: In contrast to write invalidation, there is a protocol called write broadcast, in which a write request is broadcast to all CPU cores and the caches in each core are updated. However, this protocol consumes more bus bandwidth than write failure.

MESI – Four different tokens for the Cache Line:

  1. M: Modified: The contents of this Cache Block have been updated, but not written back to main memory.
  2. E: indicates that == Exclusive == (Exclusive) : The data in the Cache Block is the same as the data in the main memory, whether it is in the Exclusive or shared state.
  3. S: == Shared == (Shared) : The data in the Cache Block is the same as the data in the main memory, whether it is in the exclusive or Shared state.
  4. I: indicates Invalidated: The data in this Cache Block is invalid. We cannot trust the data in this Cache Block.

The difference between the “exclusive” and “shared” states is as follows: in the “exclusive” == state, the corresponding Cache Line is only loaded into the current CPU core’s Cache. Other CPU cores do not load corresponding data into their caches. At this point, if we want to write data to an exclusive Cache Block, we can write data freely without telling other CPU cores.

Data in the exclusive state becomes shared if it receives a request == from the bus to read the corresponding cache. The reason for this shared state is that the other CPU core also loads the corresponding Cache Block from memory into its own Cache.

In the shared state, the same data is stored in the Cache of multiple CPU cores. Therefore, when we want to update the data in the Cache, we do not modify the data directly. Instead, we broadcast a request to all other CPU cores to invalidate the caches in other CPU cores before updating the current Cache. This broadcast operation, usually called RFO, == takes ownership of the current corresponding Cache Block ==.

This operation is a bit like the read/write lock we use in multithreading. In the shared state, everyone can read the corresponding data in parallel. But to write, we need to acquire ownership of the current write location through a lock.

The state of the entire MESI can be represented by a finite state machine. Event operations triggered by different states may come from the current CPU core or from signals broadcast by other CPU cores in the bus.

The key to achieving cache consistency is two things. The first is write propagation, which means that what is written to one CPU core needs to be propagated to other CPU cores. More importantly, the second point is to ensure the serialization of transactions, so that our data is truly consistent, and the results of our programs running on different cores are consistent. This feature is important not only at the CPU cache level, but even more so at the database level.

memory

Memory is the memory in the five components, our instructions and data, all need to be loaded into memory, will be taken by the CPU to execute.

On the Linux or Windows operating systems we use everyday, programs do not have direct access to physical memory. Our memory needs to be divided into fixed size pages and then translated from virtual memory address to physical memory address to reach the physical memory location where the data is actually stored. And the memory addresses that our program sees are virtual memory addresses.

Simple page table

Virtual memory addresses, mapped to physical memory addresses, are one-to-one mapped through a mapping table. This mapping table, on the computer, is called a page table. Page table is an address translation method that splits a memory address into a page number and an offset.

Take a 32-bit memory address as an example. The first bit is the page number of the memory address. The lower part is the offset in the memory address. The page table of address translation only needs to retain the mapping between the page number of virtual memory address and the page number of physical memory address. Memory within the same page is physically contiguous. For example, if a page size is 4K bits, we need 20 bits high and 12 bits low.

Memory address translation steps:

  1. Split virtual memory addresses into combinations of page numbers and offsets;
  2. Query the virtual page number and corresponding physical page number from the page table.
  3. Just take the physical page number and add the preceding offset to get the physical memory address.

The problem expressed like this: 32-bit memory address space, the page table needs to record a total of 2^20 mappings to the physical page number. This storage relationship is like a 2^20 array. A page number is a full 32 bits of 4 bytes, so a page table needs 4MB of space (and that’s just 32 bits). But this space we each process, has its own independent virtual memory address space. This means that each process needs such a page table. Whether the process itself is only a few kilobytes in size or requires a few gigabytes of memory, we need such a page table.

Multistage page table

Since the memory footprint of most processes is limited, the number of pages required is also limited. There’s no need to store the 2^20 physical page tables.

The memory address space of the entire process is usually “real at both ends and empty in the middle”. As the program runs, memory addresses are allocated from the top down, occupying space on the stack. The space of the heap, the memory address, is allocated from the bottom up. So, in a real program process, the address space occupied by virtual memory is usually two contiguous segments. Rather than completely scattered random memory addresses. Multilevel page tables are particularly suitable for such memory address distribution.

For example, a 4-level multilevel page table, the same virtual memory address, the offset part is the same as the simple page table above, but the original page number part, it is divided into four segments, from high to low, divided into 4 levels to 1 so that 4 page table indexes.

Accordingly, == a process will have a level 4 page table ==. We first find the corresponding Entry in the level 4 page table through the level 4 page table index. This entry stores the location of a level 3 page table. For every item in a level 4 page, there is a level 3 page table, so we could have multiple level 3 page tables. After finding the corresponding level 3 page table, we use the level 3 index to find the corresponding level 3 index entry. Entries in the level 3 index point to a level 2 page table. Similarly, in a 2-level page table we can point to a 1-level page table with a 2-level index. The data content of the entry in the last level 1 page table is the physical page number. After obtaining the physical page number, we can also use the “page number + offset” method to obtain the final physical memory address.

We may have many level 1, 2, or even 3 page tables. However, because the actual virtual memory space is usually contiguous, it is likely that we will need only a small number of level 2 page tables, or even one level 3 page table.

A multilevel page table is like a data structure of a multi-fork tree, so it is often called a page table tree. Because of the continuity of the virtual memory address distribution, many Pointers to nodes at the first level of the tree are empty, so there is no need for a corresponding subtree. We don’t need a subtree, which means we don’t need a corresponding level 2 and level 3 page table. Finding the final physical page number is like walking through a specific access path to the leaf node at the bottom of the tree.

If you look at a multilevel page table with four levels, each level is represented by five bits. So for each level 1 page table, only 2^5=32 entries are required. If each entry is still four bytes, a total of 128 bytes is required. A level 1 index table corresponds to 32 4kiBs, or 16KB. A full level 2 index table corresponds to 32 level 1 index tables, which is 512KB in size.

We can calculate that a process that takes up 1MB of memory is divided into two consecutive 512KB Spaces. Therefore, it requires two independent, filled level 2 index tables, meaning 64 level 1 index tables, two independent level 3 index tables, and one level 4 index table. There are 69 index tables, 128 bytes each, which is about 9KB of space. It’s about 1/500 of 4 megabytes.

A multilevel page table is like a tree. Because the memory address of a process is relatively concentrated and continuous, the page table tree approach can greatly save the space required by the page table. And since each process requires a separate page table, the space savings can be significant.

However, multi-stage page tables save us storage but cost us time, so it’s really a “time for space” strategy. Originally we did an address translation, only need to access the memory once to find the physical page number, calculate the physical memory address. However, with level 4 page tables, we need to access memory four times to find the physical page number.

Accelerated address translation: TLB

Because address translation is a very high frequency action, the performance of address translation becomes critical. At the same time, the instructions and data executed by the CPU are also stored in memory, so memory security needs to be considered.

The above multilevel page table memory access is poor in performance due to multiple accesses to memory. Because the program needs to use the instructions, are sequentially stored in virtual memory. The instructions we carry out are also carried out sequentially. That is, our access to instruction addresses is both “spatial local” and “temporal local,” and the data to be accessed is the same. We executed five instructions in a row. Because memory addresses are contiguous, all five instructions are usually in the same “virtual page.”

Therefore, these five consecutive memory address translation are actually from the same virtual page number, so we can “add a cache” to the previous memory translation address cache, so that we do not need to repeatedly access the memory for memory translation.

So computer engineers put a special cache chip in the CPU. This cache chip is called TLB, which stands for address switching cache. This cache holds the query results that have been translated before. In this way, when the same virtual address needs to be translated, we can directly query the result in TLB, and do not need to access the memory for multiple times to complete the translation.

TLB is similar to the CPU cache, which can be divided into instruction TLB and data TLB, namely ITLB and DTLB. In the same way, we can also grade it according to size, into L1, L2 and other layers of TLB. Similarly, write back data requires the same dirty flag bits as the CPU Cache to implement the “write back” Cache management strategy.

For performance, our entire memory conversion process is also performed by hardware. In the CPU chip, we packaged the memory management unit MMU chip, used to complete the address translation. Access and interaction with the TLB are controlled by the MMU.

Security and memory protection

Normally, we have isolated processes by differentiating between virtual and physical memory addresses. However, due to the complexity of CPU and operating system execution logic, there are still many loopholes.

Executable space protection

This mechanism means that only the instruction part of the memory used by a process is set to be “executable”, and other parts, such as the data part, are not given “executable” permission. Because both the instructions and the data, from our CPU’s perspective, are binary data. We take the data directly to the CPU, and if the data can be decoded into a reasonable instruction, it is actually executable.

Hacker’s solution: Put some encoded data in the program’s data section, and then find a way for the CPU to load it as instructions, so that the CPU can execute the instructions it wants.

Corresponding method: Control the execution permission of the memory space in the process, so that the CPU can only execute the code of the instruction area. For the contents of the data area, even if other vulnerabilities are found to load into instructions for execution, they will be blocked because they do not have permissions.

SQL injection is a typical network attack like this: if the SQL script executed by the server is assembled from a string, the parameters passed in the Web request can hide some of the SQL we want to execute, causing the server to execute SQL statements we didn’t think of. As a result, either the data in the database is destroyed, or the data is dragged into the database and leaked.

Randomize the address space layout

Originally, the memory layout of a process was fixed, so it was easy for any third party to know where the instructions were, where the stack was, where the data was, and where the heap was. This actually creates a great convenience for people who want to wreak havoc. The mechanism of address space layout randomization is that the locations of these regions are not fixed, == the memory space of the different parts of the process is randomly allocated to the memory space ==, so that the spoilers can not guess. If you can’t guess, you can’t find the location of the content you want to change. If you make a few changes, your program will crash instead of executing the code you didn’t plan.

Similarly, typical network attack protection measures are as follows: when the ciphertext of user account passwords is saved, each user is assigned a random SALT for specific ciphertext encryption.

In order to save the memory space required by the page table, we adopt a data structure called multi-level page table. However, multi-stage page tables save space but take more time to access memory multiple times. So, we put TLB, the cache for address translation, next to the MMU that actually performs address translation. Like CPU Cache, TLB is divided into instruction and data parts and can be layered into L1 and L2.

By making the contents of the data space unexecutable, you can avoid attacks such as injection attacks. By randomizing the allocation of memory space, you can prevent the code in a process’s memory from being predicted and thus vulnerable to attack.

The bus

There are many different hardware devices in the computer, and if we have N different devices, each of which needs to be connected separately, then the complexity of the system becomes N^2. In order to simplify the system, we introduced the bus, and turned this N^2 complexity into an N complexity. So design a common line, CPU wants to communicate with what equipment, what communication instructions, what corresponding data, are sent to this line; So what information is the device going to send to the CPU, it’s also going to send to this line. This line is like a highway, between the equipment and other equipment, there is no need to build a separate road, just a path to the highway.

A bus is just a set of lines. Our CPUS, memory, and input and output devices all communicate with each other through this set of lines. In fact, the corresponding design ideas, in software development is also very common. We often use a design pattern called event bus when developing large systems.

In the event bus design pattern, each module triggers the corresponding event and sends the event object to the bus. That is, each module is a publisher. Each module registers itself on the bus, listens for events on the bus, and decides whether to perform specific processing or response based on the object type or object content of the event. -> I also send up, I also take it from the top.

In this design, modules registered on the bus are loosely coupled. Modules are not dependent on each other. Both code maintenance and future extensions will be very convenient.

Modern Intel CPU architectures typically have several buses. First, the CPU and memory and cache communication buses, there are usually two kinds of buses. This approach is called dual independent bus. In the CPU, there is a fast local bus and a relatively slow front-end bus. Modern cpus usually have a dedicated cache chip. The high-speed local bus is used to communicate with the CPU Cache. The front-end bus, on the other hand, is used to communicate with main memory and input and output devices. Sometimes we refer to the local bus as the back-end bus, as opposed to the front-end bus. The front-end bus also goes by many other names, such as processor bus and memory bus.

Local bus = back-end bus, front-end bus = processor bus = memory bus = system bus

CPU inside the northbridge chip, we said the front – end bus, divided into three, into three buses. Our front-end bus, which is actually the system bus, the memory interface inside the CPU, communicates directly with the system bus, which then connects to an I/O bridge. This I/O bridge, connected to our memory bus on one side, allows our CPU to communicate with memory; On the other side, an I/O bus is connected to the I/O device.

In fact, in a real computer, the front-end bus layer is broken down even more. According to different devices, it can also be divided into independent PCI bus, ISA bus and so on.

On a physical level, we can think of a bus as a set of “wires”, usually consisting of three types:

  1. The data line is used to transmit the actual data information, which is actually the “person” of the bus.
  2. Address lines are used to determine where data is to be transferred, whether to a location in memory or to an I/O device. It’s the equivalent of taking a piece of paper and writing down where the person wants to get off.
  3. Control line, used to control access to the bus. Although we compare the bus to a bus. So when someone wants to take the bus, they need to tell the bus driver, this is our control signal.

Although the bus reduces the coupling between devices and reduces the complexity of system design, it also introduces a new problem, that is, the bus cannot provide communication function for multiple devices at the same time.

Our bus is common to many devices, so when many devices want to use the bus, we need a mechanism to decide which device should use the bus in this case. This mechanism is called bus adjudication.

Input-output device

I/O devices, not just one device. Most I/O devices have two components. The first is its == interface ==, and the second is the actual I/O device ==. Hardware devices do not connect directly to the bus and communicate with the CPU, but through interfaces that connect to the bus and communicate with the CPU through the bus.

CPU – Bus – interface – IO device

Usually heard of parallel interface, serial interface, USB interface, are computer motherboard built-in each interface. Our actual hardware devices, such as printers that use parallel ports, older mice that use serial ports, or USB sticks that use USB ports, need to be plugged into these ports to work and communicate with the CPU.

As shown in figure, SATA hard disk, above the whole green circuit board and yellow dentate part is the interface circuit, yellow dentate is the motherboard built-in interface and IO device docking, green circuit board is the control circuit.

The interface itself is a circuit board. The CPU is not actually dealing with the actual hardware, but with this interface board. == There are three types of registers in the device, which are actually on the interface circuit of the device, but not on the actual device. = =

In addition to those built into the motherboard, some interfaces can be integrated into the device. So this device, through a cable, the integrated interface of the device connected to the motherboard. The separation of the interface from the actual device actually comes from the era of open architecture in computing.

If you’re running Windows, you can open device Manager, which has all sorts of Devices, Controllers, Adaptors. All of these, in fact, are descriptions of different angles of input and output devices. Called Devices, the focus is on the actual I/O Devices themselves. Called Controllers, they focus on the control circuits inside the I/O device interfaces. They are called Adaptors because they value the interface as an adapter that can be plugged into different real devices.

How does the CPU control IO devices

Whether the interface is built in the motherboard, or integrated in the device interface, in addition to the three types of registers, there are corresponding control circuits. It is through this control circuit that the CPU can control the actual hardware by transmitting signals to the interface board.

Functions of three types of registers:

  1. Status register: this is to tell our CPU that the device is already working, so it is useless for the CPU to send data or command. Until the previous action is complete, the status register is changed againreadyState before our CPU can send the next character and command.
  2. Data register: THE CPU writes data to the I/O device. For example, to print “GeekTime”, we send a “G” to the corresponding I/O device.
  3. Command register: THE CPU sends a command telling the printer to print. At this point, the control circuit inside the printer will do two actions. The first one is to set the state in our status register to not-ready. The second is to actually operate the printer to print.

In practice, printers usually have not only data registers, but also data buffers. The CPU does not actually hand the document to the printer one character at a time, but instead transfers the entire document to the printer’s memory or data buffer to print it all at once.

Signal and address

Communication between the CPU and the I/O device is also performed through the machine instructions supported by the CPU.

MIPS, a classification of machine instructions, does not have a specific type of instruction for communicating with I/O devices. It communicates the same way as IO devices and accesses main memory, using “memory addresses”. To keep an already complex CPU as simple as possible, the computer maps the registers of the I/O device, as well as the internal memory address of the I/O device, into the main memory address space. The address space of the main memory reserves segments of memory addresses for different I/O devices. When the CPU wants to communicate with these I/O devices, it sends data to these addresses. These address information is sent through the address line in the previous lecture, and the corresponding data information is naturally sent through the data line.

The I/O device, on the other hand, monitors the address line and, as the CPU sends data to its address, connects the data from the corresponding data line to the registers and memory in the corresponding device. The CPU can send commands, query status, or transfer data to the I/O device in this way. This approach is called memory mapping (MMIO).

The CPU of MIPS is extremely simple, so there is only MMIO. Intel X86 computers, with more than 2,000 instructions, have special instructions designed to communicate with I/O devices, namely in and out instructions.

Intel cpus also support MMIO, but they can also support port mapped I/O (PMIO), or independent I/O, through specific instructions. The PMIO communication differs from the MMIO core in that the device address accessed in PMIO is no longer in the memory address space, but in a dedicated port. This port is not a hardware socket, but an abstract concept for communicating with the CPU.

Either PMIO or MMIO, the CPU sends a binary message to the address of the I/O device. The device’s own interface circuit to decode the data. The decoded data will become an instruction supported by the device, and then operate the actual hardware device through the control circuit. For the CPU, it does not need to be concerned with what operations the device itself can support. All it has to do is transfer bits of data across the bus. This is similar to the Command pattern in design mode. In the bus transmission, is a data object, and then each device to receive these objects, and then according to the object content, the actual decoding and command execution.

This is a video card of my own, in the device manager inside the resource information. As you can see, there is a Memory Range, which is the Memory address that the device maps to, which is what we called MMIO access. Also, there’s the I/O Range, which is what we called the PMIO, which is the address of the I/O device that is accessed through the port. Finally, there is an IRQ, which is an interrupt signal that will come from this device.

The CPU does not send a specific operation instruction to operate on different I/O devices. Because if that were the case, with the invention of new I/O devices, we would have to expand the CPU instruction set.

In a computer system, the communication between the CPU and the I/O device is handled in this way:

First, on the I/O device side, we split the I/O device into interface circuits that communicate with the CPU and the actual I/O device itself. The interface circuit has corresponding status registers, command registers, data registers, data buffers and device memory, etc. Interface circuits communicate with the CPU through the bus and receive instructions and data from the CPU. The control circuit in the interface circuit decodes the received instructions to actually operate the corresponding hardware devices.

On the CPU side, to the CPU, it doesn’t see individual devices, it sees individual memory addresses or port addresses. The CPU simply transmits or reads data to these addresses. There is no essential difference between the required instructions and those that manipulate memory locations. The actual operation of the corresponding I/O hardware is based on the definition of the transmitted command data at the software level, rather than providing special new instructions.


The same is true for communication between the CPU and the bluetooth mouse, an input device.

To the CPU, this is just a normal USB device on the bus, no different from other USB devices such as USB disks, USB network cards, etc. These devices just send their own data to the operating system through the USB protocol. USB doesn’t care what the data is. The USB Bluetooth mouse receiver has the same data in this layer as a normal USB mouse.

To make these USB devices work, the operating system needs to process the data that comes in, and the data is processed by drivers, so different types of USB devices need different drivers.

Back to the USB Bluetooth mouse receiver, events generated by the mouse through bluetooth send -> Bluetooth receive -> USB send -> USB receive -> driver finally reach the operating system, bluetooth and USB are just the way of data transmission, in any other TCP/ IP transmission is the same. The essence is to transfer specific data to the operating system for processing.

I/O performance -io_wait

Disk performance indicators include response time and data transfer rate. The data transmission rate is the throughput rate of a hard disk.

There are two kinds of hard disks in common use:

  1. HDD – Mechanical hard disk – SATA 3.0 port
  2. SSD – SOLID state drives (SSDS) usually use two kinds of interfaces, each part of which uses SATA 3.0 interface & PCI Express interface.

I/O sequential access and random access

Now commonly used SATA 3.0 interface, bandwidth is 6Gb/s. The “B” here is bits. This bandwidth is equivalent to 768MB of data per second. The data transfer rate of our daily HDD is around 200MB/s.

Replace the HDD with a Crucial MX500 SSD. It can transfer data at a rate of nearly 500 megabits per second, more than twice as fast as HDDS. However, SATA interface hard disk, almost to this speed, performance is also the peak. Because SATA interface is only so fast.

SSDS are faster than HDDS. PCI Express hard drives can read at about 2GB/s, about 10 times that of HDDS, and write at 1.2GB/s.

  • Acc.Time – Response Time: This metric is the Time that a program initiates a write request to the hard disk until the request is returned.
  • Seq – Data transmission rate of sequential reads and writes to hard disks – Throughput rate
  • 4K – program to read a random disk on a 4KB size of data, how much data can be read in a second

Acc.Time The response Time of the two SSDS is in the tens of microseconds. HDDS typically range from a few milliseconds to tens of milliseconds. This performance difference is not 10 times, but in dozens of times, and even hundreds of times.

Based on response time and throughput, it seems that our hard drive is performing well. Even cheap HDDS can receive a request from a CPU and return it in milliseconds. About 200MB of data can be transferred in one second. If you think about it, we usually write a record into the database, which is about 1KB in size. If we take 200MB and divide it by 1KB, that’s about 200,000 data insertions per second. – But this number is not accurate.

The answer lies in reading and writing to the hard drive. The disk performance is completely different in sequential and random read/write scenarios. Such as the above 4K indicators

It can be seen that on this indicator, the performance difference between SATA 3.0 hard disks and PCI Express interfaces becomes very small. This is because, at this point, the speed of the interface itself is no longer the bottleneck for hard disk access. It can be seen that even PCI Express interface, when random read and write, the data transfer rate is only about 40MB/s, which is a few tens of the sequential read and write situation.

Based on the data shown in the figure, 40MB / 4KB = 10,000, that is, in one second, the SSD can randomly read 4KB of data 10,000 times, and write more. This number of random reads and writes per second is called ==IOPS==, which is the number of I/O operations per second. In fact, we focus more on the PERFORMANCE metric IOPS than response time. IOPS and DTR (data transfer rate) are the core indicators of input/output performance.

In actual application development, data access is more random than sequential. When we talk about “concurrency” on a server, we mean that there are many different processes and requests accessing the server. Naturally, the data they’re accessing on the hard drive, it’s hard to put together in order. In this case, random read/write IOPS is the core indicator of server performance. HDDS typically have IOPS of around 100, rather than 200,000.

Positioning IO_WAIT

Even SSDS that use the PCI Express interface have an IOPS of around 20,000. Our cpus are typically above 2GHz, which means they can perform 2 billion operations per second. Even if it takes many clock cycles for a CPU to send a read or write instruction to a hard disk, the number of instructions a CPU can execute in a second is several orders of magnitude different from the number of operations our hard disk can perform. This is why, in application development, it is often said that “the performance bottleneck is in I/O”. Many times, after the CPU instructions are sent, we have to “wait” for our I/O operation to complete before we can proceed to the next operation.

Problem: When we actually encounter server performance problems, how do we know if the problem is due to CPU I/O to complete the operation?

  1. In the top command, you can view whether the CPU is waiting for the I/O operation to complete
$top ========== top-06:26:30 up 4 days, 53 min, 1 user, load Average: 0.79, 0.69, 0.65 Tasks: 204 total, 1 running, 203 sleeping, 0 stopped, 0 zombie %Cpu(s): 20.0US, 1.7sy, 0.0Ni, 77.7ID, 0.0wa, 0.0hi, 0.7Si, 0.0st KiB Mem: 7679792 total, 6646248 used, 1033544 free, 251688 buffers KiB Swap: 0 total, 0 used, 0 free. 4115536 cached MemCopy the code

% Indicates the wa indicator of the CPU line. This indicator represents the IOwait, which is the percentage of the CPU time spent waiting for the I/O operation to complete.

  1. Knowing that IOWait is large, take a look at what the actual I/O operation looks like. useiostatThis command shows the actual disk read/write status.
$iostat ========= AVg-CPU: %user %nice % System %iowait % Steal % Idle 17.02 0.01 2.18 0.04 0.00 80.76 Device: TPS kB_read/s kB_wrtn/s kB_read kB_wrtn SDA 1.81 2.02 30.87 706768 10777408Copy the code

In this command, there are not only iowait as a percentage of CPU wait time, but also some more specific metrics, and it is divided by the number of different hard disks installed on the machine.

TPS refers to the IOPS performance of a hard disk. The indexes kB_read/s and kB_wrtn/s correspond to our data transfer rate indexes. Knowing the TPS, kB_read/s, and KB_WRTN /s metrics for actual disk reads and writes, we can basically tell if a machine’s performance is stuck on I/O.

  1. So, the next step is to find out which process is the source of these I/O reads and writes. At this point, you need the “iotop” command.
$iotop = = = = = = = = = = Total DISK READ: 0.00 B/s | Total DISK WRITE: 15.75 K/s Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 35.44 K/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 104 BE /3 root 0.00 B/s 7.88 K/s 0.00%0.18% [jbd2/sda1-8] be/4 root 0.00 B/s 0.00 K/s 0.00%0.00 % rsyslogd-n [rs:main Q:Reg] be/4 www-data 0.00 B/s K/s 0.00%0.00 % nginx: worker processCopy the code

Using the iotop command, you can see which processes are actually using the most I/O, so that you can target and optimize the corresponding programs. In the examples above, both WA and TPS are small. Let’s use the Stress command to simulate a complex I/O situation in Linux and see how ioWait works. I typed “stres-i 2” on a machine with a single CPU core on a cloud platform and let stress simulate two processes constantly writing data from memory to the hard disk.

  1. Type “stres-i 2” on a machine with a single CPU core on a cloud platform and let stress simulate two processes constantly writing data from memory to the hard disk.
$stress-I 2 $top ============= top-06:56:02 up 3 days, 19:34, 2 users, Load Average: 5.99, 1.82, 0.63 Tasks: 88 total, 3 running, 85 sleeping, 0 stopped, 0 zombie %Cpu(s): 3.0US, 29.9SY, 0.0ni, 0.0id, 67.2wa, 0.0hi, 0.0Si, 0.0st KiB Mem: 1741304 total, 1004404 free, 307152 used, 429748 buff/cache KiB Swap: 0 total, 0 free, 0 used. 1245700 avail Mem $ iostat 2 5 ============== avg-cpu: %user % Nice % System % IOwait % Steal %idle 5.03 0.00 67.92 27.04 0.00 0.00 Device: TPS kB_read/s kB_wrtn/s kB_read kB_wrtn sda 39762.26 0.00 0.00 00 $iotop ============== Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 0.00 B/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 29161 be/4 XUwenhao 0.00 B/s 0.00 B/s 0.00%56.71% Stres-i 2 29162 be/4 xuwenhao 0.00 B/s 0.00 B/s 0.00 %  initCopy the code

As you can see, in the output of top, the CPU has a lot of SY and WA, namely system calls and IOwait.

If we look at the I/O of the hard drive through iostat, you’ll see that the TPS is pretty quickly around 40,000, which is the IOPS of the hard drive.

If we take a look at IOTOP at this point, we will see that our I/O usage is due to both processes generated by Stress.

conclusion

In the sequential read case, the performance of both HDDS and SSDS looks very good. However, when random read tests are carried out, the true story of hard drive performance will be revealed. In most application development scenarios, we do not care about the amount of sequential read/write data, but the number of INPUT/output operations per second, which is the core performance indicator IOPS.

You’ll find that even SSDS using PCI Express have IOPS of around 20,000. This performance, compared to our CPU capacity of 2 billion operations per second, is nowhere near. So a lot of times, our programs are slow to respond because the CPU is waiting for the I/O operation to complete.

Under Linux, we can look at the overall load of the entire server with commands like top. In case of slow application response, we can use this instruction first to see if the CPU is waiting for the I/O to complete its operation. Furthermore, you can run the iostat command to view the read/write status of each disk. The iotop command can help us determine which process is doing the most I/O operations.

The combination of these commands can help you quickly determine whether our application is experiencing I/O bottlenecks, and which applications these bottlenecks are from, so that you can optimize your own application based on the results of locating.

Problem solving

The performance of random read is not as good as random write. I used to think it was the other way around, but why?

There is also a WAL queue inside the disk. You only need to write a write request to this queue, but if the cache does not hit, there is no lazy to steal.