This is the seventh day of my participation in the August More text Challenge. For details, see: August More Text Challenge.

Problems with seckill logs

For services with low concurrency, we can write all required logs to a log file on disk. However, during peak periods, the seckill service may have to process more than 100,000 request QPS per node. At least two logs are generated for a request from the time it enters the seckill service to the time it fails or succeeds. In other words, a kill node may generate more than 300,000 logs per second during peak periods.

What is this concept?

The disk performance indicator is IOPS, which is read and write times per second. The IOPS of a solid state drive with good performance is about 30,000. That is, a kill node generates 10 times more logs per second than the SSD IOPS! If these logs were immediately written to disk on every request, the disk would not be able to sustain them, let alone write them over the network to the monitoring system.

Therefore, the first problem that seckill logs face is that the log quantity per second is much higher than the disk IOPS. Therefore, direct disk writing affects service performance and stability.

In addition, memory needs to be allocated to concatenate log information before the service outputs logs. After logs are output, you need to release the log memory. Where does this lead?

For languages with memory garbage collectors, such as Java and Golang, frequent memory allocation and release may result in frequent memory reclamation by the garbage collector. During memory reclamation, the CPU usage increases significantly, which affects service performance and stability.

What about languages without a memory garbage collector, such as C++? They usually allocate memory from heap memory, and large amounts of allocated and freed heap memory can cause memory fragmentation and affect service performance.

Therefore, the second problem with seckill logs is that a large number of logs cause frequent service allocation and release of memory, which affects service performance.

Finally, seckill logs face the problem of losing a large number of logs due to abnormal service exits.

We know that due to the high volume of requests processed by the Seckill service, there are many requests per second whose logs are not written to disk. If the seckill service suddenly goes down, the logs may be lost.

In highly concurrent systems, this is inevitable, and the problem is how to control the time window for writing to the log and keep the number of log entries lost within a small acceptable range.

This is the third problem with seckill logs. As you can see from the above introduction, log collection is a big problem in high-traffic business scenarios such as Seckill. It is also a performance problem that must be solved.

How can I optimize the performance of seckill logs?

As we learned earlier, kill logs are at risk of high disk I/O, memory stress, large amounts of loss, and, ultimately, the volume of logs is too large for conventional log preservation methods to work. What to do? Let me talk about these questions one by one.

Optimized disk I/O performance

First, let’s look at the problem of kill logs exceeding disk IOPS.

Last time I introduced you to multilevel caching, do you remember the difference between memory performance and disk performance? Yes, memory performance is much higher than disk performance. Can we use memory to reduce disk pressure and improve log writing performance? The answer is yes.

Linux has a special type of file system: TMPFS, or temporary file system, which is a memory-based file system. When using a temporary file system, you think you’re writing to disk in your program, but you’re actually writing to memory. The files in the temporary file system, while in memory, are not lost when the application exits because it is managed by the operating system.

As the cloud architecture ensures high availability of cloud hosts, files in the temporary file system will not be lost as long as the operating system is running properly and no one deletes files. Therefore, we can put the files that the Seckill service writes to in a temporary file system. Log writing on a temporary file system is at least 100 times better than writing directly to disk.

Of course, log files in the temporary file system cannot be written indefinitely, otherwise the temporary file system will run out of memory sooner or later. So what to do? This can be done by, for example, moving the log files to disk every time the log files reach 20MB and clearing the log files from the temporary file system. The disk performs better when writing large files sequentially than when writing small data frequently, thus reducing the write pressure.

Optimized memory allocation performance

I wonder if you have learned C language. If so, you should be familiar with the malloc function and the free function. The malloc function is primarily used to allocate memory from the heap, while the free function is used to return used memory to the heap. Heap memory is managed by the system, when the heap memory has a large number of fragments, in order to find the appropriate size of storage space, may need to compare many times to find, which undoubtedly makes the program performance greatly reduced.

When the seckill service outputs a large number of logs, there will be frequent memory allocation and return. If the memory is allocated in the normal way, the performance will suffer under high concurrency. Therefore, we need to use efficient memory management that can allocate memory quickly and avoid frequently triggering the garbage collector to reclaim memory.

How do you do that?

We can look at the practices of bike-sharing operators. Like Mobike, Harrow, Qingju, etc., bikes are started and returned at places with high traffic, rather than at the operator’s warehouses. If we can manage the allocation and return of memory according to the actual business, we can avoid the performance degradation problems such as memory fragmentation and memory garbage collection.

How do you do that?

For seckill system, it needs some additional information in the log, in order to later troubleshooting or data statistics, such as user ID, source IP, purchase of goods ID, time, etc. However, the log file is plain text, and the additional information can be integers or strings, which need to be unified into strings before output to the text file. However, in high-level languages such as Java and Golang, the string is a wrapped object, with the underlying array of characters. Concatenation with strings directly causes the program to assign new string objects to hold the concatenation results.

For example, the following code triggers memory allocation.

Copy the code

str := "hello " + userName
Copy the code

How do you avoid string memory allocation? Generally we can use the character array directly, based on the character array to do parameter concatenation. A typical example would be to implement a log object with a character array buffer, providing method concatenation parameters such as AppendInt and AppendString. Let’s take this part down here.

type Logger struct{ data []byte } const maxDataSize = 65536 func NewLogger() *Logger { l := &Logger{ data: make([]byte, 0, Func (l *Logger)AppendInt(data int){d := strconv.itoa (data) l.data = append(L.data,  d...) Func (l *Logger)AppendString(data String){l.data = Append (L. data, []byte(data)...) L.tyflush ()} // Close Logger and write the buffer data to the log file. This function is usually called before the program exits. Func (l *Logger)Close(){l.Flush()} func (l *Logger)Flush(){// } func (l *Logger)tryFlush(){if len(l *Logger) >= maxDataSize {l. lush()}}Copy the code

In the above code implementation, the concatenated parameters are appended to each Append function, and no memory is reallocated for the concatenated data if the buffer is sufficient.

How do I make sure that the buffer is sufficient? The answer is the last tryFlush function, which keeps the contents of the buffer from getting too large. When the tryFlush function finds that the data length exceeds the set maximum, it writes the data to the log file and clears the buffer. In this process, the Logger does not need to return or reallocate the buffer.

Of course, this is just a simple example, and a real production Logger is much more powerful. For those interested, see the implementation of Zap, Logrus and other Loggers.

How can I reduce the risk of losing logs

Earlier we learned that the seckill service can lose some logs when an exception occurs at high concurrency. We also learned that the Seckill service log cannot be written to the log file in real time. Do you notice that these two things are contradictory? In fact, at high concurrency, we cannot completely address the risk of losing logs, only reduce the probability of losing logs. Why?

At high concurrency, we need to cache the logs as much as possible into the program’s local memory, i.e. the Logger buffer. After a certain amount of logs is generated, log files are written in batches to achieve good write performance. However, if the program exits unexpectedly and the log size in the buffer is not large enough for a batch write, this part of the log may be discarded.

What to do?

There are two types of program exceptions. One is a controllable exception that can be caught, such as a panic triggered by an array crossing a boundary in Golang. One is uncontrollable exceptions that cannot be caught, such as concurrent reading and writing to an unlocked map in Golang.

With these two exceptions, how do you write the logs in the buffer to the log file as much as possible?

In the first case, the exception is usually caught and the Close function in the instance code is executed to write the log to the log file before exiting the program. In the second case, we can use a timer to periodically write data from the buffer to the log file, such as Flush for 100 milliseconds.