background
The company’s service system is optimized and reformed. In order to realize full-link monitoring, call logs between all service systems need to be collected.
Data: 2 billion + per day
Machine cost :3 Kafka clusters, 2 Logstash collection machines
Technology: Java, MQ, MLSQL, Logstash
The final result is shown below
Acquisition process
Process decomposition
Flow one: MLSQL consumes MQ
The original log generation side serializes and pushes it to MQ through protobuf, and then deserializes it through MLSQL and performs simple ETL processing before pushing it to MQ
Process two: Consume MQ through Logstash
The MLSQL processed data is consumed through Logstash and processed again through Ruby here, and finally written to ES and HDFS
Note: Here, part of the process is pushed to ES for business side use, while the other part is written to HDFS for data warehouse use
Process 3: Warehouse modeling
Through warehouse modeling, the final indicator results are pushed to ES for use by the business side
Note: this paper mainly uses this requirement to explain the use and optimization of Logstash in actual scenarios. The other two parts of the process will not be explained in detail
Why is it designed this way?
Reason 1:
First of all, this requirement belongs to the category of log collection, but Logstash itself does not support deserialization function, and it needs to be supported by customized Ruby plug-in development. However, as a result, the development cost is high and maintenance is difficult, so MLSQL combined with UDF is used for streaming processing
Reason two:
You may be wondering about the final output flow, why not directly write to HDFS and ES through MLSQL, there are two points:
1. Writing MLSQL to HDFS generates a large number of small files. You need to develop the file merging function separately
2. The last data written into ES needs to be modeled by combining data warehouse with other business data, while MLSQL is not good at this point, so the offline processing is adopted here
Speaking of which, specific scenarios need to be combined with the actual situation of the company to make decisions. Some students may think why not use Flume for log collection? That here does not do too much explanation, cabbage radish each has his love, suit oneself is the best! Without further ado, let’s get to the topic. Combined with this demand scenario, how to complete the collection of large data volume with less cost? And how to optimize it?
Logstash development process
1. Determine the log format
First of all, there must be more than one log format in a log file, or it may be a standardized format, which needs to be confirmed with the log generation side
2. Debugging grok
After determining the log format, I wrote the grok syntax and then debugged it. I used the built-in grok debug of kibana6 to debug it. Considering the background of this requirement, when the logstash collection is finally done, it is actually processed through MLSQL. The format of the logstash collection is a JSON string, so grok syntax is not required. But here is a simple example to illustrate
3. The debug ruby
In conjunction with this requirement, some cleaning logic is done in Ruby
4. The optimization
Here, the optimization work takes a large proportion in the whole demand development cycle because of the large amount of data and the small amount of resources. The specific optimization ideas are as follows:
1. MLSQL optimization
The optimization work in this part mainly focuses on the reverse ordering, eliminating part of useless fields and filtering part of data volume in advance. Here are some codes for registering UDF
2. Kafka side optimization
Because Kafka clusters are shared by groups, optimization on the Kafka side only involves optimization on the consumer side. Only two parameters are tuned here
One: data compression
Number of consumer threads
3. The HDFS optimization
The logstash part that writes HDFS does not use the webHDFS plug-in, but a customized one.
Because of the file lock problem involved in the custom plug-in, the last file is flushed by comparing whether the two files are the same before and after. Therefore, the context switching and flushing operation can only be reduced by reducing the update frequency of the file
4. ES optimization
The optimization of ES only involves write optimization, such as batch write, increasing the number of threads, increasing the refresh interval, forbidden Considerations of memory exchange, forbidden Refresh and Replica operation, and increasing the Index buffer
Later will continue to update the article, like more please pay attention to the public account “learning big data”