preface

Guest side user behavior log has always been known for its large volume, daily hundreds of millions, billions or even tens of billions is common, when it comes to the collection of guest side logs, it is necessary to talk about the timeliness and accuracy of log collection two topics. For storing large amounts of log data, both the common Hadoop architecture and the recently popular ClickHouse are good choices, and the pros and cons of these are the focus of our discussion. Analysis of large amounts of data is a constant topic, and ClickHouse seems to have a near-perfect solution for this, but the truth is that if you want to use it in a project, a thorough understanding of its characteristics is a major prerequisite. In the following articles, we will study this topic together.


1. How to collect massive logs quickly

In order to realize fast storage of massive data, we must first solve the problem of data landing under the condition of high concurrency and high throughput. In this regard, Nginx is the best choice. Mainstream configuration of the server single Nginx processing capacity can easily reach 10000+/s; According to this processing capacity, a single server can process 3600S24H10000 = nearly 1 billion data in a single day. After the processing also need to use what technology, we come to the specific analysis.

1. Architecture analysis

What can Nginx do

— Reverse proxy

— Load Balancing

— HTTP server (static and dynamic separation)

Forward proxy these are its main functions, what can we use it for in log collection?

The answer is: cache data files. With simple configuration, incoming data from the interface can be landed on the server’s disk in the form of files. With the high processing power of Nginx itself, we can easily complete the first step of massive log collection.

3. How do we handle Nginx landing files

It’s hard to talk about big data today on the Internet without talking about Kafka, and once the data is in Kafka, our processing of the data becomes a smooth thing. Here, we use the famous FileBeta to send data from data files collected by Nginx to Kafka in quasi-real time. The rest of the plan seems to be working out.

The application of ClickHouse in data storage

1. Write ClickHouse data

1.MergeTree table engine: For this type of table engine, ClickHouse has one of the most powerful. But it’s not a single thing, it’s a series. It’s called the MergeTree series.

2.MergeTree table engine features

Basic concept of MergeTree engine series:

(1) Batch write.

(2) Data fragments are merged in the background according to certain rules.

Main features:

(1) The stored data is sorted according to the primary key.

(2) Automatically create sparse indexes for fast data retrieval.

(3) Allow partitioning (if the partitioning key is specified).

(4) Some partitioned ClickHouse operations are faster than normal operations with the same data set and the same result set.

(5) When a partitioning key is specified in the query, ClickHouse automatically intercepts the partitioning data to increase query performance.

Given these characteristics, it is clear that when storing data, we need to write in batches, and how much is appropriate for each write: the answer is as much as possible? Why is that?

Clickhouse supports queries (SELECT) and additions (INSERTS), but does not directly support updates and deletes.

Insert: MergeTree is not an LSM tree because it does not contain “memtable” and “log” : inserted data is written directly to the file system. This makes it suitable only for bulk inserts of data, rather than very frequently inserting single rows; One insertion per second is good, but a thousand insertions per second is not. If you have a lot of stuff you want to insert, you can use a Buffer engine, which writes cached data to RAM and periodically flusher it to another table.

So how much is appropriate for each insert depends on how much data we produce per second. For ClickHouse, we can improve ClickHouse performance by not writing too often.

For example, if we want to produce 100,000 pieces of data per second, we can insert it all at once. If we only produced one data per second (we probably wouldn’t need ClickHouse… (:!

2. ClickHouse data update

We explicitly state that ClickHouse’s MergeTree table engine does not support updates, so what if we do need updates?

1. Clickhouse implements UPDATE and DELETE through variations of ALTER.

Mutations is an ALTER query variant that allows rows in a table to be changed or deleted. Mutations work for operations that change many rows in a table (single row operations are also possible).

This feature is in beta and is available starting with version 1.1.54388. The update function of Mutations is provided from version 18.12.14, and currently MergeTree engine supports Mutations. The existing table is ready to be mutated as-is (without transformation), but after the first mutation is applied to the table, its metadata format is incompatible with the previous server version, and falling back to the previous version becomes impossible.

Run the following command: ALTER TABLE [db.] TABLE DELETE WHERE filter_expr;

ALTER TABLE [db.] TABLE UPDATE column1 = expr1 [,…] WHERE filter_expr;

Note: The update feature does not support updating columns about primary or partition keys.

For the MergeTree table, the mutation is performed by overwriting the entire data section. This operation has no atomicity, the mutated part is replaced as soon as the preparation is complete, and the query of SELECT will see the data from the mutated part as well as the data from the unmutated part after the mutation is executed. Mutations are sorted in the order they were created and applied to each part in that order.

Inserts are also partially mutated – data inserted into the table before committing the mutation will be mutated, and data inserted after that will not be mutated. Note that mutations do not prevent inserts in any way.

The mutation itself uses the system configuration file to set up asynchronous execution. To track the progress of mutations, you can use the System. Mutations table. Successfully committed mutations continue to execute even if the ClickHouse server is restarted. Once a MUTATION is committed, it cannot be rolled back, but if the MUTATION is stuck for some reason, it can be cancelled by KILL MUTATION.

Entries that have mutated are not deleted immediately, and the number of retained entries is determined by the finished_mutationS_to_KEEP storage engine parameter. Old mutation entries are deleted.

The specific implementation of the mutation is to find the parts that need to be modified using the WHERE condition, and then rebuild each part, replacing the old part with the new one. Rebuilding tables with large parts can be time-consuming (the default maximum size for a part is 150 GB). Mutations are atomic in each small part.

For this part, it will not be introduced, because in the actual project, it is difficult to realize the update operation in our common sense in this way, because the update in the project is carried out randomly and piecemeal, which is different from the design principle of Mutations.

2. If you don’t want to use mutations, which are not atomic and expensive after all, another good choice is to replace MergeTree with ReplacingMergeTree.

In addition to the features of MergeTree itself, the most important feature of ReplacingMergeTree is that it can automatically clear the data of the old version according to the primary key and retain the data of the new version of the table. This is the end result of the update operation in our project.

Create table (1);

CREATE TABLE event_log.users (

`id` Int64,

`first_id` String,

`second_id` String,

`insert_date` DateTime DEFAULT now()
Copy the code

) ENGINE = ReplacingMergeTree(insert_date)

PRIMARY KEY id

ORDER BY id

SETTINGS index_granularity = 8192;

By inserting two data with the same ID into the table, ClickHouse will automatically execute the optimize operation with insert_date, removing the first data and keeping only the later data.

(2) Update table operation:

When new data needs to update old data with the same primary key, we first use select… Where id = fix_id(fixed primary key ID) query old data, update query results with new fields passed in by new data, and execute insert statement. The next step is to wait for the Optimize automatic execution.

(3) If you need to ensure that the queried data is new, you can add FINAL to the query; But this will reduce the efficiency of the query, we can use it in the project application.

conclusion

This article introduces the knowledge of collection and storage of massive user behavior data, which is an eternal topic. Data storage is a means to conduct data analysis and provide powerful data support for the operation and development of the company’s business. What should we look for in data analysis, and how to adequately prepare for data analysis when storing, will be shared in the following articles.