Preface:

At the beginning of the implementation of the front-end monitoring system, Mongo was used as the log data repository. Document storage makes it easy to expand and shrink log fields. The native JSON format also works well with NodeJs. The honeymoon period lasted for several months.

Then one day, I found that the data in the table became bigger and bigger (hundreds of millions in a single table), and the query slowed down, especially the aggregated query. Various optimizations were used: composite indexes, time constraints, periodic cleansing of old data, and so on, but the end result was not satisfactory.

Along the way, I heard from colleagues that there was a database called ClickHouse that might be helpful in the current scenario. So, I experienced:

  • “Mongo is the best”
  • “I must be Mongo useless, continue to optimize will certainly work.”
  • “I can’t play anymore. Check out ClickHouse.”
  • “ClickHouse delicious”

The journey of the mind. Long after you’ve been stable with ClickHouse by now, look back in history and have this article.

Mongo’s joys and sorrows:

At the beginning of using Mongo, I felt very comfortable using it. In the process of use, it has also been constantly optimized, and the following is about several core design points and problems encountered.

  • table

Front-end monitoring log collection is based on application and data type to build the table, which counts as a certain optimization and unit split, so that the data is not all concentrated together, but also convenient for later application deletion. In some cases, if you want to partition a single table in master/slave mode according to time, mongo actually does not support time partition, only supports cluster sharding. We have also considered dividing the table by month, and the result is app1_202101 and APP1_202102. However, the result of such dividing is that the query will be limited by the time range, and the continuous cross-month data cannot be easily searched, which will affect many of our application scenarios.

  • The index

Log data is primarily queried by time, so time is used as a single index. In order to optimize multi-field aggregate query, compound index based on actual query conditions is also used, but the actual effect is not ideal, and the index itself will occupy storage space.

  • Restricted query conditions

When entering the background system initially, there is no time limit, so the default is to query all the time range of data. When the amount of table data is very large, it will take a long time, so the default time limit is made for all queries. However, this method can not completely meet the query requirements in some scenarios.

  • Data cleaning

Since the amount of data in the table is large, resulting in slow query, then delete the data six months ago, so that the amount of data in the table is maintained within six months. In this case, a scheduled task may be required to delete the data six months ago, but in fact, the tables with large data volume are only some front-end applications for monitoring, and you only need to delete these tables (other applications whose query performance is not slow can be retained as much as possible), which requires additional judgment logic.

In the process of deleting data, I found that when a table was over 100 million, I deleted the data of about 1 month (about 10 million levels), the database CPU was directly full, the execution time would be very long, at this time our query and insert would be affected, here is a mention of the master/slave mode we used. Nothing. Just delete it day by day. Clean up after a large quantity of data table, and found no release space, Mongo only at the time of collection will release space, just remove the data, space is still in, if you want to release the space, need to stop the database first, but it will affect the normal use (direct non-stop release space also has a solution, only has a certain implementation cost, Copy a new table, switch, and finally do a data synchronization. If the log volume of a table increases in a certain period of time, the tablespace usage will increase. Suddenly found that this optimization program is also a palliative, uncomfortable ~

  • Explain query analysis

In the process of optimization, also used Mongo analysis statement, do query analysis, but found that the fact is, these query statements have no problem, it is the speed.

summary

As the number of slow queries increases, the project itself becomes a risk project. At first, the front-end monitoring system shared the cloud library with other service systems. In a certain period of time, the front-end monitoring system had the problem of slow database query, which happened to be a scheduled task at the minute level. As a result, the database generates many slow queries and CPU consumption remains high. Front-end monitoring itself will also maintain a certain amount of concurrent monitoring data in storage. Front-end monitoring drags down the entire database, and due to the large number of slow query logs, it is difficult for other services to find their own logs when checking database slow logs. Later, the operation and maintenance colleagues separately configured an exclusive database for front-end monitoring, not to say that my colleagues give benefits to the work, but to spend a little money to clean up my cancer.

Before the migration, configure 12 core 32 GB three nodes (one active node and two slave nodes). After the migration, configure 2 core 4 GB three nodes. Although the configuration is smaller, the average response time is faster after the front-end monitoring is exclusive to the database, presumably because of the migration to a new library, less space fragmentation and memory cache focusing on a single library. The data monitored by the front-end is log data. As a result, the database performance deteriorates and service data services are affected. The separation of database and service in the logging system is also reasonable.

Based on all of the above, I felt that the road had come to an end and that I had done my best to take a look at ClickHouse in a different direction and see if it could solve many of its current problems.

New understanding of ClickHouse

When you first hear ClickHouse, your first thought is that Mongo has room to optimize because it’s going to cost a lot to switch libraries, so you’re stuck with Mongo as much as you can think of. In fact, looking back now, there are good and bad. The advantage is that you can have a deeper understanding of the characteristics of the Mongo database itself, and have practice as a test standard, which is helpful for comparing ClickHouse later. The downside is that it takes a little too much time to solve the problem itself. Let’s start by explaining what a database ClickHouse really is.

Row storage and column storage

From the data storage structure to partition, here the database is divided into row database and column database two.

  • Lines of the database

Traditional databases, or the most commonly used databases, are mostly based on row data structure for data storage, such as Mysql, Mongo and so on. Generally speaking, a user table has two fields, ID, name and year. The user information we insert is a row of continuous data according to a complete user information.

Id :1,name: ‘ye Xiaochai ‘, year: 21 -> store

Id :2,name: ‘zimu ‘, year: 20 -> store

Id :3,name: ‘David ‘, year: 19 -> storage

Index ID The name age
# 0 1 Leaf languages 21
# 1 2 The child for 20
# 2 3 David 19

  • Columns of the database

ClickHouse, Hive, Spark, etc., are column-based data structures for storage. Generally speaking, a user table has two fields, name and ID. These two fields are columns. When we insert user information, it is stored continuously according to the field columns.

Id: 1, 2, 3 -> Storage

Name: ‘Ye Xiaochai ‘,’ Zimu ‘, ‘David’ -> storage

Year: 21, 20, 19 -> Storage

key # 0 # 1 # 2
ID 1 2 3
The name Leaf languages The child for David
age 21 20 19

Differences in the underlying data structure will directly affect specific queries. The following GIF is an official ClickHouse document showing differences. As shown in the figure, query the data for the three fields that meet the criteria. In Figure 1, the row database scans the data in the first row and returns three fields that meet the requirements. In this process, many other fields that are not needed will be read, which increases I/O, causes memory waste and slows down the query speed.

In Figure 2, because the storage of column database itself is continuously stored by field columns, only the three columns need to be scanned to find the data that meets the conditions.

Row database (Figure 1) :

Column database (Figure 2) :

Row database composite index

If you just look at the above description, the students who are learning this concept for the first time may feel that. Wow, this is good, after the explosion of Mysql, Mongo and other databases, I will use it! Or Mysql old players will say, cut, this problem with the index is not on the line, garbage don’t fool me! Of course, the story is not so simple, and the above content is not comprehensive, bear with me and continue to talk about it later.

Mysql and Mongo support multiple field compound indexes to increase query speed. But the index itself needs to take up extra storage space, the more complex index fields and data, the more space waste. And the composite index has the left-most matching limit, so the index cannot be flexibly matched when querying.

summary

As we mentioned earlier, I was running out of ideas when using a row database like Mongo, after running into performance problems and fiddling for a long time. Then I learned about the column database ClickHouse and discovered the core difference between the column database and the row database. When comparing queries, I found that the column database is so powerful that I should not use the row database. If so, why is Mysql still so mainstream? Next, OLTP (Online transaction processing) and OLAP (Online analytical processing).

OLTP and OLAP

On-line Transaction Processing (OLTP) and On-Line Analytical Processing (OLAP) are two concepts in different application scenarios.

OLTP (Online Transaction Processing) pays more attention to transaction processing, focusing on performing database write operations (add, delete, modify), requiring real-time and security of write operations.

OLAP (Online analytical Processing) pays more attention to data analysis, focusing on the implementation of database read operation, which requires real-time query operation.

Traditional databases such as Mysql mainly support OLTP, and in the case of small data volume, there is no query performance bottleneck most of the time. So, a lot of times you’re dealing with this kind of database. With the development of technology and the increase of Internet users. In the face of large data volume query performance can not be broken, so we found that we need to change to a more OLAP friendly analytical database. For data production, most of the time, we cannot do without the traditional OLTP database, so we often use THE OLTP database for business data storage (easy to add, delete and modify), and then analyze the business data, and then clean part of the data to the OLAP database for special query analysis. In this way, the two kinds of databases coexist, separating reading and writing, each doing what it is better at.

Now popular data warehouse, is to use such a structure. The data source in the figure below, the OLTP database used, is stored by ETL(Extract, Transform, load) in the data warehouse using the OLAP class database.

The following is the theft diagram

After we started using ClickHouse, in order to improve the efficiency of some business data queries, we also synchronized Mysql data to ClickHouse through timed synchronization for query analysis.

Of course, if the primary data collected by the server can be stored directly after the service logic is sorted out and will not be changed later, it can also be directly stored in the OLAP analytical database instead of using THE OLTP database + ETL. That’s how our front-end monitoring is used now.

summary

From the concept of two application scenarios, we know that row databases are good at adding, deleting and modifying, and column databases are good at querying. Once you understand their differences, you can then design a more appropriate database solution for different system requirements.

ClickHouse pros and cons

The official documents describe its features in detail, and each concept in the documents can be extended to a lot of knowledge. ClickHouse stands out for its superior query performance and compression performance. In the same type of database run score, is also among the best. Its advantages and characteristics in the official detailed introduction, not to say here. Distinctive Features of ClickHouse.

Its disadvantage I think is that the community building is still a little bit less obvious, it is not as mature as Mysql, there are a lot of supporting and there is no three-party library support.

The performance comparison

With ClickHouse, you can compare the conditional query capability and storage space performance of a table based on the same structure of the current system with Mongo. The improved query capability and storage capacity also demonstrate the advantages of conditional query in column storage format and compression. Of course, it’s not fair to compare two different types of databases in the end. Yes, for example:

In addition to solving query performance problems, ClickHouse also saves storage space. The cool thing is that the data was removed in Mongo before, but there is no space free problem, and it is gone. ClickHouse will split the database files by time by configuring a time partition for the table, and I can perform a partition deletion to delete data and free up space, which increases the retention time of log data if the hardware storage space is fixed.

By month:

CREATE TABLE xxx (
    time DateTime,
    status Int32,
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(time)
ORDER BY time
Copy the code

Table folder structure:

SQL > delete month from database

ALTER TABLE xxx DROP PARTITION '202104'
Copy the code

Some of the records in ClickHouse use

There are many new apis, but older versions do not support them

Because of the fast iteration, some of the syntax in the document may not be supported in recent versions of the database. Pay attention to the version of the database you are using, and try to ensure that your production, test, and local environments are the same version of ClickHouse.

Mysql protocol is supported with limitations

ClickHouse supports the Mysql protocol and can use Mysql connection and SQL syntax. Open the ClickHouse 9004 port to configure the address port account password through the Mysql connection library for connection and query. This is user friendly. However, there are limitations, such as no support for precompilation, and certain data types are passed as strings.

MYSQL ORM problem

ORM (Object Relational Mapping) can erase some database operation costs and help us improve development efficiency. Since ClickHouse provides the Mysql protocol, it is theoretically possible to use the Mysql ORM library as well.

I’m using Node, so I’m using Sequelize as the ORM library and Mysql2 as the driver. Because of differences in table building and data types, Mysql’s ORM library cannot be used to manage ClickHouse table structures. Only SQL syntax can be used.

Select id, KEY1,key2,key3 because ClickHouse does not use the increment ID field by default. Barely working.

Received a ClickHouse error while using the save method to save data (i.e. insert statement) :

MySQLHandler: MySQLHandler: Cannot read packet: : Code: 48, e.displayText() = DB::Exception: Command [ERRFMT] is not implemented., Stack trace (when copying this message, always include the lines below)
Copy the code

The error message is not very accurate. After some time of source tracing, it can be seen that some commands are not supported after the Mysql2 package is delivered to ClickHouse. MySQLHandler. CPP. Html# 183.

ClickHouse does not support pre-compilation of the Mysql protocol, and is prepared by default using a PrepareStatement. So the final conclusion is that ClickHouse can use the Mysql protocol, but does not support the Mysql ORM library, which is not friendly for practical application development.

CLICKHOUSE ORM

ClickHouse’s HTTP interface was evaluated to call the database. The Library TimonKK/ ClickHouse was selected from the official client-libraries. It basically implements a connection driver, and of course HTTP is stateless, so it does some wrapping to help assemble HTTP requests, and we just write SQL statements.

Having only one driver is not enough. Without ORM’s bonus, the development experience is not good and the efficiency is low. So I developed a basic ClickHouse ORM: Node-Clickhouse-ORm, referring to the usage habits of Mongoose (ODM) and SequelizeJS. ClickHouse wasn’t familiar enough, didn’t have enough knowledge and cost, so I only implemented some of the features I normally use.

The library has not been promoted yet, and usage is very small. Recently, around mid-February 2022, the library has been revamped and has been released as 2.0.0 beta version npmjs.com/clickhouse-… .

It will be iterated and maintained in the future, but a person’s energy and ability are limited, so I plan to set up an open source communication group, welcome to experience and join the group to participate in the discussion.

ORM/ODM

  • ORM: Object Relational Mapping
  • ODM: Object Document Mapping.

I was a little confused about the difference between ORM and ODM and why the name is different. Mongoose is called an ODM, not an ORM, but they look the same. Just because Mongo is a document database, why change the name of the ODM?

On second thought, although their names are a bit different, their essence is to abstract the storage format of data into logical objects in the program, allowing users to interact with the database by manipulating objects. Both are database middleware, different names for different types of databases. And Mongo itself database operation syntax is already the object operation mode, it and Mysql this is also different.

The LightHouse GUI is recommended

LightHouse is the GUI tool for ClickHouse, and there are a number of libraries under this name, so don’t get confused. It is developed in HTML, so you just need to download the code base directly to the local, and then run it in the browser, there is no need to start the server to run, just double-click to open it.

This tool is relatively simple, but also relatively secure. Use the HTTP interface to interact with the database, its Ajax request Query has a readonly=1, so the tool itself is mainly used to do the Query, if you want to do other SQL operations, you can change the source code to remove readonly=1.

conclusion

The introduction of ClickHouse has brought significant performance improvements. However, it is not that the system no longer uses Mongo, the metadata of the system itself and some data that need to be modified often still use Mongo.

God, I went from Mongo to ClickHouse, learning all kinds of things, experimenting with all kinds of techniques, and I still got a lot out of it

Well, that’s all for today. The experience packages used for these databases, again, come from the continuous iteration of the front-end monitoring system. I have written several blogs about front-end monitoring before, you can follow me, review the previous articles, if you like, I hope you can click three links ~

  • Build a front-end monitoring system
  • Front-end monitoring SDK development sharing
  • From scratch, < front-end abnormal monitoring system > is implemented