Table structure optimization
Table engine
There are many table engines in Clickhouse. The most widely used is the *MergeTree table engine, which is the most robust. Choose a different table engine depending on the application scenario. For example, we use Memory as the table engine for temporary tables. Here we will discuss some optimizations in MergeTree.
The website lists four MergeTree features:
- Data is stored sorted by primary key;
- Partitioning can be determined by specifying a Partitioning key.
- Support data backup;
- Support data sampling.
The MergeTree table engine’s data consists of groups of parts, each of which has its data sorted lexicographically by primary key, and then periodically merges the data in the background. Partition and data merging details may refer to ClickHouse | MergeTree engine data partition
There are several main points about MergeTree optimization:
- The partition size is determined by service characteristics. Take 100 million data in a single table as an example, the optimal partition size is 10-30.
- Set the primary key properly. The default primary key field is the sort field, that is, during table construction
ORDER BY
Field specified. throughPRIMARY KEY
The keyword specifies a primary key field that is different from the sort field, but must be prefixed by the sort field tuple:In this case the primary key expression tuple must be a prefix of the sorting key expression tuple.
. In addition, there is no limit to the length of a primary key, but an excessively long primary key has a performance impact on data writes (queries have no impact). - If the data is time-sensitive, it can be passed
TTL
Set the data expiration time. TTL can be applied to a table or a specific field, as shown in the following table:MergeTreeTTL section in.
The field type
-
Do not use strings for fields that can be identified by numeric or time types.
-
It is recommended to use UTF-8 encoding for string fields to store text, so that no additional conversion is required to read and write data.
-
Although ClickHouse stores DateTime as timestamp Long, it is not recommended to store DateTime directly because DateTime is efficient and readable because it does not require function conversion.
-
Fields with NULL values can be set to default values such as -1 and NULL strings. The reason for this is that ClickHouse requires additional files to record NULL tags, and the Nullable website clearly states that Nullable has a negative impact on performance:
Note
Using
Nullable
almost always negatively affects performance, keep this in mind when designing your databases.
Query optimization
-
Most IO reduction operations, such as column clipping, partition clipping, filtering before joining, are common ways to speed up queries, and clickHouse also works. In particular, there is no predicate push-down in ClickHouse, so you need to manually place where before join.
-
Join a large table first, as opposed to Hive.
-
Control the number of join tables to 3 or less. Although the higher version optimizes the efficiency of multi-table join, too many table joins will have a huge negative impact on query efficiency. If possible, consider querying in a single table after generating large-width tables, and normal queries are at the millisecond level.
-
Frequently used associated tables can be set to dictionary tables. The dictionary table data will be stored in memory, and the data volume should not be too large.
-
When the query field is from a single table, consider changing the join to IN. In queries in clickHouse support both single fields and tuples:
SELECT UserID IN (123.456) FROM.SELECT (CounterID, UserID) IN ((34.123), (101500.456)) FROM.-- Tuple comes from a subquery SELECT (CounterID, UserID) IN (SELECT CounterID, UserID FROM...).FROM.Copy the code
-
Using preWHERE instead of WHERE (only available in *MergeTree) improves query performance tenfold when there are significantly more query columns than filter columns. By default, ClickHouse converts parts of where to preWHERE. You can manually specify fields for preWHERE based on the data:
Prewhere automatically optimizes the way data is read during the filtering phase, reducing IO operationsselect * from work_basic_model where product='tracker_view' and ( id='eDf8fZky' or code='eDf8fZky') # replacewhereThe keywordselect * from work_basic_model prewhere product='tracker_view' and ( id='eDf8fZky' or code='eDf8fZky' ) Copy the code
-
For exact statistics, use uniqExact to replace count(DISTINCT). If a small amount of error is allowed, use uniQ and uniqCombined. The memory consumption of UNIQ is smaller and the accuracy is higher, but the performance is worse than that of UNIQ. The performance of these three functions is more than 10 times higher than count(DisitnCT).
-
For some definite data models, statistical indicators can be constructed by materialized view, so as to avoid the process of repeated calculation in data query. Materialized views are updated when new data is inserted. For Materialized VIEW creation, refer to the Materialized section of CREATE VIEW.
Configuration optimization
config.xml
Configuration items | describe |
---|---|
background_pool_size | The default size of the merge process is 16. You are advised to change it to twice the number of CPU cores |
log_queries | The default value is 0. If you change the value to 1, the system automatically creates the system_QUERy_log table and records the query information of each query |
max_execution_time | Set the maximum query time, in seconds. Default unlimited; Note that the client timeout overrides this parameter |
max_threads | Set the maximum number of cpus that a single query can use. The default is the number of CPU cores |
max_connections | Maximum number of connections. The value is 4096 by default |
max_server_memory_usage | Clickhouse uses maximum memory and defaults to 0. Dynamic application drains all resources of the machine |
max_concurrent_queries | Maximum number of concurrent queries |
max_open_files | The default value is maximum. There may be too many open files error, cooperate with ulimit -A setting, if it does not take effect, confirm/etc/security/limits.d/clickhouse.conf Whether the configuration is consistent, and whether the service is restarted, but the default value is 262144, which is quite large |
users.xml
Configuration items | describe |
---|---|
max_memory_usage | Maximum memory available for a single query. The default memory size is 10 GB |
use_uncompressed_cache | Enable the cache of uncompressed blocks to improve the speed of short queries. Disable 0 or enable 1. Default 0 (disabled) |
max_rows_to_read | Maximum number of rows that can be read from a table. An exception will be thrown if the query result exceeds the value |
max_bytes_to_read | The maximum amount of data to be returned during query. If the value is exceeded, exceptions will occur |
join_use_nulls | Whether to use null when join, default is 0. If join is null, default values will be assigned to empty columns, such as 0 for Int and “” for String. |
Other optimization
Clickhouse has since 20.6.3 provided the ability to explain execution plan parsing, which allows you to identify problems in SQL that may affect performance. See the Clickhouse native implementation plan for explain.
Two things to note about EXPLAIN:
- Currently implemented with tools (DBeaver7.2.2, clickhouse20.9.3.45)
explain
The clickhouse-client command does not return an error but does not get the parsing result. - Execute in clickhouse-client
SET send_logs_level = 'trace';
You can view more detailed log information.
Reference documentation
- Clickhouse optimizes best practices
- ClickHouse | MergeTree engine data partition
- ClickHouse *MergeTree table engine
- MergeTree
- Initial explain
- feature request: EXPLAIN output
- Restrictions on Query Complexity
- Server Settings