ClickHouse& Column database

A simple introduction

ClickHouse was originally developed for YandexMetrica, the world’s second largest Web analytics platform. It is a column database management system (DBMS) for online analytics (OLAP). It’s really a column database. What is a column database

Column database

rowId guid name
001 aaa zhangsan
002 bbb fawaikuangtu
This is how a traditional row database stores data
001:aaa,zhangsan
002:bbb,fawaikuangtu
Copy the code

However, column database storage is stored by column, and the data in the same column is stored together

aaa:001,bbb:002
zhangsan:001,fawaikuangtu:002
Copy the code

So what are the characteristics of the data, that is to do data analysis is quite fast, to the row database, take MySQL as an example, locate a record quickly, but, do a column of statistics is not too line. (MySQ I also have a lot of research, interested can leave a message, later may consider a blog yo)

Key features of an OLAP scenario

  • The vast majority of requests are read oriented.
  • Data is updated in large batches (> 1000 rows) rather than in single rows; Or not at all.
  • Data is added to the database with little change.
  • For reads, a large amount of data is extracted from the database, but only a subset of the columns.
  • The tables are “wide,” which means they contain a large number of columns.
  • Queries are relatively small (typically hundreds or less per server).
  • For simple queries, a delay of about 50 ms is allowed.
  • The column values are fairly small – numbers and short strings (for example, 60 bytes per URL).
  • Processing a single query requires high throughput (billions of rows per second per server).
  • No transaction is processed.
  • Low data consistency requirements – each query has one large table and all other tables are small.
  • The query result is significantly smaller than the source data. That is, the data is filtered or aggregated. The results can be stored in the memory of a single server.

The characteristics of Clickhouse

  • The data length is fixed to avoid CPU calculation caused by unfixed length
  • You can compress data
  • It can be stored on disk
  • Support concurrent
  • Support distributed query processing
  • Support for SQL syntax
  • Not only are they stored as columns, but they are also processed as vectors (parts of columns)
  • Support the index
  • Support approximate calculation
    • Various aggregation functions, such as distinct Values, medians, quantiles
    • Approximate query was performed on partial samples of data
    • Instead of using all the aggregation conditions, a finite number of data aggregation conditions are randomly selected for aggregation
  • And allow access control of roles (MySQL set)

The installation

Can also be their own environment ah, I lazy here, with docker, really fragrant!

Creating a server instance


$ mkdir $HOME/some_clickhouse_database

$ docker run -d --name some-clickhouse-server --ulimit nofile=262144:262144 --volume=$HOME/some_clickhouse_database:/var/lib/clickhouse yandex/clickhouse-server

Copy the code

Create a client to connect to the server

$ docker run -it --rm --link some-clickhouse-server:clickhouse-server yandex/clickhouse-client --host clickhouse-server
Copy the code

The connection

Clickhouse-client Native command line client

As shown above, clickhouse-client is a native command-line client that is upgraded with the server and is backward compatible. This means that old versions do not support new features, and new features are compatible with old versions. It is generally upgraded with the server

The command parameter

  • --host, -h– – Host name of the server. The default value islocalhost. You can choose to use a host name or an IPv4 or IPv6 address.
  • --port– Port to be connected. Default value: 9000. Note that the HTTP interface and the TCP native interface use different ports.
  • --user, -u– User name. Default value:default.
  • --password– the password. Default: an empty string.
  • --query, -q– Query in non-interactive mode.
  • --database, -d– Default database for the current operation. Default value: The default configuration of the serverdefault).
  • --multiline, -m– If this parameter is specified, multi-line query is allowed. (Enter only indicates a newline but does not indicate the end of the query.)
  • --multiquery, -n– If specified, processing is allowed;Number – separated multiple queries that only work in non-interactive mode.
  • --format, -f– The output is in the specified default format.
  • --vertical, -E– If this parameter is specified, the output is in vertical format by default. This has to do with- the format = VerticalThe same. In this format, each value is printed on a separate row, which is useful for displaying a wide table.
  • --time, -t– If this parameter is specified, the query execution time is displayed in non-interactive modestderrIn the.
  • --stacktrace– If specified, the stack trace is printed if an exception occurs.
  • --config-file– Name of the configuration file.
  • --secure– If specified, the server is connected through a secure connection.
  • --history_file– Directory for storing command history files.
  • --param_<name>– Query Parameters Set the query parameters

Support HTTP interface invocation

By default, Clickhouse-Server monitors HTTP requests on port 8123

curl 'http://localhost:8123/? query=SELECT%201'Copy the code

It does not work well on HTTP 1.1 when using Keepalive and chunked transports encoding.

MySQL link

 mysql --protocol tcp -u default -P 9004
Copy the code

JDBC driver Link

Github.com/ClickHouse/…

The data type

  • The integer
    • Int8-[-128:127]
    • Int16-[-32768:32767]
    • Int32-[-2147483648:2147483647]
    • Int64-[-9223372036854775808:9223372036854775807]
  • Unsigned integer
    • UInt8-[0:255]
    • UInt16-[0:65535]
    • UInt32-[0:4294967295]
    • UInt64-[0:18446744073709551615]
  • Floating point (also loses precision)
    • Float32 – float
    • Float64 – double
  • Special floating point number
    • Inf is infinite
    • – inf minus infinity
    • NaN non-numeric
  • Big data type (P:1 to 38 how many decimal digits, S: how many small trees)
    • Decimal(P,S)
    • Decimal32(S)
    • Decimal64(S)
    • Decimal128(S)
  • Boolean type
    • There is no specific type, you can use UInt8 to limit it to 0 or 1
  • string
    • String (arbitrary length)
    • FixedString (N) (fixed length, null byte added at the end of length)
    • UUID (16-byte, which can be generated by generateUUIDv4())
  • The date of
    • Date is accurate to the day. You can insert a Date string or a 10-bit timestampINSERT INTO dt Values (1546300800, 1), ('2019-01-01', 2);
    • Datetime([timezone]) Accurate to secondsINSERT INTO dt Values (1546300800, 1), ('2019-01-01 00:00:00', 2);
    • DateTime64(precision, [timezone]) precision controls the precision and timezone controls the timezone
  • The enumeration
    • Enum8 eg.:Enum8(‘hello’ = 1, ‘world’ = 2)
    • Enum16
  • An array of
    • array(T)
  • other
    • Each child of the Nest variant is an array
    • Each subtype of a Tuple can be defined at will
    • Nullable is usually placed with the preceding types. Nullable(Int8) indicates that Nullable can be stored
    • IPv4
    • IPv6
    • Map

As above, the operation is similar to MySQL, the main annoyance is the engine, the follow-up reference clickhouse.tech/docs/ en/will be updated