This article is based on a compilation shared with the public a few months ago, with some expansions.
Hopefully, this article will provide a simple step-by-step guide to getting started with ClickHouse. If you have a potential data analysis need but don’t know where to start, I hope this article will help you.
Writing in the front
I don’t think I need to repeat the laudatory words about ClickHouse in a performance-driven setting. As a rule of experience, you can do quasi-real-time data analysis at a fraction of the cost that RDBMSS such as MySQL cannot do, migrate data from multiple database instances remotely, or archive storage.
Thanks to the recommendation of a good friend two years ago, ClickHouse simplifies a lot of unnecessary infrastructure setup and maintenance. We have built a relatively luxurious machine (256 core 512GB memory) to conduct quasi-real-time data analysis (spend a few ten-thousandths of a second to search results from massive data), and support the landing of hundreds of thousands of data per second, without affecting the server query performance; I have also practiced running edge computing tasks from a 2,000-block NUC, getting the results needed to be analyzed relatively quickly (taking 1/1000 to 1/100th of a second), and trying to run complex calculations of over a billion data sets (minute level) on a 16-core 64GB ordinary notebook.
So ClickHouse might also be perfect for you if you have the following needs:
- Quick analysis of some offline data, data calculation, aggregation, screening.
- There are a lot of read requirements and very little need to modify the original data. If there are such requirements, you can accept the way of “append data” and “version filtering”.
- The data field is relatively rich, the data has many “columns”.
- Business concurrency demand is not high, only a few or less than one or two hundred inquiries (consumers).
Let’s start with hardware choices.
Hardware selection strategy
For those of you who would consider ClickHouse, you are probably at a point in your current business where you need or have to “think about efficiency.”
In general, there are very few cases where you need to return ClickHouse data directly to the caller as a result of synchronization. Frugal programmers tend to use asynchronous mode, so with very little concurrency, our hardware requirements for ClickHouse are getting lower and lower: Data below 100 million level, as long as the minimum of 4 core 16GB virtual machine can easily handle; For gigabytes to gigabytes of data, as long as you can handle 32 to 64 gigabytes of memory, the calculated time will depend almost exclusively on the number of cores on your device and the size of your CPU cache.
So, when considering ClickHouse for fast or relatively fast “offline” data analysis, think first about how much data you have and whether you need to meet the memory threshold for fast computation, and then how fast you need to get results. Prioritize cpus with more cores and larger CPU caches within the cost budget. In the case of Cluster mode, this is not required by default unless you need to provide real-time interfaces, have high dependency and requirements on service availability, and have exceptionally high data write pressures. Cluster configuration, of course, if you have a demand, do not recommend using the default mode of distributed, because the data is not a complete image, but evenly distributed on each node, if one node kneel off, you will be 1 “real-time” N loss of data, lead to the final result can not say it, can only say that there might be wrong. The site has launched a “Replicated” database engine, which is based on the Atomic engine and uses ZooKeeper for full data replication. It’s still experimental, but it’s better than throwing data.
Another factor that can have a significant impact on how long ClickHouse takes us to get our results is storage. SSD storage is recommended, or even terabytes of inexpensive residential storage if you are using small samples for analysis. For extreme cost, you can even refer to my previous article on Cheap Home Workstation Solutions: If you are performing high-frequency, massive data computing, have relatively large storage capacity requirements and can be expected to increase the capacity of the data, considering the cost and higher data storage reliability, Raid 50 mode of mechanical disk is more suitable for you.
Of course, if you currently have nothing but learning, pick up a Docker container and start the learning journey, as well as millions to tens of millions of levels of data calculation and analysis.
Software Environment selection
All of my current machines run in Ubuntu + container environment. Why do I choose to use this environment? Because “Ubuntu is a first-class citizen in the container world”, this article also chooses to use a container environment for quick start.
Of course, if you choose to install ClickHouse on a bare machine, using a more stable Debian is a good option, while CentOS, at this point, is really not recommended (enterprise pay for RHEL is another topic).
Running ClickHouse in a container environment loses a lot of “conversion” performance, both in storage and network forwarding, so if you can install ClickHouse bare-metal in a production environment, use it out of the container.
If you have the Docker environment installed, we can move on to the next chapter. If you are not familiar with how to install Docker, please refer to the contents about container installation in the knowledge map of this site.
Preparation: The data set used for the test
To familiarize yourself with the basic syntax and the ClickHouse high performance experience, you can start by experimenting with the official Yandex.Metrica Data. (For more performance testing, see the test data set in the official repository.)
https://datasets.clickhouse.tech/hits/partitions/hits_v1.tar
https://datasets.clickhouse.tech/visits/partitions/visits_v1.tar
In addition, to demonstrate how to quickly import data without having to worry about casting data, we also need to operate with a dataset of a traditional database type, Choose a net friend here “film” used in the open source project database (MySQL) https://yyets.dmesg.app/database.
After the data is downloaded, we need to decompress the data first.
mkdir data
tar xvf hits_v1.tar -C data
tar xvf visits_v1.tar -C data
Copy the code
Using the du command, you can see that the data used actually uses 1.7GB of space. By the way, if this data is stored in MySQL, the storage space can be expanded by 3 ~ 5 times.
Du-hs data 1.7 GB dataCopy the code
With the data decompressed, you’re ready to run the configuration on ClickHouse’s container.
Preparations: Prepare the ClickHouse configuration to run
version: "2"
services:
server:
image: Another dual/clickhouse - server: 21.9.4.35
container_name: clickhouse
expose:
- 9000
- 8123
- 9009
ulimits:
nproc: 65535
nofile:
soft: 262144
hard: 262144
environment:
- TZ=Asia/Shanghai
# - CLICKHOUSE_USER=root
# - CLICKHOUSE_PASSWORD=xmnzdwH5
volumes:
- ./data:/var/lib/clickhouse
Use as needed
# - ./config.xml:/etc/clickhouse-server/config.xml
# - ./users.xml:/etc/clickhouse-server/users.xml
Copy the code
Save the above configuration as docker-compose. Yml and start ClickHouse with docker-compose up -d for later use.
As an added bonus, ClickHouse builds quickly, so it is recommended that you do some small sample testing to see if common scenarios work before you make a version change.
At the beginning of ClickHouse experience
The SQL syntax used by ClickHouse is much looser than that of a database such as MySQL, which, by analogy, is like a jump into the world of Python and JavaScript for previous Java writers.
Because ClickHouse is started using a container, we can access the Interactive terminal of ClickHouse through the Docker exec command.
docker exec -it clickhouse clickhouse-client
Copy the code
After entering the terminal, let’s have a look at the “databases” and tables:
# query databasecc1b062138da :) show databases SHOW DATABASES Query id: Efaa1c51 - e112-43 d6 - b803-1 e6dd86ad43b ┌ ─ name ─ ─ ─ ─ ─ ┐ │ datasets │ │ default │ │ system │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ 3 rowsin setElapsed: 0.003 SEC.Switch database
cc1b062138da :) use datasets
USE datasets
Query id: b10ff8f3-0743-42f4-9ee1-663b9a2c4955
Ok.
0 rows in setElapsed: 0.002 SEC.Check the tablecc1b062138da :) show tables SHOW TABLES Query id: Ea2 c6eb8203-6-4576-9 bb7-74 ad4e1c7de9 ┌ ─ name ─ ─ ─ ─ ─ ─ ┐ │ hits_v1 │ │ visits_v1 │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ 2 rowsin setElapsed: 0.005 SEC.Copy the code
The datasets in the above result are the datasets we imported. ClickHouse stores data in the same directory as the data/datasets directory. If you look in the local directory, you can see that the data in the directory is the same as the data/datasets directory.
Tree - L 3 data/data/datasets data/data/datasets ├ ─ ─ hits_v1 │ ├ ─ ─ 201403 _10_18_2 │ │ ├ ─ ─ AdvEngineID. Bin │ │ ├ ─ ─ AdvEngineID. & │ │ ├ ─ ─ the Age. The bin │ │ ├ ─ ─ Age. & │ │ ├ ─ ─ UserID. Bin │ │ ├ ─ ─ UserID. &... │ │ ├ ─ ─ WatchID. Bin │ │ ├ ─ ─ WatchID. & │ │ ├ ─ ─ YCLID. Bin │ │ ├ ─ ─ YCLID. & │ │ ├ ─ ─ checksums. TXT │ │ ├ ─ ─ the columns. TXT │ │ ├ ─ ─ count. TXT │ │ ├ ─ ─ minmax_EventDate. Independence idx │ │ ├ ─ ─ partition. The dat │ │ └ ─ ─ primary. Independence idx │ ├ ─ ─ detached │ └ ─ ─ ├── advEngineid.mrK │ ├─ advEngineId.mrK │ ├─ advEngineid.mrK │ ├─ advEngineid.mrK │ ├─ advEngineid.mrK │ ├─ Age ├─ Age. MRK │ ├─ Attendance. Bin │ ├─ Attendance. │ ├─ Heavy Metal Metal Metal Metal Metal Metal Metal Metal Metal Metal Metal metal metal metal metal metal metal metal metal metal metal metal metal metal metal metal metal metal metal metal metal metal metal metal metal YCLID. & │ ├ ─ ─ checksums. TXT │ ├ ─ ─ the columns. TXT │ └ ─ ─ primary. Independence idx ├ ─ ─ detached └ ─ ─ format_version. TXT 6 directories, 675 filesCopy the code
To make subsequent typing easier, let’s rename the table first.
Remove the version suffix from both tables
cc1b062138da :) rename table hits_v1 to hits
RENAME TABLE hits_v1 TO hits
Query id: dba1405a-1836-4d5a-af23-3ce0f5b31d41
Ok.
0 rows in set.Elapsed: 0.014 sec.cc1b062138da :) Rename table visits_v1 to visits Query ID: 9ffc039c-86c3-42a9-91a6-3ed165254e0b Ok. 0 rowsin setElapsed: 0.012 SEC.# Check the table againcc1b062138da :) show tables SHOW TABLES Query id: C7f da91fb7c - 5224-4-9 a6c - ce4cf37f9fa8 ┌ ─ name ─ ─ ─ ┐ │ hits │ │ visits │ └ ─ ─ ─ ─ ─ ─ ─ ─ ┘ 2 rowsin setElapsed: 0.004 SEC.Copy the code
After renaming the table, let’s see how much data is in the two tables.
SELECT
hits,
visits
FROM
(
SELECT count(a)AS hits
FROM hits
) AS table_hits
,
(
SELECT count(a)AS visits
FROM visits
) AS table_visits
Copy the code
As you can see, both tables have very small amounts of data, in the millions to tens of millions.
┌ ─ ─ ─ ─ hits ─ ┬ ─ ─ visits ─ ┐ │8873898 │ 1676861│ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘1 rows in set. Elapsed: 0.005 sec.
Copy the code
Then we take a look at the table structure of the two tables, you can see that the two tables, 133, 181 columns respectively, are “normal” wide tables, very suitable for analysis.
desc hits
cc1b062138da :) desc hits
:-]
:-]
DESCRIBE TABLE hits
Query id: b8d8b650- 2395.- 4207.-b2ce- 4200.dc9c0fce ┌ ─ name ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┬ ─ type ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┬ ─ default_type ─ ┬ ─ default_expression ─ ┬ ─ the comment ─ ┬ ─ codec_expression ─ ┬ ─ ttl_ex 7. The compression ─ WatchID │ │ │ JavaEnable │ UInt8 │ │ │ Title │ String │ GoodEvent │ │ │ ├ ─ EventDate │ │ ├ ─ EventDate │ │ ├ ─ EventDate │ │ ├ ─ EventDate │ │ ├ ─ EventDate │Date│ │ │ CounterID │ │ │ │ ClientIP │ │ │ │ │ ClientIP6 │ FixedString(16│ │ │ RegionID │ UInt32 │ │ UserID │ UInt64 │ │ │ CounterClass │ Int8 │ OS │ │ UInt8 │ │ ├ ─ UserAgent │ UInt8 │ │ │ URL │ String │ │ Referer │ String │ │ URLDomain │ ├ ─ UInt8 │ ├ ─ devrobot │ ├ ─ RefererDomain │ ├ ─ RefererDomain │ │ │ RefererDomain │ │ │ │ RefererDomain │ │ │ │ RefererDomain │ │ │ │ RefererDomain │ │ │ │ │ RefererDomain │ │ │ │ RefererCategories │Array(UInt16) │ │ │ URLCategories │Array(UInt16) │ │ │ │ URLRegions │Array(UInt32) │ │ │ │ RefererRegions │Array(UInt32) │ │ │ │ UInt16 │ │ ResolutionHeight │ UInt16 │ │ │ ResolutionDepth │ │ UInt8 │ │ FlashMajor │ UInt8 │ │ FlashMinor │ UInt8 │ │ FlashMinor2 │ String │ │ │ │ UInt8 │ │ │ UInt8 │ │ │ UInt8 │ │ │ FixedString(2) │ │ │ │ │ │ │ CookieEnable │ UInt8 │ │ │ │ │ │ │ JavascriptEnable │ UInt8 │ │ │ │ │ │ │ IsMobile │ UInt8 │ │ │ │ │ │ │ │ UInt8 │ │ ├ ─ UInt8 │ │ ├ ─ MobilePhoneModel │ String │ │ Params │ String │ │ IPNetworkID │ UInt32 │ │ │ TraficSourceID │ Int8 │ │ SearchEngineID │ UInt16 │ │ SearchPhrase │ String │ │ │ AdvEngineID │ │ │ AdvEngineID │ UInt8 │ │ │ IsArtifical │ UInt8 │ │ WindowClientWidth │ UInt16 │ │ │ WindowClientHeight │ │ │ UInt16 │ │ │ Int16 │ │ │ ClientEventTime │ │ │ │ UInt8 │ │ │ SilverlightVersion2 │ UInt8 │ │ SilverlightVersion3 │ UInt32 │ │ silverlight3 │ │ │ ├ ─ UInt16 │ │ ├ ─ PageCharset │ String │ │ CodeVersion │ UInt32 │ │ SilverlightVersion4 │ │ │ UInt8 │ │ ├ ─ UInt8 │ │ IsNotBounce │ UInt8 │ │ FUniqID │ UInt64 │ │ │ │ │ HID │ │ │ IsOldCounter │ UInt8 │ │ │ IsEvent │ UInt8 │ │ │ │ ├ ─ UInt8 │ │ ├ ─ UInt8 │ │ HitColor │ FixedString(1│ │ ├ ─ UInt8 │ ├ ─ Sex │ UInt8 │ │ ├ ─ Income │ UInt8 │ │ │ ├ ─ UInt16 │ │ │ Robotness │ UInt8 │ │ GeneralInterests │Array(UInt16) │ │ │ RemoteIP │ │ │ │ RemoteIP6 │16) │ │ │ │ │ │ │ WindowName │ Int32 │ │ │ │ │ │ │ OpenerName │ Int32 │ │ │ │ │ │ │ HistoryLength │ Int16 │ │ │ │ │ │ │ BrowserLanguage │ FixedString (2│ │ │ │ BrowserCountry │ FixedString(2│ │ │ ├ ─ ─ SocialNetwork │ String │ │ │ SocialAction │ String │ │ │ ├ ─ HTTPError │ UInt16 │ │ │ │ │ SendTiming │ Int32 │ │ │ DNSTiming │ Int32 │ │ │ ConnectTiming │ Int32 │ │ │ ResponseStartTiming │ Int32 │ │ │ ResponseEndTiming │ Int32 │ │ │ FetchTiming │ ResponseStartTiming │ Int32 │ │ │ ResponseEndTiming │ Int32 │ │ │ RedirectTiming │ Int32 │ │ │ │ │ │ │ DOMInteractiveTiming │ Int32 │ │ │ │ │ │ │ DOMContentLoadedTiming │ Int32 │ │ │ │ │ │ │ DOMCompleteTiming │ Int32 │ │ │ adideventstartTiming │ Int32 │ │ LoadEventEndTiming │ Int32 │ │ │ │ │ NSToDOMContentLoadedTiming │ Int32 │ │ │ │ │ │ │ FirstPaintTiming │ Int32 │ │ │ │ │ │ │ RedirectCount │ Int8 │ │ │ │ ├ ─ UInt8 │ │ │ │ │ SocialSourcePage │ String │ │ ParamPrice │ Int64 │ │ │ │ │ ParamOrderID │ String │ │ │ ParamCurrency │ FixedString(3│ │ ├ ─ ParamCurrencyID │ UInt16 │ │ │ GoalsReached │Array(UInt32) │ │ │ │ │ │ OpenstatServiceName │ │ String │ │ │ OpenstatCampaignID │ String │ │ │ OpenstatAdID │ │ String │ │ │ OpenstatSourceID │ String │ String │ │ UTMSource │ String │ │ UTMMedium │ String │ │ │ │ │ │ │ UTMCampaign │ String │ │ │ │ UTMContent │ String │ │ UTMTerm │ String │ │ FromTag │ String │ │ │ HasGCLID │ UInt8 │ │ │ RefererHash │ UInt64 │ │ URLHash │ UInt64 │ │ │ │ CLID │ │ │ YCLID │ UInt64 │ │ │ ShareService │ String │ │ ShareURL │ String │ │ │ │ ShareTitle │ String │ │ │ │ parsedparams. Key1 │Array(String) │ │ │ │ │ parsedparams.key2 │Array(String) │ │ │ │ │ parsedparams.key3 │Array(String) │ │ │ │ │ ParsedParams.Key4 │Array(String) │ │ │ │ │ parsedparams.key5 │Array(String) │ │ │ │ │ ParsedParamsArray(Float64) │ │ │ IslandID │ FixedString(16│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘133 rows in set. Elapsed: 0.003 sec.
cc1b062138da :)
Copy the code
Here’s another chart:
cc1b062138da :) desc visits
DESCRIBE TABLE visits
Query id: 821e1692-b571- 42ad-a38f- 88.bd120e478d ┌ ─ name ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┬ ─ type ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┬ ─ default_type ─ ┬ ─ default_expression ─ ┬ ─ the comment ─ ┬ ─ codec_expression ─ ┬─ttl_expression─ ─ CounterID │ │ │ │ StartDate │Date│ │ ├ ─ mark │ Int8 │ │ IsNew │ UInt8 │ │ VisitID │ UInt64 │ │ UserID │ UInt64 │ │ │ │ │ Duration │ UInt32 │ UTCStartTime │ DateTime │ │ │ PageViews │ Int32 │ │ Int32 │ │ IsBounce │ UInt8 │ │ │ Referer │ String │ │ │ StartURL │ String │ │ RefererDomain │ String │ │ │ StartURLDomain │ String │ │ │ EndURL │ String │ │ │ LinkURL │ String │ │ IsDownload │ UInt8 │ │ │ TraficSourceID │ Int8 │ │ │ SearchEngineID │ UInt16 │ │ │ SearchPhrase │ String │ │ AdvEngineID │ UInt8 │ │ Int32 │ │ │ │ RefererCategories │ RefererCategories │Array(UInt16) │ │ │ URLCategories │Array(UInt16) │ │ │ │ URLRegions │Array(UInt32) │ │ │ │ RefererRegions │Array(UInt32) │ │ │ │ IsYandex │ │ │ GoalReachesDepth │ Int32 │ │ │ GoalReachesURL │ Int32 │ │ │ │ ├ ─ UInt8 │ │ │ SocialSourcePage │ String │ GoalReachesAny │ Int32 │ │ │ │ ├ ─ ─ MobilePhoneModel │ │ │ String │ │ │ ClientEventTime │ DateTime │ │ RegionID │ UInt32 │ │ │ │ │ │ │ ClientIP │ │ │ │ ClientIP6 │ FixedString(16│ │ │ RemoteIP │ │ │ RemoteIP6 │ FixedString(│ │ │ RemoteIP │ │ │ │ RemoteIP6 │16(│ │ │ │ IPNetworkID │ │ │ │ │ SilverlightVersion3 │ │ │ CodeVersion │ │ │ │ │ UInt16 │ ResolutionWidth │ │ │ ResolutionHeight │ UInt16 │ │ UserAgentMajor │ UInt16 │ │ │ │ UInt16 │ │ │ WindowClientWidth │ UInt16 │ │ WindowClientHeight │ UInt16 │ │ │ │ ├ ─ UInt16 │ │ FlashVersion3 │ │ ├ ─ UInt16 │ ├ ─ SilverlightVersion2 │ │ │ ├ ─ UInt8 │ │ ├ ─ silverlightversion3 │ │ │ │ │ FlashVersion4 │ │ ├ ─ UInt16 │ │ │ ClientTimeZone │ Int16 │ │ OS │ UInt8 │ │ ├ ─ UserAgent │ UInt8 │ │ │ FlashMajor │ UInt8 │ │ FlashMinor │ UInt8 │ │ │ │ │ UInt8 │ │ │ NetMajor │ NetMinor │ UInt8 │ │ │ MobilePhone │ UInt8 │ │ SilverlightVersion1 │ UInt8 │ │ │ Age │ │ UInt8 │ │ Sex │ UInt8 │ │ │ Income │ UInt8 │ │ │ UInt8 │ │ ├ ─ UInt8 │ ├ ─ JavascriptEnable │ UInt8 │ │ │ UInt8 │ │ │ │ ├ ─ UInt16 │ │ ├ ─ UInt16 │ interest │ │ UInt16 │ │ │ │ UInt8 │ │ │ GeneralInterests │Array(UInt16) │ │ │ │ │ Params │Array(String) │ │ │ │ GoalsArray(UInt32) │ │ │ GoalsArray(UInt32) │ │ │ GoalsArray(DateTime) │ │ │ │ GoalsArray│ │ │ GoalsArray(String) │ │ │ │ Goals.CurrencyID │Array(UInt32) │ │ │ WatchIDs │Array(UInt64) │ │ │ │ ParamCurrency │ Int64 │ │ │ ParamCurrency │ FixedString(│ │ │ │ ParamCurrency │ FixedString()3│ │ ├ ─ ParamCurrencyID │ UInt16 │ │ │ │ ClickLogID │ UInt64 │ │ │ │ ClickEventID │ Int32 │ │ │ │ │ │ │ ClickGoodEvent │ │ │ Int32 │ │ │ │ ClickEventTime │ DateTime │ │ │ │ ClickPriorityID │ Int32 │ │ │ │ ClickPhraseID │ Int32 │ │ │ │ │ │ ClickPageID │ Int32 │ │ │ ClickPlaceID │ Int32 │ │ │ ClickTypeID │ In addition, there are three types of files: Int32 │ │ │ │ ClickResourceID │ Int32 │ │ │ ClickCost │ UInt32 │ │ ClickClientIP │ │ │ │ │ ├ ─ UInt8 │ │ │ ├ ─ UInt8 │ │ │ ├ ─ UInt8 │ │ │ │ │ │ │ │ │ │ ClickOrderID │ │ │ │ ClickBannerID │ UInt32 │ │ │ ClickMarketCategoryID │ UInt32 │ │ ClickMarketPP │ │ │ │ │ │ │ │ │ │ │ │ - click - MarketCategoryName │ - click - marketcategoryName │ - click - marketcategoryName │ - click - marketcategoryName │ - click - marketCategoryName │ - click - marketCategoryName │ - click - marketCategoryName │ - click - marketCategoryName │ - click - marketCategoryName │ │ │ ClickAWAPSCampaignName │ String │ │ │ │ │ │ │ ClickPageName │ String │ │ │ │ │ │ │ ClickTargetType │ UInt16 │ │ │ │ │ │ ├ ─ UInt8 │ │ │ ├ ─ ClickSelectType │ Int8 │ │ ├ ─ UInt8 │ │ ├ ─ ClickTargetPhraseID │ │ │ │ ├ ─ ClickTargetPhraseID │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ClickOptions │ String │ │ │ │ │ │ │ ClickGroupBannerID │ Int32 │ │ │ │ │ │ │ OpenstatServiceName │ String │ │ │ │ │ │ │ OpenstatCampaignID │ String │ │ │ OpenstatAdID │ String │ │ │ OpenstatSourceID │ String │ │ │ │ UTMSource │ String │ │ │ │ UTMMedium │ String │ │ │ UTMCampaign │ String │ │ │ UTMContent │ String │ │ │ UTMTerm │ String │ │ │ │ FromTag │ String │ │ HasGCLID │ │ FirstVisit │ DateTime │ │ │ PredLastVisit │Date│ │ │ │ LastVisit │Date│ │ │ TotalVisits │ UInt32 │ │ traficsource. ID │Array(Int8) │ │ │ │ │ │ │ TraficSource. SearchEngineID │Array(UInt16) │ │ │ TraficSource.AdvEngineID │Array(UInt8) │ │ │ TraficSourceArray(UInt16) │ │ │ │ │ │ │ TraficSource. SocialSourceNetworkID │Array(UInt8) │ │ │ │ traficsource.domain │Array(String) │ │ │ │ │ │ │ TraficSource. SearchPhrase │Array(String) │ │ │ │ │ │ │ TraficSource. SocialSourcePage │Array(String) │ │ │ │ Attendance │ FixedString(String)16(│ │ │ CLID │ UInt32 │ │ YCLID │ UInt64 │ │ │ NormalizedRefererHash │ │ │ SearchPhraseHash │ UInt64 │ │ │ │ │ │ │ RefererDomainHash │ UInt64 │ │ │ │ │ │ │ NormalizedStartURLHash │ UInt64 │ │ │ │ │ │ │ StartURLDomainHash │ UInt64 │ │ │ │ │ │ │ NormalizedEndURLHash │ UInt64 │ │ │ │ │ │ │ TopLevelDomain │ UInt64 │ │ │ │ │ URLScheme │ UInt64 │ │ │ URLScheme │ UInt64 OpenstatServiceNameHash │ │ │ OpenstatCampaignIDHash │ UInt64 │ │ │ │ OpenstatAdIDHash │ UInt64 │ │ │ │ OpenstatSourceIDHash │ UInt64 │ │ │ UTMSourceHash │ UInt64 │ │ │ │ UTMMediumHash │ UInt64 │ │ │ UTMCampaignHash │ │ │ UTMContentHash │ UInt64 │ │ │ │ │ │ │ UTMTermHash │ UInt64 │ │ │ │ │ │ │ FromHash │ UInt64 │ │ │ │ │ │ │ WebVisorEnabled │ UInt8 │ │ │ │ │ │ ├ ─ WebVisorActivity │ │ │ │ │ parsedparams. Key1 │Array(String) │ │ │ │ │ parsedparams.key2 │Array(String) │ │ │ │ │ parsedparams.key3 │Array(String) │ │ │ │ │ ParsedParams.Key4 │Array(String) │ │ │ │ │ parsedparams.key5 │Array(String) │ │ │ │ │ ParsedParamsArray(Float64) │ │ │ │ MarketArray(UInt8) │ │ │ │Array(UInt32) │ │ │ │ Market.OrderID │Array(String) │ │ │ │ MarketArray│ │ │ │ MarketArray(UInt32) │ │ │ │ MarketArray(UInt32) │ │ │ Market.DirectOrderID │Array(UInt32) │ │ │ │ MarketArray(UInt32) │ │ │ │ Market.GoodID │Array(String) │ │ │ │ │ MarketArray(String) │ │ │ │ │ MarketArray│ │ │ │ Market. (│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ MarketArray│ │ ├ ─ ─ IslandID │ FixedString(0 folders, 0 folders, 0 folders)16│ │ │ │ │ │ │ │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘181 rows in set. Elapsed: 0.004 sec.
Copy the code
Data query operation
Next, let’s do some basic query operations on the data in these two tables.
Group query Obtains Top N records
A frequent operation in daily business is to have a large number of data, we need to filter the “popular” data, suppose we want to count the number of visits to the site with different “counter ID” sub-module, then we can do the following statement:
cc1b062138da :) SELECT CounterID, count(a)AS nums FROM hits GROUP BY CounterID ORDER BY nums DESC LIMIT 10;
SELECT
CounterID,
count(a)AS nums
FROM hits
GROUP BY CounterID
ORDER BY nums DESC
LIMIT 10
Query id: 4206896d-e52e- 48fc-bad1- 3Fe60949f90a ┌ ─ CounterID ─ ┬ ─ ─ ─ nums ─ ┐ │1704509 │ 523264 │
│ 732797 │ 475698 │
│ 598875 │ 337212 │
│ 792887 │ 252197 │
│ 3807842 │ 196036 │
│ 25703952 │ 147211 │
│ 716829 │ 90109 │
│ 59183 │ 85379 │
│ 33010362 │ 77807 │
│ 800784 │ 77492│ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ┘10 rows in set. Elapsed: 0.032 sec. Processed 8.87 million rows.35.50 MB (277.95 million rows/s., 1.11 GB/s.)
Copy the code
Or if we want to achieve a function such as “Hot list of Sina and Zhihu search”, we can display the top ten hot keywords and the corresponding times of search in the following way:
cc1b062138da :) SELECT SearchPhrase AS keyword, count(a)AS nums FROM hits GROUP BY keyword ORDER BY nums DESC, keyword LIMIT 10
SELECT
SearchPhrase AS keyword,
count(a)AS nums
FROM hits
GROUP BY keyword
ORDER BY
nums DESC,
keyword ASC
LIMIT 10
Query id: 62952a2b- 858.a- 463.d-b5d37 -Af60b9f7e92 ┌ ─ keyword ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┬ ─ ─ ─ ─ nums ─ ┐ │ │8267016│ │ с т12.168. 01. │ 3567 │
│ orton │ 2402│ ├ ─ f · ры лица иг ы н диза н │2166 │
│ imgsrc │ 1848│ │ б р ы з f о kind guide и seem │1659 │
│ индийский афтозный │ 1549│ │ оооотвод а и │1480│ │ выступ пная н ин │ выступ пная н ин │1247│ │ ю н о с т ь │1112│ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘10 rows in set. Elapsed: 0.096 sec. Processed 8.87 million rows.112.70 MB (92.77 million rows/s., 1.18 GB/s.)
Copy the code
Perform accurate screening according to numerical values
AdvEngineID = 0; AdvEngineID = 0; AdvEngineID = 0; AdvEngineID = 0;
cc1b062138da :) SELECT count(a)FROM hits WHERE AdvEngineID ! = 0
SELECT count(a)FROM hits
WHERE AdvEngineID ! = 0
Query id: e3bc420a-ef87- 42a4- 879.e-Bbf548d2e3a0 ┌ ─count(a) ─ ┐ │30641│ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘1 rows in set. Elapsed: 0.013 sec. Processed 8.87 million rows.8.87 MB (668.11 million rows/s., 668.11 MB/s.)
Copy the code
Data calculation
In addition to filtering out empty data, it is easy to have operations that need to sum the total number of accesses and calculate the average of all sources. ClickHouse provides a number of high performance calculation functions:
cc1b062138da :) SELECT sum(AdvEngineID), count(), avg(ResolutionWidth) FROM hits
SELECT
sum(AdvEngineID),
count(),
avg(ResolutionWidth)
FROM hits
Query id: 47aaef91- 69.c9- 49ed-8e2d- 9801.Ec243338 ┌ ─sum(AdvEngineID) ─ ┬ ─count(a) ─ ┬ ─avg(ResolutionWidth) ─ ┐ │329039 │ 8873898 │ 1400.8565027454677│ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘1 rows in set. Elapsed: 0.042 sec. Processed 8.87 million rows.26.62 MB (210.68 million rows/s., 632.05 MB/s.)
Copy the code
Data to heavy
Of course, sometimes, we store some redundant data in the application side. In order to calculate accurately, we need to take a certain field as the criterion to perform deduplication operation and carry out more accurate calculation.
The following shows how to filter data from the counter whose counter ID is 800784 and group the data according to the RegionID field. Finally, the user ids after deduplication are displayed in descending order.
cc1b062138da :) SELECT RegionID, uniq(UserID) AS UID FROM hits WHERE CounterID = 800784 GROUP BY RegionID ORDER BY UID DESC, RegionID LIMIT 10
SELECT
RegionID,
uniq(UserID) AS UID
FROM hits
WHERE CounterID = 800784
GROUP BY RegionID
ORDER BY
UID DESC,
RegionID ASC
LIMIT 10
Query id: 02049c4f-78e9- 402.a- 93.ab- 1586.C5d7a406 ┌ ─ RegionID ─ ┬ UID ─ ─ ┐ │196 │ 559 │
│ 3 │ 161 │
│ 241 │ 147 │
│ 207 │ 106 │
│ 225 │ 62 │
│ 1 │ 60 │
│ 46 │ 59 │
│ 36 │ 58 │
│ 104 │ 58 │
│ 47 │ 57│ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ┘10 rows in set. Elapsed: 0.011 sec. Processed 81.92 thousand rows.1.31 MB (7.61 million rows/s., 121.83 MB/s.)
Copy the code
For a more complex example, you can use uniqIf to set multiple deduplication conditions at the same time.
cc1b062138da :) SELECT uniq(UserID), uniqIf(UserID, CounterID = 800784), uniqIf(FUniqID, RegionID = 213) FROM hits
SELECT
uniq(UserID),
uniqIf(UserID, CounterID = 800784),
uniqIf(FUniqID, RegionID = 213)
FROM hits
Query id: 76839d79- 9551.- 4863.-a423-E0c21b2193fe ┌ ─ uniq (UserID) ─ ┬ ─ uniqIf (UserID,equals(CounterID, 800784)) ─ ┬ ─ uniqIf (FUniqID,equals(RegionID, 213)) ─ ┐ │120665 │ 4057 │ 106│ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘1 rows in set. Elapsed: 0.162 sec. Processed 8.87 million rows.212.97 MB (54.87 million rows/s., 1.32 GB/s.)
Copy the code
Data sorting
Sorting data is also a common requirement. For example, we show the last ten written data according to data time.
cc1b062138da :) SELECT EventTime FROM hits ORDER BY EventTime DESC LIMIT 10
SELECT EventTime
FROM hits
ORDER BY EventTime DESC
LIMIT 10
Query id: fbdc1df90ff1- 408.f-b681- 3A94be1e6a86 ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ EventTime ─ ┐ │2014- 03- 24 03:59:59 │
│ 2014- 03- 24 03:59:59 │
│ 2014- 03- 24 03:59:59 │
│ 2014- 03- 24 03:59:59 │
│ 2014- 03- 24 03:59:59 │
│ 2014- 03- 24 03:59:59 │
│ 2014- 03- 24 03:59:59 │
│ 2014- 03- 24 03:59:59 │
│ 2014- 03- 24 03:59:59 │
│ 2014- 03- 24 03:59:59│ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘10 rows in set. Elapsed: 0.174 sec. Processed 8.87 million rows.35.50 MB (51.11 million rows/s., 204.43 MB/s.)
Copy the code
Sorting of compound query conditions
ClickHouse also has great query support for structured data. You can query for array type data and sort the results directly using statements like the following:
cc1b062138da :) SELECT ParsedParams.Key1 FROM visits FINAL WHERE VisitID ! = 0 AND notEmpty(ParsedParams.Key1) ORDER BY VisitID LIMIT 10
SELECT ParsedParams.Key1
FROM visits
FINAL
WHERE (VisitID ! = 0) AND notEmpty(ParsedParams.Key1)
ORDER BY VisitID ASC
LIMIT 10
Query id: 05a3ae1843 -a5- 414.a9 -c8a- 8 -b98c6cbce42 ┌ ─ ParsedParams. Key1 ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ ['К н о п seem а'] │
│ ['pageParams'.'pageParams'] │
│ ['pageParams'] │
│ ['gen_time'] │
│ ['pageParams'] │
│ ['pageParams'] │
│ ['pageParams'] │
│ ['pageParams'] │
│ ['П р а kind guide'.'gen_timestamp'.'П р а kind guide'.'Д е й с т kind guide и'.'affili'.'gen_timestamp'.'кспери ент по отрыв счет папок' non -.'П р о с м о т р п и с е м'] │
│ ['М а р seem а'.'gen_time'.'М а р seem а'] │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘10 rows in set. Elapsed: 0.307 sec. Processed 1.68 million rows.62.63 MB (5.46 million rows/s., 203.83 MB/s.)
Copy the code
Range queries
In addition to the use of mathematical comparison symbols, limited range query is generally divided into two types, the first type is to set the “whitelist” query:
cc1b062138da :) select sum(Sign) from visits where CounterID in (942285);
SELECT sum(Sign)
FROM visits
WHERE CounterID IN (942285)
Query id: 44b2f770- 8916.- 46b9-b056- 74.B24b155e95 ┌ ─sum(Sign) ─ ┐ │108133│ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘1 rows in set. Elapsed: 0.008 sec. Processed 114.69 thousand rows.573.44 KB (14.42 million rows/s., 72.12 MB/s.)
Copy the code
The second type of query is based on a given time range. In ClickHouse, if we want to know when a piece of data was recorded, we can even do this:
cc1b062138da :) SELECT min(EventDate), max(EventDate) FROM hits
SELECT
min(EventDate),
max(EventDate)
FROM hits
Query id: dd1b7383- 6846.4 -cf6- 86.cb- 5217.Ffa1357f ┌ ─min(EventDate) ─ ┬ ─max(EventDate) ─ ┐ │2014- 03- 17 │ 2014- 03- 23│ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘1 rows in set. Elapsed: 0.015 sec. Processed 8.87 million rows.17.75 MB (599.25 million rows/s., 1.20 GB/s.)
Copy the code
The Join query
Federated queries are also a common requirement, and when it comes to joins, a picture is worth a thousand words.
If we want to combine a table of millions and a table of nearly millions to calculate the user UV and click PV of the same date, we can do this:
SELECT
EventDate,
hits,
visits
FROM
(
SELECT
EventDate,
count(a)AS hits
FROM hits
GROUP BY EventDate
)
ANY LEFT JOIN
(
SELECT
StartDate AS EventDate,
sum(Sign) AS visits
FROM visits
GROUP BY EventDate
) USING (EventDate)
ORDER BY hits DESC
LIMIT 10
SETTINGS joined_subquery_requires_alias = 0
Query id: dbfee618- 01b1- 44bf-af8f- 12A73c7ff300 ┌ ─ ─ EventDate ─ ┬ ─ ─ ─ ─ hits ─ ┬ ─ visits ─ ┐ │2014- 03- 17 │ 1406958 │ 265108 │
│ 2014- 03- 19 │ 1405797 │ 261624 │
│ 2014- 0318 │ 1383658 │ 258723 │
│ 2014- 03- 20 │ 1353623 │ 255328 │
│ 2014- 03- 21 │ 1245779 │ 236232 │
│ 2014- 03- 23 │ 1046491 │ 202212 │
│ 2014- 03- 22 │ 1031592 │ 197354│ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ┘7 rows in set. Elapsed: 0.051 sec. Processed 10.55 million rows.22.78 MB (207.12 million rows/s., 447.17 MB/s.)
Copy the code
The subquery
Subqueries are also a very common requirement and can help eliminate many unnecessary manual steps, such as filtering a batch of users and then calculating their access:
cc1b062138da :) SELECT count(a)FROM hits PREWHERE UserID IN (SELECT UserID FROM hits WHERE CounterID = 800784);
SELECT count(a)FROM hits
PREWHERE UserID IN
(
SELECT UserID
FROM hits
WHERE CounterID = 800784
)
Query id: 5d254281- 8 -c9e- 4593.9 -d8a-Bdce24037d97 ┌ ─count(a) ─ ┐ │1956422│ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘1 rows in set. Elapsed: 0.032 sec. Processed 8.96 million rows.71.97 MB (283.90 million rows/s., 2.28 GB/s.)
cc1b062138da :)
Copy the code
In the example above, we use PREWHERE to optimize the query. It can filter out a large number of unnecessary reads. However, if you have a small amount of data, or the concurrency pressure is low, the same as WHERE is not a problem.
# PREWHERE
1 rows in set. Elapsed: 0.106 sec. Processed 8.96 million rows.71.97 MB (84.79 million rows/s., 681.46 MB/s.)
# WHERE
1 rows in set. Elapsed: 0.026 sec. Processed 8.96 million rows.71.97 MB (339.11 million rows/s., 2.73 GB/s.)
Copy the code
Fuzzy string search
There are times when we need to do a quick fuzzy search, and ClickHouse also provides amazing performance:
cc1b062138da :) SELECT count(a)FROM hits WHERE URL LIKE '%avtomobili%';
SELECT count(a)FROM hits
WHERE URL LIKE '%avtomobili%'
Query id: 93ea9afc-070e-47ae-885e-6B14c26315a5 ┌ ─count(a) ─ ┐ │51354│ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘1 rows in set. Elapsed: 0.152 sec. Processed 8.87 million rows.767.95 MB (58.44 million rows/s., 5.06 GB/s.)
Copy the code
Easter egg: Lazy draw a picture
In addition to this, we sometimes ask for graphs to help us visualize the difference in the magnitude of the data, which can also be handled using the bar function that comes with ClickHouse.
cc1b062138da :) SELECT CounterID, count(a)AS c, bar(c, 0.523264) FROM hits GROUP BY CounterID ORDER BY c DESC, CounterID ASC LIMIT 100;
SELECT
CounterID,
count(a)AS c,
bar(c, 0.523264)
FROM hits
GROUP BY CounterID
ORDER BY
c DESC,
CounterID ASC
LIMIT 100
Query id: ee79be07-c93f4 -aba-ad81- 8 -D6938c99f96 ┌ ─ CounterID ─ ┬ ─ ─ ─ ─ ─ ─ ─ c ┬ ─ bar (count(), 0.523264) ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │1704509 │ 523264│ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ │ │732797 │ 475698│ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ ▋ │ │598875 │ 337212│ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ ▌ │ │792887 │ 252197│ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ ▌ │ │3807842 │ 196036│ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ ▊ │ │25703952 │ 147211│ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ ▌ │ │716829 │ 90109│ █ █ █ █ █ █ █ █ █ █ █ █ █ ▋ │ │59183 │ 85379│ █ █ █ █ █ █ █ █ █ █ █ █ █ │ │33010362 │ 77807│ █ █ █ █ █ █ █ █ █ █ █ ▊ │ │800784 │ 77492│ █ █ █ █ █ █ █ █ █ █ █ ▋ │ │20810645 │ 73213│ █ █ █ █ █ █ █ █ █ █ █ ▏ │ │25843850 │ 68945│ █ █ █ █ █ █ █ █ █ █ ▌ │ │23447120 │ 67570│ █ █ █ █ █ █ █ █ █ █ ▎ │ │14739804 │ 64174│ █ █ █ █ █ █ █ █ █ ▋ │ │32077710 │ 60456│ █ █ █ █ █ █ █ █ █ ▏ │ │22446879 │ 58389│ █ █ █ █ █ █ █ █ ▊ │ │170282 │ 57017│ █ █ █ █ █ █ █ █ ▋ │ │11482817 │ 52345│ █ █ █ █ █ █ █ █ │ │63469 │ 52142│ █ █ █ █ █ █ █ ▊ │ │29103473 │ 47758│ █ █ █ █ █ █ █ ▎ │ │10136747 │ 44080│ █ █ █ █ █ █ ▋ │ │27528801 │ 43395│ █ █ █ █ █ █ ▋ │ │10581377 │ 43279│ █ █ █ █ █ █ ▌ │ │9841201 │ 40581│ █ █ █ █ █ █ ▏ │ │20310963 │ 37562│ █ █ █ █ █ ▋ │ │17337667 │ 34301│ █ █ █ █ █ ▏ │ │28600281 │ 32776│ █ █ █ █ █ │ │32046685 │ 28788│ █ █ █ █ ▍ │ │10130880 │ 26603│ █ █ █ █ │ │8676831 │ 25733│ █ █ █ ▊ │ │53230 │ 25595│ █ █ █ ▊ │ │20271226 │ 25585│ █ █ █ ▊ │ │17420663 │ 25496│ █ █ █ ▊ │ │631207 │ 25270│ █ █ █ ▋ │ │633130 │ 24744│ █ █ █ ▋ │ │14324015 │ 23349│ █ █ █ ▌ │ │8537965 │ 21270│ █ █ █ ▎ │ │11285298 │ 20825│ █ █ █ ▏ │ │14937615 │ 20788│ █ █ █ ▏ │ │185050 │ 20785│ █ █ █ ▏ │ │16368233 │ 19897│ █ █ █ │ │81602 │ 19724│ █ █ █ │ │62896 │ 19717│ █ █ █ │ │12967664 │ 19402│ █ █ ▊ │ │15996597 │ 18557│ █ █ ▋ │ │4379238 │ 18370│ █ █ ▋ │ │90982 │ 17443│ █ █ ▋ │ │18211045 │ 17390│ █ █ ▋ │ │14625884 │ 17302│ █ █ ▋ │ │12864910 │ 17279│ █ █ ▋ │ │126096 │ 16959│ █ █ ▌ │ │30296134 │ 16849│ █ █ ▌ │ │26360482 │ 16175│ █ █ ▍ │ │17788950 │ 16017│ █ █ ▍ │ │5928716 │ 15340│ █ █ ▎ │ │15469035 │ 15171│ █ █ ▎ │ │29732125 │ 15146│ █ █ ▎ │ │32946244 │ 15104│ █ █ ▎ │ │20957241 │ 14719│ █ █ ▎ │ │9495695 │ 14584│ █ █ ▏ │ │29241146 │ 14540│ █ █ ▏ │ │109805 │ 14199│ █ █ ▏ │ │26905788 │ 13972│ █ █ ▏ │ │212019 │ 13930│ █ █ ▏ │ │171509 │ 13792│ █ █ │ │23913162 │ 13615│ █ █ │ │1861993 │ 13509│ █ █ │ │125776 │ 13308│ █ █ │ │11312316 │ 13181│ █ █ │ │32667326 │ 13181│ █ █ │ │28628973 │ 12922│ █ ▊ │ │122804 │ 12520│ █ ▊ │ │12322758 │ 12352│ █ ▊ │ │1301819 │ 12283│ █ ▊ │ │10769545 │ 12183│ █ ▋ │ │21566939 │ 12170│ █ ▋ │ │28905364 │ 12158│ █ ▋ │ │4250765 │ 12049│ █ ▋ │ │15009727 │ 11818│ █ ▋ │ │12761932 │ 11733│ █ ▋ │ │26995888 │ 11658│ █ ▋ │ │12759346 │ 11514│ █ ▋ │ │1507911 │ 11452│ █ ▋ │ │968488 │ 11444│ █ ▋ │ │15736172 │ 11358│ █ ▋ │ │54310 │ 11193│ █ ▋ │ │17027391 │ 11047│ █ ▋ │ │17439919 │ 10936│ █ ▋ │ │4480860 │ 10747│ █ ▋ │ │26738469 │ 10738│ █ ▋ │ │9986231 │ 10656│ █ ▋ │ │1539995 │ 10655│ █ ▋ │ │214556 │ 10625│ █ ▌ │ │219339 │ 10522│ █ ▌ │ │3266 │ 10503│ █ ▌ │ │30563429 │ 10128│ █ ▌ │ │1960469 │ 10098│ █ ▌ │ │7901143 │ 10022│ █ ▌ │ │194599 │ 9997│ █ ▌ │ │21052498 │ 9780│ █ ▍ │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘100 rows in set. Elapsed: 0.034 sec. Processed 8.87 million rows.35.50 MB (259.87 million rows/s., 1.04 GB/s.)
Copy the code
MySQL > alter database
Next, let’s learn how to import data, taking MySQL and CSV as examples. (official support direct import or export of dozens of different data formats: clickhouse. Tech/docs/useful/int… , you can adjust according to the actual needs, reduce unnecessary data conversion operations.
Start MySQL database
To enable database instances to communicate with each other, we need to create a virtual container network:
docker network create dbtest
Copy the code
As above, save the following as docker-comemage. yml (you can store it in a different directory, you’ll need to change the name of the data directory if you store it in the same directory) :
version: '3'
services:
mysql:
container_name: mysql
image: Mysql: 5.7.25
ports:
- 13306: 3306
environment:
MYSQL_USER: test
MYSQL_PASSWORD: test
MYSQL_DATABASE: test
MYSQL_ROOT_PASSWORD: test
command: --character-set-server=utf8mb4 --collation-server=utf8mb4_unicode_ci --default-storage-engine=INNODB --transaction-isolation=READ-COMMITTED --binlog_format=row --max_allowed_packet=33554432 --sql_mode="STRICT_ALL_TABLES" --local-infile=0
volumes:
# - /etc/localtime:/etc/localtime:ro
# - /etc/timezone:/etc/timezone:ro
- ./data:/var/lib/mysql
healthcheck:
test: ["CMD-SHELL"."mysqladmin ping -h localhost -p$$MYSQL_ROOT_PASSWORD" ]
networks:
- dbtest
networks:
dbtest:
external: true
Copy the code
Docker-compose up -d: docker-compose up -d: docker-compose up -d
Initialize test data
After downloading the test data mentioned above, uncompress it and you can see that the compression ratio is approximately 1:6 (1:10 in the earlier example).
unzip yyets_mysql.zip
du -hs yyets*
262M yyets.sql
41M yyets_mysql.zip
Copy the code
Dump the data into a directory that the database container can access (another way is recommended for a more intuitive comparison of data volumes before and after migration) :
The mv yyets - mysql5.7. SQL dataCopy the code
Or use the docker cp command to copy the data into the MySQL container’s data folder:
docker cp yyets.sql mysql:/var/lib/mysql-files/
Copy the code
Then use the Docker exec command to log in to the data terminal in the container and create an empty database.
docker exec-it mysql mysql -uroot -ptest mysql> create database yyets; Query OK, 1 row affected (0.01sec)Copy the code
Next, the downloaded external data source is initialized for the ClickHouse imported data source.
docker exec -it mysql bash
Import from mysql-files or from mysql directory, depending on where you put the data
mysql -uroot -ptest yyets < /var/lib/mysql-files/yyets.sql
# Ignore the following prompts
mysql: [Warning] Using a password on the command line interface can be insecure.
Copy the code
After a while the data is restored to the MySQL instance.
After the SQL file is imported, use du again to check the disk space. Compared with the SQL file before the import, the disk space has expanded by about 270MB (including the index).
du -hs data
530M data
Copy the code
Let’s go into the database again and look at the table and the total amount of data.
docker exec -it mysql mysql -uroot -Ptest # check the database list mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
| sys |
| test |
| yyets |
+--------------------+
6 rows in set (0.03SEC) # alter database mysql> use yyets;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -Mysql > alter table mysql > alter table mysql > alter table mysql > alter table mysql> show tables;
+-----------------+
| Tables_in_yyets |
+-----------------+
| yyets |
+-----------------+
1 row in set (0.00SEC) # check the data volume of mysql> select count(*) from yyets;
+----------+
| count(*) |
+----------+
| 17579 |
+----------+
1 row in set (0.83 sec)
Copy the code
As you can see, there are less than 20,000 pieces of data in the database, which is relatively small, but enough for us to test.
Adjust the ClickHouse instance
To visualize the data storage differences, close the original ClickHouse instance, replace it with a clean data directory, and start a new ClickHouse instance.
However, we also need to make a simple adjustment to the ClickHouse configuration to add it to the same container network:
version: "2"
services:
server:
.
networks:
- dbtest
networks:
dbtest:
external: true
Copy the code
I won’t go over the process as much as I did earlier, but let’s take note of how much disk space an empty ClickHouse directory takes up.
clickhouse du -hs *
2.0M data
4.0K docker-compose.yml
Copy the code
Import MySQL data into ClickHouse
In order to simplify data import and reduce unnecessary “data transformation” operations, we can create a scheme by creating an “online table” first.
- Official documentation of data types: clickhouse tech/docs/useful/SQL…
Log in to the terminal using the container first:
docker exec -it clickhouse clickhouse-client
Copy the code
Then “mount” MySQL to ClickHouse;
981a7a68eb35 :) CREATE DATABASE yyets_online ENGINE = MySQL('mysql:3306'.'yyets'.'root'.'test')
CREATE DATABASE yyets_online
ENGINE = MySQL('mysql:3306'.'yyets'.'root'.'test')
Query id: 45e4570e- 3536.- 4240.- 9775.-fbc9d91ffdcc
Ok.
0 rows in set. Elapsed: 0.026 sec.
981a7a68eb35 :)
Copy the code
MySQL > alter database; MySQL > alter database;
981a7a68eb35 :) use yyets_online;
USE yyets_online
Query id: 9b373b74- 11a5- 47be- 8358.-989522311e3e
Ok.
0 rows in set. Elapsed: 0.002 sec.
981a7a68eb35 :) show tables
SHOW TABLES
Query id: 04cf20ce-b656- 441.e-b414-A63225d498d1 ┌ ─ name ─ ─ ┐ │ yyets │ └ ─ ─ ─ ─ ─ ─ ─ ┘1 rows in set. Elapsed: 0.006 sec.
Copy the code
Moving on to the table structure, you see that the data types have been converted.
981a7a68eb35 :) desc yyets
DESCRIBE TABLE yyets
Query id: 650a4f24-fd8b- 4290.-b40e-cb59b78a926e ┌ ─ name ─ ─ ─ ─ ─ ─ ─ ┬ ─ type ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┬ ─ default_type ─ ┬ ─ default_expression ─ ┬ ─ the comment ─ ┬ ─ codec_expression ─ ┬ ─ ttl_expression ─ ┐ │ id │ │ Int32 │ │ │ │ │ URL │ Nullable(String) │ │ │ name │ Nullable(String) │ │ expire │ Nullable (Int32) │ │ │ │ │ │ │ expire_cst │ Nullable (String) │ │ │ │ │ │ │ data │ Nullable (String) │ │ │ │ │ │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘6 rows in set. Elapsed: 0.004 sec.
Copy the code
Go ahead and use the show create statement to get the command to create the current table structure.
981a7a68eb35 :) show create yyets
SHOW CREATE TABLE yyets
Query id: 85caee099 -bc0-4e7c9 -ff2-bfab820c2c6b ┌ ─ statement ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │CREATE TABLE yyets_online.yyets
(
`id` Nullable(Int32),
`cnname` Nullable(String),
`enname` Nullable(String),
`aliasname` Nullable(String),
`views` Nullable(Int32),
`data` Nullable(String)
)
ENGINE = MySQL('mysql:3306'.'yyets'.'yyets'.'root'.'test') │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘1 rows in set. Elapsed: 0.005 sec.
Copy the code
Then modify the database engine section and add the primary key to complete our subsequent creation of the offline table for data calculation.
981a7a68eb35 :) CREATE DATABASE yyets_offline;
CREATE DATABASE yyets_offline
Query id: edab3813-f3404 -a48- 86.d7-f12c6ad3297e
Ok.
0 rows in set. Elapsed: 0.019 sec.
981a7a68eb35 :) CREATE TABLE yyets_offline.yyets
:-] (:-] `id` Int32,
:-] `url` Nullable(String),
:-] `name` Nullable(String),
:-] `expire` Nullable(Int32),
:-] `expire_cst` Nullable(String),
:-] `data` Nullable(String)
:-] ) ENGINE = MergeTree()
:-] ORDER BY `id`
CREATE TABLE yyets_offline.yyets
(
`id` Int32,
`url` Nullable(String),
`name` Nullable(String),
`expire` Nullable(Int32),
`expire_cst` Nullable(String),
`data` Nullable(String)
)
ENGINE = MergeTree
ORDER BY id
Query id: b8371553- 683.a- 418.f- 9593.-b4da49fd5517
Ok.
0 rows in set. Elapsed: 0.031 sec.
981a7a68eb35 :)
Copy the code
Finally, use the INSERT into statement to import the data from the online table into the offline table.
981a7a68eb35 :) insert INTO yyets_offline.yyets select * from yyets_online.yyets
INSERT INTO yyets_offline.resource SELECT *
FROM yyets_online.resource
Query id: fe5eec24-ae0d4 -deb-abba- 76.df242ae8a9
Ok.
0 rows in set. Elapsed: 5.276 sec. Processed 17.58 thousand rows.244.68 MB (3.33 thousand rows/s., 46.37 MB/s.)
Copy the code
After the import is complete, we can first do a simple query to confirm whether the data transfer is complete.
SELECT
online,
offline
FROM
(
SELECT count(a)AS online
FROM yyets_online.yyets
) AS table1
,
(
SELECT count(a)AS offline
FROM yyets_offline.yyets
) AS table2
Query id: c0edda99- 8491.4 -dd7-bcbc- 0323.B986553e ┌ ─ online ─ ┬ ─ offline ─ ┐ │17579 │ 17579│ └ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘1 rows in set. Elapsed: 0.032 sec. Processed 17.52 thousand rows.74.18 KB (542.94 thousand rows/s., 2.30 MB/s.)
Copy the code
You can see that the two tables have the same amount of data, indicating that the data import is complete. Feel free to try a simple keyword match.
466b62ff5dae :) SELECT id, name FROM yyets_offline.`yyets` WHERE `data` LIKE 'diffuse wei % %' limit 10
SELECT
id,
name
FROM yyets_offline.yyets
WHERE data LIKE 'diffuse wei % %'
LIMIT 10
Query id: 1d1a3884-b73b4 -afd-b850-418e8B218e74 ┌ ─ ─ ─ ─ ─ id ┬ ─ name ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │29620│ Marvel Phase Two Preview │ └ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ ┌ ─ ─ ─ ─ ─ id ┬ ─ name ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │30574 │ Marvel One-Shot: Agent Carter │
│ 31794 │ Marvel One-Shot: AllHail The King33101 │ Marvel 75 Years: From Pulp to Pop! │
│ 33108 │ Marvel One-Shots │
│ 34471 │ Marvel │
│ 35159 │ Iron Fist │
│ 35506 │ The Defenders │
│ 35565│ spiders-Man │
│ 35738│ The Punisher │ └ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘10 rows in set. Elapsed: 0.277 sec. Processed 17.58 thousand rows.243.35 MB (63.52 thousand rows/s., 879.26 MB/s.)
Copy the code
After the data is imported, clean up and delete the connection to the mounted remote online database.
981a7a68eb35 :) drop database yyets_online
DROP DATABASE yyets_online
Query id: 31c82654-a15f-48ba-b27e-e7fa67896ecb
Ok.
0 rows in setElapsed: 0.007 SEC.Copy the code
Finally, use DU to take a look at the disk space ClickHouse uses.
du -hs data
77M data
Copy the code
ClickHouse’s storage footprint is very light compared to MySQL’s 530M data storage.
CSV Data Import
In general, CSV data is very simple to import, but there is a lot of “non-standard data” in CSV data: Such as tag, not strict escape, not strictly numeric type hybrid system a newline, field data contains “unexpected data”, etc., for this kind of situation, we need to “fix” the data first, because different scenario data completely different, so there is no way to provide a general scheme of “universal”, but with the experience of a n experienced person, Use Golang or Node to write a cleaning tool, the execution efficiency is really quite fast, interested students can have a try.
Back to “data import”, depending on whether the pipeline process is also needed, there are usually two ways to import CSV data. If your data is pre-processed, you can take the following way to import directly:
clickhouse-client --format_csv_delimiter="|" --format_csv_allow_double_quotes=0 --format_csv_allow_single_quotes=0 --query="INSERT INTO yyets_csv.resource FORMAT CSV" < /var/lib/clickhouse/dump.csv
Copy the code
However, if your data needs to be transformed using a procedural tool, you can import it in a way similar to the following, using awK, sed, or your own tools:
cat /var/lib/clickhouse/dump.csv | clickhouse-client --format_csv_delimiter="|" --format_csv_allow_double_quotes=0 --format_csv_allow_single_quotes=0 --query="INSERT INTO yyets_csv.resource FORMAT CSV"
Copy the code
The last
Unconsciously wrote nearly ten thousand words, let’s talk here first, there is time to talk about the pit in the process of use.
–EOF
We have a little group of hundreds of people who like to do things.
In the case of no advertisement, we will talk about software and hardware, HomeLab and programming problems together, and also share some information of technical salon irregularly in the group.
Like to toss small partners welcome to scan code to add friends. (To add friends, please note your real name, source and purpose, otherwise it will not be approved)
All this stuff about getting into groups
If you think the content is still practical, welcome to share it with your friends. Thank you.
This article is published under a SIGNATURE 4.0 International (CC BY 4.0) license. Signature 4.0 International (CC BY 4.0)
Author: Su Yang
Creation time: October 16, 2021 statistical word count: 33858 words reading time: 68 minutes to read this article links: soulteary.com/2021/10/16/…