The full text takes about 30 minutes to read. If you care about the results, you can directly jump to the tail, including the pressure measurement method and script.
MySQL has been very stable in recent years and its various architectures are very mature. One of the new business requirements is that the data is getting bigger and bigger, and the analysis scenario of MySQL is a little difficult to support, which is why HTAP has various architectures now. If your scenario runs a little over budget with the new HTAP, use Clickhouse and Databend to support your business. This article tests the performance differences between Clickhouse and Databend based on object storage. Both products currently support S3 for storage, allocating storage on demand.
Clickhouse is known as one of the most searchable databases for wide tables on the planet: Clickhouse has been working on cloud native since last year, with support for S3 as a new storage option.
Databend is a new cloud-native object-storage-based database that delivers low cost, high performance, and elastic scaling.
Documentation: Databend.rs /doc (with Databend architecture diagram)
Databend Repo: github.com/datafuselab… (Welcome to Star)
First we need to know what is cloud native? MySQL cloud native MySQL cloud native MySQL cloud native MySQL cloud native MySQL
• No hardware management or configuration is required
• No need to do software installation and management
• No need to worry about fault management, upgrades and optimizations
• Supports rapid and flexible capacity expansion and reduction
• Only pay for storage and resources when used, not if there is no business request
• Don’t worry too much about the resources you use
For MySQL running on the cloud, we can compare the above conditions, and we will find that users still need to manage a lot of problems about upgrade, configuration, optimization and failure. Similarly, MySQL support personnel on the cloud platform also need to have a lot of work about failure, upgrade and optimization. On top of that, MySQL on the cloud can’t scale in seconds and only pay for the resources it uses. Cloud native is to strive in this direction, so that users live more relaxed and happy.
To be more specific, are there any cloud native apps out there? The answer, of course, is yes, and these products set a standard for us. For example, CockroachDB Cloud, PlanetScale in the database world, SnowflakeDB in the warehouse world. Databend is currently implementing this goal as well.
Why does Databend use S3 object storage?
To do a database developer, the development of a proprietary storage may also be the pursuit of technical practitioners. Databend was designed with the following questions for storage:
• Supports high availability
• Don’t worry about the number of copies
• Multi-IDC available, and multi-cloud switching
• Support global data sharing and distribution
• Don’t worry about reserved space, just pay for the space you use
• Supports concurrent read and write operations based on the same data in multiple clusters, and provides snapshot-level isolation
• Complete transaction support
• No management backup, support arbitrary flashback within a specified time (in table, DB granularity)
After evaluating the requirements above, we found that object storage on the cloud was exactly what we needed. All writes are incremental. Delete and drop can support lazy operations at the snapshot isolation level. Therefore, we plan to create a cloud native data store in the object storage system. Currently, Databend supports AWS S3, Qcloud COS, Ali Cloud OSS, MiniO and other s3 protocol supporting products. For more deployment methods, see databend.rs/doc/categor…
Test steps
Applied for a c5n.9xlarge machine in AWS, 36C, 72G, 200G(only used to store ontime data)
– System: Ubuntu 20
– Clickhouse: 22.2.3 (see installation on the official website)
– Databend: obtains the current binary version of Github
Github.com/datafuselab…
The data download
wget --no-check-certificate --continue https://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_{1987.. 2021} _ {1.. 12}.zipCopy the code
Clickhouse configuration and table structure
Add: storage. XML to /etc/clickhouse-server/config.d
<yandex>
<storage_configuration>
<disks>
<s3>
<type>s3</type>
<endpoint>https://databend-shared.s3.us-east-2.amazonaws.com/ch-data-s3/</endpoint>
<access_key_id>your-key-id</access_key_id>
<secret_access_key>your-key</secret_access_key>
<cache_enabled>true</cache_enabled>
</s3>
</disks>
<policies>
<s3>
<volumes>
<main>
<disk>s3</disk>
</main>
</volumes>
</s3>
</policies>
</storage_configuration>
</yandex>
Copy the code
Need to replace the above: your-key-id, your-key for the content of your real environment. Clikhouse data import and table structure reference: clickhouse.com/docs/en/get…
Clickhouse Ontime table structure needs to be modified, just add storage_policy=’s3′ at the end:
CREATE TABLE `ontime` ( ... ) ENGINE = MergeTree PARTITION BY Year ORDER BY (IATA_CODE_Reporting_Airline, FlightDate) SETTINGS Index_Granularity = 8192, storage_policy='s3';Copy the code
This allows the data to be stored on S3, but the meta information remains local to Clickhouse.
Databend configuration and table structure
Databend configuration reference: databend.rs/doc/deploy/… Ensure that Databend and Clickhouse connect to one bucket. Databend table structure: create_ontime.sql
CREATE TABLE ontime
(
Year UInt16 NOT NULL,
Quarter UInt8 NOT NULL,
Month UInt8 NOT NULL,
DayofMonth UInt8 NOT NULL,
DayOfWeek UInt8 NOT NULL,
FlightDate Date NOT NULL,
Reporting_Airline String NOT NULL,
DOT_ID_Reporting_Airline Int32 NOT NULL,
IATA_CODE_Reporting_Airline String NOT NULL,
Tail_Number String NOT NULL,
Flight_Number_Reporting_Airline String NOT NULL,
OriginAirportID Int32 NOT NULL,
OriginAirportSeqID Int32 NOT NULL,
OriginCityMarketID Int32 NOT NULL,
Origin String NOT NULL,
OriginCityName String NOT NULL,
OriginState String NOT NULL,
OriginStateFips String NOT NULL,
OriginStateName String NOT NULL,
OriginWac Int32 NOT NULL,
DestAirportID Int32 NOT NULL,
DestAirportSeqID Int32 NOT NULL,
DestCityMarketID Int32 NOT NULL,
Dest String NOT NULL,
DestCityName String NOT NULL,
DestState String NOT NULL,
DestStateFips String NOT NULL,
DestStateName String NOT NULL,
DestWac Int32 NOT NULL,
CRSDepTime Int32 NOT NULL,
DepTime Int32 NOT NULL,
DepDelay Int32 NOT NULL,
DepDelayMinutes Int32 NOT NULL,
DepDel15 Int32 NOT NULL,
DepartureDelayGroups String NOT NULL,
DepTimeBlk String NOT NULL,
TaxiOut Int32 NOT NULL,
WheelsOff Int32 NOT NULL,
WheelsOn Int32 NOT NULL,
TaxiIn Int32 NOT NULL,
CRSArrTime Int32 NOT NULL,
ArrTime Int32 NOT NULL,
ArrDelay Int32 NOT NULL,
ArrDelayMinutes Int32 NOT NULL,
ArrDel15 Int32 NOT NULL,
ArrivalDelayGroups Int32 NOT NULL,
ArrTimeBlk String NOT NULL,
Cancelled UInt8 NOT NULL,
CancellationCode String NOT NULL,
Diverted UInt8 NOT NULL,
CRSElapsedTime Int32 NOT NULL,
ActualElapsedTime Int32 NOT NULL,
AirTime Int32 NOT NULL,
Flights Int32 NOT NULL,
Distance Int32 NOT NULL,
DistanceGroup UInt8 NOT NULL,
CarrierDelay Int32 NOT NULL,
WeatherDelay Int32 NOT NULL,
NASDelay Int32 NOT NULL,
SecurityDelay Int32 NOT NULL,
LateAircraftDelay Int32 NOT NULL,
FirstDepTime String NOT NULL,
TotalAddGTime String NOT NULL,
LongestAddGTime String NOT NULL,
DivAirportLandings String NOT NULL,
DivReachedDest String NOT NULL,
DivActualElapsedTime String NOT NULL,
DivArrDelay String NOT NULL,
DivDistance String NOT NULL,
Div1Airport String NOT NULL,
Div1AirportID Int32 NOT NULL,
Div1AirportSeqID Int32 NOT NULL,
Div1WheelsOn String NOT NULL,
Div1TotalGTime String NOT NULL,
Div1LongestGTime String NOT NULL,
Div1WheelsOff String NOT NULL,
Div1TailNum String NOT NULL,
Div2Airport String NOT NULL,
Div2AirportID Int32 NOT NULL,
Div2AirportSeqID Int32 NOT NULL,
Div2WheelsOn String NOT NULL,
Div2TotalGTime String NOT NULL,
Div2LongestGTime String NOT NULL,
Div2WheelsOff String NOT NULL,
Div2TailNum String NOT NULL,
Div3Airport String NOT NULL,
Div3AirportID Int32 NOT NULL,
Div3AirportSeqID Int32 NOT NULL,
Div3WheelsOn String NOT NULL,
Div3TotalGTime String NOT NULL,
Div3LongestGTime String NOT NULL,
Div3WheelsOff String NOT NULL,
Div3TailNum String NOT NULL,
Div4Airport String NOT NULL,
Div4AirportID Int32 NOT NULL,
Div4AirportSeqID Int32 NOT NULL,
Div4WheelsOn String NOT NULL,
Div4TotalGTime String NOT NULL,
Div4LongestGTime String NOT NULL,
Div4WheelsOff String NOT NULL,
Div4TailNum String NOT NULL,
Div5Airport String NOT NULL,
Div5AirportID Int32 NOT NULL,
Div5AirportSeqID Int32 NOT NULL,
Div5WheelsOn String NOT NULL,
Div5TotalGTime String NOT NULL,
Div5LongestGTime String NOT NULL,
Div5WheelsOff String NOT NULL,
Div5TailNum String NOT NULL
);
Copy the code
Loading data method:
cat load_ontime.sh
echo "unzip ontime ,input your ontime zip dir: ./load_ontime.sh zip_dir" ls $1/*.zip |xargs -I{} -P 4 bash -c "echo {}; unzip -q {} '*.csv' -d ./dataset" if [ $? -eq 0 ]; then echo "unzip success" else echo "unzip was wrong!!!" The exit 1 fi cat create_ontime. SQL | mysql - h127.0.0.1 - P3307 - uroot if [$? - eq 0]; then echo "Ontime table create success" else echo "Ontime table create was wrong!!!" exit 1 fi time ls ./dataset/*.csv|xargs -P 8 -I{} curl -H "insert_sql:insert into ontime format CSV" -H "skip_header:1" -F "upload=@{}" -XPUT http://localhost:8081/v1/streaming_loadCopy the code
Chmod +x load_ontime.sh # sudo apt-get install mysql-clientCopy the code
Use load_ontime.sh and ontime to compress the file location. You can load the data.
The interesting thing here is that Clickhouse has no transaction support and may load data differently under different concurrent loads.
The test script
The hyperfine needs to be installed by itself:
Wget https://github.com/sharkdp/hyperfine/releases/download/v1.13.0/hyperfine_1.13.0_amd64.deb sudo DPKG -i Hyperfine_1. 13.0 _amd64. DebCopy the code
Pressure test script:
cat run_ontime.sh
#! /bin/bash cat << EOF > bench.sql SELECT DayOfWeek, count(*) AS c FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC; SELECT DayOfWeek, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC; SELECT Origin, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY Origin ORDER BY c DESC LIMIT 10; SELECT IATA_CODE_Reporting_Airline AS Carrier, count(*) FROM ontime WHERE DepDelay>10 AND Year = 2007 GROUP BY Carrier ORDER BY count(*) DESC; SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(DepDelay>10)*1000 AS c3 FROM ontime WHERE Year=2007 GROUP BY Carrier ORDER BY c3 DESC; SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(DepDelay>10)*1000 AS c3 FROM ontime WHERE Year>=2000 AND Year <=2008 GROUP BY Carrier ORDER BY c3 DESC; SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(DepDelay) * 1000 AS c3 FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier; SELECT Year, avg(DepDelay) FROM ontime GROUP BY Year; select Year, count(*) as c1 from ontime group by Year; SELECT avg(cnt) FROM (SELECT Year,Month,count(*) AS cnt FROM ontime WHERE DepDel15=1 GROUP BY Year,Month) a; select avg(c1) from (select Year,Month,count(*) as c1 from ontime group by Year,Month) a; SELECT OriginCityName, DestCityName, count(*) AS c FROM ontime GROUP BY OriginCityName, DestCityName ORDER BY c DESC LIMIT 10; SELECT OriginCityName, count(*) AS c FROM ontime GROUP BY OriginCityName ORDER BY c DESC LIMIT 10; EOF WARMUP=3 RUN=10 export script="hyperfine -w $WARMUP -r $RUN" script="" function run() { port=$1 result=$2 script="hyperfine -w $WARMUP -r $RUN" i=0 while read SQL; Do f = / TMP/bench_ ${I}. The SQL echo "$before_sql" echo "$SQL" > $f > > $f # s = "cat $f | clickhouse - the client - host 127.0.0.1 $port - the port "s =" cat $f | mysql - h127.0.0.1 - P $port - uroot -s "script =" $script '$s' "I = $[I + 1] done <. / bench. SQL script="$script --export-markdown $result" echo $script | bash -x } run "3307" "$1" echo "select version() as version" | mysql - h127.0.0.1 - P3307 - uroot > > $resultCopy the code
Clickhouse can copy run_ontime.sh to ch_run.sh to modify the run part:
script="" function run() { port=$1 result=$2 script="hyperfine -w $WARMUP -r $RUN" i=0 while read SQL; Do f = / TMP/bench_ ${I}. The SQL echo "$SQL" > $f s = "cat $f | clickhouse - the client - host 127.0.0.1 - port $port" script = "$script '$s'" i=$[i+1] done <<< $(cat bench.sql) script="$script --export-markdown $result" echo $script | bash -x } run "9000" "$1"Copy the code
Usage:
./run_time.sh D20220322.md
./ch_run.sh C202220322.md
Copy the code
Finally, compare the two MD files in the result.
Comparing the results
The environment | Clickhouse on S3(ms) | Databend on S3(ms) |
---|---|---|
Q1 | 498.2 | 186.6 |
Q2 | 682.1 | 247.2 |
Q3 | 620.7 | 354.7 |
Q4 | 269.6 | 125.1 |
Q5 | 160 | 146.6 |
Q6 | 694.3 | 371.3 |
Q7 | 699.9 | 389.2 |
Q8 | 994.9 | 524.9 |
Q9 | 35.9 | 372.1 |
Q10 | 1484.6 | 521.2 |
Q11 | 741.2 | 439.5 |
Q12 | 1945 | 2898.1 |
Q13 | 1129 | 1183.1 |
Graphical comparison
The above results show that Clickhouse is superior to Databend only in Q9. The analysis shows that Q9 is Clickhouse’s direct use of dictionary queries, which gives Databend a direction to optimize for Q9.
conclusion
Databend currently outperforms Clickhouse in large and wide table computing power overall in the object-based direction. In essence, Databend now outperforms Snowflake in performance. And if you’re interested in that, you can also look out for our upcoming Meetup.
If you are interested in using Databend, you can learn more about it by following the links below:
Databend on miniO: databend.rs/doc/deploy/…
Databend on COS: databend.rs/doc/deploy/…
Databend on AWS S3: databend.rs/doc/deploy/…
Databend vectorization: databend.rs/doc/perform…
You can also add wechat: 82565387 for support and add password: Databend.