Based on the number of object storage warehouse Databend vs Clickhouse performance contrast | guess who will win

The full text takes about 30 minutes to read. If you care about the results, you can directly jump to the tail, including the pressure measurement method and script.

MySQL has been very stable in recent years and its various architectures are very mature. One of the new business requirements is that the data is getting bigger and bigger, and the analysis scenario of MySQL is a little difficult to support, which is why HTAP has various architectures now. If your scenario runs a little over budget with the new HTAP, use Clickhouse and Databend to support your business. This article tests the performance differences between Clickhouse and Databend based on object storage. Both products currently support S3 for storage, allocating storage on demand.

Clickhouse is known as one of the most searchable databases for wide tables on the planet: Clickhouse has been working on cloud native since last year, with support for S3 as a new storage option.

Databend is a new cloud-native object-storage-based database that delivers low cost, high performance, and elastic scaling.

Documentation: Databend.rs /doc (with Databend architecture diagram)

Databend Repo: github.com/datafuselab… (Welcome to Star)

First we need to know what is cloud native? MySQL cloud native MySQL cloud native MySQL cloud native MySQL cloud native MySQL

• No hardware management or configuration is required

• No need to do software installation and management

• No need to worry about fault management, upgrades and optimizations

• Supports rapid and flexible capacity expansion and reduction

• Only pay for storage and resources when used, not if there is no business request

• Don’t worry too much about the resources you use

For MySQL running on the cloud, we can compare the above conditions, and we will find that users still need to manage a lot of problems about upgrade, configuration, optimization and failure. Similarly, MySQL support personnel on the cloud platform also need to have a lot of work about failure, upgrade and optimization. On top of that, MySQL on the cloud can’t scale in seconds and only pay for the resources it uses. Cloud native is to strive in this direction, so that users live more relaxed and happy.

To be more specific, are there any cloud native apps out there? The answer, of course, is yes, and these products set a standard for us. For example, CockroachDB Cloud, PlanetScale in the database world, SnowflakeDB in the warehouse world. Databend is currently implementing this goal as well.

Why does Databend use S3 object storage?

To do a database developer, the development of a proprietary storage may also be the pursuit of technical practitioners. Databend was designed with the following questions for storage:

• Supports high availability

• Don’t worry about the number of copies

• Multi-IDC available, and multi-cloud switching

• Support global data sharing and distribution

• Don’t worry about reserved space, just pay for the space you use

• Supports concurrent read and write operations based on the same data in multiple clusters, and provides snapshot-level isolation

• Complete transaction support

• No management backup, support arbitrary flashback within a specified time (in table, DB granularity)

After evaluating the requirements above, we found that object storage on the cloud was exactly what we needed. All writes are incremental. Delete and drop can support lazy operations at the snapshot isolation level. Therefore, we plan to create a cloud native data store in the object storage system. Currently, Databend supports AWS S3, Qcloud COS, Ali Cloud OSS, MiniO and other s3 protocol supporting products. For more deployment methods, see databend.rs/doc/categor…

Test steps

Applied for a c5n.9xlarge machine in AWS, 36C, 72G, 200G(only used to store ontime data)

– System: Ubuntu 20

– Clickhouse: 22.2.3 (see installation on the official website)

– Databend: obtains the current binary version of Github

Github.com/datafuselab…

The data download

wget --no-check-certificate --continue https://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_{1987.. 2021} _ {1.. 12}.zipCopy the code

Clickhouse configuration and table structure

Add: storage. XML to /etc/clickhouse-server/config.d

<yandex>
    <storage_configuration>
        <disks>
            <s3>
                <type>s3</type>
                <endpoint>https://databend-shared.s3.us-east-2.amazonaws.com/ch-data-s3/</endpoint>
                <access_key_id>your-key-id</access_key_id>
                <secret_access_key>your-key</secret_access_key>
                <cache_enabled>true</cache_enabled>
            </s3>
        </disks>
        <policies>
            <s3>
                <volumes>
                    <main>
                        <disk>s3</disk>
                    </main>
                </volumes>
            </s3>
        </policies>
    </storage_configuration>
</yandex>

Copy the code

Need to replace the above: your-key-id, your-key for the content of your real environment. Clikhouse data import and table structure reference: clickhouse.com/docs/en/get…

Clickhouse Ontime table structure needs to be modified, just add storage_policy=’s3′ at the end:

CREATE TABLE `ontime` ( ... ) ENGINE = MergeTree PARTITION BY Year ORDER BY (IATA_CODE_Reporting_Airline, FlightDate) SETTINGS Index_Granularity = 8192, storage_policy='s3';Copy the code

This allows the data to be stored on S3, but the meta information remains local to Clickhouse.

Databend configuration and table structure

Databend configuration reference: databend.rs/doc/deploy/… Ensure that Databend and Clickhouse connect to one bucket. Databend table structure: create_ontime.sql

CREATE TABLE ontime
(
    Year                            UInt16 NOT NULL,
    Quarter                         UInt8 NOT NULL,
    Month                           UInt8 NOT NULL,
    DayofMonth                      UInt8 NOT NULL,
    DayOfWeek                       UInt8 NOT NULL,
    FlightDate                      Date NOT NULL,
    Reporting_Airline               String NOT NULL,
    DOT_ID_Reporting_Airline        Int32 NOT NULL,
    IATA_CODE_Reporting_Airline     String NOT NULL,
    Tail_Number                     String NOT NULL,
    Flight_Number_Reporting_Airline String NOT NULL,
    OriginAirportID                 Int32 NOT NULL,
    OriginAirportSeqID              Int32 NOT NULL,
    OriginCityMarketID              Int32 NOT NULL,
    Origin                          String NOT NULL,
    OriginCityName                  String NOT NULL,
    OriginState                     String NOT NULL,
    OriginStateFips                 String NOT NULL,
    OriginStateName                 String NOT NULL,
    OriginWac                       Int32 NOT NULL,
    DestAirportID                   Int32 NOT NULL,
    DestAirportSeqID                Int32 NOT NULL,
    DestCityMarketID                Int32 NOT NULL,
    Dest                            String NOT NULL,
    DestCityName                    String NOT NULL,
    DestState                       String NOT NULL,
    DestStateFips                   String NOT NULL,
    DestStateName                   String NOT NULL,
    DestWac                         Int32 NOT NULL,
    CRSDepTime                      Int32 NOT NULL,
    DepTime                         Int32 NOT NULL,
    DepDelay                        Int32 NOT NULL,
    DepDelayMinutes                 Int32 NOT NULL,
    DepDel15                        Int32 NOT NULL,
    DepartureDelayGroups            String NOT NULL,
    DepTimeBlk                      String NOT NULL,
    TaxiOut                         Int32 NOT NULL,
    WheelsOff                       Int32 NOT NULL,
    WheelsOn                        Int32 NOT NULL,
    TaxiIn                          Int32 NOT NULL,
    CRSArrTime                      Int32 NOT NULL,
    ArrTime                         Int32 NOT NULL,
    ArrDelay                        Int32 NOT NULL,
    ArrDelayMinutes                 Int32 NOT NULL,
    ArrDel15                        Int32 NOT NULL,
    ArrivalDelayGroups              Int32 NOT NULL,
    ArrTimeBlk                      String NOT NULL,
    Cancelled                       UInt8 NOT NULL,
    CancellationCode                String NOT NULL,
    Diverted                        UInt8 NOT NULL,
    CRSElapsedTime                  Int32 NOT NULL,
    ActualElapsedTime               Int32 NOT NULL,
    AirTime                         Int32 NOT NULL,
    Flights                         Int32 NOT NULL,
    Distance                        Int32 NOT NULL,
    DistanceGroup                   UInt8 NOT NULL,
    CarrierDelay                    Int32 NOT NULL,
    WeatherDelay                    Int32 NOT NULL,
    NASDelay                        Int32 NOT NULL,
    SecurityDelay                   Int32 NOT NULL,
    LateAircraftDelay               Int32 NOT NULL,
    FirstDepTime                    String NOT NULL,
    TotalAddGTime                   String NOT NULL,
    LongestAddGTime                 String NOT NULL,
    DivAirportLandings              String NOT NULL,
    DivReachedDest                  String NOT NULL,
    DivActualElapsedTime            String NOT NULL,
    DivArrDelay                     String NOT NULL,
    DivDistance                     String NOT NULL,
    Div1Airport                     String NOT NULL,
    Div1AirportID                   Int32 NOT NULL,
    Div1AirportSeqID                Int32 NOT NULL,
    Div1WheelsOn                    String NOT NULL,
    Div1TotalGTime                  String NOT NULL,
    Div1LongestGTime                String NOT NULL,
    Div1WheelsOff                   String NOT NULL,
    Div1TailNum                     String NOT NULL,
    Div2Airport                     String NOT NULL,
    Div2AirportID                   Int32 NOT NULL,
    Div2AirportSeqID                Int32 NOT NULL,
    Div2WheelsOn                    String NOT NULL,
    Div2TotalGTime                  String NOT NULL,
    Div2LongestGTime                String NOT NULL,
    Div2WheelsOff                   String NOT NULL,
    Div2TailNum                     String NOT NULL,
    Div3Airport                     String NOT NULL,
    Div3AirportID                   Int32 NOT NULL,
    Div3AirportSeqID                Int32 NOT NULL,
    Div3WheelsOn                    String NOT NULL,
    Div3TotalGTime                  String NOT NULL,
    Div3LongestGTime                String NOT NULL,
    Div3WheelsOff                   String NOT NULL,
    Div3TailNum                     String NOT NULL,
    Div4Airport                     String NOT NULL,
    Div4AirportID                   Int32 NOT NULL,
    Div4AirportSeqID                Int32 NOT NULL,
    Div4WheelsOn                    String NOT NULL,
    Div4TotalGTime                  String NOT NULL,
    Div4LongestGTime                String NOT NULL,
    Div4WheelsOff                   String NOT NULL,
    Div4TailNum                     String NOT NULL,
    Div5Airport                     String NOT NULL,
    Div5AirportID                   Int32 NOT NULL,
    Div5AirportSeqID                Int32 NOT NULL,
    Div5WheelsOn                    String NOT NULL,
    Div5TotalGTime                  String NOT NULL,
    Div5LongestGTime                String NOT NULL,
    Div5WheelsOff                   String NOT NULL,
    Div5TailNum                     String NOT NULL
);

Copy the code

Loading data method:

cat load_ontime.sh

echo "unzip ontime ,input your ontime zip dir: ./load_ontime.sh zip_dir" ls $1/*.zip |xargs -I{} -P 4 bash -c "echo {}; unzip -q {} '*.csv' -d ./dataset" if [ $? -eq 0 ]; then echo "unzip success" else echo "unzip was wrong!!!" The exit 1 fi cat create_ontime. SQL | mysql - h127.0.0.1 - P3307 - uroot if [$? - eq 0]; then echo "Ontime table create success" else echo "Ontime table create was wrong!!!" exit 1 fi time ls ./dataset/*.csv|xargs -P 8 -I{} curl -H "insert_sql:insert into ontime format CSV" -H "skip_header:1" -F "upload=@{}" -XPUT http://localhost:8081/v1/streaming_loadCopy the code

Chmod +x load_ontime.sh # sudo apt-get install mysql-clientCopy the code

Use load_ontime.sh and ontime to compress the file location. You can load the data.

The interesting thing here is that Clickhouse has no transaction support and may load data differently under different concurrent loads.

The test script

The hyperfine needs to be installed by itself:

Wget https://github.com/sharkdp/hyperfine/releases/download/v1.13.0/hyperfine_1.13.0_amd64.deb sudo DPKG -i Hyperfine_1. 13.0 _amd64. DebCopy the code

Pressure test script:

cat run_ontime.sh

#! /bin/bash cat << EOF > bench.sql SELECT DayOfWeek, count(*) AS c FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC; SELECT DayOfWeek, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC; SELECT Origin, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY Origin ORDER BY c DESC LIMIT 10; SELECT IATA_CODE_Reporting_Airline AS Carrier, count(*) FROM ontime WHERE DepDelay>10 AND Year = 2007 GROUP BY Carrier ORDER BY count(*) DESC; SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(DepDelay>10)*1000 AS c3 FROM ontime WHERE Year=2007 GROUP BY Carrier ORDER BY c3 DESC; SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(DepDelay>10)*1000 AS c3 FROM ontime WHERE Year>=2000 AND Year <=2008 GROUP BY Carrier ORDER BY c3 DESC; SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(DepDelay) * 1000 AS c3 FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier; SELECT Year, avg(DepDelay) FROM ontime GROUP BY Year; select Year, count(*) as c1 from ontime group by Year; SELECT avg(cnt) FROM (SELECT Year,Month,count(*) AS cnt FROM ontime WHERE DepDel15=1 GROUP BY Year,Month) a; select avg(c1) from (select Year,Month,count(*) as c1 from ontime group by Year,Month) a; SELECT OriginCityName, DestCityName, count(*) AS c FROM ontime GROUP BY OriginCityName, DestCityName ORDER BY c DESC LIMIT 10; SELECT OriginCityName, count(*) AS c FROM ontime GROUP BY OriginCityName ORDER BY c DESC LIMIT 10; EOF WARMUP=3 RUN=10 export script="hyperfine -w $WARMUP -r $RUN" script="" function run() { port=$1 result=$2 script="hyperfine -w $WARMUP -r $RUN" i=0 while read SQL; Do f = / TMP/bench_ ${I}. The SQL echo "$before_sql" echo "$SQL" > $f > > $f # s = "cat $f | clickhouse - the client - host 127.0.0.1 $port - the port "s =" cat $f | mysql - h127.0.0.1 - P $port - uroot -s "script =" $script '$s' "I = $[I + 1] done <. / bench. SQL script="$script --export-markdown $result" echo $script | bash -x } run "3307" "$1" echo "select version() as version" | mysql - h127.0.0.1 - P3307 - uroot > > $resultCopy the code

Clickhouse can copy run_ontime.sh to ch_run.sh to modify the run part:

script="" function run() { port=$1 result=$2 script="hyperfine -w $WARMUP -r $RUN" i=0 while read SQL; Do f = / TMP/bench_ ${I}. The SQL echo "$SQL" > $f s = "cat $f | clickhouse - the client - host 127.0.0.1 - port $port" script = "$script '$s'" i=$[i+1] done <<< $(cat bench.sql) script="$script --export-markdown $result" echo $script | bash -x } run "9000" "$1"Copy the code

Usage:

./run_time.sh D20220322.md
./ch_run.sh  C202220322.md

Copy the code

Finally, compare the two MD files in the result.

Comparing the results

The environment	Clickhouse on S3(ms)	Databend on S3(ms)
Q1	498.2	186.6
Q2	682.1	247.2
Q3	620.7	354.7
Q4	269.6	125.1
Q5	160	146.6
Q6	694.3	371.3
Q7	699.9	389.2
Q8	994.9	524.9
Q9	35.9	372.1
Q10	1484.6	521.2
Q11	741.2	439.5
Q12	1945	2898.1
Q13	1129	1183.1

Graphical comparison

The above results show that Clickhouse is superior to Databend only in Q9. The analysis shows that Q9 is Clickhouse’s direct use of dictionary queries, which gives Databend a direction to optimize for Q9.

conclusion

Databend currently outperforms Clickhouse in large and wide table computing power overall in the object-based direction. In essence, Databend now outperforms Snowflake in performance. And if you’re interested in that, you can also look out for our upcoming Meetup.

If you are interested in using Databend, you can learn more about it by following the links below:

Databend on miniO: databend.rs/doc/deploy/…

Databend on COS: databend.rs/doc/deploy/…

Databend on AWS S3: databend.rs/doc/deploy/…

Databend vectorization: databend.rs/doc/perform…

You can also add wechat: 82565387 for support and add password: Databend.

Based on the number of object storage warehouse Databend vs Clickhouse performance contrast | guess who will win

Why does Databend use S3 object storage?

Test steps

The data download

Clickhouse configuration and table structure

Databend configuration and table structure

The test script

Comparing the results

conclusion

Related Posts

Mybatis gets the action to execute after getting the proxy object

Java – SpringBoot integration log4j stomp pit

Docker volume data volume!