DolphinDB and Druid comparison tests in the sequential database

DolphinDB and Druid are distributed analytical sequential databases. Although the former is developed in c++ and the latter in Java, they have a lot in common in terms of architecture, functionality, application scenarios, etc. This report compares the performance of SQL query, data import, and disk space.

1. System introduction

Druid is an OLAP data warehouse implemented in the Java language for low-latency queries and inserts on terabytes of data and real-time streaming data analysis. Druid features high availability and scalability with key technologies such as distributed storage, SN architecture, column storage, inverted indexing, and bitmap indexing. Druid also provides multiple language interfaces and partial SQL support.

DolphinDB Database is an analytical, distributed sequential Database written in C++ with built-in streaming data processing engine, parallel computing engine and distributed computing capabilities. DolphinDB has a built-in distributed file system that supports horizontal and vertical cluster scaling. Provides SQL – and Python-like scripting languages that not only manipulate data with SQL, but also perform more complex in-memory calculations. Provides apis for other common programming languages to facilitate integration with existing applications. DolphinDB can process terabytes of data quickly and is good for historical data modeling and real-time streaming in finance, as well as mass sensor data processing and real-time analysis in the Internet of Things.

2. Configure the system

2.1 Hardware Configuration

The hardware configuration of the test is as follows:

Device: DELL OptiPlex 7060

CPU: Inter® Core™ I7-8700 CPU @ 3.20GHz, 6 cores and 12 threads

Memory: 32 gb

Hard disk: 256GB SSD, 1.8TB Seagate ST2000DM008-2FR102 mechanical hard disk

Operating system: Ubuntu 16.04 x64

2.2 Environment Configuration

The test environment is a multi-node cluster under a single server. Set the number of DolphinDB data nodes to four and the maximum available memory of a single data node to 4GB. Set the number of Druid nodes to 5, namely overload, Broker, Historical, Coordinator, and middleManager. Druid disables the query cache function because the Druid server caches query results by default. If the Druid server averages multiple query results, the Druid server disables the Query cache function. Disable Druid’s roll Up feature so as not to affect Druid’s write performance tests. Follow the default Settings for other configurations.

The original CSV files are stored on HDDS. The database is stored on SSDS.

3. Test the data set

This test uses the TAQ data set from level1 of the us stock market in August 2007. The TAQ data set is divided into 23 CSV files per day, with the size of a single file ranging from 7.8g to 19.1g, and the size of the whole data set is about 290G, with a total of 6,561,693,704 pieces of data.

The data types of the various DolphinDB and Druid fields in the TAQ test dataset are as follows:

In Druid, the DATE field is specified as the TIMESTAMP column. All other fields are used as Dimension fields.

4. Data partitioning scheme

DolphinDB uses a combination of stock symbol and date partitions, with 128 partitions by stock symbol range and 23 by date.

Druid only supports time-range partitioning, so we specify the DATE column as timestamp, divided by day into 23 partitions.

5. Comparison tests

DolphinDB and Druid were compared in terms of database query performance, I/O performance and disk footprint.

5.1 Database Query Performance

The DolphinDB script language supports SQL syntax and has extensions for sequential data. Druid provides a language for querying based on the Json data format, as well as DSQL for SQL queries. This test uses Druid’s built-in DSQL.

We performed several common SQL queries against the TAQ dataset. In order to reduce the influence of accidental factors on the results, the query performance test conducted 10 times for each query operation and averaged the total time in milliseconds. When testing DolphinDB, we used the Timer statement to evaluate the execution time of SQL statements on the server side. Druid provides no tools or functions to output query times. Instead, the client command line tool DSQL prints execution times. Druid returns longer execution times than DolphinDB, which takes longer to transfer and display query results. The DSQL and Druid servers are on the same node, and the impact time is about 1ms. The time of about 1ms does not affect our test conclusion, so no special treatment is done.

The FOLLOWING table shows the SQL representation of the seven queries.

The test results are shown in the following table.

DolphinDB performed better than Druid for almost all queries, between three and 30 times faster than Druid.

DolphinDB allows multiple dimensions, such as time and ticker in TAQ. Tests that need to be filtered or grouped by ticker in queries (e.g., tests 1, 3, 6, 7) DolphinDB had an even greater advantage.

5.2 I/O Performance Testing

We tested DolphinDB and Druid for importing single files (7.8GB) and multiple files (290.8GB). To be fair, we turned off Druid’s Roll Up feature. The test results are shown in the following table. The time is in seconds.

Druid takes more than 16 times longer to import a single file than DolphinDB. DolphinDB is faster when importing multiple files because it supports parallel importing. See Appendix 2 for the data import script.

5.3 Testing Disk Space Usage

When data was imported to DolphinDB and Druid, we compared their compression rates. The test results are shown in the following table.

DolphinDB uses the LZ4 compression algorithm to quickly compress data stored in columns. DolphinDB’s SYMBOL type uses dictionary encoding to convert strings to integers before compression. Druid uses LZ4 algorithm to compress timestamp and metrics directly, and uses dictionary encoding, bitmap index and roaring bitmap to compress dimensions field. Using dictionary encoding can reduce the space of string storage, bitmap index can perform bitwise logical operation quickly, and bitmap index compression can further save the storage space.

In this test, the DolphinDB database took up about 80% more disk space than Druid. The main reason for this difference is that the compression ratio of the BID and OFR floating-point fields on DolphinDB and Druid is quite different. DolphinDB had a compression ratio of 20%, but Druid had a compression ratio of 5%. The reason is that the test dataset is a historical dataset, and the data is already sorted by date and stock. Druid works well with bitmap compression, and unique has a very limited number of quotes.

Although Druid’s database has a higher compression ratio and a smaller static disk footprint, Druid generates segment cache directories with a total disk footprint of 65 GB. DolphinDB, on the other hand, does not require extra space and has less overall disk space than Druid.

6. Summary

DolphinDB’s performance advantages over Druid come from (1) differences in storage and partitioning mechanisms, (2) differences in development languages (c++ vs Java), (3) differences in memory management, and (4) differences in implementation of algorithms such as sorting and hashing.

In terms of partitioning, Druid only supports time-based range partitioning, which is less flexible than DolphinDB, which supports value partitioning, range partitioning, hash partitioning, and list partitioning and can combine multiple fields for each table. DolphinDB partitions are more fine-grained, and data or queries are less likely to be clustered to a single node. DolphinDB also scans fewer blocks of DolphinDB data when querying, and has a faster response time and better performance.

DolphinDB was also more functional than Druid in terms of performance. DolphinDB has a very powerful Window function mechanism and more comprehensive SUPPORT for SQL Joins. There is good support for time-series data specific sliding function, ASOF Join, window Join, and DolphinDB. DolphinDB combines databases, programming languages, and distributed computing. In addition to regular database queries, DolphinDB also supports more complex in-memory, distributed, and stream computing.

DolphinDB and Druid also work slightly differently. If the Druid crashes or the segment-cache is empty, it takes a lot of time to reload the data and decompress each segment to the segment-cache for query. As a result, Druid takes longer to restart and requires more space.

The appendix

Appendix 1. Environment Configuration

(1) DolphinDB configuration

controller.cfg

localSite=localhost:9919:ctl9919
localExecutors=3
maxConnections=128
maxMemSize=4
webWorkerNum=4
workerNum=4
dfsReplicationFactor=1
dfsReplicaReliabilityLevel=0
enableDFS=1
enableHTTPS=0
12345678910
Copy the code

cluster.nodes

localSite,mode
localhost:9910:agent,agent
localhost:9921:DFS_NODE1,datanode
localhost:9922:DFS_NODE2,datanode
localhost:9923:DFS_NODE3,datanode
localhost:9924:DFS_NODE4,datanode
123456
Copy the code

cluster.cfg

maxConnection=128
workerNum=8
localExecutors=7
webWorkerNum=2
maxMemSize=4
12345
Copy the code

agent.cfg

workerNum=3
localExecutors=2
maxMemSize=4
localSite=localhost:9910:agent
controllerSite=localhost:9919:ctl9919
12345
Copy the code

(2) the Druid configuration

_common

# Zookeeper
druid.zk.service.host=zk.host.ip
druid.zk.paths.base=/druid
# Metadata storage
druid.metadata.storage.type=mysql
druid.metadata.storage.connector.connectURI=jdbc:mysql://db.example.com:3306/druid
# Deep storage
druid.storage.type=local
druid.storage.storageDirectory=var/druid/segments
# Indexing service logs
druid.indexer.logs.type=file
druid.indexer.logs.directory=var/druid/indexing-logs
123456789101112
Copy the code

broker:

Xms24g
Xmx24g
XX:MaxDirectMemorySize=4096m

# HTTP server threads
druid.broker.http.numConnections=5
druid.server.http.numThreads=25

# Processing threads and buffers
druid.processing.buffer.sizeBytes=2147483648
druid.processing.numThreads=7

# Query cache
druid.broker.cache.useCache=false
druid.broker.cache.populateCache=false

coordinator:
Xms3g
Xmx3g

historical:
Xms8g
Xmx8g

# HTTP server threads
druid.server.http.numThreads=25

# Processing threads and buffers
druid.processing.buffer.sizeBytes=2147483648
druid.processing.numThreads=7

# Segment storage
druid.segmentCache.locations=[{"path":"var/druid/segment-cache","maxSize":0}]
druid.server.maxSize=130000000000

druid.historical.cache.useCache=false
druid.historical.cache.populateCache=false

middleManager:
Xms64m
Xmx64m

# Number of tasks per middleManager
druid.worker.capacity=3

# HTTP server threads
druid.server.http.numThreads=25

# Processing threads and buffers on Peons
druid.indexer.fork.property.druid.processing.buffer.sizeBytes=4147483648
druid.indexer.fork.property.druid.processing.numThreads=2
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
Copy the code

overload:

Xms3g
Xmx3g
12
Copy the code

Appendix 2. Data import script

DolphinDB script:

if (existsDatabase("dfs://TAQ")) dropDatabase("dfs://TAQ") db = database("/Druid/table", SEQ, 4) t=loadTextEx(db, 'table', ,"/data/data/TAQ/TAQ20070801.csv") t=select count(*) as ct from t group by symbol buckets = cutPoints(exec symbol from t, 128) buckets[size(buckets)-1]=`ZZZZZ t1=table(buckets as bucket) t1.saveText("/data/data/TAQ/buckets.txt") db1 = The database (" ", VALUE, 2007.08.01.. Partition = loadText("/data/data/ data.txt ") partitions = exec * from partition DB2 = database("", RANGE, partitions) db = database("dfs://TAQ", HIER, [db1, db2]) db.createPartitionedTable(table(100:0, `symbol`date`time`bid`ofr`bidsiz`ofrsiz`mode`ex`mmid, [SYMBOL, DATE, SECOND, DOUBLE, DOUBLE, INT, INT, INT, CHAR, SYMBOL]), `quotes, `date`symbol) def loadJob() { filenames = exec filename from files('/data/data/TAQ') db = database("dfs://TAQ") filedir = '/data/data/TAQ' for(fname in filenames){ jobId = fname.strReplace(".csv", "") jobName = jobId submitJob(jobId,jobName, loadTextEx{db, "quotes", `date`symbol,filedir+'/'+fname}) } } loadJob() select * from getRecentJobs() TAQ = loadTable("dfs://TAQ","quotes"); 1234567891011121314151617181920212223242526272829303132Copy the code

Druid script:

{
"type" : "index",
"spec" : {
"dataSchema" : {
"dataSource" : "TAQ",
"parser" : {
"type" : "string",
"parseSpec" : {
"format" : "csv",
"dimensionsSpec" : {
"dimensions" : [
"TIME",
"SYMBOL",
{"name":"BID", "type" : "double"},
{"name":"OFR", "type" : "double"},
{"name":"BIDSIZ", "type" : "int"},
{"name":"OFRSIZ", "type" : "int"},
"MODE",
"EX",
"MMID"
]
},
"timestampSpec": {
"column": "DATE",
"format": "yyyyMMdd"
},
"columns" : ["SYMBOL",
"DATE",
"TIME",
"BID",
"OFR",
"BIDSIZ",
"OFRSIZ",
"MODE",
"EX",
"MMID"]
}
},
"metricsSpec" : [],
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "day",
"queryGranularity" : "none",
"intervals" : ["2007-08-01/2007-09-01"],
"rollup" : false
}
},
"ioConfig" : {
"type" : "index",
"firehose" : {
"type" : "local",
"baseDir" : "/data/data/",
"filter" : "TAQ.csv"
},
"appendToExisting" : false
},
"tuningConfig" : {
"type" : "index",
"targetPartitionSize" : 5000000,
"maxRowsInMemory" : 25000,
"forceExtendableShardSpecs" : true
}
}
}
Copy the code

The test dataset used about 300GB of TAQ data from the US stock market. Through the tests, we found:

DolphinDB writes data about 30 times faster than Druid.
DolphinDB was about 10 times faster than Druid.
DolphinDB databases take up 80% more static space than Druid and run on slightly less total disk space.