1. An overview of the
Spark is a general-purpose parallel computing framework for big data based on in-memory computing, with built-in components such as batch processing, stream processing, machine learning, and graph processing. Hive is a Hadoop-based data warehouse that supports SQL command query, improving Hadoop ease of use. Spark is used together with Hive and Hadoop. Data partitions in Hive can be used to conveniently manage and filter data and improve query efficiency.
DolphinDB is a high performance distributed sequential database written in C++. It has a high throughput and low latency column memory engine. It integrates powerful programming languages, including Python and SQL scripting languages. DolphinDB uses Data sources internally to abstract partitioned Data. On Data Source, SQL, machine learning, batch and stream processing can be completed. A Data Source can be either a built-in database partition or external Data. If the Data Source is a partition of the built-in database, most of the calculation can be done locally, greatly improving the efficiency of the calculation and query.
This report compares the performance of DolphinDB, Spark when accessing HDFS (Spark+Hadoop), and Spark when accessing HDFS (Spark+Hive+Hadoop) through Hive components. The test covers data import, disk space usage, data query, and multi-user concurrent query. Through comparative testing, we can gain a better understanding of the major factors affecting performance and the optimal application scenarios for different tools.
2. Configure the environment
2.1 Hardware Configuration
This test uses two servers (machine 1 and machine 2) with the same configuration. The configuration parameters are as follows:
Host: DELL PowerEdge R730xd
CPU: Intel Xeon(R) CPU E5-2650 V4 (24-core 48-thread 2.20GHz)
Memory: 512 GB (32GB x 16, 2666 MHz)
Hard disk: 17T HDD (1.7T × 10, 222 MB/s read; 210 MB/s write)
Network: 10 gigabit Ethernet
OS: CentOS Linux release 7.6.1810 (Core)
2.2 Cluster Configuration
DolphinDB was tested for Linux V0.95. The control nodes of the test cluster are deployed on machine 1. Each machine has three data nodes. Each data node is configured with 8 workers, 7 executors, and 24 GB of memory.
The test Spark version is 2.3.3, which runs Apache Hadoop 2.9.0. Hadoop and Spark are configured in fully distributed mode. Machine 1 is the Master, and machine 1 and Machine 2 have slaves. Hive version 1.2.2 has Hive on both machine 1 and machine 2. The metadata is stored in the MySql database on machine 1. Spark and Spark + Hive use the Standalone client mode to submit applications.
DolphinDB, Spark, and Spark+Hive were all configured with six hard disks. The total CPU and memory used were the same for different concurrent numbers of DolphinDB, Spark, and Hive. They were all 48 threads with 144G memory. The resources used by Spark and Spark+Hive are only for specific applications. Each application has six executors. In the case of multiple concurrent users, the resources used by a single user of Spark, Spark+Hive decrease as the number of users increases. Table 1 shows the resources used by each user with different concurrent numbers.
Table 1. Resources used by a single user with different concurrency numbers of Spark and Spark+Hive
3. Data set and database design
3.1 the data set
The test data set is the TAQ data set provided by The New York Stock Exchange (NYSE), which contains the Level 1 quotation data of more than 8000 stocks from January 08.2007 to August 31, 2007, including the trading time, stock code, bid price, ask price, buy quantity, sell quantity and other quotation information. There are 6.5 billion (6,561,693,704) quotation records in the data set, and one trading day record is saved in a CSV. There are 23 trading days in the month, and 23 uncompressed CSV files are 277 GB in total.
Data source: www.nyse.com/market-data…
3.2 Database Design
Table 2. Data types of TAQ in various systems.
In the DolphinDB database, partitions are grouped by date and symbol columns. The first partition uses date for values and the second uses symbol for ranges. There are 100 DolphinDB partitions, each around 120M.
Spark stores data in HDFS in 23 CSV files corresponding to 23 directories. Spark+Hive uses two partitions. The first partition uses the DATE column for static partitioning and the second partition uses the stock SYMBOL for dynamic partitioning.
See appendix for detailed scripts.
4. Data import and query test
4.1 Data import Test
The raw data is evenly distributed across six hard disks on two servers to make full use of all the resources in the cluster. DolphinDB imports data in parallel in asynchronous multi-node mode. Spark and Spark+Hive start six applications in parallel to read data and store data in HDFS. The data import time of each system is shown in Table 3. Table 4 lists the disk space occupied by data in each system. See appendix for the data import script.
Table 3. DolphinDB, Spark, Spark+Hive import time
Table 4. Disk space occupied by DolphinDB, Spark, and Spark+Hive data
DolphinDB had four times better import performance than Spark and Spark+Hive, and six times better than Spark+Hive. DolphinDB is written in C++ and has many internal optimizations to take advantage of disk IO.
DolphinDB takes up about twice as much disk space as Spark and Spark+Hive because both use the Parquet format on Hadoop. The Parquet format is written to Hadoop using Spark using SNappy compression by default.
4.2 Data query test
To ensure the fairness of the test, each query statement must be tested several times. Before each test, the page cache, directory item cache, and disk cache of the system are cleared using Linux system commands. DolphinDB also clears its built-in cache.
The queries in Table 5 cover most of the query scenarios, including grouping, sorting, condition, aggregate calculation, point query, and full table query, and are used to evaluate DolphinDB, Spark, and Spark+Hive for different numbers of submissions.
Table 5. DolphinDB, Spark, Spark+Hive query statements
4.2.1 DolphinDB and Spark single user query tests
DolphinDB and Spark single-user queries are shown below, with an average of eight DolphinDB and Spark queries.
Table 6. DolphinDB and Spark query results of single users
DolphinDB performed about 200 times better than Spark+HDFS. DolphinDB queries are filtered by DolphinDB partition fields from Q1 to Q6. DolphinDB only loads data from a specified partition and does not need full table scanning. However, Spark requires full table scanning from Q1 to Q6, which takes a lot of time. DolphinDB and Spark both require full table scans for Q7 queries, but DolphinDB loads only relevant columns and does not need to load all columns, whereas Spark loads all data. DolphinDB was not as different from Spark because Query runtime was dominated by data loading.
4.2.2 DolphinDB and Spark+Hive Single user Query Tests
DolphinDB is more efficient than Spark because the data is partitioned and predicates are pushed down when querying. Here, Spark is used to access HDFS with the Hive component, and the query performance of DolphinDB and Spark+Hive is compared. The following are the results of DolphinDB, Spark+Hive single user queries, which take an average of eight times.
Table 7. DolphinDB, Spark+Hive single user query results
DolphinDB performed tens of times better than Spark+Hive. Compared to the results in Table 6, Spark+Hive queries were much faster than Spark, and DolphinDB’s advantage was significantly reduced. This is because Hive partitions data and only loads partial data when query conditions contain partition fields to implement data filtering and improve efficiency. Memory overflow occurs when query statement Q7 scans the entire table.
DolphinDB and Spark+Hive partition data and can filter data by pushing down predicates when loading data, but DolphinDB is faster than Spark+Hive. This is because Spark+Hive reads HDFS data between different systems. Data needs to be serialized, transmitted over the network, and deserialized, which takes time and deteriorates performance. DolphinDB does most of its calculations locally, reducing data transfer and being more efficient.
4.2.3 DolphinDB compared with Spark
DolphinDB was compared with Spark and Spark+Hive. The performance of Spark was affected by data partitioning, data filtering, and transfer. In this section, we loaded data into memory and calculated DolphinDB and Spark+Hive. We omit Spark+Hive because Hive is only used for data filtering, and it is more efficient to read data from HDFS, where the test data is already in memory.
Table 8 shows statements that test computing power. Each test consists of two statements: the first statement loads the data into memory, and the second statement evaluates the data in memory. DolphinDB caches data automatically, and Spark recreates a temporary table TmpTbl using its default caching mechanism.
Table 8. DolphinDB and Spark computing power comparison sentences
DolphinDB and Spark tests showed the average number of DolphinDB and Spark tests.
Table 9. DolphinDB and Spark computing test results
Because the data is already in memory, Spark uses less time than table 6, but DolphinDB still has better computing power than Spark. DolphinDB is written in C++ and manages its own memory, which is more efficient than Spark using the JVM to manage memory. DolphinDB also has more efficient algorithms built in to improve performance.
DolphinDB’s distributed computing computes data in a specified memory partition. Spark loads blocks across the HDFS. A data block contains data with different symbol values. Although cached, data still needs to be filtered, so the ratio between Q1 and Q2 is large. Broadcast variables used in Spark calculation are compressed. If they are transferred to another executor and decompressed, performance deteriorates.
4.2.4 Concurrent Query by Multiple Users
We tested DolphinDB, Spark, and Spark+Hive with multi-user concurrent queries using the query statements in Table 5. Here are the test results, where the elapsed time is the average of eight queries.
Table 10. DolphinDB, Spark, Spark+Hive multi-user query results
Figure 1. Comparison of DolphinDB and Spark multi-user query results
Figure 2. Comparison of DolphinDB and Spark+Hive multi-user query results
As can be seen from the above results, with the increase of the number of concurrent requests, the query time of the three gradually increases. When the number of concurrent users reaches 8, Spark performance degrades significantly compared with the previous situation of a small number of concurrent users, and Spark will cause the death of the worker when executing Q7. Similar to DolphinDB, Spark+ Hive is stable when accessing multiple users, but memory overflow occurs when executing Q7 queries.
The query configuration of Spark+ Hive is the same as that of Spark. It can be partitioned and can filter data. The amount of data to be queried is small, so the efficiency is higher than that of Spark to scan all data.
DolphinDB performed significantly better than Spark and Spark+ Hive in concurrent queries. As you can see from the above figure, DolphinDB had an almost linear advantage over Spark as the number of users increased. The advantages over Spark + Hive remain unchanged, demonstrating the importance of data filtering when data partitions are used for query.
DolphinDB allows multiple users to share data concurrently, unlike Spark, which is application-specific. Therefore, in the case of eight concurrent users, Spark allocates less resources to each user, resulting in a significant performance deterioration. DolphinDB data sharing can reduce resource usage, leaving more resources available for users to calculate, increasing the efficiency and number of concurrent users.
5. Summary
DolphinDB can be loaded in parallel, while Spark and Spark+Hive import data through multiple applications. DolphinDB imports four to six times faster than Spark and Spark+ Hive. DolphinDB occupies about twice as much disk space as Spark and Spark+ Hive on Hadoop. Spark and Spark+ Hive use SNappy compression.
DolphinDB has an even greater advantage when it comes to SQL queries for data. The advantages come from four main areas :(1) localized computing, (2) partition filtering, (3) optimized in-memory computing, and (4) data sharing across sessions. DolphinDB queries were several to a hundred times faster than Spark and dozens of times faster than Spark+ Hive when viewed by a single user. Spark reads HDFS data between different systems, including data serialization. Deserialization takes up a lot of time and resources. DolphinDB SQL queries are mostly localized, reducing data transfer and load times. Spark+ Hive is faster than Spark because Spark+ Hive scans only data of related partitions and filters data. DolphinDB was still several times more powerful than Spark after stripping out localization and partition filtering (that is, all the data was already in memory). DolphinDB’s partition-based distributed computing is efficient and manages memory better than Spark’s JVA-BASED management. The efficiency of Spark’s multi-user concurrency will gradually decrease as the number of users increases. Too many users will cause the death of workers when querying large amounts of data. Spark + Hive supports stable multi-user concurrency, but memory overflow may occur if data is too loaded. DolphinDB can share data with multiple users, reducing the resources used to load data. DolphinDB is hundreds of times faster than Spark and dozens of times faster than Spark+Hive. DolphinDB’s performance advantage over Spark grew as the number of users increased. When partition queries are involved, Spark+ Hive and DolphinDB significantly improve query performance.
Spark is an excellent general-purpose distributed computing engine that performs well in SQL query, batch processing, stream processing, and machine learning. But because SQL queries typically only need to evaluate the data once, the advantages of in-memory computing are not fully realized compared to machine learning, which requires hundreds of iterations. Therefore, Spark is more recommended for computationally intensive machine learning.
DolphinDB was also found to be a very lightweight implementation in our tests. Cluster setup was quick and easy, and Spark + Hive+ Hadoop cluster setup was very complex.
The appendix
Appendix 1. Data Preview
Appendix 2. Hive Create table Sentences
CREATE TABLE IF NOT EXISTS TAQ (time TIMESTAMP, bid DOUBLE, ofr DOUBLE, bidsiz INT, ofrsiz INT, mode INT, ex TINYINT, Mmid STRING) PARTITIONED BY (date date, symbol STRING) STORED AS PARQUET;Copy the code
Appendix 3.
DolphinDB import data script:
Fps1 and fPS2 represent all the CSV paths on machine 1 and 2 respectively. The vector FPS is the vector containing fPS1 and fPS2. AllSites1 and allSites2 represent the names of data nodes on machine 1 and 2 respectively Vector DATE_RANGE for allSites1 and allSites2 =2007.07.01.. Date_schema =database('', VALUE, DATE_RANGE) symbol_schema=database('', RANGE, buckets) db=database(FP_DB, COMPO,[date_schema,symbol_schema]) taq = db.createPartitionedTable(schema, `taq, `date`symbol) for(i in 0.. 1){ for(j in 0.. (size(fps[i])-1)) { rpc(allSites[i][j] % size(allSite[i])],submitJob,"loadData" , "loadData" ,loadTextEx{database(FP_DB), "taq", `date`symbol, fps[i][j]} ) } }Copy the code
Spark and Hive data import configuration:
--master local[8]
--executor-memory 24G
Copy the code