Kudu + Impala is introduced
An overview of the
Kudu and Impala are among the top projects Cloudera has contributed to the Apache Foundation. Kudu, as the underlying storage, not only supports high concurrency and low latency KV queries, but also maintains good Scan performance. This feature makes it theoretically capable of both OLTP and OLAP queries. Impala, as an established SQL parsing engine, has been widely validated for its stability and speed in the face of ad-hoc Query class requests. Impala does not have its own storage engine, which parses SQL and connects to its underlying storage engine. At the beginning of the release Impala mainly supported HDFS. After the release of Kudu, Impala and Kudu have been deeply integrated.
In many big data frameworks, Impala is positioned similar to Hive, but Impala is more focused on quick resolution of ad-hoc query SQL, which is still better suited to Hive for SQL that takes longer to execute. For SQL queries such as GroupBy, Impala performs memory calculations. Therefore, Impala has high requirements on machine configuration. The official recommendation is that Impala have a memory of 128 GB or more.
Execution efficiency is Impala’s biggest advantage. Impala already parses data stored in HDFS much faster than Hive, but with the Kudu addition, some queries can be executed as much as 100 times faster than Hive.
It’s worth noting that Kudu and Impala are two different breeds of antelope from Africa, and Cloudera likes to name its products after the fast animals.
Relevant background
OLTP and OLAP
OLTP(On-line Transaction Processing) is for high concurrency and low latency of INSERT, DELETE, UPDATE, SELECT, etc.. .
On-line Analytical Processing (OLAP) is oriented to BI Analytical data requests. It has a high tolerance for delay and processes much more data than OLTP.
In the traditional sense, OLTP corresponds to MySQL and other relational databases, and OLAP corresponds to data warehouse. OLTP and OLAP are different datastore query engines, handle different requests, and require very different architectures. This feature means that the data needs to be stored in at least two places, needs to be synchronized regularly or in real time, and also needs to maintain consistency. This feature has caused great trouble to the data development engineers, and wasted a lot of time on the data synchronization and verification of students. The appearance of Kudu+Impala, although not a perfect solution to this problem, can not be denied that it alleviates this contradiction.
It is important to note here that OLTP does not strictly require transactions to meet the four ACID conditions. In fact, OLTP is an older concept than ACID. In this article, the concepts of OLTP and OLAP focus on data volume, concurrency, latency requirements, and so on, not transactions.
Kudu source
Kudu was originally developed by Cloudera and contributed to the Apache Foundation on December 3, 2015. Kudu graduated as an Apache Top Project on July 25, 2016. It’s worth noting that Kudu has been heavily supported by Chinese company Xiaomi, which is deeply involved in the development, with a Kudu Committer.
As you can see from the time of graduation, Kudu is still young, with a lot of details to work out and some important features to develop (such as things). But the Kudu+Impala combination is being used by a growing number of companies, and is currently being promoted by Cloudera as a new big data solution.
Other terms
- HDFS is the most basic storage engine in the Hadoop ecosystem. Please note that HDFS is mainly designed to store large files and provide high-throughput read and write services. HDFS is not suitable for storing small files and does not support a large number of random reads and writes.
- MapReduce is the most basic computing framework of distributed computing. By dividing tasks into multiple Mapper and Reducer, MapReduce can handle common big data tasks well.
- Hadoop was originally composed of HDFS+MapReduce. With the release of Hadoop2.0 and 3.0, Hadoop is being given more functions.
- Hbase was inspired by Google’s Bigtable paper, and the project first started at Powerset. Hbase is a typical OLTP-type request processing engine based on the HADOOP Distributed File System (HDFS) and has high random read and write performance. In addition, Hbase has good Scan performance and can process certain types of OLAP requests.
- Spark is a new-generation computing engine that integrates iterative computing and streaming computing. Compared with MapReduce, Spark provides richer distributed computing primitives and can complete distributed computing tasks more efficiently.
- Ad-hoc Query, often translated as Ad Hoc Query, is an important concept in data warehousing. In data exploration and analysis applications, temporary SQL is usually arbitrarily pieced together, and there are certain requirements on the query speed, such queries are collectively called AD hoc queries.
- Column storage Compared to row storage, column storage puts data in the same column together. Because the data in the same column is more repeated, it can provide higher compression ratio in storage. Moreover, since most BI analyses only read part of the columns, compared with row storage, column storage only scans the required columns and reads less data, thus providing faster queries. Common column storage protocols include Parquet and so on.
Kudu introduction
What is the Kudu
Kudu is a storage engine built around the Hadoop ecosystem. Kudu has the same design philosophy as the Hadoop ecosystem, runs on regular servers, can be distributed and deployed at scale, and meets the high availability requirements of the industry. Its design concept is fast Analytics on fast data. Similar to Hbase in most scenarios, Kudu reduces random read/write performance and improves scan performance. In most scenarios, Kudu has the same random read/write performance as Hbase but far better scan performance than Hbase.
Unlike storage engines such as Hbase, Kudu has the following advantages:
- Fast OLAP class query processing speed
- It is highly compatible with common systems in the Hadoop ecosystem such as MapReduce and Spark, and its connection driver is supported and maintained by the official system
- With deep integration with Impala, Kudu+Impala provides better performance in most scenarios than the traditional ARCHITECTURE of HDFS+Parquet+Impala.
- A powerful and flexible conformance model that allows users to define a conformance model individually for each request, even including strong sequence conformance.
- Supports both OLTP and OLAP requests with good performance.
- Kudu is integrated into ClouderaManager and is operational friendly.
- High availability. Raft Consensus algorithm was adopted as the post-master election model, and the data was still readable even after the election was lost.
- Support structured data, pure column storage, save space at the same time, to provide more efficient query speed.
Typical usage scenarios for the Kudu
Streaming real-time computing scenario
Streaming computing scenarios typically have a constant stream of writes, while at the same time supporting near-real-time read, write, and update operations. Kudu’s design handles this scenario well.
Time Series Storage Engine (TSDB)
The hash sharding design of Kudu is a good way to avoid the local hotspot problem of TSDB class requests. The efficient Scan performance enables Kudu to support query operations better than Hbase.
Machine learning & Data Mining
Machine learning and data mining intermediate results often require high-throughput batch write and read, with a small number of random read and write operations. Kudu is designed to meet the storage requirements of these intermediate results.
Coexist with historical heritage data
In the actual production environment of the industrial industry, there is often a large amount of historical heritage data. Impala supports multiple underlying storage engines such as HDFS and Kudu. This feature makes it unnecessary to migrate all data to Kudu while using Kudu.
Important concepts in Kudu
The column type storage
Kudu is, of course, a pure column storage engine. Compared to Hbase, which only stores data in columns, Kudu’s column storage is closer to Parquet’s, allowing for more efficient Scan operations while using less storage space. The advantages of column storage are mainly due to two factors: 1. Normally OLAP queries only access part of the column data. The column storage engine supports on-demand access in this case, while the row storage engine must retrieve all the data in the row. 2. Data grouped together in columns generally have a higher compression ratio because data with the same columns tend to have higher similarity.
Table
All data in Kudu is stored in tables. Each Table has its corresponding Table structure and primary key, and data is stored in order according to primary key. Because Kudu is designed to support very large amounts of data, the data in the Table is split into pieces called tablets.
Tablet
Similar to other distributed storage services, multiple copies of a Tablet are placed on different servers. At the same time, only one Tablet exists as the leader. Each copy can provide read operations independently, while write operations need to be consistent and synchronized.
Tablet Server
The Tablet service, as the name suggests, does all the reading and writing to the Tablet. For a given tablet, one will be the leader and the others will be the followers. The leader election and disaster recovery principles follow the Raft consistency algorithm, which is described later in this article. Note that the number of tablets that can be held by a Tablet service is limited, and this requires that the Kudu table structure be designed with a reasonable number of partitions. Too few can result in performance degradation, and too many can result in too many tablets, putting pressure on the Tablet service.
Master
The master stores all meta information of other services. At the same time, at most one master can provide services as the leader. After the leader breaks down, the leader will be re-elected according to Raft consistency algorithm.
The master coordinates metadata reads and writes from the client. For example, when creating a new table, the client sends a request to the master, and the master forwards the request to catelog, tablet, and other services.
The Master itself doesn’t store data, it’s stored in a tablet and copied to a normal tablet.
The Tablet service makes heartbeat connections to the Master every second.
Raft Consensus Algorithm
Kudu uses Raft consistency algorithm, which divides nodes into follower, candidate and leader roles. When the leader node breaks down, the follower will become candidate and become a new leader through majority election principle. Because of the majority rule, there can be at most one leader at any given time. The leader receives the data modification instructions uploaded by the client and sends them to the followers. When most followers write data, the leader considers the write to be successful and informs the client.
Catalog Table
The Catelog table stores some of the Kudu metadata, including Tables and Tablets.
Overview of the Kudu architecture
As can be seen from the figure below, there are three masters, one of which is the leader and the other two are followers.
There are four Tablet servers and n tablets and copies are distributed evenly across all four tablets. Each tablet has a leader and two followers. Each table is divided into multiple tablets by the number of slices.
Impala is introduced
What is the Impala
Impala is an interactive SQL parsing engine built on the Hadoop ecosystem. Impala’s SQL syntax is highly Hive compatible and provides standard ODBC and JDBC interfaces. Impala itself does not provide data storage services. Its underlying data can come from HDFS, Kudu, Hbase, and even Amazon S3.
Impapa, which was first developed by Cloudera and contributed to the Apache Foundation in December 2015, is now officially named Apache Impala(Incubating)
Impala itself is not a complete replacement for Hive. Hive is still the best and most stable choice for some high-throughput, long-run requests. Even SparkSQL is not as stable as Hive.
Impala is not as stable as Hive, but in terms of performance Impala can definitely kill Hive. Impala adopts the in-memory computing model. For distributed Shuffle, it can utilize the memory and CPU resources of modern computers as much as possible. Impala also has preprocessing and analysis techniques. After table data is inserted, you can use the COMPUTE STATS instruction to allow Impala to deeply analyze the column and column data.
The advantage of the Impala
- SQL syntax that is highly similar to Hive and does not require much learning
- Large data scale SQL parsing ability, efficient use of memory and CPU utilization, quickly return SQL query results.
- Impala integrates multiple low-level data sources. Data such as HDFS, Kudu, and Hbase can be shared through Impala without data synchronization.
- Deep integration with Hue to provide visual SQL operations and Work Flow.
- Provides standard JDBC and ODBC interfaces to facilitate seamless access of downstream services.
- Provides permission management in a maximum of columns, meeting data security requirements in the actual production environment.
Impala and Hive SQL compatibility?
Impala is highly Hive compatible, but some Hive SQL features are not supported in Impala. These include:
- Types such as Data are not supported
- XML and Json functions are not supported
- Multiple DISTINCT entries are not supported. To complete multiple DISTINCT entries, perform the following operations
select v1.c1 result1, v2.c1 result2 from (select count(distinct col1) as c1 from t1) v1 cross join (select count(distinct col2) as c1 from t1) v2;Copy the code
Impala is not only syntactic but also architecturally compatible with Hive. Impala directly uses Hive’s meta-database. For companies, table structures already in Hive are available without migrating them.
What does Kudu+Impala mean to us
Kudu+Impala provides a good solution for real-time data warehouse storage. This architecture supports random read and write while maintaining good Scan performance. In addition, it provides official client support for streaming computing frameworks such as Spark. These features mean that data can be written to Kudu in real time from Spark real-time computing, the upper Impala provides BI analysis SQL queries, and the lower Kudu data can be directly operated on Spark iterative computing framework for data mining and algorithm requirements.
The shortcomings of Kudu and Impala
Limitation of Kudu primary key
- The primary key cannot be changed after the table is created.
- The primary key of a row cannot be modified by the Update operation. To change the primary key of a row, you need to delete and add a new row of data, and the operation does not maintain atomicity;
- The primary key type does NOT support DOUBLE, FLOAT, BOOL, and the primary key must NOT be NULL (NOT NULL);
- Auto-generated primary keys are not supported;
- The primary key storage unit (CELL) corresponding to each row is 16KB at most.
The limitation of Kudu column
- Some data types in MySQL, such as DECIMAL, CHAR, VARCHAR, DATE, and ARRAY, are not supported.
- The data type and nullable column attributes are not modifiable.
- A table has a maximum of 300 columns.
Limitation of the Kudu table
- The number of backup tables must be an odd number, with a maximum of 7.
- The backup number cannot be modified after being set.
The limitation of Kudu Cells
- The maximum size of the cell is 64KB, and this is before compression.
Limitation of Kudu fragments
- Only manual sharding is supported. Automatic sharding is not supported.
- The sharding setting does not support modification. To modify the sharding setting, the operation “build new table – derivative data – Delete old table” is required.
- Discarting most of the tablet backups requires manual repair.
Kudu capacity limit
- Recommended maximum number of Tablet Servers is 100;
- The recommended maximum number of Masters is 3;
- It is recommended that each tablet server store a maximum of 4 terabytes of data. ;
- The recommended number of tablets to be stored per tablet server is no more than 1000.
- The maximum number of tablets to be stored on a single tablet server for each table fragmentation is 60.
Kudu other limitations
- Kudu is designed for analysis purposes, and too much data per row can cause problems;
- Secondary indexes are not supported because the primary key has an index.
- Multi-line transaction operations are not supported;
- Some features of relational data, such as foreign keys, are not supported;
- Column and table names enforce UTF-8 encoding and a maximum of 256 bytes;
- This Compaction does not immediately release space. It requires a Compaction operation, but Compaction cannot be performed manually.
- Deleting a table immediately frees up space.
Impala stability
- Impala is not suitable for very long SQL requests;
- Impala does not support high concurrent read and write operations, even though Kudu does.
- Impala is partially incompatible with Hive syntax.
FAQ
Does Impala support high concurrent reads and writes?
Is not supported. Although Impala is designed as a Bi-ad-hoc query platform, its single SQL execution is costly and does not support low-latency, high-concurrency scenarios.
Can Impala replace Hive?
No, Impala is designed as a memory computing model, which is more efficient but less stable than Hive. Hive is still the first choice for long running SQL requests.
How much memory does Impala need?
Like Spark, Impala puts data into memory as much as possible for computation. When memory is scarce, Impala uses disk for computation, but it is the memory size that determines Impala’s efficiency and stability. Impala officially recommends at least 128GB of memory and allocating 80% of the memory to Impala
Does Impala have Cache?
Impala does not Cache table data. Impala only Cache metadata such as table structures. Although in real life the same query might run faster the second time, this is not Impala’s Cache; it is the Linux system’s Cache or the underlying storage’s Cache.
Can you add custom functions to Impala?
You can. Impala1.2 supports UDFs, although adding UDFs for Impala is more complicated than adding UDFs for Hive.
Why is Impala so fast?
Impala is built for speed, and many optimizations have been made in the details of execution efficiency. On a larger scale, Impala does not use MapReduce as a computing model compared to Hive. MapReduce is a great invention that solves many distributed computing problems, but unfortunately MapReduce was not designed for SQL. When SQL is converted into MapReduce computing primibles, it often requires multiple iterations, and the data needs to land more times, resulting in great waste.
- Impala will cache as much data as possible in memory, so that the data can complete THE SQL query without landing, which is much more efficient than MapReduce’s design of landing every iteration.
- Impala’s resident process avoids the MapReduce startup overhead, which is a disaster for ad-hoc queries.
- Impala is designed for SQL. It reduces the number of iterations, and avoids unnecessary Shuffle and Sort.
At the same time, Impala’s modern computing framework enables it to take advantage of modern, high-performance servers.
- Impala uses LLVM to generate dynamically executed code
- Impala takes advantage of the hardware configuration as much as possible, including the SSE4.1 instruction set to prefetch data.
- Impala coordinates the disk IO itself, carefully controlling the throughput of each disk to maximize the overall throughput.
- At the code efficiency level, Impala is done in C++ and strives for language details, including inline functions, inner loop unwrapping, and other speed-ups
- When it comes to program memory usage, Impala takes advantage of C++’s natural memory footprint, which is much smaller than that of JVM languages, and also follows the minimal memory usage principle at the code detail level, which frees up more memory for data caching.
What are the advantages of Kudu over Hbase, and why?
Kudu is similar to Hbase in some features, and comparisons are inevitable. However, Kudu and Hbase differ in two essential ways.
- Kudu’s data model is more like a traditional relational database, and Hbase is a complete no-SQL design, where everything is in bytes.
- Kudu’s disk storage model is true column storage, and Kudu’s storage architecture is very different from Hbase’s. In general, pure OLTP requests are suitable for Hbase, and combined OLTP and OLAP requests are suitable for Kudu.
Is Kudu a pure in-memory database?
Kudu is not a pure in-memory database. The data blocks of Kudu are MemRowSet and DiskRowSet, and most of the data is stored on disk.
Does Kudu have its own storage format or does it follow Parquet’s?
Kudu’s memory storage is row storage and disk storage is column storage. The format is similar to that of Parquet, with some differences to support random read and write requests.
Do compActions need to be manually operated?
Compactions are designed to be executed automatically in the Kudu background and are executed slowly in blocks. Manual operations are currently not supported.
Does Kudu support automatic expiration deletion?
Is not supported. Hbase supports this feature.
Does Kudu have the same local hotspots as Hbase?
Modern distributed storage designs tend to store data in order by primary key. This may cause some local hotspot access. For example, in the real-time log storage model with time as the primary key, logs are always written last in the time order, which may cause serious local hotspot in Hbase. Kudu has the same problem, but it is much better than Hbase. Kudu supports hash sharding. Data is written to the tablet based on the hash and then to the primary key.
Where does Kudu fit into CAP theory?
Like Hbase, Kudu is the CP in CAP. As long as one client succeeds in writing data, other clients can read the same data. If the system breaks down, there is a delay in writing data.
Does Kudu support multiple indexes?
Kudu only supports the Primary Key index, but the Primary Key can be set to multiple columns. Traditional database support features like Kudu, which automatically adds indexes, multi-index support, and foreign keys, are being designed and developed.
What is Kudu’s support for transactions?
Kudu does not support multi-line transactions and does not support rollback transactions, but Kudu guarantees atomicity for single-line operations.
Most of the content of this article is translated and collated from Kudu and Impala’s official websites
Author: Gao Yunxiang
Wrote: August 2017