Hi, I’m Big D.

Why column storage is widely used in OLAP, and what are its advantages over row storage? Today, let’s compare the two storage methods.

In fact, column storage is not a new technology, dating back to the 1983 paper Cantor. However, due to the early hardware conditions and application scenarios, traditional transactional databases (OLTP), such as Oracle, MySQL and other relational databases, store data in the way of rows.

Until the rise of analytical database (OLAP) in recent years, the concept of column storage becomes popular again. Big data-related databases such as HBase and Cassandra store data in the form of columns.

Principles and characteristics of line storage

For OLAP scenarios, where a full row is added, deleted, modified, and reviewed, row storage is a good choice for storing data on disk in line format.

When a query queries and returns results based on the required fields, because these fields are buried in each row of data, each complete row must be read, and the large number of disk rotation addressing operations make reading much less efficient.

As an example, the emP table for employee information is shown below.

Data is stored in rows on the disk, and the data in the same row is stored next to each other.

Select * from EMP where dept = A;

select name from emp where dept = A
Copy the code

Since the values of DEPT are stored discretely on disk, the disk needs to be rotated several times during the query to complete data location and return results.

The principle and characteristics of column storage

For OLAP scenarios, a typical query would traverse the entire table, grouping, sorting, aggregating, and so on, eliminating the advantage of having a whole row of records together in row storage. Also, analytical SQL often does not use all columns, but only those columns that are needed, and the irrelevant columns in that row have to be scanned.

In column storage, however, data from the same column is stored next to each other, as shown in the figure below.

Therefore, when querying and returning results based on the required fields, it is not allowed to scan each row of data and find the required data according to the column. In this way, disk rotation is reduced and the performance is improved.

Again in the example above, since the values of DEPT in the column store are stored sequentially on disk, disk only needs to query sequentially and return the results.

Column storage not only has the advantage of on-demand query to improve efficiency, because the data in the same column belongs to the same type, such as value type, string type, etc., the similarity is very high. You can also choose to use appropriate encoding compression to reduce the storage space of data, thereby reducing I/O and improving read performance.

In general, row storage and column storage do not say who is better than who, only who is better for which application scenario.

You are very welcome to scan the qr code below and add our wechat account: Abox_0226, and note “Enter the group”. Questions about big data technology can be discussed together in the group.