Abstract: In the Apache Druid low-level storage design, we have to talk about column storage. Here you are, looking at row storage patterns, column storage patterns, their advantages and disadvantages, and optimizations for column storage patterns.

Today’s motto: Don’t be limited to one way of thinking. Learn more about different dimensions.

Let’s start with data storage

Most of our first database systems were row storage systems. When I studied database in college, the teacher asked us to imagine a database as a table. Each data record is a row of data, and each row of data contains several columns. So the way we think about most of our data storage is a slightly more complex form management system. We write the data line by line and then filter out the rows we want based on the query criteria.

Most traditional relational databases organize data in a row-oriented manner. Like Mysql, Postgresql. In recent years, more and more traditional databases have added column storage capabilities. Although the technology of column storage has been around for more than a decade, it has never been as popular a way to organize storage as it is today.

Row storage and column storage are the underlying ways of organizing data in a database. (Not on the same level as document, K-V, timing, etc.)

Line storage

Row storage systems organize data in rows. Suppose now there are the following blog data (when the university teacher assigned the system project homework always let us do a blog system, probably because they first contact with the Internet is BBS) :

[

  {

    "title""Oriented Column Store".

    "author""Alex".

    "publish_time"1508423456.

    "like_num"1024

}, {

    "title""Apache Druid".

    "author""Bob".

    "publish_time"1504423069.

    "like_num"10

}, {

    "title""Algorithm".

    "author""Casey".

    "publish_time"1512523069.

    "like_num"16

  }

]

Copy the code

Row storage stores data on disk in the following ways. So we can think about, what kind of storage is this good for? (Pause for 5 seconds to think about it here) It is useful to write data line by line. When writing a data record, you only need to append the data to the existing data record.

Row storage is suitable for Online Transaction Processing (OLTP) systems. Since data is stored on a row basis, data can be written faster. It is also easier to query data by record.

Most of you will ask, isn’t that what our system is for? So why do I need column storage, and what is column storage?

Let’s imagine a scenario where instead of looking up Bob’s blog, I want to count the number of blogs Bob has posted, or the number of blog likes the system has received today. If it is a row storage system, how will the database operate? (Pause and think for 10 seconds)

As shown in the figure, to count all likes, first read all rows into memory, and then do the sum operation on the like_num column to get the result. Assuming that the disk can read data from three boxes in the figure at a time (it actually needs to read in bytes), this aggregate calculation requires N(N= data amount) disk accesses.

This need to aggregate statistics across large data sets is a common behavior of OLAP systems. Given this need, we can see why column storage has become popular in recent years. Because the demand for data, big data, data analysis, OLAP(Online Analytical Processing), is increasing, it’s not about the transactions that data is written to, or the query of data by record, it’s about data filtering, statistics.

Column storage

With the same example data above, let’s look at how column storage organizes data.

[

  {

    "title""Oriented Column Store".

    "author""Alex".

    "publish_time"1508423456.

    "like_num"1024

}, {

    "title""Apache Druid".

    "author""Bob".

    "publish_time"1504423069.

    "like_num"10

}, {

    "title""Algorithm".

    "author""Casey".

    "publish_time"1512523069.

    "like_num"16

  }

]

Copy the code

As shown in the figure, column storage organizes data from each column together. So you can think about what’s good for that? (Pause 5 seconds)

Yes, it is useful for column operations, such as counting the sum of all like_num. The process will be as follows:

Again, it is assumed that the disk can read three boxes at a time (in bytes). As you can see, the way the data is organized by column storage takes only one disk operation.

In the world of programming, we’ve learned that any choice or inclination comes with a price. Space for time, time for space, consistent availability balance, etc. There is inevitably a downside to choosing column storage. The first manifestation is data writing.

When a new piece of data arrives, each column needs to be stored in its proper location. This requires multiple writes to the disk. (Of course, a real database does not show the “squeeze” and “move” situation, the database will be different columns of data in different places; In the case of multiple write operations, most storage systems use buffering to mitigate these problems.

contrast

Row-Store Column-Store
Because data is written and read row by row, reading data often requires reading columns that are not necessary You can read only the necessary columns
Easy to read and write data by record Writing and reading data from record to record is slow
Suitable for OLTP systems Suitable for OLAP systems
Not conducive to aggregate statistics operations on large data sets Facilitates data aggregation of large data sets
It’s bad for data compression To compress data

Column storage benefits

There are several natural advantages to column-based storage:

  • Automatic indexing

    Because the storage is based on columns, each column acts as an index by itself. So when you do something that requires an index, you don’t need an extra data structure to create a proper index for this column.

  • Good for data compression

    Compression is good for two reasons. Firstly, you will find that the cardinality of most disaggregated data is actually repeated. For example, since the same author will publish multiple blogs, the cardinality of all values in the author column must be less than the number of blogs. Therefore, there is no need to store such a large amount of data in the author column. Second, the same column data type is consistent, which is conducive to the optimization and compression of data structure filling, and for the numeric column data type can take more favorable algorithm to compress storage.

The last

The column storage schema is now common in many analytical databases. Moreover, due to the increasing demand for big data analysis, more and more traditional row storage databases have also added column storage mode, such as Oracle and Sql Server have column storage feature.

The underlying data store in Apache Druid is based on the column schema. Those of you who are interested can review it. In addition, HBase is a typical column storage database. Let’s talk a little bit about how data is stored underneath HBase sometime. You can also talk about how to compress a number column (you can also think about how to compress a number column first).

Series of articles:

Time Series Database (TSDB) introduction and selection ten minutes to understand Apache Druid Underlying storage design Apache Druid cluster design and workflow

Reference article:

https://towardsdatascience.com/the-beauty-of-column-oriented-data-2945c0c9f560

https://dataschool.com/data-modeling-101/row-vs-column-oriented-databases/

For more information about data storage, please follow my official account.

MageByte

This article was typeset using MDNICE