This is the 20th day of my participation in the August Text Challenge.More challenges in August
Line storage
- Data is stored in rows
- Queries that are not indexed consume a lot of IO
- Building indexes and views takes physical space and time resources
- In the face of a large number of queries, complex query databases must be swollen to meet performance requirements
The column type storage
- Data is stored in columns, with each column stored separately
- Accessing only the columns involved in the query significantly reduces system IO
- Consistent data type, similar data characteristics efficient compression
Advantages of column storage
- Analysis scenarios often involve reading a large number of rows but a few columns. In row saving mode, data is stored consecutively in rows and all columns are stored in a block. All columns that do not participate in calculation are read during I/O, and the read operation is greatly amplified. In column storage mode, only the columns involved in calculation need to be read, which greatly reduces IO cost and speeds up query.
- Data in the same column belongs to the same type and is compressed effectively. Column storage often has a compression ratio of up to ten times or even higher, which saves a lot of storage space and reduces the storage cost.
- A higher compression ratio means a smaller data size and less time to read the corresponding data from disk.
- Free compression algorithm selection. Different columns of data have different data types and therefore different compression algorithms are applicable. You can choose the most appropriate compression algorithm for different column types.
- A high compression ratio means that more data can be stored in the same size of memory and the system cache works better.
Official data show that in some analysis scenarios, acceleration effects of 100 times or more can be achieved by using column storage.
Row and column stores compare advantages and disadvantages
Contrast aspects | Line storage | Column storage |
---|---|---|
Write performance | Write people is a one-time completion, higher performance | Splitting a row of records into a single column saves significantly more times of writing than row storage, and actually takes more time than row storage |
Read performance | When a few columns are read, other irrelevant columns need to be traversed, resulting in high I/O overhead. The data can be read in sequence to ensure high performance | When reading a few columns, there is no need to read irrelevant columns, and the performance is high. When reading the entire row, all columns must be read separately and assembled into rows, which has low performance |
Data compression | Each row is stored together for low compression | Data is stored in column units, which enables data of the same type to be stored together, friendly to compression algorithms and high compression |
Typical representative | Text File, Sequence File, etc | ORC, Parquet, Carbon Data, etc |
Performance comparison of row and column storage
- Line storage
- Column storage
The image is from the ClickHouse website. Please refer to my blog about ClickHouse – What is ClickHouse? What features does ClickHouse have?