For more Greenplum technology dry goods, please visitGreenplum Chinese community website
Composite indexes, also known as multi-field indexes, refer to indexes built on multiple fields of a table. They are widely supported in database systems, and Greenplum is no exception. In previous articles, we introduced the default index for Greenplum, the B-tree index. This article provides a brief overview of compound indexes in Greenplum and related best practices.
Establish test data
First we create a test table: student, it has 4 fields:
- Id, the unique number of the student
- Score, test score, between 0 and 100
- Class, class of students, between 1 and 10
- Comment, student comment, insert a random string in this example
article=# create table student (id int, score int, class int, comment varchar(128)) distributed by (id);
CREATE TABLE
Copy the code
Then insert about 20 million test data, with a small number of students failing (insert 2) :
article=# insert into student (id, score, class, comment) SELECT g, random() * 40 + 60, random() * 10 + 1, Md5 (g::text) FROM generate_series(1,20000000) as g; INSERT 0 20000000 article=# insert into student (id, score, class, comment) SELECT g, random() * 60, random() * 10 + 1, Md5 (g::text) FROM generate_series(20000001,20010000) as g; INSERT 0 10000Copy the code
Then create a compound index on the (score,class, ID) column. Since no index type is specified, create the default index type in Greenplum, namely b-tree index.
article=# create index on student(score,class,id);
CREATE INDEX
article=# analyze student;
ANALYZE
Copy the code
In this case, the index is sorted by (score,class, ID). If score is identical, the index is sorted by (class, ID). If score is identical, the index is sorted by (class, ID). A simple example is as follows:
score class id
60 1 34
60 4 20
60 4 198
75 1 230
75 8 45
75 8 100
90 3 200
92 1 400
Copy the code
In Greenplum, the established B-tree composite index is logically shown in the figure below, where rounded rectangles are index tuples, where the stored key values are arranged by (score,class, ID) as described above. Square rectangles are B-tree nodes/pages that store multiple index tuples within each node.
Adjustment cost parameter
Before querying, first set a GUC value in Greenplum that is associated with the cost estimate (see [documentation](docs-cn.greenplum.org/v6/ref\_gui…). : random_page_cost, which represents the I/O cost of random page access, defaults to 100 (for mechanical disks); In contrast, seq_page_cost, which represents the IO cost of sequential page access, defaults to 1.
Because my environment is an SSD disk, which reads randomly and sequentially at about the same speed, the default random_page_cost value is too large, so I reset it to 5. In general, a larger value of random_page_cost makes sequential scans more likely, while a smaller value makes index scans more likely.
article=# show seq_page_cost;
seq_page_cost
---------------
1
(1 row)
article=# show random_page_cost;
random_page_cost
------------------
100
(1 row)
article=# set random_page_cost to 5;
SET
Copy the code
Query examples
Let’s take a look at the specific use of compound indexes in Greenplum with a few queries.
- Query the total number of failed students
article=# explain (costs off) select count(*) from student where score < 60;
QUERY PLAN
-------------------------------------------------------------------------
Aggregate
-> Gather Motion 2:1 (slice1; segments: 2)
-> Aggregate
-> Bitmap Heap Scan on student
Recheck Cond: (score < 60)
-> Bitmap Index Scan on student_score_class_id_idx
Index Cond: (score < 60)
Optimizer: Postgres query optimizer
(8 rows)
Copy the code
As can be seen from the query plan, Index Scan [1] is used here (actually Bitmap Index Scan, which is also a kind of Index Scan, see article. According to the structure of B-tree introduced before, in the query process, the first record score<60 is found in b-tree. Because the leaf node layer is orderly, all the results can be obtained very quickly by traversing the page one by one through the right pointer.
- Query the total number of students in a class who failed
article=# explain (costs off) select count(*) from student where class = 1 and score < 60;
QUERY PLAN
-------------------------------------------------------------------------
Aggregate
-> Gather Motion 2:1 (slice1; segments: 2)
-> Aggregate
-> Bitmap Heap Scan on student
Recheck Cond: ((score < 60) AND (class = 1))
-> Bitmap Index Scan on student_score_class_id_idx
Index Cond: ((score < 60) AND (class = 1))
Optimizer: Postgres query optimizer
(8 rows)
Copy the code
As you can see from the query plan, index scans are still used here. In combination with the above example, in general, the compound index can be used when the query criteria correspond to a prefix of the compound index (as in this example, socre+class), regardless of the order in which they appear in the WHERE statement (in this case, the class comes first, so the index can still be used).
- Query the total number of students in class 3
article=# explain (costs off) select * from student where class = 3;
Copy the code
This query does not use indexes, but only ordinary sequential Seq Scan. This is because indexes are often accompanied by a large number of random I/OS, and the result of this query is expected to hit 1/10 of the records. Therefore, sequential Scan is more efficient: sequential I/O + cache hits. Again, compound indexes are usually used only if a prefix is hit.
- Select * from student where ID =100
Although we have mentioned the prefix effect of compound indexes many times before, it is not necessary to use compound indexes without prefixes.
article=# explain (costs off) select * from student where class = 3;
QUERY PLAN
------------------------------------------
Gather Motion 2:1 (slice1; segments: 2)
-> Seq Scan on student
Filter: (class = 3)
Optimizer: Postgres query optimizer
(4 rows)
Copy the code
As you can see from the query plan, the composite index can be used even though id is not a prefix to the index (score,class, ID). I have many years of experience in Mysql in the early years, and I know Mysql index left prefix principle, so I am surprised by the index optimization effect in Greenplum. Is it possible that the query plan is wrong? Let’s compare the actual running time of this query using sequential and index scans respectively:
article=# explain analyze select * from student where id = 100; QUERY PLAN ------------------------------------------------------------------------------------- [...] -> Seq Scan on student (cost=0.00.. Rows =1 width=45) (actual time=163.513.. 2282.622 rows = 1 loops = 1) [...]. Article =# explain analyze select * from student where id = 100; QUERY PLAN ------------------------------------------------------------------------------------- [...] Index Scan using student_score_class_id_IDx on student (cost=0.19.. Rows =1 width=45) (actual time= 134.649.. 636.103 rows = 1 loops = 1) [...]. Execution time: 637.869 msCopy the code
It is true that the index scan is faster, and the query data is about 4 times better than the sequential scan. So what’s the reason here? For the query with ID =100, since it is not a prefix to the composite index, the index tuples that match ID =100 in the data are not contiguous. According to the b-tree query principle, all index tuples need to be read to determine whether the condition is met. The following figure compares the read process with that of sequential scan:
From the reading process of the figure above, we can know:
-
Sequential SeqScan directly reads data tuples in the heap. Since tuples are not required to be in order, they are basically sequential IO, but each data tuple is larger than the corresponding index tuple (with more comment fields).
-
IndexScan,
A. Each page needs to be read sequentially, with sequential I/OS within the page and random I/OS between the pages
B. Reading tuples matching index conditions from the heap requires a large amount of random I/O (if there are many tuples matching query conditions)
Back to our previous query, only 1 data hit id=100, so the second part (b) of index scan can be ignored basically, so the query time is the time to read all index pages and all data pages. But the index page is smaller overall because it doesn’t have a comment field, which is why this query is faster with an index scan. You can query the total size of the data page and the index page with the following Sql, and you can see that the index page is smaller:
article=# select pg_relation_size('student');
pg_relation_size
------------------
1724776448
(1 row)
article=# select pg_indexes_size('student');
pg_indexes_size
-----------------
624852992
(1 row)
Copy the code
Therefore, we can also learn that Greenplum’s query optimizer is very sophisticated and can be prepared to estimate this situation, so there is no rigid use of the left-most prefix to determine the use scenario that matches the index.
conclusion
As mentioned above, if the data in the table is often queried in combination with several fields, creating a compound index in Greenplum can help speed up those queries. However, when creating an index, put the most distinguished or frequently queried fields first.
In addition, because indexes in Greenplum are secondary indexes (i.e. indexes stored independently of the data in the table), a composite index (COL1, COL2,col3) is created. In fact, the effect is equivalent to the establishment of (COL1), (COL1, COL2), (COL1, COL2, COL3) three indexes, so the establishment of multiple independent indexes, composite index can greatly reduce the index space overhead.