Greenplum is a massively parallel processing database consisting of a master and multiple segments whose data is distributed across each segment according to a defined distribution strategy. A single row of a table can be allocated to one or more segments, but with so many segments, which or which segments will it be allocated to? The distribution strategy will tell us.
Distribution strategy
In Greenplum 5, there are two distribution strategies:
-
Hash distribution
-
Random distribution
In Greenplum 6, another policy has been added:
-
Hash distribution
-
Random distribution
-
Copy distribution
Hash distribution:
To use this policy, create a table using “DISTRIBUTED BY (column, […]” ).
The hash algorithm causes the distribution key to assign each row to a specific segment. Keys with the same value will always be hashed into the same segment. Choosing a unique distribution Key, such as the Primary Key, ensures a more even distribution of data. Hash distribution is the default distribution strategy for tables.
If the DISTRIBUTED clause is not provided when the table is created, use the PRIMARY KEY (if the table actually has one) or the first eligible column of the table as the distribution KEY. What types of columns are eligible? Columns of geometric types or user-defined data types cannot be used as Greenplum distribution key columns. If there are no eligible columns in the table, the random-distribution strategy is degraded.
However, if a DISTRIBUTED clause is not provided, Greenplum’s final distribution strategy is influenced by other factors, such as: GUC GP_create_TABLE_random_DEFAULt_distribution and the optimizer used at the time will also influence the final decision. Therefore, don’t forget to add the DISTRIBUTED BY clause to CREATE TABLE. Otherwise, the distribution strategy of the table might be a Schrodinger’s cat.
Random distribution:
To use this strategy, you need to use the “DISTRIBUTED” clause when creating tables.
Random distribution circulates rows to each segment in the order they arrive. Unlike the hash distribution strategy, rows of data with the same value are not necessarily on the same segment. Although random distribution ensures an even distribution of data, hash distribution strategies should be selected whenever possible because hash distribution has better performance.
Replication distribution:
This distribution strategy is a new feature of GPDB 6.
- Greenplum data distribution and partitioning strategy
To use this strategy, use the “DISTRIBUTED REPLICATED” clause when creating tables.
The Greenplum database allocates each row to each segment. With this distribution strategy, the table data is evenly distributed because each segment stores the same rows. The copy distribution strategy is used when you need to execute user-defined functions on segments that need access to all rows in the table. Or when large tables are joined with small tables, specifying a sufficiently small table as replicated may also improve performance.
Note one exception: there is no distribution policy for the Catalog table.
Summary of 3 strategies:
Hash distribution | Random distribution | Copy distribution | |
---|---|---|---|
Gp5/6 applies | GP5, GP6 | GP5, GP6 | GP6 |
statements | DISTRIBUTED BY (column, [ … ]) | DISTRIBUTED RANDOMLY | DISTRIBUTED REPLICATED |
Default policy? | ✔ | ✘ | ✘ |
storage | 1 segment | 1 segment | N segments |
Uniform distribution | Depends on the distribution key chosen | ✔ | ✔ |
Query performance | ✔ | ✘ | – |
Partitioning strategies
Now let’s look at partitioning. For Greenplum novices, partitioning can easily be confused with distribution, which is fundamentally different from partitioning. Distribution is a physical partition of stored data, while partitioning is a logical partition.
Partitioning is done through the “PARTITION BY” clause, which allows a large table to be divided into multiple subtables. The “SUBPARTITION BY” clause can divide child tables into smaller tables. Theoretically, Greenplum has no limit on how many levels or partitioned tables a root table can have, but for any level of partitioning (the table’s hierarchical level), a partitioned table can have up to 32,767 child partitioned tables.
When you only consider distribution, you can treat a partitioned table as a normal table. For a root table, its data is first allocated to a partitioned table, and then the individual partitioned table is distributed over segments of Greenplum, just like any unpartitioned table, according to the partitioned table distribution strategy. The tables in the Greenplum database are physically distributed across Greenplum segments, making parallel query processing possible. Table partitioning is a tool for logically partitioning large tables to improve query performance and facilitate data warehouse maintenance tasks. Partitioning does not change the physical distribution of table data between segments.
Greenplum supports the following partition types:
-
RANGE: Partitioning data based on numeric ranges, such as dates or prices.
-
LIST partitioning: Data partitioning based on a LIST of values, such as sales regions or product lines.
-
A combination of two types.
Partitioning large tables improves query performance and simplifies database maintenance tasks, such as rolling old data out of the database.
But don’t create more partitions than you need. Creating too many partitions can slow down management and maintenance, such as cleaning up, recovering segments, expanding clusters, checking disk usage, and so on.
Unless the query optimizer can prune partitions based on query predicates, using partitions will not improve query performance. Queries that need to scan each partitioned table in turn run slower than queries that simply scan the root table without partitions, so if partition clipping is rarely available in your queries, try to avoid partitioning tables as much as possible. In GPCC, you can check the visual plan in the query monitor in case irrelevant partitions are scanned.
You may also encounter another type of partition: the default partition.
When incoming data does not match all partitions, it will be inserted into the default partition. If the partition design does not have a default partition, it will reject its insert operation.
Default partitioning is a double-edged sword that makes table operations safe, but can also mask problems.
Suppose you have a table and create partitions based on the “age” column. It defines a LIST that enters Partition1 when the data row is of age 1; When age is 2, it goes into Partition2… , when the age is 100, it enters Partition100. But one day, a 101-year-old comes along, and BANG, an error occurs because you haven’t created a partition for age = 101, so there’s no partition101 table either. This man has nowhere to go.
If you create a default partition for this table, the 101-year-old will go to that default partition. The problem was solved and everyone was happy.
If one day we live longer, say 200 years, then everyone over 100 will be assigned to the default partition. The default partition gets bigger and bigger, and if no one notices, the query gets slower and slower because the partition is too big for partition pruning to be effective.
Given how important these distribution and partitioning strategies for tables are, you might ask how we can monitor these conditions and detect exceptions early.
We’ll explain more in our next post, How GPCC can Help.