Create and distribute tables
To create distributed tables, you first need to define the table schema. To do this, you can define a TABLE using the CREATE TABLE statement, just as you would a regular PostgreSQL TABLE.
- CREATE TABLE
- www.postgresql.org/docs/curren…
CREATE TABLE github_events
(
event_id bigint,
event_type text,
event_public boolean,
repo_id bigint,
payload jsonb,
repo jsonb,
actor jsonb,
org jsonb,
created_at timestamp
);
Copy the code
Next, you can specify table distribution columns and create working shards using the create_distributed_table() function.
SELECT create_distributed_table('github_events'.'repo_id');
Copy the code
This function informs Citus that the github_events table should be distributed over the REPO_ID column (via hash column values). The function also uses the citus.shard_count and citus.shard_replication_factor configuration values to create shards on the working node.
This example creates a total of citus.shard_count shards, each of which has a portion of hash token space and replicates according to the default citus.shard_REPLICATION_factor configuration value. A shard copy created on a worker has the same table schema, index, and constraint definition as a table on a coordinator. After the replica is created, this function stores all distributed metadata on the coordinator.
Each shard created is assigned a unique shard ID, and all its copies have the same shard ID. Each shard is represented on the work node as a regular PostgreSQL table named tablename_shardid, where tablename is the name of the distributed table and Shardid is the unique ID assigned to the shard. You can connect to the worker Node (Worker) Postgres instance to view or run commands on individual shards.
You are now ready to insert data into distributed tables and run queries against them. You can also learn more about the UDFs used in this section in the Citus Utility Functions section of the document.
- Citus Utility Functions
- Docs.citusdata.com/en/v11.0-be…
Refer to table
The above method distributes the table across multiple horizontal shards, but another possibility is to distribute the table into a single shard and copy the shards to each working node. Tables that are distributed in this way are called reference tables. They are used to store data that multiple nodes in a cluster need to access frequently.
Common candidates for referencing tables include:
- Smaller tables need to join with larger distributed tables.
- Missing in multi-tenant applications
The tenant ID
Columns or tables not associated with tenants. In some cases, to reduce the migration effort, users can even choose to associate from tenants that are currently missingThe tenant ID
Create a reference table in the. - Tables that need unique constraints that span multiple columns and are small enough.
For example, suppose that a multi-tenant e-commerce site needs to calculate sales tax for transactions at any of its stores. Tax information is not specific to any tenant. It makes sense to merge them into a shared table. A US-centric reference table might look like this:
-- a reference table
CREATE TABLE states (
code char(2) PRIMARY KEY,
full_name text NOT NULL,
general_sales_tax numeric(4.3));-- distribute it to all workers
SELECT create_reference_table('states');
Copy the code
Queries such as calculating taxes for shopping carts can now be added to the States table without network overhead, and foreign keys can be added to the State code for better verification.
In addition to distributing the table as a single replication shard, the CREATE_Reference_TABLE UDF marks it as a reference table in the Citus metadata table. Citus automatically performs a two-phase commit (2PC) to modify tables marked in this way, which provides a strong consistency guarantee.
- 2PC
- En.wikipedia.org/wiki/Two-ph…
If you have an existing distributed table, you can change it to a reference table by running:
SELECT undistribute_table('table_name');
SELECT create_reference_table('table_name');
Copy the code
For another example of using reference tables in a multi-tenant application, see Sharing Data between tenants.
- Share data between tenants
- Docs.citusdata.com/en/v11.0-be…
Distribute coordinator data
If an existing PostgreSQL database is converted into a coordinator node for a Citus cluster, the data in its tables can be distributed efficiently with minimal disruption to the application.
The create_distributed_TABLE function described earlier applies to both empty and non-empty tables, with the latter automatically distributing table rows across the cluster. NOTICE: Copying data from local table… To see if it does, for example:
CREATE TABLE series AS SELECT i FROM generate_series(1.1000000) i;
SELECT create_distributed_table('series'.'i');
NOTICE: Copying data from local table...
NOTICE: copying the data has completed
DETAIL: The local data in the table is no longer visible, but is still on disk.
HINT: To remove the local data, run: SELECT truncate_local_data_after_distributing_table($$public.series$$)
create_distributed_table
--------------------------
(1 row)
Copy the code
Data migration blocks writes to the table, and once the function is committed, pending writes are treated as distributed queries. (If the function fails, the query becomes local again.) Reading can continue normally, and once the function is committed, it becomes a distributed query.
If tables A and B are distributed and A has A foreign key for B, set the distribution key for target table B first. An error occurs when executed in the wrong order:
ERROR: cannot create foreign key constraint
DETAIL: Referenced table must be a distributed table or a reference table.
Copy the code
If they cannot be distributed in the correct order, drop the foreign keys, distribute the table, and then recreate the foreign keys.
After the table is distributed, run the TRUNCATE_LOCAL_DATA_after_distributing_table function to delete local data. Citus queries cannot access the remaining local data in distributed tables and may result in unrelated constraint violations on the coordinator.
- truncate_local_data_after_distributing_table
- Docs.citusdata.com/en/v11.0-be…
When migrating data from an external database, such as Amazon RDS to Citus Cloud, first create a Citus distributed table with Create_distributed_TABLE, and then copy the data to the table. Copying to a distributed table avoids running out of space on the coordinating node.
To buy the table
Co-location is the practice of strategically partitioning data, keeping related information on the same machine for efficient relational operations while taking advantage of the horizontal scalability of the entire data set. For more information and examples, see table co-location.
- Table to buy
- Docs.citusdata.com/en/v11.0-be…
The table is co-located in the group. To manually control the co-location allocation of a table, use the optional colocate_with parameter of Create_distributed_TABLE. If you do not care about the co-location of the table, ignore this parameter. It defaults to a ‘default’ value and groups the table with any other default Co-Location table that has the same distribution column type, fragment count, and replication factor. If you want to interrupt or update this implicit colocation, use update_distributed_table_colocation().
-- these tables are implicitly co-located by using the same
-- distribution column type and shard count with the default
-- co-location group
SELECT create_distributed_table('A'.'some_int_col');
SELECT create_distributed_table('B'.'other_int_col');
Copy the code
Specify colocATED_with => ‘None’ when the new table is not related to other tables in its potential implicit co-location group.
-- not co-located with other tables
SELECT create_distributed_table('A'.'foo', colocate_with => 'none');
Copy the code
Splitting unrelated tables into their own co-location group improves shard rebalancing performance because shards in the same group must move together.
- Fragment rebalancing
- Docs.citusdata.com/en/v11.0-be…
When tables are really related (for example, when they will be joined), it makes sense to put them together explicitly. The benefits of proper co-location outweigh any rebalancing costs.
To explicitly co-locate multiple tables, distribute one table, and then place the other tables into their co-location group. Such as:
-- distribute stores
SELECT create_distributed_table('stores'.'store_id');
-- add to the same group as stores
SELECT create_distributed_table('orders'.'store_id', colocate_with => 'stores');
SELECT create_distributed_table('products'.'store_id', colocate_with => 'stores');
Copy the code
Information about the co-location group is stored in the PG_dist_colocation table, and pg_dist_partition shows which tables are assigned to which groups.
- pg_dist_colocation
- Docs.citusdata.com/en/v11.0-be…
- pg_dist_partition
- Docs.citusdata.com/en/v11.0-be…
fromCitus 5.x
upgrade
Starting with Citus 6.0, we adopted co-location as a first-class concept and started tracking table assignments to co-location groups in PG_dist_colocation. Because Citus 5.x does not have this concept, tables created using Citus 5 are not explicitly marked in the metadata as being in the same location, even though the tables are physically in the same location.
Because Citus uses managed metadata information for query optimization and push-down, it becomes critical to notify Citus of this co-location for previously created tables. To fix the metadata, simply mark the table as co-located using mark_tables_colocated:
-- Assume that stores, products and line_items were created in a Citus 5.x database.
-- Put products and line_items into store's co-location group
SELECT mark_tables_colocated('stores'.ARRAY['products'.'line_items']);
Copy the code
- mark_tables_colocated
- Docs.citusdata.com/en/v11.0-be…
This function requires that the table be distributed with the same method, column type, number of fragments, and replication method. It doesn’t refragment or physically move the data, it just updates the Citus metadata.
Delete table
You can use the standard PostgreSQL DROP TABLE command to DROP your distributed tables. As with regular tables, DROP TABLE removes any indexes, rules, triggers, and constraints that exist in the target TABLE. In addition, it removes shards on working nodes and cleans up their metadata.
DROP TABLE github_events;
Copy the code
Modify the table
Citus automatically propagates multiple DDL statements, which means that modifying distributed tables on coordinator nodes also updates shards on workers. Other DDL statements need to be propagated manually, and certain other statements, such as those that modify distributed columns, are prohibited. Trying to run a DDL that does not meet the auto-propagation criteria will throw an error and leave the tables on the coordination node unchanged.
The following is a reference to the types of DDL statements that are propagated. Note that automatic propagation can be enabled or disabled using configuration parameters.
- Configuration parameters
- Docs.citusdata.com/en/v11.0-be…
Add/modify columns
Citus automatically propagates most ALTER TABLE commands. Adding columns or changing their default values works the same as in a standalone PostgreSQL database:
- ALTER TABLE
- www.postgresql.org/docs/curren…
-- Adding a column
ALTER TABLE products ADD COLUMN description text;
-- Changing default value
ALTER TABLE products ALTER COLUMN price SET DEFAULT 7.77;
Copy the code
Major changes to existing columns, such as renaming or changing their data types, can also be made. However, you cannot change the data type of the distributed column. This column determines how the table data is distributed across the Citus cluster, and changing its data type will require moving the data.
- Distribution of the column
- Docs.citusdata.com/en/v11.0-be…
Attempting to do this results in an error:
-- assuming store_id is the distribution column
-- for products, and that it has type integer
ALTER TABLE products
ALTER COLUMN store_id TYPE text;
/* ERROR: cannot execute ALTER TABLE command involving partition column */
Copy the code
As a workaround, you might consider changing the distribution column, updating it, and then changing back.
- Change distribution column
- Docs.citusdata.com/en/v11.0-be…
Add/remove constraints
Using Citus allows you to continue to enjoy the security of a relational database, including database constraints (see the PostgreSQL documentation). Due to the nature of distributed systems, Citus does not cross-reference uniqueness constraints or referential integrity between working nodes.
- Database constraints
- www.postgresql.org/docs/curren…
Foreign keys may be created in these cases:
- Between two local (non-distributed) tables
- Between two reference tables
- Between the reference table and the local table (enabled by default, passedcitus.enable_local_reference_table_foreign_keys (boolean))
- Docs.citusdata.com/en/v11.0-be…
- When a key contains a distribution column, in twoTo buyBetween distributed tables
- Docs.citusdata.com/en/v11.0-be…
- As aRefer to tableDistributed table of
- Docs.citusdata.com/en/v11.0-be…
Foreign keys from reference tables to distributed tables are not supported.
Citus supports all foreign key reference operations from local to reference table, but does not support reverse support ON DELETE/UPDATE CASCADE (reference local).
Primary keys and uniqueness constraints must include distribution columns. Adding them to a non-distributed column produces an error (see Inability to create uniqueness constraints).
- Unable to create a unique constraint
- Docs.citusdata.com/en/v11.0-be…
This example shows how to create primary and foreign keys on a distributed table:
--
-- Adding a primary key
-- --------------------
-- We'll distribute these tables on the account_id. The ads and clicks
-- tables must use compound keys that include account_id.
ALTER TABLE accounts ADD PRIMARY KEY (id);
ALTER TABLE ads ADD PRIMARY KEY (account_id, id);
ALTER TABLE clicks ADD PRIMARY KEY (account_id, id);
-- Next distribute the tables
SELECT create_distributed_table('accounts'.'id');
SELECT create_distributed_table('ads'.'account_id');
SELECT create_distributed_table('clicks'.'account_id');
--
-- Adding foreign keys
-- -------------------
-- Note that this can happen before or after distribution, as long as
-- there exists a uniqueness constraint on the target column(s) which
-- can only be enforced before distribution.
ALTER TABLE ads ADD CONSTRAINT ads_account_fk
FOREIGN KEY (account_id) REFERENCES accounts (id);
ALTER TABLE clicks ADD CONSTRAINT clicks_ad_fk
FOREIGN KEY (account_id, ad_id) REFERENCES ads (account_id, id);
Copy the code
Also, include distribution columns in the uniqueness constraint:
-- Suppose we want every ad to use a unique image. Notice we can
-- enforce it only per account when we distribute by account id.
ALTER TABLE ads ADD CONSTRAINT ads_unique_image
UNIQUE (account_id, image_url);
Copy the code
Non-null constraints can be applied to any column (distributed or not) because they do not require lookups between working nodes.
ALTER TABLE ads ALTER COLUMN image_url SET NOT NULL;
Copy the code
Use the NOT VALID constraint
In some cases, it can be useful to impose constraints on new rows while allowing existing nonconforming rows to remain unchanged. Citus uses PostgreSQL’s “NOT VALID” constraint to specify this function for the CHECK constraint and foreign keys.
For example, consider an application that stores user profiles in reference tables.
- Refer to table
- Docs.citusdata.com/en/v11.0-be…
-- we're using the "text" column type here, but a real application
-- might use "citext" which is available in a postgres contrib module
CREATE TABLE users ( email text PRIMARY KEY );
SELECT create_reference_table('users');
Copy the code
Over time, imagine some non-addresses entering the table.
INSERT INTO users VALUES
('[email protected]'), ('[email protected]'), ('lol');
Copy the code
We want to validate addresses, but PostgreSQL generally doesn’t allow us to add CHECK constraints for existing rows that fail. However, it does allow constraints that are marked as invalid:
ALTER TABLE users
ADD CONSTRAINT syntactic_email
CHECK (email ~
'^[a-zA-Z0-9.!#$%&''*+/=?^_`{|}~-]+@[a-zA-Z0-9](? : [a - zA - Z0-9 -], 21 {0} [a zA - Z0-9])? (? :\.[a-zA-Z0-9](? : [a - zA - Z0-9 -], 21 {0} [a zA - Z0-9])?) * $'
) NOT VALID;
Copy the code
This worked, and the new line was protected.
INSERT INTO users VALUES ('fake');
/* ERROR: new row for relation "users_102010" violates check constraint "syntactic_email_102010" DETAIL: Failing row contains (fake). */
Copy the code
Later, during off-peak hours, the database administrator can try to fix the error line and revalidate the constraint.
-- later, attempt to validate all rows
ALTER TABLE users
VALIDATE CONSTRAINT syntactic_email;
Copy the code
The PostgreSQL documentation has more information about NOT VALID and VALIDATE constraints in the ALTER TABLE section.
- ALTER TABLE
- www.postgresql.org/docs/curren…
Add/drop indexes
Citus supports adding and removing indexes:
-- Adding an index
CREATE INDEX clicked_at_idx ON clicks USING BRIN (clicked_at);
-- Removing an index
DROP INDEX clicked_at_idx;
Copy the code
- Add and drop indexes
- www.postgresql.org/docs/curren…
Adding an index requires a write lock, which may not be desirable in a multi-tenant “record system.” To minimize application downtime, create indexes simultaneously instead. This method requires more total effort and takes longer to complete than standard index builds. However, because it allows normal operations to continue while the index is being built, this approach is useful for adding new indexes in a production environment.
-- Adding an index without locking table writes
CREATE INDEX CONCURRENTLY clicked_at_idx ON clicks USING BRIN (clicked_at);
Copy the code
- SQL-CREATEINDEX-CONCURRENTLY
- www.postgresql.org/docs/curren…
Manually modify
Currently other DDL commands are not automatically propagated; however, you can propagate changes manually. See Manual query propagation.
- Manual query propagation
- Docs.citusdata.com/en/v11.0-be…
More and more
- Quick test Citus Distributed PostgreSQL cluster (distributed table, co-located, reference table, column store) on Kubernetes