This is the 10th day of my participation in the August More text Challenge. For details, see: August More Text Challenge

collation

Collations and Case Sensitivity: Collations and Case Sensitivity. In general, in queries and text, we care about case. Case sensitivity, in the general sense, should be called collations (or collates).

In related ORM frameworks, for case-sensitive databases, string.lower is most likely used to force case-insensitive comparisons. But doing so might prevent the application from using the index…

EF Core 5.0, for example, introduced the configuration of case-sensitive or collation rules. Case-sensitive comparisons can be implemented efficiently without affecting query performance.

Introduction to Collation Rules

A basic concept in text processing is collation. A collation is a set of rules that determine how text values should be sorted and compared for equality.

For example, case-insensitive collation ignores the difference between uppercase and lowercase letters for equality comparison purposes, but case-sensitive collation does not.

However, because case sensitivity is culturally sensitive (for example, I and I stand for different letters in Turkish), there are multiple case-insensitive collation rules, each with its own set of rules.

Collation also extends beyond case sensitivity to other aspects of character data; In German, for example, it is sometimes (but not always) desirable to treat A and AE as the same.

Finally, the collation also defines how the text values are arranged in order: German puts a after a, and Swedish puts it at the end of the alphabet.

All text operations in the database use a collation rule, whether explicit or implicit, to determine how the operation compares and sorts strings. The actual collation available and its naming scheme are database specific.

Databases typically allow default collation to be defined at the database or column level, and you can also explicitly specify which collation should be used for a particular operation in a query.

Create a table to store ABab. Perform the following operations based on the preceding information.

CREATE TABLE OrderTest(
  letter char(1) NOT NULL
);
INSERT INTO OrderTest values('B'), ('b'), ('A'), ('a');
Copy the code

SQL Server collation rules

Selection of collation rules

In most database systems, the default collation is defined at the database level; Unless overridden, this collation implicitly applies to all text operations that occur in the database.

DATABASE collation is usually set at DATABASE creation time (via the CREATE DATABASE DDL statement), and if not specified, defaults to server-level values (server-level, the COLLation used by RDBMS or operating system Settings) when set.

For example, the default server-level collation in SQL Server is SQL_Latin1_General_CP1_CI_AS, a case-insensitive, acoustic-sensitive collation. Case sensitive collation can be used with SQL_Latin1_General_CP1_CS_AS.

Although database systems often allow you to change collation rules for existing databases, doing so can cause complications; It is recommended that you select collation rules before creating the database.

View the collation rules configured for the current instance

Use the SERVERPROPERTY function to query the Server collation of the current SQL Server instance.

SELECT CONVERT(varchar, SERVERPROPERTY('collation'));

-- or: EXECUTE sp_helpsort;

-- (No column name)
-- Chinese_PRC_CI_AS
Copy the code

Or attributes:

Obtain the currently supported collation

To query all available collations, use the fn_helpCollations () built-in function:

SELECT * FROM sys.fn_helpcollations();
Copy the code

The hierarchy of collation rules

SQL Server supports collation at the following levels:

  • Server-level collations
  • Database-level collations
  • Column-level collations
  • Expression-level collations

Collation rules for the database

It is better to specify the required collation when creating a new database or database column, rather than changing the default collation for the SQL Server instance.

Specify collation rules when creating a database

CREATE DATABASE Collate_DB_Test
COLLATE Chinese_PRC_CS_AS;
Copy the code

Change the collation rules of the database

ALTER DATABASE Collate_DB_Test COLLATE Chinese_PRC_CI_AS;
Copy the code

Query the collation rules of the database

You can view the collation of the database by saying something like this:

SELECT CONVERT (VARCHAR(50), DATABASEPROPERTYEX('database_name'.'collation'));
Copy the code

Or obtain the information from sys. Databases in the system view.

SELECT name, collation_name FROM sys.databases;  
Copy the code

In addition, you can use the properties of SSMD to view database collation. However, you cannot see collation by generating the statement that creates the database.

Such as:

SELECT CONVERT (VARCHAR(50), DATABASEPROPERTYEX('Collate_DB_Test'.'collation'));

-- Chinese_PRC_CS_AS
Copy the code

Column collation rule

Collations can also be defined on text columns, overriding the database defaults. This can be useful if some columns need to be case insensitive while the rest of the database needs to be case sensitive.

Specifies the collation of columns

Create TABLE myTable (
    id int primary key,
    mycol NVARCHAR(10) COLLATE Chinese_PRC_CI_AS
);
Copy the code

Modify the collation of columns

ALTER TABLE myTable ALTER COLUMN mycol NVARCHAR(10) COLLATE Chinese_Simplified_Pinyin_100_CS_AI;
Copy the code

View the collation of columns

The collation of columns cannot be shown by creating a table sentence. You can view collation only in the SSMS properties or in the following statements

SELECT name, collation_name FROM sys.columns WHERE name = N'<insert character data type column name>';
Copy the code

For example, join sys.columns and sys.tables to view the collation rules

SELECT t.name TableName, c.name ColumnName, collation_name  
FROM sys.columns c  
inner join sys.tables t on c.object_id = t.object_id;  
Copy the code

Collation for the specified column of the specified table:

SELECT t.name TableName, c.name ColumnName, collation_name  
FROM sys.columns c  
inner join sys.tables t on c.object_id = t.object_id
where t.name='myTable' and c.name='mycol';  
Copy the code

Collation rules are explicitly used in SQL statements

Conditional clause

You can explicitly use Collation by specifying the COLLATE keyword and Collation name in the SQL statement and leaving the other parts unchanged as in normal SQL.

<column_name/constant_value> COLLATE <collate_name>
Copy the code

For example, query the column name of the OrderTest table and specify a case-sensitive collation:

-- Case insensitive by default
select * from OrderTest where letter='a';
Copy the code

The result is:

letter
A
a
Copy the code
-- Specifies that queries use case-sensitive collation
select * from OrderTest where letter COLLATE SQL_Latin1_General_CP1_CS_AS ='a';

-- 或 Chinese_PRC_CS_AS
select * from OrderTest where letter COLLATE Chinese_PRC_CS_AS ='a';
Copy the code

Because the specified collation is case sensitive, only a is returned in the result

letter
a
Copy the code

Specify collation when sorting

Sort By specifying the collation rule in the Order By clause of the query:

The default Chinese_PRC_CI_AS -
select * from OrderTest order by letter;

Results -
-- letter
-- A
-- a
-- B
-- b
Copy the code

Specify a case-sensitive collation:

select * from OrderTest order by letter COLLATE Chinese_PRC_CS_AS;

Results -
-- letter
-- a
-- A
-- b
-- B
Copy the code

Sorted by Unicode encoding:

Sort by Unicode code values
select * from OrderTest order by letter COLLATE Chinese_PRC_BIN2; -- 或者 COLLATE Chinese_PRC_BIN;

Results -
-- letter
-- A
-- B
-- a
-- b
Copy the code

Display collation and index

Indexes are one of the most important factors in database performance — queries that run efficiently using indexes may stop executing without them.

An index implicitly inherits the collation of its columns; This means that all queries on that column automatically use the index defined on that column — provided that the query does not specify a different collation.

Specifying an explicit collation in a query usually prevents the query from using the index defined on that column because the collations would no longer match;

Therefore, caution is recommended when using this feature. It is best to define the collation at the column (or database) level to allow all queries to implicitly use the collation and benefit from the index.

Note that some databases allow collation to be defined when an index is created (for example, PostgreSQL, Sqlite). This allows multiple indexes to be defined on the same column, speeding up operations with different collations (such as case-sensitive and case-insensitive comparisons).

Always check the query plan of the query and ensure that the correct indexes are used in performance-critical queries executed against large amounts of data.

【 Key 】 About the meaning of several common abbreviations in collation rules

There are many abbreviated flags in THE Collation of SQL Server. The following describes the meanings.

These indexes are also common in other collations.

  • _CS: a Case – sensitive. Case-sensitive letters, lowercase letters will precede uppercase letters. No distinction can be used_CI– Case – insensitive.
  • _AS: Accent – sensitive. Distinguish accented and unaccented characters. For example, “a” does not mean “gul”. If this parameter is not set, the collation does not distinguish accents. You can show the selection_AI– Accent – insensitive.
  • _BIN: Binary. Sort and compare the data in the SQL Server table according to the bit pattern defined for each character.

The binary sort order is case – and accented. Binary is also the fastest sort order.

  • _WS: Width – sensitive. Distinguish full-width and half-width characters. If this item is not selected, SQL Server will sort the full and half representations of the same character as the same. Omitting this is the only way to specify full half-angle insensitivity.

  • _BIN2: binary-code point. Sort and compare data in SQL Server tables based on Unicode code points of Unicode data. For non-Unicode data, binary code points use the same comparison as binary sort.

The advantage of using binary point sort order is that there is no data reordering required in applications that compare sorted SQL Server data. As a result,

The binary code point sort order provides simpler application development and possible performance gains. Binary collations

  • _UTF8: utf-8. Allows utF-8 encoded data to be stored in SQL Server. If this option is not selected, SQL Server will use the default non-Unicode encoding format for the applicable data types.

  • _KS: Kana – sensitive. Distinguish between two types of Japanese kana characters: hiragana and Katakana. If this item is not set, the collation does not distinguish kana names. Omiting this is the only way to specify that the kana is insensitive.

  • _VSS: Variation – the selector – sensitive. Distinguish between the various ideograph variant selectors in Japanese_Bushu_Kakusu_140 and Japanese_XJIS_140, the Japanese collation rules introduced in SQL Server 2017 (14.x). The variant sequence consists of a base character and an additional variant selector. If this _VSS item is not selected, the collation is insensitive to the variant selector and the variant selector is not considered in the comparison.

Supplementary character Supplementary characters_SC

SQL Server 2012 (11.x) introduced a new set of supplementary character sorting rules (_SC — Supplementary character). Can be used with nCHAR, NVARCHAR, and SQL_Variant data types to represent a full Unicode character range (000000 — 10FFFF). For example, Latin1_General_100_CI_AS_SC.

SQL Server character set information

Information about character sets can be viewed in the sys. syscharSets view.

References and Recommendations

  • Reference to Collation and Unicode Support

  • Check out this article: Questions About SQL Server Collations You Were Too Shy to Ask In particular, how to identify or find a unicode-only collation? , these can only use the NCHAR, NVARCHar, or NTEXT data types.