Master GaussDB(DWS) SQL advanced skills: full text retrieval

Abstract: This paper briefly introduces the principle and usage of GaussDB(DWS) full-text retrieval.

This article is shared by Zhang Jingyao from huawei cloud community advanced Full text Search for GaussDB(DWS) SQL.

Text search, as its name implies, is the process of finding a specified pattern in a given document. GaussDB(DWS) supports full-text search for fields of the Chinese type and their combinations in a table, finds the text that matches a given pattern, and displays the matching result in the desired way.

Based on the author’s experience and thinking, this paper briefly introduces the full-text retrieval function of GaussDB(DWS), hoping to be helpful to readers.

1. The preprocessing

There are many ways to find a pattern in a given document, such as using the grep command to search for a regular expression. In theory, you can use grep to retrieve patterns for text fields in a database, and GaussDB(DWS) can match strings with either the keyword “LIKE” or the operator “~”. But there are many problems with this. First, each paragraph of text is scanned, which is inefficient and difficult to measure “match” or “relevance”. Moreover, it can only match strings mechanically, and lacks the ability to analyze grammar and semantics. For example, it is difficult to automatically identify and match plural nouns and tense transformation of verbs in English, and it cannot obtain satisfactory retrieval results for texts composed of natural languages.

GaussDB(DWS) performs full-text retrieval in a search engine-like manner. First, preprocess the given text and pattern, including extracting words or phrases from a piece of text, removing useless stop words for retrieval, standardizing the deformed words, etc., to make them suitable for retrieval and then match them.

In GaussDB(DWS), raw documents and search criteria are represented as text, or strings. The preprocessed document becomes a TSvector, which is implemented by the to_tsvector function. For example,

postgres=# select to_tsvector('a fat cat ate fat rats'); To_tsvector -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 'ate: 4:' the cat '3' fat ': 2, 5' rat: 6 row (1)Copy the code

If you look at the tsvector output above, you can see what to_tsvector looks like:

First, words are extracted and their positions are marked with integers, such as “fat” at the position of the second and fifth words in the original sentence.
Moreover, the word “A” is so common that it appears in almost every document and does little to help retrieve useful information. To apply Shannon’s theory, the more likely a word is to appear, the less information it contains. Words like “a” and “the” carry so little information that they have been dropped as stop words. Note that this does not affect the position numbering of other words, “fat” is still 2 and 5, not 1 and 4.
In addition, the plural “rats” was replaced with the singular “rat.” The operation, known as Normalize, is the practice of removing suffixes and preserving roots in response to changes in Western words that can occur in different contexts. Its significance is to simplify natural language retrieval. For example, when searching for “rat”, documents containing “rat” and “rats” can be retrieved. The normalized words are called lexemes, such as rat. And the original word is called the language symbol ** (token).

There are many benefits to converting a document to TSVector form. For example, indexes can be easily created to improve the speed and efficiency of retrieval, and when the number of documents is large, keyword retrieval through indexes is much faster than full-text scanning matching like grep. For example, different keywords can be assigned different weights according to their importance, so as to facilitate the sorting of search results and find the documents with the highest relevance.

The pre-processed retrieval criteria are converted to tsquery type, which can be implemented by the to_tsquery function. For example,

postgres=# select to_tsquery('a & cats & rat'); to_tsquery --------------- 'cat' & 'rat'(1 row)Copy the code

As can be seen from the above example:

Similar to to_tsvector, to_tsquery also removes stop words and standardizes input text, such as removing “a” and changing “cats” to “cat”.
Input of retrieval condition itself must be used and (&), or (|), not (!) Operator connections, such as the following statement, report an error

postgres=# select to_tsquery(‘cats rat’); ERROR: syntax error in tsquery: “cats rat”CONTEXT: referenced column: to_tsquery

Plainto_tsquery does not have this limitation. Plainto_tsquery changes the input word to an “and” condition:

postgres=# select plainto_tsquery('cats rat'); plainto_tsquery----------------- 'cat' & 'rat'(1 row)postgres=# select plainto_tsquery('cats,rat'); plainto_tsquery----------------- 'cat' & 'rat'(1 row)Copy the code

In addition to functions, you can cast a string to tsvector or tsquery, for example

postgres=# select 'fat cats sat on a mat and ate a fat rat'::tsvector; tsvector ----------------------------------------------------- 'a' 'and' 'ate' 'cats' 'fat' 'mat' 'on' 'rat' 'sat'(1 row)postgres=# select 'a & fat & rats'::tsquery; tsquery ---------------------- 'a' & 'fat' & 'rats'(1 row)Copy the code

Unlike functions, casts do not remove stop words, do not standardize them, and do not record word positions for tsVector types.

2. Pattern matching

After converting the input document and retrieval criteria to tsVector and tsquery, pattern matching is ready. GaussDB(DWS) uses the @@ operator for pattern matching. Returns True on success and false on failure.

For example, create the following table,

postgres=# create table post(postgres(# id bigint,postgres(# author name,postgres(# title text,postgres(# body text); CREATE TABLE-- insert some tuplesCopy the code

To retrieve a post title with “physics” or “math” in the body, use the following statement:

postgres=# select title from post where to_tsvector(body) @@ to_tsquery('physics | math'); title ----------------------------- The most popular math booksCopy the code

You can also query by combining multiple fields:

postgres=# select title from post where to_tsvector(title || ' ' || body) @@ to_tsquery('physics | math'); title ----------------------------- The most popular math books(1 row)Copy the code

Notice Different query methods may produce different results. For example, the following match was not successful because ::tsquery did not standardize the search criteria and the word “cats” was not found in the previous tsVector:

postgres=# select to_tsvector('a fat cat ate fat rats') @@ 'cats & rat'::tsquery; ? column? ---------- f(1 row)Copy the code

For the same document and retrieval criteria, the following match will succeed because to_tsquery will change “cats” to “cat” :

postgres=# select to_tsvector('a fat cat ate fat rats') @@ to_tsquery('cats & rat'); ? column? ---------- t(1 row)Copy the code

Similarly, the following match is unsuccessful because to_tsvector removes the stop word a:

postgres=# select to_tsvector('a fat cat ate fat rats') @@ 'cat & rat & a'::tsquery; ? column? ---------- f(1 row)Copy the code

The following will succeed because ::tsvector preserves all words:

postgres=# select 'a fat cat ate fat rats'::tsvector @@ 'cat & rat & a'::tsquery; ? column? ---------- f(1 row)Copy the code

Therefore, appropriate retrieval methods should be selected according to needs.

In addition, the @@ operator can implicitly cast the input text, for example,

postgres=# select title from post where body @@ 'physics | math'; title-------(0 rows)Copy the code

Text @@text = to_tsvector(text) @@plainto_tsquery (text) Because plainto_tsquery would make or conditions’ physics | math ‘into’ physic ‘&’ math ‘and conditions. Use with extreme care.

3. Create and use indexes

As mentioned earlier, scanning text fields in a table one by one is slow and inefficient, while index lookups can improve the speed and efficiency of retrieval. GaussDB(DWS) supports full-text retrieval using GIN (Generalized Inverted Index). GIN is a common index used in search engines. The main principle of GIN is to search for the document by keyword to improve query efficiency. GIN indexes can be created on fields of type TEXT by:

postgres=# create index post_body_idx_1 on post using gin(to_tsvector('english', body));CREATE INDEX
Copy the code

Note that the to_tsvector function must be used to generate the tsvector. No forced or implicit type conversions can be used. Furthermore, the to_tsvector function used here has one more parameter than in the previous section, ‘English’, which specifies the Text search Configuration. The text search configuration is covered in the next section. The tsVector computed by different configurations will generate different indexes. Therefore, this parameter must be specified explicitly. In addition, only the configuration and fields specified in the tsVector can be queried through the index. For example, in the following query, the former can be retrieved by post_body_IDx_1, while the latter cannot find the corresponding index and can only be retrieved by full table scan.

postgres=# explain select title from post where to_tsvector('english', body) @@ to_tsquery('physics | math'); QUERY PLAN ----------------------------------------------------------------------------------------------------- id | operation | E-rows | E-width | E-costs ----+---------------------------------+--------+---------+--------- 1 | -> Streaming (type: GATHER) | 1 | | 2 | 42.02 32 - > Bitmap Heap Scan on post | 1 | | 3 | - > 16.02 32 Bitmap Index Scan | 1 | | 0 to 12.00 postgres=# explain select title from post where to_tsvector('french', body) @@ to_tsquery('physics | math'); QUERY PLAN ---------------------------------------------------------------------------------------------- id | operation  | E-rows | E-width | E-costs  ----+------------------------------+--------+---------+------------------ 1 | -> Streaming (type: GATHER) | 1 | 32 | 2 | - > 1000000002360.50 Seq Scan on post | 1 | | 1000000002334.50 32Copy the code

4. Text Search Configuration

This section talks about how GaussDB(DWS) preprocesses documents, or how to_TSVector works.

Document preprocessing is generally carried out in the following three steps:

The first step is to extract the words or phrases in the text one by one. This work is done by parsers or Segmentation machines. When complete, the document becomes a series of tokens.
The second step is to standardize the tokens obtained in the previous step, including removing prefixes and suffixes, converting synonyms, removing stop words and so on according to the specified rules, so as to obtain lexemes. This step is done according to the Dictionary, that is, the Dictionary defines standardized rules.
Finally, the position (and weight) of each bit is recorded to obtain the TSVector.

As you can see from the above description, if a parser and dictionary are given, then the rules for document preprocessing are established. In GaussDB(DWS), this set of rules for document preprocessing is called the Text Search Configuration. The full-text search configuration determines the result and quality of the match.

As shown in the figure below, a full-text retrieval configuration consists of a parser and a set of dictionaries. Input documents are first broken down into tokens by the parser, then each token is looked up dictionary by dictionary, and if the token is found in a dictionary, Normalize it according to the rules of that dictionary. Some dictionaries Normalize and mark the token as “processed” so that subsequent dictionaries don’t process it anymore. Some dictionaries Normalize and output new tokens to subsequent dictionaries for processing. Such dictionaries are called “filter” dictionaries.

Figure 1 Document preprocessing process

The parser used by the configuration is specified when the configuration is created and cannot be modified, for example,

postgres=# create text search configuration mytsconf (parser = default);CREATE TEXT SEARCH CONFIGURATION
Copy the code

GaussDB(DWS) has four built-in parsers and does not support custom parsers.

postgres=# select prsname from pg_ts_parser; prsname ---------- default ngram pound zhparser(4 rows)Copy the code

Dictionaries are specified by the ALTER TEXT SEARCH CONFIGURATION command, for example

postgres=# alter text search configuration mytsconf add mapping for asciiword with english_stem,simple;ALTER TEXT SEARCH CONFIGURATION
Copy the code

Specifies that Mytsconf standardises “asciiWord” tokens using english_STEM and simple dictionaries.

The “asciiword” in the above statement is a token type. The parser classifies the decomposed tokens. Different parsers classify the tokens differently. You can view the result using the ts_token_type function. For example, the ‘default’ parser classifies tokens into the following 23 types:

postgres=# select * from ts_token_type('default'); tokid | alias | description -------+-----------------+------------------------------------------ 1 | asciiword | Word, all ASCII 2 | word | Word, all letters 3 | numword | Word, letters and digits 4 | email | Email address 5 | url | URL 6 | host | Host 7 | sfloat | Scientific notation 8 | version | Version number 9 | hword_numpart | Hyphenated word part, letters and digits 10 | hword_part | Hyphenated word part, all letters 11 | hword_asciipart | Hyphenated word part, all ASCII 12 | blank | Space symbols 13 | tag | XML tag 14 | protocol | Protocol head 15 | numhword | Hyphenated word, letters and digits 16 | asciihword | Hyphenated word, all ASCII 17 | hword | Hyphenated word, all letters 18 | url_path | URL path 19 | file | File or path name 20 | float | Decimal notation 21 | int | Signed integer 22 | uint | Unsigned integer 23 | entity | XML entity(23 rows)Copy the code

Dictionaries already in the current database can be queried using the system table PG_ts_dict.

If a configuration is specified, the document is preprocessed according to the specified configuration, as shown in the previous section creating GIN indexes. If no configuration is specified, to_tsvector uses the default configuration specified by the default_TEXt_search_config variable.

postgres=# show default_text_search_config; - view the current default configuration default_text_search_config -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- pg_catalog. English (row 1) postgres = # set default_text_search_config = mytsconf; SETpostgres=# show default_text_search_config; default_text_search_config---------------------------- public.mytsconf(1 row)postgres=# reset default_text_search_config; RESETpostgres=# show default_text_search_config; default_text_search_config---------------------------- pg_catalog.english(1 row)Copy the code

Note that default_TEXt_search_config is a session-level variable and is only valid for the current session. To make the default configuration persistent, you can modify the variable with the same name in the postgresql.conf configuration file, as shown in the following figure.

After the modification, restart the process.

conclusion

The full text retrieval module of GaussDB(DWS) provides powerful document search capabilities. Compared with pattern matching using the “LIKE” keyword or the “~” operator, full-text retrieval provides richer semantic syntax support and can process natural language text more intelligently. With appropriate indexes, efficient retrieval of documents can be achieved.

Click to follow, the first time to learn about Huawei cloud fresh technology ~

Master GaussDB(DWS) SQL advanced skills: full text retrieval

1. The preprocessing

2. Pattern matching

3. Create and use indexes

4. Text Search Configuration

conclusion

Related Posts

Java uses SNMP to manage devices

Simulating commodity SEC kill in high concurrency scenario

I heard that you can master Spring Boot integration timing task and asynchronous call in 10 minutes?