English full-text index is based on the space and other symbols of word segmentation, no special word segmentation

Mysql 5.7.6 supports Chinese full-text indexing

  • Ngram

  • The number of Chinese words is configured

ngram_token_size=2
Copy the code
  • Creating a full-text index
Alter table 'name' add fulltext index idx_ft_ngram_name(' column1 ', 'column2') with parser ngram; CREATE TABLE articles ( id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, title VARCHAR (200), body TEXT, FULLTEXT (title, body) WITH PARSER ngram ) ENGINE = INNODB; CREATE FULLTEXT INDEX ft_index ON articles (title,body) WITH PARSER ngram;Copy the code
  • Full-text retrieval mode

    There are two commonly used full-text search modes:

    1. NATURAL LANGUAGE MODE

    Natural language mode is MySQL’s default full-text search mode. The natural language pattern cannot use operators and cannot specify complex queries such as the keyword must or must not appear.

    BOOLEAN MODE (BOOLEAN MODE)

    BOOLEAN schema can use operators to support complex queries that specify whether a keyword must or must not appear, or whether the keyword has a high or low weight.

  • Pay attention to

  1. Full-text indexes can only be created on fields of type CHAR, VARCHAR, or TEXT.
  2. Full text indexing only supports InnoDB and MyISAM engines.
  3. MATCH (columnName) AGAINST (‘keywords’). The field name used by the MATCH() function must be the same as that specified when the full-text index was created. In the example above, MATCH (title,body) uses the same field name as defined by the full-text index ft_articles(title,body). If you want to query the title or body fields separately, you need to create a new full-text index on the title and body fields respectively.
  4. The field names used by the MATCH() function must be those of the same table, because full-text indexes cannot be retrieved across multiple tables.
  5. If you are importing large data sets, it is much faster to import data and then create a full-text index on the table than to create a full-text index on the table and then import data, so full-text indexes affect TPS.

PGSQL Chinese full text index

1. Two data typestsvector.tsquery

2. Index creation

  • Build indexes for TSV functions
    create index post_tsvcontent_idx on posts using gin(to_tsvector('english', title || content));
Copy the code
  • Add a TSV column to index the column (recommended)
alter table posts add_column tsv_content tsvector;
update posts set tsv_content = to_tsvector('english', coalesce(title,'') || coalesce(content,''));
create index post_tsvcontent2_idx on posts using gin(tsv_content);
Copy the code

3. Chinese Full Text Search (zhparser)

Zhparser is a Chinese word segmentation plug-in for PostgreSQL. It enables PostgreSQL to support Full Text Search.

Enable Chinese word segmentation

CREATE EXTENSION zhparser; CREATE TEXT SEARCH CONFIGURATION testzhcfg (PARSER = zhparser); ALTER TEXT SEARCH CONFIGURATION testzhcfg ADD MAPPING FOR n,v,a,i,e,l WITH simple; -- Optional parameter setting ALTER role all set zhparser.multi_short=on; SELECT * FROM ts_parse('zhparser', 'hello world! In 2010, the construction of government-subsidized housing was launched nationwide, and the construction and investment of government-subsidized housing were increased from the central government to local governments. In 2011, affordable housing entered a larger construction phase. Jiang Weixin, minister and party secretary of the Ministry of Housing and Urban-Rural Development, said at the national Housing and Urban-rural Development Work Conference late last year that low-income housing will continue to be built. '); SELECT to_tsvector(' testZhcfg ',' TestZhcfg ',' this year, although the number of affordable housing starts has been lowered, the actual annual scale of construction and completion will exceed the previous years, and the corresponding demand for funds will also hit a record. Chen guoqiang said. In his opinion, compared with 2011, affordable housing construction in 2012 will be more severe pressure on the financial support. '); SELECT to_tsquery(' testZhcfg ', 'housing fund pressure ');Copy the code

The method of full-text index using word segmentation is as follows:

Create index idx_T1 on T1 using gin (to_tsvector(' ZHCFG ',upper(name))); Select * from T1 where to_tsvector(' ZHCFG ',upper(t1.name)) @@to_tsquery (' ZHCFG ',' fireproof '); select * from t1 where to_tsvector(' ZHCFG ',' fireproof ');Copy the code

Custom Chinese word segmentation dictionary

SELECT to_tsquery('testzhcfg', 'housing fund pressure '); Insert into PG_ts_custom_word values (' affordable '); insert into pg_ts_custom_word values (' affordable '); Select zhprs_sync_dict_xdb(); SELECT to_tsquery('testzhcfg', 'housing fund pressure ');Copy the code

Note the following when using a custom participle:

  • A maximum of 1,000,000 user-defined participles are supported. If the number exceeds the threshold, no processing is performed. Ensure that the number of participles is within this range. Custom participles work together with the default participle dictionary.
  • Each word has a maximum length of 128 bytes, and any excess will be truncated.
  • The addition, deletion and modification of participles must be carried outselect zhprs_sync_dict_xdb();And re-establishing the connection will take effect.

Index data structure

Gist GIN Full text Index (common inverted index)Copy the code

reference

MySQL 5.7 Chinese full text search tutorial