preface
Some time ago, because the project needs, need to search chat records according to keywords, isn’t this a search engine function? ElasticSearch is a distributed search engine that comes to my mind at first. However, due to some reasons, the company has limited server resources. There is no extra machine to deploy a set of ElasticSearch service. Then came Mysql’s full-text index.
Introduction to the
In fact, Mysql has long supported full text indexing, but it has only supported English retrieval. Since version 5.7.6, Mysql has built ngram full text parser to support Chinese, Japanese, and Korean word segmentation.
Mysql full-text index adopts the principle of inverted index, in inverted index key is the primary key, each keyword corresponds to a series of files, these files appear in the keyword. So when the user searches for a keyword, the sorter locates that keyword in the inverted index and immediately finds all the files that contain that keyword.
This paper tests that InnoDB is used as the database engine based on Mysql 8.0
Ngram full text parser
A ngram is a sequence of n consecutive words in a text. The Ngram full-text parser is capable of segmentation of text, where each word is a consecutive sequence of n words. For example, use the Ngram full-text parser to participle “Hello pretty boy” :
N = 1: 'you' and 'good', 'beautiful', 'wang' n = 2: 'hello', 'beautiful', 'handsome' n = 3: 'hello jing', 'good handsome' n = 4: 'hello handsome'Copy the code
MySQL uses the global variable ngram_token_size to configure the size of n in ngram. The value ranges from 1 to 10, and the default value is 2. Usually ngram_token_size is set to the minimum word count of the word to be queried. If you need to search for words, set ngram_token_size to 1. With a default value of 2, searching for a word yields no results. Since a Chinese word has at least two characters, the default value of 2 is recommended.
To check the default Mysql ngram_token_size size, run the following command:
show variables like 'ngram_token_size'
Copy the code
There are two ways to set the value of the global variable ngram_token_size:
Mysqld = mysqld;
mysqld --ngram_token_size=2
Copy the code
Mysql.ini = mysql.ini = my.ini
ngram_token_size=2
Copy the code
Creating a full-text index
Create a full-text index when creating a table
CREATE TABLE `article` (
`id` bigint NOT NULL,
`url` varchar(1024) COLLATE utf8mb4_general_ci NOT NULL DEFAULT '',
`title` varchar(256) COLLATE utf8mb4_general_ci NOT NULL DEFAULT '',
`source` varchar(32) COLLATE utf8mb4_general_ci DEFAULT '',
`keywords` varchar(32) COLLATE utf8mb4_general_ci DEFAULT NULL,
`publish_time` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`),
FULLTEXT KEY `title_index` (`title`) WITH PARSER `ngram`
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci;
Copy the code
2. Use ALTER TABLE
ALTER TABLE article ADD FULLTEXT INDEX title_index(title) WITH PARSER ngram;
Copy the code
3. Select Create Index
CREATE FULLTEXT INDEX title_index ON article (title) WITH PARSER ngram;
Copy the code
Retrieve the way
1. NATURAL LANGUAGE Retrieval
Natural language mode is MySQL’s default full-text search mode. The natural language pattern cannot use operators and cannot specify complex queries such as the keyword must or must not appear.
The sample
Select * from article where MATCH(title) AGAINST (' Beijing tourism 'IN NATURAL LANGUAGE MODE); Select * from article where MATCH(title) AGAINST (' Beijing tour '); select * from article where MATCH(title) AGAINST (' Beijing tour ');Copy the code
It can be seen that content containing “Beijing” or “tourism” can be found according to the search of “Beijing tourism” in this mode, because it is divided into two keywords according to natural language.
The results returned in the above example are automatically sorted by degree of match, with the highest degree of match first, which is a non-negative floating point number.
The sample
Select *, MATCH(title) AGAINST (' Beijing Tourism ') as score from article where MATCH(title) AGAINST (' Beijing Tourism 'IN NATURAL LANGUAGE MODE);Copy the code
2, BOOLEAN search (BOOLEAN MODE)
Boolean search mode can use operators to support complex queries that specify whether a keyword must or must not appear, or whether the keyword has a high or low weight.
The sample
Select * from article where MATCH(title) AGAINST (' date 'IN BOOLEAN MODE);Copy the code
Select * from article where MATCH(title) AGAINST ('+ 'IN BOOLEAN MODE); select * from article where MATCH(title) AGAINST ('+' IN BOOLEAN MODE);Copy the code
More examples of operators:
'Appointment strategy' has no operator, which means or, contains either 'appointment' or 'strategy'. '+ Appointment + strategy 'must contain both words. '+ Appointment strategy' must contain 'appointment', but if it also contains' strategy ', the match is higher. '+ date-walkthrough 'must contain the word' appointment 'and cannot contain the word' walkthrough '. '+ appointment ~ walkthrough 'must contain' appointment ', but records that also contain 'walkthrough' have a lower match than records that do not contain 'walkthrough'. The '+ Appointment +(> Strategy < technique)' query must contain records for appointment and strategy or Appointment and technique, but strategy is a better match than technique. The 'Appointment *' query contains records that begin with' Appointment '. 'Dating tips' uses double quotation marks around the search term, similar to like '% dating tips %'. For example, dating tips will be matched, but dating tips won't.Copy the code
Compared with the Like
Full-text index and like query comparison, has the following advantages:
- Like only performs fuzzy matching, but full-text indexing provides some syntactic and semantic query functions, which will perform word segmentation on the string to be checked, depending on the Mysql thesaurus.
- The full-text index can set its own minimum and maximum length of words, and the words to be ignored.
- Using a full-text index to look up a string in a column returns the matching degree, which can be interpreted as the number of matched keywords, which is a floating point number.
And the performance of full-text retrieval is better than that of like query
The following test is conducted with about 50W data:
Select * from article where title like '% Beijing %';Copy the code
Select * from article where MATCH(title) AGAINST (' Beijing 'IN BOOLEAN MODE);Copy the code
You can see that the like query is 1.536s, and the full text index query is 0.094s, which is about 16 times faster.
conclusion
Full-text indexes can be searched quickly, but there is also the overhead of maintaining the index. The larger the field length is, the larger the full-text index is created, which affects the throughput of DML statements. If the data volume is small, you can use the full text index to search, but if the data volume is large, you are advised to use the dedicated search engine ElasticSearch to do this.
END
Phase to recommend
How to ensure idempotency of interfaces?
Distributed transaction solutions that you must understand
Is this it? Distributed ID transmitter actual combat
Some knowledge of design patterns in factory patterns
Is this it? Spring transaction failure scenarios and solutions