Recently, when reviewing the database index part, I saw fulltext, also known as full-text index. Although full-text index is seldom used in daily business, I feel it is somewhat interesting, so I spent some time to study it, and hereby record it.
The introduction of
concept
Most of the queries we need can be done by comparing values, filtering ranges, and so on. However, if you want to filter queries by keyword matching, you need queries based on similarity rather than precise numerical comparisons. Full-text indexes are designed for this scenario.
Why full-text indexing, you might say, when you can do fuzzy matching with like + %? Like + % is appropriate for small amounts of text, but unthinkable for large amounts of text data retrieval. Full-text indexes can be N times faster than like + % in front of a large amount of data. The speed is not an order of magnitude, but full-text indexes may have accuracy problems.
You may not have noticed full-text indexing, but you should be familiar with at least one full-text indexing technique: the various search engines. Although search engines index huge amounts of data and are often not backed by relational databases, the basic principles of full-text indexing are the same.
Versioning support
Before we begin, let’s talk about versions of full-text indexes, storage engines, and data type support
- Prior to MySQL 5.6, only the MyISAM storage engine supported full-text indexing;
- MySQL 5.6 and later, MyISAM and InnoDB storage engines support full-text indexing;
- Full-text indexes can be created only for fields whose data types are CHAR, VARCHar, text, and their families.
When testing or using full-text indexes, check to see if your version of MySQL, storage engine, and data type support full-text indexes.
Operating full text index
The operation of the index is searched casually, here or again wordy.
create
Create a full-text index when the table is created
create table fulltext_test ( id int(11) NOT NULL AUTO_INCREMENT, content text NOT NULL, tag varchar(255), MyISAM DEFAULT CHARSET=utf8; MyISAM DEFAULT CHARSET=utf8;Copy the code
Create full-text indexes on existing tables
create fulltext index content_tag_fulltext
on fulltext_test(content,tag);
Copy the code
Create a full-text index using the SQL statement ALTER TABLE
alter table fulltext_test
add fulltext index content_tag_fulltext(content,tag);
Copy the code
Modify the
Modify an O, delete directly rebuild.
delete
DROP full-text indexes directly using DROP INDEX
drop index content_tag_fulltext
on fulltext_test;
Copy the code
Delete full-text index by SQL statement ALTER TABLE
alter table fulltext_test
drop index content_tag_fulltext;
Copy the code
Using full-text indexes
Instead of using like + % for fuzzy matching, full-text indexing has its own syntax, using the match and Against keywords, for example
select * from fulltext_test
where match(content,tag) against('xxx xxx');
Copy the code
Note: The columns specified in the match() function must be exactly the same as the columns specified in the full-text index, otherwise an error will be reported and the full-text index cannot be used, because the full-text index does not record which column the keyword came from. If you want to use a full-text index for a column, create a separate full-text index for that column.
Test full text index
Adding test Data
With this knowledge, you can test out full-text indexes.
First, create the test table and insert the test data
create table test (
id int(11) unsigned not null auto_increment,
content text not null,
primary key(id),
fulltext key content_index(content)
) engine=MyISAM default charset=utf8;
insert into test (content) values ('a'),('b'),('c');
insert into test (content) values ('aa'),('bb'),('cc');
insert into test (content) values ('aaa'),('bbb'),('ccc');
insert into test (content) values ('aaaa'),('bbbb'),('cccc');
Copy the code
Execute the following query using the full text index syntax
select * from test where match(content) against('a');
select * from test where match(content) against('aa');
select * from test where match(content) against('aaa');
Copy the code
According to our conventional wisdom, four records should be displayed, but there are none, only when the following query is executed
select * from test where match(content) against('aaaa');
Copy the code
That’s when I found aAAA.
Why is that? There are many reasons for this, the most common of which is the minimum search length. In addition, when using full-text indexes, the test table must have at least four records, otherwise unexpected results may occur.
MySQL full-text indexes have two variables, minimum search length and maximum search length. Words with a length less than the minimum search length and a length greater than the maximum search length are not indexed. In plain English, to use a full-text index search for a term, the length of the term must be within the range of the above two variables.
The default values for both can be viewed using the following command
show variables like '%ft%';
Copy the code
You can see the names and default values of these two variables in both MyISAM and InnoDB storage engines
// MyISAM
ft_min_word_len = 4;
ft_max_word_len = 84;
// InnoDB
innodb_ft_min_token_size = 3;
innodb_ft_max_token_size = 84;
Copy the code
As you can see, the default minimum search length is 4 for MyISAM and 3 for InnoDB. In other words, MySQL’s full-text index only indexes words with length greater than or equal to 4 or 3.
The minimum search length is specified
Full-text index parameters cannot be dynamically modified, and must be modified by modifying the MySQL configuration file. To change the minimum search length to 1, open the MySQL configuration file /etc/my.cnf and add the following content to [mysqld]
[mysqld]
innodb_ft_min_token_size = 1
ft_min_word_len = 1
Copy the code
Then restart the MySQL server and repair the full-text index. Note that after modifying the parameter, you must repair the index, otherwise the parameter will not take effect.
There are two repair methods, you can use the following command repair
repair table test quick;
Copy the code
A, AA, and AAA can be queried by deleting the index and re-creating it.
However, there is still a question, why does the keyword a, aa, aaa, aAAA not appear in the results, before talking about this question, let’s talk about the two full-text indexes.
Two full-text indexes
Full-text index of natural languages
By default, or when using the in Natural Language mode modifier, the match() function performs a natural language search on the text collection, and the examples above are natural language full-text indexes.
The natural language search engine will calculate the relevance of each document object and query. Here, relevance is based on the number of matched keywords and the number of times that keyword appears in the document. The fewer times words appear in the overall index, the higher their relevance. Conversely, very common words will not be searched, and if a word appears in more than 50% of the records, natural language searches will not search for such words. As mentioned above, there must be more than four records in the test table for this reason.
This mechanism is also easier to understand, for example, a data table storage is a piece of article, the common words in the article, modal words and so on, there must be more, it is not meaningful to search these words, need to search those articles have special meaning of the word, so as to separate the article.
Boolean full text index
In Boolean search, we can customize the relevance of a particular term being searched in the query, and when writing a Boolean search query, we can customize the search with some prefix modifiers.
Ft_boolean_syntax = ft_boolean_syntax = ft_boolean_syntax = ft_boolean_syntax
-
- The word must be included
-
- The word must not be included
-
To improve the relevance of the word, the results of the query are higher
-
< reduces the relevance of the word, and the results of the query are later
-
The (*) asterisk wildcard character can only be followed by a word
The problem mentioned above can be solved by using a Boolean full-text index query. Using the following command, a, AA, aaa, and aAAA are all queried.
select * test where match(content) against('a*' in boolean mode);
Copy the code
conclusion
Okay, we’re almost done. It’s time to summarize.
MySQL’s full-text index initially only supported English, because English words have Spaces between them, which is convenient to use as word separators. Asian characters, such as Chinese, Japanese, Chinese, etc., do not have Spaces, which creates certain limitations. However, starting with MySQL 5.7.6, an Ngram full-text parser was introduced to address this issue and is available for both MyISAM and InnoDB engines.
In fact, MyISAM storage engine has a number of limitations on the full text index support, such as the performance impact of table level locking, data file crash, recovery after crash, which makes MyISAM’s full text index unsuitable for many application scenarios. Therefore, in most cases, the recommendation is to use another solution, such as Sphinx, Lucene and other third-party plug-ins, or use InnoDB storage engine full text index.
A couple of points
Find out version support before using full-text indexes; Full-text index is N times faster than like + %, but there may be accuracy problems. If a large amount of data needs to be indexed, you are advised to add data before creating indexes. For Chinese, you can use MySQL versions later than 5.7.6, or third-party plug-ins.
Source: blog.csdn.net/mrzhouxiaof…