What is What/Sphinx
Definition: Sphinx is a full-text search engine.
Features:
-
Excellent index and performance
-
Easy to integrate SQL and XML data sources and use the SphinXAPI, SphinXQL, or SphinXSE search interfaces
-
Easy to scale through distributed search
-
High-speed indexing (peak performance of 10-15Mb/s on modern CPUs)
-
High performance search (1.2GB text, 1 million documents search, support up to 150 to 250 queries per second)
Why/ Why use Sphinx
Usage scenarios encountered
Encounter a demand like this: the user can search through the article title and article content, and the article title and article content are stored in different libraries, and is across the room.
alternative
A, Direct implementation of cross-library LIKE query in the database
Advantages: simple operation disadvantages: low efficiency, will cause greater network overhead
B. Combine Sphinx Chinese word segmentation search engine
Advantages: high efficiency, high scalability disadvantages: not responsible for data storage
The data is indexed using the Sphinx search engine, the data is loaded in once, and then stored in memory after being done. This allows the user to simply retrieve the data from the Sphinx server. Also, Sphinx does not have MySQL’s accompanying disk I/O defect and performs better.
Other typical usage scenarios
1. Fast, efficient, extensible and core full-text retrieval
It is faster than MyISAM and InnoDB when data is large. The ability to create indexes on mixed data from multiple source tables, not limited to fields on a single table. The ability to consolidate search results from multiple indexes. Full-text search can be optimized according to additional conditions on attributes.
Use WHERE clauses and LIMIT clauses effectively
When a SELECT query is performed on multiple WHERE conditions, the index is less selective or has no supported fields at all, resulting in poor performance. Sphinx can index keywords. The difference is that in MySQL, the internal engine decides whether to use an index or a full scan, whereas Sphinx lets you choose which access method to use. Because Sphinx stores data in RAM, Sphinx does not do much I/O. MySQL has what’s called a semi-random I/O disk read, which reads the records line by line into the sort buffer, then sorts them again, and then throws away most of the rows. So Sphinx uses less memory and disk I/O.
3. Optimize the Group By query
Sphinx uses fixed memory for sorting and grouping, which is slightly more efficient than similar MySQL queries where all data sets can be placed in RAM.
4. Generate result sets in parallel
Sphinx allows you to produce several copies of the same data at the same time, again using a fixed amount of memory. As a contrast, traditional SQL methods either run two queries or create a temporary table for each search result set. Sphinx uses a multi-query mechanism to do this. Instead of launching queries one after another, you put several queries into a batch and submit them in a single request.
5. Spread up and out
Scaling up: adding CPUs/cores, scaling disk I/O out: multiple machines, i.e. Distributed Sphinx
6. Aggregate sharded data
Ideal for situations where data is distributed among different physical MySQL servers. Example: There is a 1 terabyte table with 1 billion posts, sharded to 10 MySQL servers by user ID, which is certainly fast for a single user query, if you need to implement an archive paging function that shows all posts posted by a user’s friends. That would require multiple MySQL servers to be accessed by colleagues. It’s going to be slow. Sphinx, on the other hand, simply creates a few instances, maps frequently accessed article properties in each table, and then paging the query in three lines of code.
How/ How to use Sphinx
Sphinx workflow flowchart:
Flow chart interpretation:
-
Database: Data source, which is the data source Sphinx indexes. Because Sphinx is engine independent, database independent, data sources can be MySQL, PostgreSQL, XML, etc.
-
Indexer: An Indexer that retrieves data from a data source and generates a full-text index of the data. Indexer can be run periodically as needed to update the index periodically.
-
Searchd: Searchd talks directly to the client program and uses the Indexer program to build indexes to quickly process search queries.
-
APP: Client program. Receive a search string from user input, send a query to the Searchd program and display the returned results.
How Sphinx works
The entire workflow of Sphinx is that the Indexer program extracts data from the database, parses the data, and then generates a single or multiple indexes based on the generated parsers and passes them to the searchd program. The client can then search through API calls.
Now that you’ve seen how Sphinx works, it’s time to get Sphinx to work. Let’s take a look at the configuration of Sphinx.
The Sphinx configuration
Data source configuration
Let’s start with an example of a data source configuration file:
Source test {type = MySQL SQL_HOST = 127.0.0.1 SQL_USER = ROOT SQL_PASS = ROOT SQL_DB = TEST SQL_PORT = 3306 # Optional, default is 3306 SQL_QUERY_PRE = SET NAMES UTF8 SQL_QUERY_PRE = SELECT id, name, name, Add_time FROM tbl_test sql_attr_timestamp = add_time sql_query_info_pre = SET NAMES utf8 sql_query_info = SELECT * FROM tbl_test sql_attr_timestamp FROM tbl_test WHERE id=$id }
Among them
-
The source is followed by the name of the data source, which is used for indexing.
-
Type: Data source type (MySQL, Postresql, Oracle, etc.);
-
SQL_HOST, SQL_USER, SQL_PASS, SQL_DB, SQL_PORT are the authentication information to connect to the database;
-
SQL_QUERY_PRE: Defines the encoding of the query
-
SQL_QUERY: Data source configuration core statement that Sphinx uses to pull data from the database;
-
SQL_ATTR_ * : Index attribute, additional information (value) attached to each document, which can be used for filtering and sorting when searching. Once the properties are set, Sphinx returns the set properties when the Sphinx search API is called.
-
SQL_QUERY_INFO_PRE: Sets the query code, which can be set if the query marks are garbled during debugging from the command line;
-
SQL_QUERY_INFO: Sets the information returned at the command line.
The index configuration
index test_index { source = test path = /usr/local/coreseek/var/data/test docinfo = extern charset_dictpath = /usr/local/mmseg3/etc/ charset_type = zh_cn.utf-8 ngram_len = 1 ngram_chars = U+3000.. U+2FA1F }
Among them
-
The test_index followed by index is the index name
-
Source: name of data source;
-
Path: The base name of the index file, which the indexer program uses as a prefix to generate the index file name. Properties, for example, rally is/usr/local/sphinx/data/test1 spa, and so on.
-
Docinfo: indexed document attribute value storage mode;
-
Charset_dictpath: The directory in which dictionary files are enabled for Chinese word segmentation. The uni.lib dictionary file must exist in this directory.
-
Charset_type: data encoding type;
-
Ngram_len: The length of the participle;
-
Ngram_chars: A valid character set to be recognized for unary character splitting mode.
Chinese word segmentation core configuration
A dollar participle
charset_type = utf8 ngram_len = 1 ngram_chars = U+3000.. U+2FA1F
The mmseg participle
charset_type = utf8
charset_dictpath = /usr/local/mmseg3/etc/
ngram_len = 0
Run the example
Database data
Use the indexer program to index
The query
As you can see, add_time is returned from the configuration file, as shown in Figure 1 above. SQL_QUERY_INFO returns the information shown in Figure 2.
The configuration of Sphinx is not very flexible. The configuration of each part is shown here according to the workflow. For more advanced configuration, you can refer to the documentation when using.
After the configuration of Sphinx, let’s move on to how Indexer, the indexing program in Sphinx, does indexing.
After Sphinx reads the data from the database using a configuration file, it passes the data to the Indexer program, which then reads the records item by item and indexes each record based on a word segmentation algorithm, which can be a unary word segmentation/MMSEG word segmentation. Let’s start with the data structures and algorithms used by Indexer.
Inverted index
An inverted index is a data structure used to store a mapping of where a word is stored in a document or group of documents under a full-text search. It is the most commonly used data structure in document retrieval systems.
Inverted Index: Inverted Index is a concrete form of storage to realize the “word-document matrix”. Through Inverted Index, a list of documents containing the word can be quickly obtained according to the word. Inverted Index is a kind of Inverted Index.
The traditional index is: index ID-> document content, whereas the inverted index is: document content (participle) -> index ID. It can be understood by analogy with the difference between a forward proxy and a reverse proxy. A forward proxy proxies internal requests to the outside, and a reverse proxy proxies external requests to the inside. So it’s appropriate to say transpose index.
The inverted index consists of two main sections: the word dictionary and the inverted file.
Word dictionary is a very important part of inverted index, which is used to maintain the relevant information of all the words that have appeared in the document set, and also to record the position information of the inverted list corresponding to a certain word in the inverted file. In support of search, according to the user’s query words, to query in the word dictionary, you can get the corresponding inverted list, and as the basis for subsequent sorting.
For a large collection of documents that may contain hundreds of thousands or even millions of different words, can quickly locate a word directly affect the response speed of the search, so you need to efficient data structure to build and word dictionary lookup, the commonly used data structure including the hash chain table structure and tree structure of dictionary.
Inverted indexing basics
-
Document: The processing object of general search engines is Internet web pages, while the concept of Document is broader, representing the storage object existing in the form of text. Compared with web pages, it covers more forms, such as Word, PDF, HTML, XML and other documents in different formats can be called documents. An email, a text message, or a tweet can also be called a document. In the remainder of this book, documents will be used in many cases to represent textual information.
-
Document Collection: A Collection composed of several documents is called a Document Collection. For example, a large number of Internet pages or a large number of E-mail messages are concrete examples of document collections.
-
The reference Document (the Document ID) : in the search engine inside, each Document in the Document collection will be given a unique internal number, this number as a unique identifier for this Document, so convenient internal processing, the interior of the each Document number is called “Document number”, later DocID is sometimes used to easily represent the Document number.
-
Word ID: Similar to document ID, a Word is represented by a unique number inside the search engine. Word ID can be used as the unique representation of a Word.
The Indexer program divides the acquired records according to the configured word segmentation algorithm, and then saves them with the inverted index as the data structure.
Word segmentation algorithm
A dollar participle
Core configuration of unary participle
charsey_type = zh_cn.utf8 ngram_len = 1 ugram_chars = U+4E00.. U+9FBF ngram_len is the length of the participle.
Ngram_chars identifies the character set to be used for unary word segmentation patterns.
The word segmentation algorithm supported by native Sphinx is unary word segmentation. This word segmentation algorithm cuts each word of the record and makes an index. The advantage of this index is that the coverage rate is high to ensure that every record can be searched. The disadvantage is that it generates large index files and consumes a lot of resources when updating the index. Therefore, the use of unary participles is not recommended unless there is a special need and the data is not very small.
On the basis of Sphinx, Chinese people developed Coreseek which supports Chinese word segmentation. The only difference between Coreseek and Sphinx is that Coreseek also supports MMSEG word segmentation algorithm for Chinese word segmentation.
The mmseg participle
Mmseg word segmentation algorithm is based on the statistical model, so the rules of the algorithm is based on corpus analysis and mathematical induction, because Chinese characters with no clear dividing line, can cause a lot of character boundary ambiguity, and, in Chinese, it is hard to define a word or phrase, therefore, algorithm, besides to do statistics and mathematical induction to do ambiguity resolution.
In MMSEG, there is a concept called chunk.
Chunk, which is the word segmentation of a sentence. Contains an array of terms and four rules.
For example, in the study of life, there are ‘research/life’ and ‘graduate student/life’. These are the two chunks.
A chunk has four attributes: length, average length (length/number of word segments), variance, and single-word degrees of freedom (the sum of the logarithms of the frequency of each word item).
After finishing word segmentation, a variety of word segmentation methods will be obtained. At this time, some filtering rules should be used to solve the ambiguity, so as to get the final word segmentation method.
Ambiguity Resolution Rules:
1. Maximum Match Matches the word with the maximum length. For example, “internationalization”, there are two participles of “international/internationalization” and “internationalization”, the latter should be chosen.
2. The maximum average word length matches the largest chunk of the average word. For example, “Nanjing Yangtze River Bridge” has three participles: “Nanjing/Yangtze River Bridge” and “Nanjing/Mayor/River Bridge”. The average word length of the former is 7/2=3.5, while that of the latter is 7/3=2.3, so the former is chosen.
3. Maximum variance to the chunk with the largest variance. For example, “research life science”, there are “graduate student/life/science”, “research/life/science” two participles, and their word length is the same as 2. So I’m going to keep filtering, and the variance of the former is 0.82, and the variance of the latter is 0. So choose the first participle.
Select the chunk with the highest frequency of a single word. For example, “mainly because”, “mainly/is/because”, “mainly/if/because” two word segmentation, their word length, variance are the same, and “is” word frequency is higher, so choose the first word segmentation.
If the remaining chunk is still greater than one after the above four rules are filtered, there is nothing the algorithm can do but write the extension itself.
The last of all
Of course, one could argue that a database index can be Sphinx indexed with a different data structure, but the big difference is that Sphinx is like a single-table database without any support for relational queries. Furthermore, indexes are primarily used for the implementation of the search function rather than as the primary data source. Therefore, your database may conform to the third normal form, but the index will be completely de-normalized and mainly contain the data that needs to be searched.
On the other hand, most databases suffer from an internal fragmentation problem, where they encounter too many semi-random I/O tasks in a single large request. That is, consider an index in a database where the query points to the index and the index points to the data, and if the data is separated on different disks due to fragmentation problems, then the query will take a long time.
conclusion
Through the practice of a project, I found that the main point of using Sphinx is the configuration file. If you understand the configuration, the basic usage is easy to grasp. If you want to dig deeper, like how it works, you have to look at more information. Advanced features haven’t been used yet, and will be shared later. Finally, if you want to extend Sphinx to customize more powerful features, you can read the source code and write the extension. Using Sphinx also has its drawbacks, and if you want to ensure high-quality searches, you need to maintain the thesaurus manually on a regular basis. If you can’t keep updating the thesaurus regularly, then you can consider plug-ins such as Baidu search. It would be even better if machine learning could be added.
Original article, writing style is limited, talent is shallow, if there is something wrong in the article, wang told.
If this article is helpful to you, please click on the recommendation. Writing articles is not easy.