The problem

In recent attempts to index data with Sphinx, the desired documents are often not indexed, the resulting list of documents is not the best, and the significantly better results are not shown. After searching, I found something wrong with the SetLimit setting of Sphinx. I set the value of the cutoff parameter.

SetLimits ( $offset, $limit, $max_matches=1000, $cutoff=0)

$max_matches: controls the maximum amount of data that the server returns in the current request. $cutoff: Controls the number of queries limited (stop queries when Sphinx queries exceed the $cutoff)

The official documentation

Sets offset into server-side result set ($offset) and amount of matches to return to client starting from that offset ($limit). Can additionally control maximum server-side result set size for current query ($max_matches) and the threshold amount of matches to stop searching at ($cutoff). All parameters must be non-negative integers.

First two parameters to SetLimits() are identical in behavior to MySQL LIMIT clause. They instruct searchd to return at most $limit matches starting from match number $offset. The default offset and limit settings are 0 and 20, that is, to return first 20 matches.

max_matches setting controls how much matches searchd will keep in RAM while searching. All matching documents will be normally processed, ranked, filtered, and sorted even if max_matches is set to 1. But only best N documents are stored in memory at any given moment for performance and RAM usage reasons, and this setting controls that N. Note that there are two places where max_matches limit is enforced. Per-query limit is controlled by this API call, but there also is per-server limit controlled by max_matches setting in the config file. To prevent RAM usage abuse, server will not allow to set per-query limit higher than the per-server limit.

You can’t retrieve more than max_matches matches to the client application. The default limit is set to 1000. Normally, you must not have to go over this limit. One thousand records is enough to present to the end user. And if you’re thinking about pulling the results to application for further sorting or filtering, that would be much more efficient if performed on Sphinx side.

$cutoff setting is intended for advanced performance control. It tells searchd to forcibly stop search query once $cutoff matches had been found and processed.

Offset and limit

$offset; $limit; $offset; $limit; $offset; $limit; $offset;

max_matches

The max_matches argument, which is a bit confusing with the cutoff argument, is the maximum number of indexed results that can be returned, for example, if you’re searching for a string, it’s usually good enough to get 1000 results, after that it’s useless, even if it’s Baidu or Google, It does not return all index results. Max_matches controls the number of 1000 documents, of course, by looking in Sphinx’s inverted list based on the search string, sorting (sorting, filtering) all documents that meet the criteria. And then you take the first max_matches, so even if you have max_matches set to 1, Sphinx is still going to have to handle all the documents that meet this condition, and then you take the first max_matches. This parameter can be set on each request or in the Sphinx profile, which has a higher precedence, meaning that max_matches on each request cannot be higher than max_matches in the Sphinx profile

cutoff

It also has control over the maximum number of documents returned, compared to max_matches, but unlike max_matches, it starts indexing documents that meet the cutoff at the beginning of the list, and when the cutoff is reached, it doesn’t index any further documents. The final processing (sorting, filtering) also takes place only in this cutoff. Once the cutoff is set, the level of max_matches seems to be lower and will be based on the documents the cutoff gets, which is why I haven’t indexed many of the better documents because I’ve set the cutoff so I can just remove them. Right

Obviously, the cutoff greatly reduces the index amount of data and improves the performance, but it will result in a great loss of the index precision. However, if the requirement for the index precision is not high, and the cutoff will not cause a great loss of the precision, Index performance can be improved by cutoff