Quick implementation of content similarity recommendations using ES

Question and answer system: through a descriptive text given by the user, search for questions close to the user’s input through similarity calculation. Recommendation: When browsing the current article, the user recommends articles similar to this article based on content similarity

More_like_this is used to help you find more data like this document. To help you find more data like this document, you need to create an index library that contains the title and desc fields:

 PUT /search_data
{
    "mappings": {
        "properties": {
            "title": {
                "type": "text"."term_vector": "yes"
            },
            "desc": {
                "type": "text"}}}}Copy the code

Term_vector If it is yes, term_vector will index terms vector, speeding up the calculation of similarity. Term_vector can be used to query more_like_this if term_vector is not configured. However, more_like_this can be used to query more_like_this if term_vector is not configured.

Make recommendations based on a short paragraph or a problem description statement

GET /_search
{
    "query": {
        "more_like_this" : {
            "fields" : ["title"."desc"]."like" : "Qingming Festival spring outing spring tourism school spring outing parent-child outing enterprise outing"."min_term_freq" : 1,
            "max_query_terms": 12}}}Copy the code

Fields Indicates the field to be queried. Currently, only text and term are supported
Like to query for similar text, either a document ID or a query term
Min_term_freq Minimum word frequency. Words below this frequency will be ignored
Max_query_terms According to max_query_terms, extract the largest tFIDF values of like in this term, and other terms will be ignored

In addition, if the text is too long, similar recommendations can be made based on the article Id

GET /_search
{
    "query": {
        "more_like_this" : {
            "fields" : ["title"."desc"]."like": [{"_index" : "search_data"."_id" : "1"}]."min_term_freq" : 1,
            "max_query_terms": 12}}}Copy the code

Like can be an array with multiple articles, and _index can also correspond to an index library that is not the current query.

Results the fine-tuning

Unlike, if you are not satisfied with the recommendation result, you can also fine-tune the parameters by using the same method as like, but the difference is that some content you don’t like is passed in here, and the weight reduction is carried out during similarity calculation. It should be noted that the weight reduction is not obvious if the head recommendation is used.

GET search_data/_search
{
  "size": 112, 
  "_source": ["desc"."title"]."query": {
    "more_like_this" : {
            "fields" : ["title"."desc"]."unlike":[
              {
                "_index" : "search_data"."_id" : "1270715"
              },
              {
                "_index" : "search_data"."_id" : "1238991"
              },
              {
                "_index" : "search_data"."_id" : "506680"
              },
              "I'm going to block things I don't like."]."like": [{"_index" : "search_data"."_id" : "986604"}]."min_term_freq": 1}}}Copy the code

Other parameters are optional

Min_doc_freq: Minimum document frequency, default is 5.
Max_doc_freq: maximum document frequency.
Min_word_length: minimum length of a word.
Max_word_length: the maximum length of a word.
Stop_words: list of stop words.
Analyzer: Word analyzer.
Minimum_should_match: The minimum number of words the document should match. Default is 30% of the words after the query participle.
Boost_terms: Weight of the term.
Include: Whether to return the input document as a result.
Boost: The weight of the entire Query, which defaults to 1.0.

Author: Yi Qixiu Engineer Yarn -> Personal home page

Quick implementation of content similarity recommendations using ES

Make recommendations based on a short paragraph or a problem description statement

In addition, if the text is too long, similar recommendations can be made based on the article Id

Results the fine-tuning

Other parameters are optional

Related Posts

Aiming at 5G, fast application unexpectedly preempts traffic dividend

docker apache php-fpm AH01071: Got error ‘Primary script unknown\n’

How to install MySQL 8.0.11 on Linux