Elasticsearch master episode 73
Term Vector: Term Vector: term Vector for elasticSearch
The term vector is introduced
Gets statistics for each term within a field in the document
- term information:
term frequency in the field
.term positions
.start and end offsets
.term payloads
- term statistics: set term_statistics = true;
total term frequency
, the frequency of occurrence of a term in all documents;document frequency
And how many documents include this term
- field statistics:
document count
, how many documents contain the field;sum of document frequency
, the sum of df of all terms in a field;sum of total term frequency
, the tf sum of all terms in a field
GET /twitter/tweet/1/_termvectors GET /twitter/tweet/1/_termvectors? fields=textCopy the code
Term statistics and field statistics are not precise and will not be considered. Some Doc may have been deleted
I’ll tell you, it’s rarely used, but when it’s used, in general, it’s when you need to probe some data. For example, if you want to see a certain term, a certain term, a journey to the West, this term appears in how many documents. Or a field, film_desc, the description of the movie, how many doc’s contain that description.
Index-iime term Vector experiment
Term vector, which involves a lot of term and field-related statistics, can be collected in two ways
- Index-time, when configured in the mapping and indexed, will generate the term and field statistics for you
- Query-time, you haven’t generated any Term vector information before, and then when you look at the Term vector, you can see it directly, on the fly, calculates all sorts of statistics on the spot, and returns it to you
- Master, how to collect term Vector information
- Knowing the term vector information, you can learn how to use the term vector for data exploration
- indexing
PUT /waws_index
{
"mappings": {
"waws_type": {
"properties": {
"text": {
"type": "text"."term_vector": "with_positions_offsets_payloads"."store" : true,
"analyzer" : "fulltext_analyzer"
},
"fullname": {
"type": "text"."analyzer" : "fulltext_analyzer"}}}},"settings" : {
"index" : {
"number_of_shards" : 1."number_of_replicas" : 0
},
"analysis": {
"analyzer": {
"fulltext_analyzer": {
"type": "custom"."tokenizer": "whitespace"."filter": [
"lowercase"."type_as_payload"]}}}}}Copy the code
- In the data
PUT /waws_index/waws_type/1
{
"fullname" : "Leo Li"."text" : "hello test test test "
}
PUT /waws_index/waws_type/2
{
"fullname" : "Leo Li"."text" : "other hello test ..."
}
Copy the code
- To get the data
GET /waws_index/waws_type/1/_termvectors
{
"fields" : ["text"]."offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}
{
"_index": "waws_index"."_type": "waws_type"."_id": "1"."_version": 1."found": true,
"took": 19."term_vectors": {
"text": {
"field_statistics": {
"sum_doc_freq": 6.The df sum of all terms in a field
"doc_count": 2.There are two doc's
"sum_ttf": 8 Select * from all fields;
},
"terms": {
"hello": { # Hello
"doc_freq": 2.# How many doc's contain this field
"ttf": 2.The frequency with which a term appears in all documents
"term_freq": 1.# hello is included several times in doc1
"tokens": [{"position": 0.# position
"start_offset": 0.# start offset
"end_offset": 5.End offset
"payload": "d29yZA=="}},"test": {
"doc_freq": 2."ttf": 4."term_freq": 3."tokens": [{"position": 1."start_offset": 6."end_offset": 10."payload": "d29yZA=="
},
{
"position": 2."start_offset": 11."end_offset": 15."payload": "d29yZA=="
},
{
"position": 3."start_offset": 16."end_offset": 20."payload": "d29yZA=="}]}}}}}Copy the code
Query-time term vector experiment
GET /waws_index/waws_type/1/_termvectors
{
"fields" : ["fullname"]."offsets" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}
{
"_index": "waws_index"."_type": "waws_type"."_id": "1"."_version": 1."found": true,
"took": 39."term_vectors": {
"fullname": {
"field_statistics": {
"sum_doc_freq": 4."doc_count": 2."sum_ttf": 4
},
"terms": {
"leo": {
"doc_freq": 2."ttf": 2."term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 3}},"li": {
"doc_freq": 2."ttf": 2."term_freq": 1."tokens": [{"position": 1."start_offset": 4."end_offset": 6}]}}}}}Copy the code
In general, if conditions permit, you can use the term vector of Query time. What data do you want to probe
Manually specify the term vector for doc
GET /waws_index/waws_type/_termvectors
{
"doc" : {
"fullname" : "Leo Li"."text" : "hello test test test"
},
"fields" : ["text"]."offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}
{
"_index": "waws_index"."_type": "waws_type"."_version": 0."found": true,
"took": 1."term_vectors": {
"text": {
"field_statistics": {
"sum_doc_freq": 6."doc_count": 2."sum_ttf": 8
},
"terms": {
"hello": {
"doc_freq": 2."ttf": 2."term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 5}},"test": {
"doc_freq": 2."ttf": 4."term_freq": 3."tokens": [{"position": 1."start_offset": 6."end_offset": 10
},
{
"position": 2."start_offset": 11."end_offset": 15
},
{
"position": 3."start_offset": 16."end_offset": 20}]}}}}}Copy the code
You specify a doc by hand, you don’t actually specify doc, you specify the entry you want to insert, hello test, so you can put it in a field
Split these terms, and then for each term, calculate some statistics about it in all the existing doc’s
This is very useful, it allows you to manually specify the data situation of the term to probe, you can specify to probe the statistics of the term “Odyssey to the West”
Specify an Analyzer manually to generate the term vector
GET /waws_index/waws_type/_termvectors
{
"doc" : {
"fullname" : "Leo Li"."text" : "hello test test test"
},
"fields" : ["text"]."offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true,
"per_field_analyzer" : {
"text": "standard"}} {"_index": "waws_index"."_type": "waws_type"."_version": 0."found": true,
"took": 0."term_vectors": {
"text": {
"field_statistics": {
"sum_doc_freq": 6."doc_count": 2."sum_ttf": 8
},
"terms": {
"hello": {
"doc_freq": 2."ttf": 2."term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 5}},"test": {
"doc_freq": 2."ttf": 4."term_freq": 3."tokens": [{"position": 1."start_offset": 6."end_offset": 10
},
{
"position": 2."start_offset": 11."end_offset": 15
},
{
"position": 3."start_offset": 16."end_offset": 20}]}}}}}Copy the code
terms filter
GET /waws_index/waws_type/_termvectors
{
"doc" : {
"fullname" : "Leo Li"."text" : "hello test test test"
},
"fields" : ["text"]."offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true,
"filter" : {
"max_num_terms" : 3."min_term_freq" : 1."min_doc_freq" : 1}} {"_index": "waws_index"."_type": "waws_type"."_version": 0."found": true,
"took": 1."term_vectors": {
"text": {
"field_statistics": {
"sum_doc_freq": 6."doc_count": 2."sum_ttf": 8
},
"terms": {
"hello": {
"doc_freq": 2."ttf": 2."term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 5}]."score": 1
},
"test": {
"doc_freq": 2."ttf": 4."term_freq": 3."tokens": [{"position": 1."start_offset": 6."end_offset": 10
},
{
"position": 2."start_offset": 11."end_offset": 15
},
{
"position": 3."start_offset": 16."end_offset": 20}]."score": 3
}
}
}
}
}
Copy the code
That is to say, it is also useful to filter out the term vector statistics you want to see based on the term statistics. For example, if you probe the data, you can filter out some terms that appear too infrequently
multi term vector
GET _mtermvectors
{
"docs": [{"_index": "my_index"."_type": "my_type"."_id": "2"."term_statistics": true
},
{
"_index": "my_index"."_type": "my_type"."_id": "1"."fields": [
"text"]}]} {"docs": [{"_index": "waws_index"."_type": "waws_type"."_id": "2"."_version": 1."found": true,
"took": 0."term_vectors": {
"text": {
"field_statistics": {
"sum_doc_freq": 6."doc_count": 2."sum_ttf": 8
},
"terms": {
"...": {
"doc_freq": 1."ttf": 1."term_freq": 1."tokens": [{"position": 3."start_offset": 17."end_offset": 20."payload": "d29yZA=="}},"hello": {
"doc_freq": 2."ttf": 2."term_freq": 1."tokens": [{"position": 1."start_offset": 6."end_offset": 11."payload": "d29yZA=="}},"other": {
"doc_freq": 1."ttf": 1."term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 5."payload": "d29yZA=="}},"test": {
"doc_freq": 2."ttf": 4."term_freq": 1."tokens": [{"position": 2."start_offset": 12."end_offset": 16."payload": "d29yZA=="}]}}}}}, {"_index": "my_index"."_type": "my_type"."_id": "1"."error": {
"root_cause": [{"type": "index_not_found_exception"."reason": "no such index"."index_uuid": "_na_"."index": "my_index"}]."type": "index_not_found_exception"."reason": "no such index"."index_uuid": "_na_"."index": "my_index"}}}]Copy the code
- The second
GET /waws_index/_mtermvectors
{
"docs": [{"_type": "test"."_id": "2"."fields": [
"text"]."term_statistics": true
},
{
"_type": "test"."_id": "1"} {}]"docs": [{"_index": "waws_index"."_type": "test"."_id": "2"."_version": 0."found": false,
"took": 0
},
{
"_index": "waws_index"."_type": "test"."_id": "1"."_version": 0."found": false,
"took": 0}}]Copy the code
- The third
GET /waws_index/waws_type/_mtermvectors
{
"docs": [{"_id": "2"."fields": [
"text"]."term_statistics": true
},
{
"_id": "1"} {}]"docs": [{"_index": "waws_index"."_type": "waws_type"."_id": "2"."_version": 1."found": true,
"took": 0."term_vectors": {
"text": {
"field_statistics": {
"sum_doc_freq": 6."doc_count": 2."sum_ttf": 8
},
"terms": {
"...": {
"doc_freq": 1."ttf": 1."term_freq": 1."tokens": [{"position": 3."start_offset": 17."end_offset": 20."payload": "d29yZA=="}},"hello": {
"doc_freq": 2."ttf": 2."term_freq": 1."tokens": [{"position": 1."start_offset": 6."end_offset": 11."payload": "d29yZA=="}},"other": {
"doc_freq": 1."ttf": 1."term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 5."payload": "d29yZA=="}},"test": {
"doc_freq": 2."ttf": 4."term_freq": 1."tokens": [{"position": 2."start_offset": 12."end_offset": 16."payload": "d29yZA=="}]}}}}}, {"_index": "waws_index"."_type": "waws_type"."_id": "1"."_version": 1."found": true,
"took": 0."term_vectors": {
"text": {
"field_statistics": {
"sum_doc_freq": 6."doc_count": 2."sum_ttf": 8
},
"terms": {
"hello": {
"term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 5."payload": "d29yZA=="}},"test": {
"term_freq": 3."tokens": [{"position": 1."start_offset": 6."end_offset": 10."payload": "d29yZA=="
},
{
"position": 2."start_offset": 11."end_offset": 15."payload": "d29yZA=="
},
{
"position": 3."start_offset": 16."end_offset": 20."payload": "d29yZA=="}]}}}}}]}Copy the code
- The fourth
GET /_mtermvectors
{
"docs": [{"_index": "waws_index"."_type": "waws_type"."doc" : {
"fullname" : "Leo Li"."text" : "hello test test test"}}, {"_index": "my_index"."_type": "my_type"."doc" : {
"fullname" : "Leo Li"."text" : "other hello test ..."}}]} {"docs": [{"_index": "waws_index"."_type": "waws_type"."_version": 0."found": true,
"took": 0."term_vectors": {
"fullname": {
"field_statistics": {
"sum_doc_freq": 4."doc_count": 2."sum_ttf": 4
},
"terms": {
"leo": {
"term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 3}},"li": {
"term_freq": 1."tokens": [{"position": 1."start_offset": 4."end_offset": 6}]}}},"text": {
"field_statistics": {
"sum_doc_freq": 6."doc_count": 2."sum_ttf": 8
},
"terms": {
"hello": {
"term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 5}},"test": {
"term_freq": 3."tokens": [{"position": 1."start_offset": 6."end_offset": 10
},
{
"position": 2."start_offset": 11."end_offset": 15
},
{
"position": 3."start_offset": 16."end_offset": 20}]}}}}}, {"_index": "waws_index"."_type": "waws_type"."_version": 0."found": true,
"took": 0."term_vectors": {
"text": {
"field_statistics": {
"sum_doc_freq": 6."doc_count": 2."sum_ttf": 8
},
"terms": {
"...": {
"term_freq": 1."tokens": [{"position": 3."start_offset": 17."end_offset": 20}},"hello": {
"term_freq": 1."tokens": [{"position": 1."start_offset": 6."end_offset": 11}},"other": {
"term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 5}},"test": {
"term_freq": 1."tokens": [{"position": 2."start_offset": 12."end_offset": 16}]}}},"fullname": {
"field_statistics": {
"sum_doc_freq": 4."doc_count": 2."sum_ttf": 4
},
"terms": {
"leo": {
"term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 3}},"li": {
"term_freq": 1."tokens": [{"position": 1."start_offset": 4."end_offset": 6}]}}}}}]}Copy the code