Elasticsearch
Distributed full text search engine
First, use scenarios
- Information search
- E-commerce sites
- Job site
- News website
- Log collection and analysis – ELK
- Data analysis – product sales, visits, consumption amount
Ii. Core Concepts
- Index Indicates the Index. – Database Indicates the Database
- Shard Index fragment
- A Shard corresponds to a Lucene Index
- Each Shard has a translog
- Type Indicates the Type (to be abolished) – Table indicates the Table
- Document Document – Row Data Row
- Field – Column Field
- Mapping Mapping-scheme Field Constraints
Three, API
URL with? Explain can view the cause of the statement error
1. The Index Index
- Create index – PUT/Index name
- Check whether the index exists – HEAD/index name
- View index properties
- Single – GET/index name
- Multiple -get/index name 1, index name 2, index name 3
- All – GET _all
- – GET /_cat/indices? v
- Enable index -post/index name /_open
- Close index – POST/index name /_close
- DELETE index – DELETE/Index name 1, index name 2, index name 3
- Index migration – POST _reindex
- version_type
- [Default] internal – Migrates directly, overwrites existing documents when encountered
- External-retain version information for migration and update version when encountering an existing document
- op_type
- Create – An error occurs when encountering an existing document
- conflicts
- Proceed – When you encounter an existing document, an error message is displayed indicating only the number of document conflicts
- Query – Supports data filtering, sorting, and quantity Settings
- version_type
2. The Mapping Mapping
- Create a mapping – PUT/Index name /_mapping
PUT/index library name /_mapping {"properties": {
"Field name": {
"type": "Type"."index": true."store": true."analyzer": "Word splitter"}}}Copy the code
PUT /lagou-company-index/_mapping/
{
"properties": {
"name": {
"type": "text"."analyzer": "ik_max_word"
},
"job": {
"type": "text"."analyzer": "ik_max_word"
},
"logo": {
"type": "keyword"."index": "false"
},
"payment": {
"type": "float"}}}Copy the code
-
View mapping – GET/Index name /_mapping
-
View all mappings
- GET _mapping
- GET all/_mapping
-
Modify a mapping – PUT/Index name /_mapping
- You can only add mapping fields, not change them
- If you need to change the mapping, you can only delete the reconstruction mapping
-
Create an index and a mapping – PUT/index name
3. Document Document
-
Add the document
- Specify ID -post/index name /_doc/{ID}
- Automatically generates id-post/index name /_doc
-
To view the document
- ID search – GET/index name /_doc/{ID}
- Conditional search – GET/ index name /_search
- Return attribute filtering – GET/index name /_doc/{id}? _source= attribute 1, attribute 2
-
Update the document
- Global update (added after original data is deleted) -put/index name /_doc/{id} -id is added if it does not exist
- Partial update (modify single field) – POST/index name /_update/{id}
-
Delete the document
- Specify ID -delete/index name /_doc/{ID}
- Conditional Filter – POST/Index name /_delete_by_query
-
Batch search
- GET /_mget
- GET/Index name /_mget
-
Batch add, delete, change – POST / _bulk {” method “: {” _index” : “index name”, “_id” : “id number”}} {} “data”
- Create – Adds a document
- Index – Add document, full-text replace document – equivalent to PUT
- Update – Locally updates the document
- Delete – Deletes a document
You are advised to update 1000 to 5000 documents at a time. The document size ranges from 5 MB to 15 MB
4. Mapping attributes
-
Type type
-
String String
- Text – participle, not aggregable
- Keyword – Can be aggregated without keyword
-
Numberical value
- byte
- short
- interger
- long
- double
- float
- half_float
- Scaled_float – High precision, precision factor needs to be specified
-
Date Date – [Suggestion] Use long to save milliseconds
-
Array an Array
- If any element in the array is matched, it is considered to be matched
- When sorting, ascending uses the smallest element in the array, descending uses the largest element in the array
-
Object
-
Geo_point latitude and longitude
-
-
Index Whether to index – Whether to search – [default] true
-
Store or not – Whether data is stored independently, which speeds up parsing but consumes space – [default] false
-
Analyer participle
- Chinese
- Ik_max_word [often used] – maximum granularity
- Ik_smart – coarsest granularity
- Chinese
-
Dynamic Indicates the dynamic mapping mode when unfamiliar fields are encountered
- True – Automatic mapping
- False – ignore
- Strict – an error
-
Date_detection Whether to turn off date detection – When set to false, the string will always be string
-
Dynamic_date_formats sets the string conversion date rule
-
Dynamic_templates uses different mappings for different fields or data types
-
Refresh_interval Index refresh frequency – [default] 1 second
-
The index. The translog. Durability translog brush set way – [default] sync
-
Index. translog.sync_interval Translog flush interval – [default] 5 seconds
PUT/index library name {"settings": {"number_of_shards": Number of fragments,"number_of_replicas": Number of copies,"refresh_interval": "Index refresh Rate"."index.translog.durability": "async"."index.translog.sync_interval": "5s"
},
"mappings": {"dynamic": "Dynamic mapping mode"."date_detection": Whether to turn off date detection,"dynamic_date_formats": "MM/dd/yyyy"."properties": {"Field name": {"Mapping attribute Name":"Mapping attribute value"}}}},Copy the code
Fifth, search type
POST/index library name /_search {"query": {"Search type": {"Search criteria":"Find conditional value"}},"sort": [{"Fields to sort": {"order": "asc"}}]."highlight": {
"pre_tags": "<font color='pink'>"."post_tags": "</font>"."fields": [{"Fields to highlight": {}}},"from": Current page number,"size": Number of items per page}Copy the code
-
Match_all – Finds all
-
Match – Sets the search conditions for word search. The relationship between terms is CHANGED from OR – to and and requires the operator attribute
-
Match_phrase – Will look for conditional participles, and the target document must contain all participles in the same order
-
Multi_match – Searches for terms in terms of or, and can specify the search field
- You can use * to describe field – *_name
- You can use ^ enhanced field weighting – subject^3
-
Term – Lookup regardless of word
-
Query_string – Specifies field OR full-text search, AND splits strings using the AND, OR, AND ~ operators
-
Range-range search, used to find numbers and dates
-
Exists – A non-null lookup
-
Prefix – Searches for prefix matches
-
Wildcard – Wildcard lookup
-
Regexp – Regular lookup
-
Fuzzy-fuzzy lookup
-
Bool – Compound lookup
- Must – Must contain
- Filter – Must contain, does not affect the score, will be cached in memory, repeated search speed
- Should – should include
- Must_not – Must not be included and does not affect scoring
-
Dis_max – Multiple search field scores, only take the highest score as the score – default to add multiple search field scores
-
Suggest — suggest a search
- Completion – Finds conditional prefix matches and makes suggestions
- Preserve_separators – Finds whether to reserve separators for conditions
- Preserve_position_increments – Whether to ignore the stop word when the first word of the suggested word is the stop word
- Phrase – Will find the condition word segmentation, judge the matching degree with the original text and give suggestions
- Term – Classifies search terms and makes recommendations for each term
- Missing – To give advice when an entry cannot be found in the dictionary
- Always – Gives advice whether an entry is found in a dictionary or not
- Popular – Suggestions for higher frequency of words, whether or not they are found in the dictionary
- Context – Similar to Completion, add categories for further filtering
Production Suggestions:
Completion → Zero matching → Phase → Zero matching → term
Polymerization analysis
"aggregations" : {
"<aggregation_name>": {<! -- aggregate name -->"<aggregation_type>": {<! --> <aggregation_body> <! -- aggregator: which fields are aggregated -->} [,"meta": { [<meta_data_body>] } ]? <! --> [,"aggregations": { [<sub_aggregation>]+ } ]? <! -->} [,"<aggregation_name_2>": {... }] * <! -- aggregate name -->}Copy the code
1. Statistical method
- Pointer aggregation metric
- Bucket polymerization bucketing – Data is grouped before aggregation statistics are performed
2. Statistical Pointers
- The maximum Max
- Min min
- And the sum
- The mean avg
- Count count
- Document fields have a value count, value_count
- To recalculate cardinality
- Stats – Includes Max, min, sum, AVG, and count
- Advanced statistics extended_STATS – includes sum of squares, variance, and standard deviation
- Percentiles – Percentiles can be specified
- Percentile_ranks Interval percentage statistics
Distributed cluster
1. The role
- Cluster – A Cluster consisting of multiple nodes, each of which is identified by a common Cluster name
- The Node Node
- Master – Whether you are eligible to run for the primary node – [default] true
- Data – Whether to save data – [default] true
- Shard Shard – The data partition of an index
- The number of primary shards is immutable unless the index is rebuilt
- By default, each master shard has one replica shard, and the two shards are not on the same node
Characteristics of 2.
- New nodes are automatically discovered
- Node peer – Each node can receive a request and forward the request to the other node where the data is stored
- When the node is down, the missing data is recovered through copy fragmentation
- Search time in a hundred milliseconds
3. Building and planning
The principle of
-
30 GB JVM memory, the maximum size of shards is set to 30 GB, and then calculate the total number of shards based on the data volume.
-
The total number of slices divided by 1.5 ~ 3 is the number of nodes
-
The number of copies is 2 to ensure high availability
-
When the search performance deteriorates, the number of copies can be increased to improve the concurrent search capability
application
-
Search function – Tens of millions to billions of data – two to four nodes
-
Online processing analysis – ELK – Data volume of billions – dozens to hundreds of nodes
4. Consistency assurance
- ? Wait_for_active_shards = Number of Synchronization fragments &timeout= Timeout duration
8. Relevance
- Application-side join Application connection – Independent between indexes – Applies to a small number of document records
- Data denormalization, Nested objects Nested documents
- Through field redundancy, index performance is sacrificed for lookup performance
- Redundant fields should rarely change
- Suitable for small number of relationship processing
- This applies to scenarios where you read too much and write too little
- Parent/ Child Relationships document
- Sacrifice lookup performance for index performance
- A lookup cannot return both parent and child documents
- Parent and child documents must be on the same shard
- This applies to scenarios where you write too much and read too little
9. Persistence
-
refresh
- Writes the memory buffer to a new segment, making the index retrievable
- [Default] Runs every 1 second
-
flush
- Flush all segments, clear the Translog, and create commit points
- [Default] Runs every 30 minutes
When the node crashes and restarts, the Translog log is replayed from the commit point to recover the data
Concurrency control
- Built-in version number -? If_seq_no = version &if_primary_term=1
- Custom version number -? Version = Version &version_type=external
11. Paging scheme
- From + size – Common paging method, deep paging can cause performance problems
- Scroll – to cache all qualified search results – not suitable for real-time search, suitable for background batch processing
- Search after – Determine the next page based on the last data on the previous page – cannot skip pages
Xii. Performance optimization
- Set the number of copies to 0 for the first time
- Automatically generates a DOC ID to avoid disk read operations
- Unimportant fields have no word or index
- Adjust index refresh interval – default 1 second