Historically, Elasticsearch relied on the Schema on Write schema to quickly search for data. We have now added the Schema on Read schema to Elasticsearch to give users the flexibility to change the document’s schema after ingestion and to generate fields that exist only as part of the search query. Together, Schema on Read and Schema on Write provide users with options to balance performance and flexibility according to their needs.
Our Schema on Read solution is the Runtime Fields, which are evaluated only at query time. They are defined in index maps or queries, and once defined, they are immediately available for search requests, aggregation, filtering, and sorting. Because Runtime Fields are not indexed, adding run-time fields does not increase the size of the index. In fact, they can reduce storage costs and speed up ingestion.
But there are trade-offs. Queries on run-time fields can be expensive, so the data you normally search or filter against should still be mapped to index fields. Even if your index size is small, Runtime Fields will slow down your search. We recommend a combination of Runtime fields and index fields to find the right balance between use-case uptake speed, index size, flexibility, and search performance.
Adding Runtime Fields is easy
The easiest way to define Runtime fields is in a query. For example, if we have the following indexes:
PUT my_index
{
"mappings": {
"properties": {
"address": {
"type": "ip"
},
"port": {
"type": "long"
}
}
}
}
Copy the code
And load some documents into it:
POST my_index / _bulk {" index ": {" _id" : "1"}} {" address ":" 2 ", "port" : "80"} {" index ": {" _id" : "2"}} {" address ", "2", "port" : "8080"} {" index ": {" _id" : "3"}} {" address ":" 2.4.8.16 ", "port" : "80"}Copy the code
We can create concatenation of two fields using static strings, as follows:
GET my_index/_search { "runtime_mappings": { "socket": { "type": "keyword", "script": { "source": "emit(doc['address'].value + ':' + doc['port'].value)" } } }, "fields": [ "socket" ], "query": { "match": { "socket": 2: "8080"}}}Copy the code
Produces the following response:
{ "took" : 17, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : , "hits" : {0} "total" : {" value ": 1, the" base ":" eq "}, "max_score" : 1.0, "hits" : [{" _index ": "My_index _type", "" :" _doc ", "_id" : "2", "_score" : 1.0, "_source" : {" address ":" 2 ", "port" : "8080"}, "fields" : {" socket ": [2:8080" "]}}}}]Copy the code
We defined the field socket in the runtime_mappings section. We used a short Painless script that defined how each document would calculate the value of the socket (using + to represent concatenation of the value of the address field with the static string “:” and the value of the port field). We then used field sockets in the query. The field socket is a temporary runtime field that exists only for the query and is evaluated when the query is run. When defining the Painless script to use with Runtime Fields, you must include emit to return calculated values.
If we find that the socket is a field we want to use in multiple queries, without having to define it for each query, we can simply add it to the map with a call:
PUT my_index/_mapping
{
"runtime": {
"socket": {
"type": "keyword",
"script": {
"source": "emit(doc['address'].value + ':' + doc['port'].value)"
}
}
}
}
Copy the code
Then the query does not have to contain a socket field definition, for example:
The GET my_index / _search {" fields ": [" socket"], "query" : {" match ": {" socket" : "2:8080"}}}Copy the code
The statement “fields”: [“socket”] is required only to display the value of the socket field. Now, a field query can be used for any query, but it does not exist in the index and does not increase the size of the index. The socket is computed only for queries that require it and for documents that require it.
Use it like any other field
Because runtime fields are exposed through the same API as index fields, queries can reference some indexes, where this field is runtime Fields, as well as others, where this field is an index field. You have the flexibility to choose which fields to index and which fields to keep as Runtime fields. This separation between field generation and field consumption promotes more organized code that is easier to create and maintain.
You can define runtime fields in index maps or search requests. This inherent functionality gives you flexibility in how you use runtime fields and index fields together.
Overwrite field values at query time
Often, when it’s too late, you find errors in production data. While it is easy to fix the ingestion instructions for future ingestion documents, it is much more difficult to fix data that has already been ingested and indexed. Using run-time fields, you can fix errors in index data by overwriting values at query time. Runtime Fields can overwrite index fields with the same name so that you can correct errors in index data.
This is a simple example to make it more concrete. Suppose we have an index with a message field and an address field:
PUT my_raw_index
{
"mappings": {
"properties": {
"raw_message": {
"type": "keyword"
},
"address": {
"type": "ip"
}
}
}
}
Copy the code
Then load the document into it:
POST my_raw_index/_doc/1 { "raw_message": "199.72.81.55 -- [01/Jul/ 1995:00:00:01-0400] GET /history/ Apollo /1.0 200 6245", "address": "1.2.3.4"}Copy the code
Alas, the document contains the wrong IP address in the Address field (1.2.3.4). The correct IP address exists in raw_message, but somehow the wrong address was resolved in the sent document to extract it into Elasticsearch and index it. This is not a problem for individual documents, but what if a month later we find that 10% of the documents contain the wrong address? Fixing it for new documents is not important, but reindexing documents that have already been ingested is often operationally complex. For Runtime Fields, you can immediately fix it by using Runtime Fields to mask the index field. This is how it is handled in queries:
GET my_raw_index/_search { "runtime_mappings": { "address": { "type": "ip", "script": """Matcher m = /\d+\.\d+\.\d+\.\d+/.matcher(doc["raw_message"].value); if (m.find()) emit(m.group());" "" } }, "fields": [ "address" ] }Copy the code
The command above returns:
{ "took" : 4, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 1, the "base" : "eq"}, "max_score" : 1.0, "hits" : [{" _index ":" my_raw_index ", "_type" : "_doc", "_id" : "1", "_score" : 1.0, "_source" : {" raw_message ": "199.72.81.55 -- [01/Jul/ 1995:00:00:01-0400] GET /history/ Apollo/HTTP/1.0 200 6245", "address" : "2"}, "fields" : {" address ": [" 199.72.81.55"]}}}}]Copy the code
You can also make changes in the map to make it available to all queries. Note that the use of regular expressions is now enabled by default via Painless Script.
Balance performance and flexibility
With index fields, you can do all the preparation work during ingestion and maintain complex data structures for optimal performance. But querying runtime Fields is slower than querying index fields. So what if queries are slow after you start using Runtime Fields?
We recommend using asynchronous searches when retrieving run-time fields. If the query completes within a given time threshold, the complete result set is returned, just as in a synchronous search. However, even if the query is not complete at that point, you will still get partial result sets, and Elasticsearch will continue polling until the full result set is returned. This mechanism is particularly useful when managing the index life cycle, because updated results are usually returned first and are often more important to the user.
For best performance, we rely on index fields to do the heavy lifting of the query in order to compute runtime Fields values for only a subset of the document.
Change the field from Runtime to index field
Runtime Fields allows users the flexibility to change their mapping and parsing while working with data in a real-time environment. Because the runtime field does not consume resources, and because you can change the script that defines it, users can try until the optimal mapping is reached. If you find that the Runtime Fields are useful in the long run, you can pre-calculate their value at index time by simply defining the field in the template as an index field and ensuring that the ingested documents include it. This field will be indexed from the next index conversion and will provide better performance. Queries that use this field do not need to change at all.
This scheme is particularly useful for dynamic mapping. On the one hand, allowing new documents to generate new fields is very helpful because the data in them can be used immediately (the structure of entries often changes, for example, due to changes in the software that generated the logs). Dynamic mapping, on the other hand, runs the risk of aggravating indexes and even creating mapping explosions, because you never know if some document will surprise you with 2000 new fields. Runtime Fields can provide a solution for this situation. You can automatically create new fields as Runtime fields to avoid adding index burden (because they don’t exist in the index) and not count in index.mapping.total_fields-limit. These automatically created runtime fields can be queried, albeit with lower performance, so users can use them and decide to change them to index fields at the next conversion if needed.
We recommend initially using Runtime Fields to experiment with your data structures. After processing the data, you may decide to index the Runtime Fields to improve search performance. You can create a new index, then add the field definition to the index map, add the field to _source, and ensure that the new field is included in the ingested document. If you are using data flow, you can update the index template so that Elasticsearch knows to index the field when creating an index from that template. In a future release, we plan to make it easy to change the Runtime fields to index fields, just like moving fields from the Runtime part of the map to the properties part.
The following request creates a simple index map with a TIMESTAMP field. Includes “dynamic”: “Runtime” instructs Elasticsearch to dynamically create additional fields in this index as Runtime fields. If the Runtime Fields contain painless script, the value of the field is computed based on the Painless script. If runtime fields are created without a script, as shown in the following request, the system looks for the field with the same name as the runtime field in _source and uses its value as the value for the Runtime fields.
PUT my_index-1
{
"mappings": {
"dynamic": "runtime",
"properties": {
"timestamp": {
"type": "date",
"format": "yyyy-MM-dd"
}
}
}
}
Copy the code
Let’s index the document to see the benefits of these Settings:
POST my_index-1/_doc/1
{
"timestamp": "2021-01-01",
"message": "my message",
"voltage": "12"
}
Copy the code
Now that we have the index’s TIMESTAMP field and the two Runtime fields (message and voltage), we can look at the index map:
GET my_index-1/_mapping
Copy the code
The runtime part includes message and voltage. These fields are not indexed, but we can still query them as precisely as we index them. The command above shows:
{
"my_index-1" : {
"mappings" : {
"dynamic" : "runtime",
"runtime" : {
"message" : {
"type" : "keyword"
},
"voltage" : {
"type" : "keyword"
}
},
"properties" : {
"timestamp" : {
"type" : "date",
"format" : "yyyy-MM-dd"
}
}
}
}
}
Copy the code
We’ll create a simple search request to query the message field:
GET my_index-1/_search
{
"query": {
"match": {
"message": "my message"
}
}
}
Copy the code
Responses include the following HITS:
{ "took" : 999, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : , "hits" : {0} "total" : {" value ": 1, the" base ":" eq "}, "max_score" : 1.0, "hits" : [{" _index ": "My_index - 1", "_type" : "_doc", "_id" : "1", "_score" : 1.0, "_source" : {" timestamp ":" 2021-01-01 ", "message" : "my message", "voltage" : "12" } } ] } }Copy the code
After reviewing this response, we noticed a problem: we didn’t specify voltage as a number! Since voltage is runtime fields, this can be easily resolved by updating the field definitions in the “Runtime” part of the map:
PUT my_index-1/_mapping
{
"runtime":{
"voltage":{
"type": "long"
}
}
}
Copy the code
The previous request changed voltage to type LONG, which is effective immediately for indexed documents. To test this behavior, we construct a simple query for all documents with voltages between 11 and 13:
GET my_index-1/_search
{
"query": {
"range": {
"voltage": {
"gt": 11,
"lt": 13
}
}
}
}
Copy the code
Because our voltage is 12, the query returns the document in my_index-1. If we look at the map again, we’ll see that voltage is now a runtime field of type LONG, even for the document that was ingested into Elasticsearch before updating the field type in the map:
{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 1, the "base" : "eq"}, "max_score" : 1.0, "hits" : [{" _index ":" my_index - 1 ", "_type" : "_doc", "_id" : "1", "_score" : 1.0, "_source" : {" timestamp ":" 2021-01-01 ", "message" : "my message", "voltage" : "12"}}]}}Copy the code
Later, we may determine that VOLTAGE is useful in aggregation and we want to index it to the next index created in the data flow. We create a new index (my_index-2) that matches the index template of the data stream, defines voltage as an integer, and knows what data type we want after trying the Runtime Fields.
Ideally, we would update the index template itself so that the changes take effect on the next flip. You can run a query on the voltage field in any index that matches the my_index* pattern, even if that field is a Runtime field in one index and an index field in another.
PUT my_index-2
{
"mappings": {
"dynamic": "runtime",
"properties": {
"timestamp": {
"type": "date",
"format": "yyyy-MM-dd"
},
"voltage":
{
"type": "integer"
}
}
}
}
Copy the code
Therefore, for Runtime Fields, we introduced a new field life cycle workflow. In this workflow, fields can be generated automatically as Runtime fields without impact on resource consumption or risk of mapping explosion, allowing users to start using data immediately. While still runtime fields, the mapping of fields can be optimized for actual data, and due to the flexibility of runtime fields, these changes take effect for documents that have been extracted into Elasticsearch. When it becomes clear that the field is useful, you can change the template so that in the index created from that point (after the next flip), the field will be indexed for best performance.
conclusion
In most cases, especially if you know your data and what data to use, indexed fields are a must because of their performance advantages. On the other hand, when flexibility in document parsing and schema structure is needed, Runtime Fields can now provide the answer.
The Runtime fields and index fields are complementary functions – they form a symbiotic relationship. Runtime Fields offer flexibility, but they will not work well in large-scale environments without the support of indexes. The index’s strong and robust structure provides a sheltered environment in which the flexibility of the Runtime Fields can boast of its true colors in a way not so different from the way algae shelter corals. Everyone will benefit from this symbiosis.
Read more:
- Elasticsearch: Used Runtime fields to override index fields to fix errors – Released in 7.11
- Elasticsearch: Create Runtime field and use it in Kibana – released in 7.11
- Elasticsearch: Dynamically create Runtime Fields-7.11 release