Like the Grok processor, the Dissect processor also extracts structured fields from individual text fields in a document. However, unlike the Grok processor, parsing does not use regular expressions. This makes Dissect’s syntax simpler, and in some cases faster than Grok Processor.
Dissect matches individual text fields with defined patterns. We’ve covered Grok and Dissect processors in my previous article “Elastic Observability – Structuring Data using Pipeline.” In today’s article, we want to take a closer look at dissect processors. In today’s presentation, I’ll show you some examples.
Hands-on practice
A simple example
Let’s start with a simple example:
POST _ingest/pipeline/_simulate { "pipeline": { "description": "Example using dissect processor", "processors": [ { "dissect": { "field": "message", "pattern": "%{@timestamp} [%{loglevel}] %{status}" } } ] }, "docs": [ { "_source": {"message": "2019-09-29t00:39:02.912z [Debug] MyApp stopped"}}]}Copy the code
Above, we extract messages by pattern. In Disssect, a particular concern is the use of whitespace. If the Spaces do not match, an error will also occur. The results above are:
{ "docs" : [ { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "@timestamp" : 2019-09-29t00:39:02.912z ", "logLevel" : "Debug", "message" : [Debug] MyApp stopped", "status" : "MyApp stopped"}, "_ingest" : {"timestamp" : "2020-12-09T04.40:40.894589z"}}}]}Copy the code
Obviously it extracts loglevel, message and status. Note that we also lost the [and] characters inside.
Skip the field
Since dissect is an exact match, in practical use, we may not want a field to appear in our document, although it can be structured. Let’s take a look at the following example:
POST _ingest/pipeline/_simulate { "pipeline": { "description": "Example using dissect processor", "processors": [ { "dissect": { "field": "message", "pattern": "%{@timestamp} [%{?loglevel}] %{status}" } } ] }, "docs": [{"_source": {"message": "2019-09-29t00:39:02.912z [Debug] MyApp stopped"}}]}Copy the code
In the example above, we used %{? Loglevel}, which indicates that we do not need loglevel in our results:
{ "docs" : [ { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "@timestamp" : [Debug] MyApp stopped", "status" : "2019-09-29t00:39:02.912z ", "message" : "MyApp stopped"}, "_ingest" : {"timestamp" : "2020-12-09t04:47:24.7823z"}}}]}Copy the code
Obviously, the loglevel field is missing from this output.
Handling multiple Spaces
Dissect processors are very strict. It needs exactly matching whitespace, otherwise parsing will not succeed, as in:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": "Example using dissect processor",
"processors": [
{
"dissect": {
"field": "message",
"pattern": "%{@timestamp} %{status}"
}
}
]
},
"docs": [
{
"_source": {
"message": "2019-09-29 MyApp stopped"
}
}
]
}
Copy the code
MyApp stopped = MyApp stopped = MyApp stopped = MyApp stopped = MyApp stopped = MyApp stopped
{ "docs" : [ { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "@timestamp" : "2019-09-29", "message" : "2019-09-29 MyApp stopped", "status" : "" }, "_ingest" : { "timestamp" : "2020-12-09T05:01:58.065065z"}}}]}Copy the code
As you can see from the results above, it doesn’t parse our message at all. The status field is empty. So how do we deal with this? We can use the right-facing padding modifier -> ignore the padding:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": "Example using dissect processor",
"processors": [
{
"dissect": {
"field": "message",
"pattern": "%{@timestamp->} %{status}"
}
}
]
},
"docs": [
{
"_source": {
"message": "2019-09-29 MyApp stopped"
}
}
]
}
Copy the code
The result of the above run is:
{ "docs" : [ { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "@timestamp" : "2019-09-29", "message" : "2019-09-29 MyApp stopped", "status" : "MyApp stopped" }, "_ingest" : { "timestamp" : "2020-12-09T05:07:23.294188z"}}}]}Copy the code
We can also use an empty key to skip unwanted Spaces:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": "Example using dissect processor",
"processors": [
{
"dissect": {
"field": "message",
"pattern": "[%{@timestamp}]%{->}[%{status}]"
}
}
]
},
"docs": [
{
"_source": {
"message": "[2019-09-29] [MyApp stopped]"
}
},
{
"_source": {
"message": "[2019-09-29] [MyApp stopped]"
}
}
]
}
Copy the code
Above we used %{->} to match unwanted Spaces. Above, we used two documents, one containing one space and one containing two Spaces. The results are as follows:
{ "docs" : [ { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "@timestamp" : "2019-09-29", "message" : "[2019-09-29] [MyApp stopped]", "status" : "MyApp stopped" }, "_ingest" : { "timestamp" : "The 2020-12-09 T05: they. 752694 z"}}}, {" doc ": {" _index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "@timestamp" : "2019-09-29", "message" : "[2019-09-29] [MyApp stopped]", "status" : "MyApp stopped" }, "_ingest" : {"timestamp" : "2020-12-09t05:21:14.752701z"}}}]}Copy the code
Additional fields
In many cases, we can even append multiple fields to a single field, for example:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": "Example using dissect processor",
"processors": [
{
"dissect": {
"field": "message",
"pattern": "%{@timestamp} %{+@timestamp} %{+@timestamp} %{loglevel} %{status}",
"append_separator": " "
}
}
]
},
"docs": [
{
"_source": {
"message": "Oct 29 00:39:02 Debug MyApp stopped"
}
}
]
}
Copy the code
Above, our time expression is Oct 29 00:39:02. It is made up of three strings. We combine these three strings into an @timestamp field by %{@timestamp} %{+@timestamp} %{+@timestamp}. The result of running the above is:
{ "docs" : [ { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "@timestamp" : "Oct 29 00:39:02", "loglevel" : "Debug", "message" : "Oct 29 00:39:02 Debug MyApp stopped", "status" : "MyApp stopped"}, "_ingest" : {"timestamp" : "2020-12-09t05:27:29.785206z"}}}]}Copy the code
Note that in the example above we used append_separator and configured it as an empty string. Otherwise the three strings will be cascaded in our result to become Oct2900:39:02. This may not be the result we want in practical use.
The key – value in advance
We can use %{*field} as key and %{&field} as value to match:
POST _ingest/pipeline/_simulate { "pipeline": { "description": "Example using dissect processor key-value", "processors": [ { "dissect": { "field": "message", "pattern": "%{@timestamp} %{*field1}=%{&field1} %{*field2}=%{&field2}" } } ] }, "docs": [ { "_source": { "message": "2019009-29T00: 39:02.912z host=AppServer status=STATUS_OK"}}]}Copy the code
The result of the above run is:
{ "docs" : [ { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "@timestamp" : "2019009-29T00: 39:02.912z ", "host" : "AppServer", "message" : "2019009-29T00: 39:02.912z host=AppServer status=STATUS_OK", "status" : "STATUS_OK"}, "_INGest" : {"timestamp" : "2020-12-09T05:34:30.47561z"}}}]}Copy the code
Challenge yourself
From the above exercise, you may have felt that the Dissect processor is very useful, and also very simple to use. So let’s do a really useful example:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"processors": [
]
},
"docs": [
{
"_source": {
"message": """Mar 22 01:27:39 localhost haproxy[14415]: Server updates /appServer02 is UP, reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms."""
}
}
]
}
Copy the code
Here is an example of haProxy. It’s a long message. How can we use the processor to process this information and make it into a structured document?
We can use the Dissect processor. Based on what we have learned above, we can do this first:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"processors": [
{
"dissect": {
"field": "message",
"pattern": "%{timestamp} %{+timestamp} %{+timestamp}"
}
}
]
},
"docs": [
{
"_source": {
"message": """Mar 22 01:27:39 localhost haproxy[14415]: Server updates /appServer02 is UP, reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms."""
}
}
]
}
Copy the code
Above, we concatenate the first three strings into a timestamp field. Run the command above:
{ "docs" : [ { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "message" : """Mar 22 01:27:39 localhost haproxy[14415]: Server updates /appServer02 is UP, reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms.""", "timestamp" : """Mar2201:27:39 localhost haproxy[14415]: Server updates /appServer02 is UP, reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms.""" }, "_ingest" : { "timestamp" : 2020-12-09T05:38:44.674567z "}}}]}Copy the code
Obviously the first three strings are conjoined into one string, and it’s greedy. It matches all the following strings into this string. We need to modify it again:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"processors": [
{
"dissect": {
"field": "message",
"pattern": "%{timestamp} %{+timestamp} %{+timestamp} %{host}",
"append_separator": " "
}
}
]
},
"docs": [
{
"_source": {
"message": """Mar 22 01:27:39 localhost haproxy[14415]: Server updates /appServer02 is UP, reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms."""
}
}
]
}
Copy the code
We add append_separator and use %{host} to match all subsequent strings:
{ "docs" : [ { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "host" : """localhost haproxy[14415]: Server updates /appServer02 is UP, reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms.""", "message" : """Mar 22 01:27:39 localhost haproxy[14415]: Server updates /appServer02 is UP, reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms.""", "timestamp" : "Mar 22 01:27:39" }, "_ingest" : {"timestamp" : "2020-12-09t05:41:53.667182z"}}}]}Copy the code
Obviously, this time we can see the timestamp field clearly, but the host field is still a long string. We continue with:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"processors": [
{
"dissect": {
"field": "message",
"pattern": "%{timestamp} %{+timestamp} %{+timestamp} %{host} %{process}[%{id}]:%{rest}",
"append_separator": " "
}
}
]
},
"docs": [
{
"_source": {
"message": """Mar 22 01:27:39 localhost haproxy[14415]: Server updates /appServer02 is UP, reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms."""
}
}
]
}
Copy the code
Above, we extract the process and its ID, and put the rest into %{rest} :
{ "docs" : [ { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "rest" : """ Server updates /appServer02 is UP, reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms.""", "process" : "haproxy", "host" : "localhost", "id" : "14415", "message" : """Mar 22 01:27:39 localhost haproxy[14415]: Server updates /appServer02 is UP, reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms.""", "timestamp" : "Mar 22 01:27:39" }, "_ingest" : {"timestamp" : "2020-12-09t05:46:11.833548z"}}}]}Copy the code
From the rest above, we can see that the front part is a status and the back part is a KV type data. We can use the KV processor to process it.
Let’s first extract status:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"processors": [
{
"dissect": {
"field": "message",
"pattern": "%{timestamp} %{+timestamp} %{+timestamp} %{host} %{process}[%{id}]:%{status}, %{rest}",
"append_separator": " "
}
}
]
},
"docs": [
{
"_source": {
"message": """Mar 22 01:27:39 localhost haproxy[14415]: Server updates /appServer02 is UP, reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms."""
}
}
]
}
Copy the code
Run the command above:
{ "docs" : [ { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "rest" : """reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms.""", "process" : "haproxy", "host" : "localhost", "id" : "14415", "message" : """Mar 22 01:27:39 localhost haproxy[14415]: Server updates /appServer02 is UP, reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms.""", "status" : " Server updates /appServer02 is UP", "timestamp" : "Mar 22 01:27:39" }, "_ingest" : { "timestamp" : "2020-12-09T05:50:18.300969z"}}}]}Copy the code
Obviously, we can get the status field. The next rest field is obviously a key-value. We can use KV processor to process:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"processors": [
{
"dissect": {
"field": "message",
"pattern": "%{timestamp} %{+timestamp} %{+timestamp} %{host} %{process}[%{id}]:%{status}, %{rest}",
"append_separator": " "
}
},
{
"kv": {
"field": "rest",
"field_split": ", ",
"value_split": ":"
}
}
]
},
"docs": [
{
"_source": {
"message": """Mar 22 01:27:39 localhost haproxy[14415]: Server updates /appServer02 is UP, reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms."""
}
}
]
}
Copy the code
On top we added a processor called KV:
{ "docs" : [ { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "rest" : """reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms.""", "reason" : " Layer7 check passed", "process" : "haproxy", "code" : "2000", "check duration" : "3ms.", "message" : """Mar 22 01:27:39 localhost haproxy[14415]: Server updates /appServer02 is UP, reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms.""", "host" : "localhost", "id" : "14415", "status" : " Server updates /appServer02 is UP", "timestamp" : "Mar 22 01:27:39", "info" : "\" OK \ ""}," _ingest ": {" timestamp" : "the 2020-12-09 T06:00:37. 990909 z"}}}}]Copy the code
From the above results, we can see that we have all the fields we want. Let’s remove the unwanted message and REST fields:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"processors": [
{
"dissect": {
"field": "message",
"pattern": "%{timestamp} %{+timestamp} %{+timestamp} %{host} %{process}[%{id}]:%{status}, %{rest}",
"append_separator": " "
}
},
{
"kv": {
"field": "rest",
"field_split": ", ",
"value_split": ":"
}
},
{
"remove": {
"field": "message"
}
},
{
"remove": {
"field": "rest"
}
}
]
},
"docs": [
{
"_source": {
"message": """Mar 22 01:27:39 localhost haproxy[14415]: Server updates /appServer02 is UP, reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms."""
}
}
]
}
Copy the code
Above, I used the remove handler to remove the message and rest fields:
{ "docs" : [ { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "reason" : " Layer7 check passed", "process" : "haproxy", "code" : "2000", "check duration" : "3ms.", "host" : "localhost", "id" : "14415", "status" : " Server updates /appServer02 is UP", "timestamp" : "Mar 22 01:27:39", "info" : "\" OK \ ""}," _ingest ": {" timestamp" : "the 2020-12-09 T05:59:44. 138394 z"}}}}]Copy the code
From the step-by-step process above, we can see how to structure an unstructured data.