Small knowledge, big challenge! This paper is participating in theEssentials for programmers”Creative activities
Standard output (Stdout)
Outputs /stdout, like its inputs/stdin predecessors, is the most basic and simple output plugin. Here, too, is a brief introduction to a common understanding of output plug-ins.
Configuration of the sample
output {
stdout {
codec => rubydebug
workers => 2
}
}
Copy the code
explain
The output plug-in uniformly has one parameter, workers. The Logstash setup is multithreaded for output.
The second is the CODEC setup. The role of CODEC has been discussed before. With the possible exception of Codecs/Multiline, other Codec plug-ins don’t have many Settings of their own. Therefore, the following configuration section is usually omitted. In other words. The complete version of the above configuration example would be:
output {
stdout {
codec => rubydebug {
}
workers => 2
}
}
Copy the code
The most important and common use of the Outputs/STdout plug-in alone is debugging. Therefore, when it is not effective, run the command line parameter -vv to see more detailed debugging information.
Save as a File
The log collection system stores data scattered on hundreds of servers in a central server. This is the primary requirement of O&M. In his early years, he even named the syntax of the output
. Logstash certainly does that.
In contrast to LogStash::Inputs::File, the LogStash::Outputs::File format can be used to automatically define the output path to the date-named path.
Configuration of the sample
output {
file {
path => "/path/to/%{+yyyy/MM/dd/HH}/%{host}.log.gz"
message_format => "%{message}"
gzip => true
}
}
Copy the code
explain
The first thing to notice when using the Output /file plug-in is the message_format parameter. By default, the plug-in outputs JSON data for the entire event. This may not be what users expect in most cases. You might just want to keep the log in its original format. The %{message} field should be defined as %{message}, provided, of course, that you did not remove or modify the %{message} field with remove_field or update in the previous filter plugin.
Another very useful parameter is gzip. The GZIP format is a very strange and friendly format. The formats include:
- A 10-byte header containing a magic number, version number, and timestamp
- Optional extension header, such as the original file name
- File body, including DEFLATE compressed data
- An 8-byte endnote, including the CRC-32 checksum and the uncompressed length of the original data
This allows Gzip to identify the data segment by segment — which, in turn, can be compressed and added segment by segment!
This is great for streaming data!
Tip: You may have seen the amazing documentation of parallel command line tools that have been circulating on the Internet for concurrent processing of data, but you will never see the results in your own use. This is actually because gzip files processed in documents can be processed separately and then merged.
Save in Elasticsearch
Early on there were three different ElasticSearch plugins for Logstash. To version 1.4.0, developers to completely rewrite the LogStash: : Outputs: : Elasticsearch plug-in. With this plugin, you can now switch between different protocols supported by the Elasticsearch cluster.
Configuration of the sample
Output {elasticSearch {host => "192.168.0.2" protocol => "HTTP" index => "logstash-%{type}-%{+ YYYY.mm. Dd}" index_type => "%{type}" workers => 5 template_overwrite => true } }Copy the code
explain
agreement
The new plug-in now supports three protocols: Node, HTTP, and Transport.
The Node protocol is most convenient in a small cluster. The Logstash node runs as the client node of ElasticSearch. If you run the following command, you can see your own logstash process name with the node. Role value c:
# curl 127.0.0.1:9200 / _cat/nodes? Percent ram. Percent Load node. Role Master name local 192.168.0.102 7 C-logstash -local-1036-2012 local 192.168.0.2 7 d * SunstreakCopy the code
Specifically, as a quick example to run, you can also run an elasticSearch server embedded inside the Logstash process. The embedded server stores the index in the $PWD/data directory by default. If you want to change these configurations, just write your custom configuration in the $PWD/ ElasticSearch. yml file, which logstash will try to load automatically.
For large clusters with many indexes, you can use the Transport protocol. The Logstash process forwards all the data to a host you specify. This protocol is different from the node protocol above. The node protocol allows a process to receive the entire Elasticsearch cluster status information. When a process receives an event, it knows which machine in the cluster the event belongs to, so it connects directly to that machine to send the data. Transport processes do not store this information and do not send this information to all Logstash processes when the cluster state is updated (full updates are sent for node changes and index changes). More details Elasticsearch cluster state, see www.elasticsearch.org/guide.
If you already have an Elasticsearch cluster, but the version is not quite the same as the Logstash version, HTTP is recommended. Logstash uses POST to send data.
Tips:
- Logstash 1.4.2 in the case of transport and HTTP protocols is a fixed connection specifying a host to send data. Since 1.5.0, host can be set to an array, which selects different nodes from the node list to send data to achieve round-robin load balancing.
- Kibana4 mandates that all nodes in the ES cluster be at least 1.4, The node logstuck-1.4 (elasticSearch. jar version 1.1.1) will cause Kibana4 to fail. If you are using Kibana4, you must use HTTP instead.
- “Elasticsearch above 1.0 should work with the latest version of the logstash node protocol,” the developers stated in the IRC freenode#logstash channel. This information is for reference only, please carefully test before launching.
Performance issues
Logstash 1.4.2 uses the author’s own FTW library by default under the HTTP protocol, which is distributed with version 0.0.39. This version has memory leak problem, long run under the output performance is getting worse and worse!
Solutions:
-
If performance requirements are not high, you can configure the ENV[“BULK”] environment variable when starting the Logstash process to force elasticSearch to use the official Ruby library. The command is as follows:
export BULK=”esruby”
-
For performance requirements, try logstash- 1.5.0rC2. The new version of Outputs/ElasticSearch abandons the FTW library in favor of a Manticore library proprietary to the JRuby platform. According to the test, the performance is quite close to FTW.
-
For high performance requirements, you can manually update the FTW library version. Currently, the latest version is 0.0.42. It is said that the memory problem will be solved in 0.0.40.
The template
Elasticsearch supports predefined Settings and mapping for indexes (if your version of Elasticsearch supports this API, but it probably does). Logstash comes with an optimized template as follows:
{
"template" : "logstash-*"."settings" : {
"index.refresh_interval" : "5s"
},
"mappings" : {
"_default_" : {
"_all" : {"enabled" : true},
"dynamic_templates": [{"string_fields" : {
"match" : "*"."match_mapping_type" : "string"."mapping" : {
"type" : "string"."index" : "analyzed"."omit_norms" : true."fields" : {
"raw" : {"type": "string"."index" : "not_analyzed"."ignore_above" : 256}}}}}],"properties" : {
"@version": { "type": "string"."index": "not_analyzed" },
"geoip" : {
"type" : "object"."dynamic": true."path": "full"."properties" : {
"location" : { "type" : "geo_point" }
}
}
}
}
}
}
Copy the code
Key Settings include:
- template for index-pattern
This template is only applied to indexes that match logstash-*. Sometimes we change the Logstash default index name. Remember that you also need to upload a template that matches your custom index name using the PUT method. Of course, I prefer to put your own name after “logstash-” and change it to index => “logstash-custom-%{+ YYYy.mm. Dd}”.
- refresh_interval for indexing
Elasticsearch is a near real-time search engine. It actually refreshes the data every 1 second. For log analysis applications, we don’t have to do this in real time, so the logstash template is changed to five seconds. You can also enlarge the refresh interval as needed to improve data write performance.
- multi-field with not_analyzed
Elasticsearch automatically uses its own default word splitter (space, dot, slash, etc.) to analyze fields. Crossovers are important for searching and scoring, but greatly reduce the performance of index writes and aggregate requests. So the Logstash template defines a type of field called multi-field. This type automatically adds a field ending in “.raw” and sets this field to disable the tokenizer. Simply put, when you want to get the aggregate result of a URL field, don’t use “URL” directly, but use “url.raw” as the field name.
- geo_point
Elasticsearch supports geo_point type, Geo Distance aggregation, and more. For example, you can ask for the total number of data points within 10 kilometers of a geo_point. This type of data is used in the Kibana BetterMap Type panel.
You are advised to configure other templates
- doc_values
Doc_values is a new feature for Elasticsearch 1.3. Fields with this feature enabled build FieldData on disk when the index is written. In the past, FieldData was fixed to use memory only. It is easy to trigger an OOM error when the request scope is larger:
ElasticsearchException[org.elasticsearch.common.breaker.CircuitBreakingException: Data too large, data for field [@timestamp] would be larger than limit of [639015321/609.4mb]]
Doc_values can only be used for field configurations that are word-blind (for string fields “index”:”not_analyzed”, numeric and time fields do not have word partitions by default).
Doc_values is a disk cache, but the system has its own VFS cache, which is not too bad. In official tests, after 1.4 optimizations, it was only 15% slower than FieldData using memory. Therefore, it is strongly recommended to enable this configuration when the data volume is large:
{
"template" : "logstash-*"."settings" : {
"index.refresh_interval" : "5s"
},
"mappings" : {
"_default_" : {
"_all" : {"enabled" : true},
"dynamic_templates": [{"string_fields" : {
"match" : "*"."match_mapping_type" : "string"."mapping" : {
"type" : "string"."index" : "analyzed"."omit_norms" : true."fields" : {
"raw" : { "type": "string"."index" : "not_analyzed"."ignore_above" : 256."doc_values": true}}}}}],"properties" : {
"@version": { "type": "string"."index": "not_analyzed" },
"@timestamp": { "type": "date"."index": "not_analyzed"."doc_values": true."format": "dateOptionalTime" },
"geoip" : {
"type" : "object"."dynamic": true."path": "full"."properties" : {
"location" : { "type" : "geo_point" }
}
}
}
}
}
}
Copy the code
- order
If you have an idea to customize your own template, great. There are several options:
- In the logstash/outputs/elasticsearch configuration in the open
manage_template => false
Options, then do everything yourself; - In the logstash/outputs/elasticsearch configuration in the open
template => "/path/to/your/tmpl.json"
The logstash option to send your own template file; - Instead of changing the logstash configuration, send a separate template using the Templates Order function of ElasticSearch.
Select * from elasticSearch (select * from elasticSearch); select * from elasticSearch (select * from elasticSearch); The merge is then applied again as an overlay with the higher order values.
For example, if you are satisfied with the above template and just want to change the refresh_interval, you just need to create a new one:
{
"order" : 1."template" : "logstash-*"."settings" : {
"index.refresh_interval" : "20s"}}Copy the code
Curl -xput http://localhost:9200/_template/template_newid -d ‘@/path/to/your/ tppl. json’
Logstash default template, the order is zero, id is the logstash, through the logstash/outputs/elasticsearch configuration options template_name modification. Your new template should not conflict with that name.
Recommended reading
- www.elasticsearch.org/guide