Introduction to the use of FileBeat with billions of architectures

This series of articles is divided into three parts: FileBeat, Logstash and ES. The logging system will be designed, deployed and optimized to maximize resource utilization and achieve optimal performance based on tens of billions of pieces of data per day. This article focuses on fileBeat

introduce

Version: filebeat – 7.12.0

This is about the log collection of K8S. The deployment mode is DaemonSet. During the collection, the namespace of K8S cluster is classified, and then different topics are created into Kafka according to the name of namespace

K8s Log file description

In general, in a container log in the output to the standard output (stdout), with * – json. The naming of the log are stored in/var/lib/docker/containers directory, of course, if changed the docker data directory, that is in the modified data directory, such as:

# tree /data/docker/containers/ data/docker/containers ├ ─ ─ 009227 c00e48b051b6f5cb65128fd58412b845e0c6d2bec5904f977ef0ec604d │ ├ ─ ─ 009227 c00e48b051b6f5cb65128fd58412b845e0c6d2bec5904f977ef0ec604d - json. The log │ ├ ─ ─ checkpoints │ ├ ─ ─ config. The v2. The json │ ├ ─ ─ Hostconfig. Json │ └ ─ ─ mountsCopy the code

You can see here, there’s this file: / data/docker/containers/container id / * – json. The log, then k8s will default in/var/log/containers and/var/log/pods will generate these log files in a directory of soft connection, as shown below:

cattle-node-agent-tvhlq_cattle-system_agent-8accba2d42cbc907a412be9ea3a628a90624fb8ef0b9aa2bc6ff10eab21cf702.log
etcd-k8s-master01_kube-system_etcd-248e250c64d89ee6b03e4ca28ba364385a443cc220af2863014b923e7f982800.log
Copy the code

You will then see that this directory contains all of the container logs for this host, named as:

[podName]_[nameSpace]_[depoymentName]-[containerId].log
Copy the code

This is the way of naming deployment. Others may vary a little, such as DaemonSet, StatefulSet, etc. But all have one thing in common: deployment

*_[nameSpace]_*.log
Copy the code

At this point, knowing this feature, you can move on to the deployment and configuration of Filebeat.

Filebeat deployment

The deployment was carried out by DaemonSet. There is nothing to be said here. The deployment can be carried out directly according to the official documents

---
apiVersion: v1
data:
  filebeat.yml: |- filebeat.inputs: - type: container enabled: true paths: - /var/log/containers/*_test-1_*log fields: namespace: test-1 env: dev k8s: cluster-dev - type: container enabled: true paths: - /var/log/containers/*_test-2_*log fields: namespace: test-2 env: dev k8s: cluster-dev filebeat.config.modules: path: ${path.config}/modules.d/*.yml reload.enabled: false output.kafka: hosts: [10.0.105.74: "9092", "10.0.105.76:9092", "10.0.105.96:9092] the topic: '%{[fields.k8s]}-%{[fields.namespace]}' partition.round_robin: reachable_only: truekind: ConfigMap
metadata:
  name: filebeat-daemonset-config-test
  namespace: default
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: filebeat
  namespace: kube-system
  labels:
    k8s-app: filebeat
spec:
  selector:
    matchLabels:
      k8s-app: filebeat
  template:
    metadata:
      labels:
        k8s-app: filebeat
    spec:
      serviceAccountName: filebeat
      terminationGracePeriodSeconds: 30
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
      containers:
      - name: filebeat
        image: Docker. Elastic. Co/beats/filebeat: 7.12.0
        args: [
          "-c"."/etc/filebeat.yml"."-e",]env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        securityContext:
          runAsUser: 0
          # If using Red Hat OpenShift uncomment this:
          #privileged: true
        resources:
          limits:
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 100Mi
        volumeMounts:
        - name: config
          mountPath: /etc/filebeat.yml
          readOnly: true
          subPath: filebeat.yml
        - name: data
          mountPath: /usr/share/filebeat/data
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: varlog
          mountPath: /var/log
          readOnly: true
      volumes:
      - name: config
        configMap:
          defaultMode: 0640
          name: filebeat-config
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: varlog
        hostPath:
          path: /var/log
      # data folder stores a registry of read status for all files, so we don't send everything again on a Filebeat pod restart
      - name: data
        hostPath:
          # When filebeat runs as non-root user, this directory needs to be writable by group (g+w).
          path: /var/lib/filebeat-data
          type: DirectoryOrCreate
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: filebeat
subjects:
- kind: ServiceAccount
  name: filebeat
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: filebeat
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: filebeat
  labels:
    k8s-app: filebeat
rules:
- apiGroups: [""] # "" indicates the core API group
  resources:
  - namespaces
  - pods
  - nodes
  verbs:
  - get
  - watch
  - list
- apiGroups: ["apps"]
  resources:
    - replicasets
  verbs: ["get"."list"."watch"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: filebeat
  namespace: kube-system
  labels:
    k8s-app: filebeat
Copy the code

Kubectl apply-f is not the focus of this article.

The official deployment reference: raw.githubusercontent.com/elastic/bea…

Filebeat configuration introduction

Here is the configuration structure of FileBeat

filebeat.inputs:

filebeat.config.modules:

processors:

output.xxxxx:

Copy the code

The structure is roughly like this. The complete data flow can be simply described as the following figure:

If you want to collect more than one cluster, you will also use the same namespace to do the classification, but the topic name needs to add k8S cluster name, so that it is convenient to separate. Write the re in inputs to fetch and read the log file in the specified namespace, for example:

filebeat.inputs:
- type: container
  enabled: true
  paths:
  - /var/log/containers/*_test-1_*log
  fields:
    namespace: test-1
    env: dev
    k8s: cluster-dev
Copy the code

I used the bim5D-BASIC namespace and used the *_test-1_*log to get the log file with this namespace name. I also added a custom field for topic creation. If there is more than one namespace, it can be arranged as follows:

filebeat.inputs:
- type: container
  enabled: true
  paths:
  - /var/log/containers/*_test-1_*log
  fields:
    namespace: test-1
    env: dev
    k8s: cluster-dev
- type: container
  enabled: true
  paths:
  - /var/log/containers/*_test-2_*log
  fields:
    namespace: test-2
    env: dev
    k8s: cluster-dev
Copy the code

The downside of this is that if you have a lot of namespaces, you have a lot of configuration. Don’t worry, there is a more concise way to write this below

Note: The log type must be set to Container

I added a custom field named namespace, which is the name of the following topic. However, there are many namespaces in this topic, so how to create the topic dynamically when exporting?

output.kafka:
  hosts: ["10.0.105.74:9092"."10.0.105.76:9092"."10.0.105.96:9092"]
  topic: '%{[fields.namespace]}'
  partition.round_robin:
    reachable_only: true
Copy the code

Note the syntax: %{[fields.namespace]}

The complete configuration looks like this:

apiVersion: v1
data:
  filebeat.yml: |- filebeat.inputs: - type: container enabled: true paths: - /var/log/containers/*_test-1_*log fields: namespace: test-1 env: dev k8s: cluster-dev - type: container enabled: true paths: - /var/log/containers/*_test-2_*log fields: namespace: test-2 env: dev k8s: cluster-dev filebeat.config.modules: path: ${path.config}/modules.d/*.yml reload.enabled: false output.kafka: hosts: [10.0.105.74: "9092", "10.0.105.76:9092", "10.0.105.96:9092] the topic: '%{[fields.k8s]}-%{[fields.namespace]}' partition.round_robin: reachable_only: truekind: ConfigMap
metadata:
  name: filebeat-daemonset-config-test
  namespace: default
Copy the code

If you didn’t do anything with the log, you’d end up here, but what’s missing when you’re looking at the log? That’s right! You only know the log content and which namespace the log comes from, but you do not know which service the log belongs to, which POD, or even want to view the mirror address of the service, etc. However, this information is not available in our above configuration mode, so you need to add more information.

This time we use a configuration item called: Processors. Check out the official explanation

You can define processors in your configuration to process events before they are sent to the configured output

In a nutshell, it’s dealing with logs

Now let’s focus on this place, which is very useful and important

Processors with FileBeat

Example Add K8s basic information

When collecting k8S logs, if the above configuration is followed, there is no information about POD, for example:

Pod Name
Pod UID
Namespace
Labels
Etc., etc.

To add this information, use a tool in Processors called add_kubernetes_metadata, which simply adds k8s metadata. Here’s an example of how to do this: processors processors add_kubernetes_metadata

processors:
  - add_kubernetes_metadata:
      host: ${NODE_NAME}
      matchers:
      - logs_path:
          logs_path: "/var/log/containers/"
Copy the code

Host: specifies the node to operate on fileBeat in case it cannot be accurately detected, such as running FileBeat matchers in host network mode: the matcher is used to construct a lookup key logs_path that matches the identifier created by the index: The base path for container logs. If not specified, the default log path for the platform on which Filebeat is run

After adding the k8S metadata information, you can see the k8S information in the log.

{
  "@timestamp": "The 2021-04-19 T07:07:36. 065 z"."@metadata": {
    "beat": "filebeat"."type": "_doc"."version": "7.11.2"
  },
  "log": {
    "offset": 377708."file": {
      "path": "/var/log/containers/test-server-85545c868b-6nsvc_test-1_test-server-885412c0a8af6bfa7b3d7a341c3a9cb79a85986965e363e8752 9b31cb650aec4.log"}},"fields": {
    "env": "dev"."k8s": "cluster-dev"
    "namespace": "test-1"
  },
  "host": {
    "name": "filebeat-fv484"
  },
  "agent": {
    "id": "7afbca43-3ec1-4cee-b5cb-1de1e955b717"."name": "filebeat-fv484"."type": "filebeat"."version": "7.11.2"."hostname": "filebeat-fv484"."ephemeral_id": "8fd29dee-da50-4c88-88d5-ebb6bbf20772"
  },
  "ecs": {
    "version": "1.6.0"
  },
  "stream": "stdout"."message": "2021-04-19 15:07:36.065  INFO 23 --- [trap-executor-0] c.n.d.s.r.aws.ConfigClusterResolver      : Resolving eureka endpoints via configuration"."input": {
    "type": "container"
  },
  "container": {
    "image": {
      "name": "hub.test.com/test/test-server:3.3.1-ent-release-SNAPSHOT.20210402191241_87c9b1f841c"
    },
    "id": "885412c0a8af6bfa7b3d7a341c3a9cb79a85986965e363e87529b31cb650aec4"."runtime": "docker"
  },
  "kubernetes": {
    "labels": {
      "pod-template-hash": "85545c868b"."app": "geip-gateway-test"
    },
    "container": {
      "name": "test-server"."image": "hub.test.com/test/test-server:3.3.1-ent-release-SNAPSHOT.20210402191241_87c9b1f841c"
    },
    "node": {
      "uid": "511d9dc1-a84e-4948-b6c8-26d3f1ba2e61"."labels": {
        "kubernetes_io/hostname": "k8s-node-09"."kubernetes_io/os": "linux"."beta_kubernetes_io/arch": "amd64"."beta_kubernetes_io/os": "linux"."cloudt-global": "true"."kubernetes_io/arch": "amd64"
      },
      "hostname": "k8s-node-09"."name": "k8s-node-09"
    },
    "namespace_uid": "4fbea846-44b8-4d4a-b03b-56e43cff2754"."namespace_labels": {
      "field_cattle_io/projectId": "p-lgxhz"."cattle_io/creator": "norman"
    },
    "pod": {
      "name": "test-server-85545c868b-6nsvc"."uid": "1e678b63-fb3c-40b5-8aad-892596c5bd4d"
    },
    "namespace": "test-1"."replicaset": {
      "name": "test-server-85545c868b"}}}Copy the code

Kubernetes key value contains pod information, node information, namespace information, etc., basically contains some key information about K8S.

However, the problem is that there is too much information in this log. More than half of the information is not what we want, so we need to remove some fields that are not useful to us

Delete unnecessary fields

processors:
  - drop_fields:
      # Delete unnecessary fields
      fields:
        - host
        - ecs
        - log
        - agent
        - input
        - stream
        - container
      ignore_missing: true
Copy the code

Meta information: @metadata cannot be deleted

Adding log Time

It can be seen from the above log information that there is no separate field about the log time. Although there is a @timestamp in it, it is not Beijing time, and what we want is the log time. There is time in message, but how can we get it and add a separate field? At this point, you need to use script, you need to write a JS script to replace.

processors:
  - script:
      lang: javascript
      id: format_time
      tag: enable
      source: > function process(event) { var str=event.Get("message"); var time=str.split(" ").slice(0, 2).join(" "); event.Put("time", time); }  - timestamp:
      field: time
      timezone: Asia/Shanghai
      layouts:
        - 'the 2006-01-02 15:04:05'
        - 'the 2006-01-02 15:04:05. 999'
      test:
        - 'the 2019-06-22 16:33:51'
Copy the code

After the addition, there will be a time field, in later use, can use this field.

Reassembles k8S source information

In fact, at this point we have completed all of our requirements, but after adding k8S information, there are many useless fields, and we can also use drop_fields to remove the useless fields, for example:

processors:
  - drop_fields:
      # Delete unnecessary fields
      fields:
        - kubernetes.pod.uid
        - kubernetes.namespace_uid
        - kubernetes.namespace_labels
        - kubernetes.node.uid
        - kubernetes.node.labels
        - kubernetes.replicaset
        - kubernetes.labels
        - kubernetes.node.name
      ignore_missing: true
Copy the code

You can also get rid of useless fields, but the structure level is the same, there are many layers nested, and the end result might look like this

{
  "@timestamp": "The 2021-04-19 T07:07:36. 065 z"."@metadata": {
    "beat": "filebeat"."type": "_doc"."version": "7.11.2"
  },
  "fields": {
    "env": "dev"."k8s": "cluster-dev"
    "namespace": "test-1"
  },
  "message": "2021-04-19 15:07:36.065  INFO 23 --- [trap-executor-0] c.n.d.s.r.aws.ConfigClusterResolver      : Resolving eureka endpoints via configuration"."kubernetes": {
    "container": {
      "name": "test-server"."image": "hub.test.com/test/test-server:3.3.1-ent-release-SNAPSHOT.20210402191241_87c9b1f841c"
    },
    "node": {
      "hostname": "k8s-node-09"
    },
    "pod": {
      "name": "test-server-85545c868b-6nsvc"
    },
    "namespace": "test-1"}}Copy the code

In this way, when using ES to create template, there will be a lot of nested layers, and it is very inconvenient to query. In this case, we should optimize the hierarchy and continue script plug-in

processors:
  - script:
      lang: javascript
      id: format_k8s
      tag: enable
      source: > function process(event) { var k8s=event.Get("kubernetes"); var newK8s = { podName: k8s.pod.name, nameSpace: k8s.namespace, imageAddr: k8s.container.name, hostName: k8s.node.hostname } event.Put("k8s", newK8s); }Copy the code

K8s = podName, nameSpace, imageAddr, hostName; drop kubernetes = kubernetes; The final result is as follows:

{
  "@timestamp": "The 2021-04-19 T07:07:36. 065 z"."@metadata": {
    "beat": "filebeat"."type": "_doc"."version": "7.11.2"
  },
  "fields": {
    "env": "dev"."k8s": "cluster-dev"
    "namespace": "test-1"
  },
  "time": "The 2021-04-19 15:07:36. 065"."message": "2021-04-19 15:07:36.065  INFO 23 --- [trap-executor-0] c.n.d.s.r.aws.ConfigClusterResolver      : Resolving eureka endpoints via configuration"."k8s": {
      "podName": "test-server-85545c868b-6nsvc"."nameSpace": "test-1"."imageAddr": "hub.test.com/test/test-server:3.3.1-ent-release-SNAPSHOT.20210402191241_87c9b1f841c"."hostName": "k8s-node-09"}}Copy the code

So it looks very clean. However, it is still a little tedious, because if you add a new namespace later, then you need to change the configuration again every time, which is also very tedious, so is there a better way? There are some answers.

The final optimization

Since we can create a topic from output.kafka by specifying fields, we can take advantage of this and set it like this:

    output.kafka:
      hosts: ["10.0.105.74:9092"."10.0.105.76:9092"."10.0.105.96:9092"]
      topic: '%{[fields.k8s]}-%{[k8s.nameSpace]}'  Get the namespace by injecting k8S meta information into the past
      partition.round_robin:
        reachable_only: true
Copy the code

We also created a K8S field under fields to distinguish between different K8S clusters, so we can optimize the configuration file to look like this

apiVersion: v1
data:
  filebeat.yml: |- filebeat.inputs: - type: container enabled: true paths: - /var/log/containers/*.log multiline.pattern: '^[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}|^[1-9]\d*\.[1-9]\d*\.[1-9]\d*\.[1-9]\d*' multiline.negate: true multiline.match: after multiline.timeout: 10s filebeat.config.modules: path: ${path.config}/modules.d/*.yml reload.enabled: false processors: - drop_event.when.regexp: or: kubernetes.pod.name: "filebeat.*" kubernetes.pod.name: "external-dns.*" kubernetes.pod.name: "coredns.*" kubernetes.pod.name: "eureka.*" kubernetes.pod.name: "zookeeper.*" - script: lang: javascript id: format_time tag: enable source: > function process(event) { var str=event.Get("message"); var time=str.split(" ").slice(0, 2).join(" "); event.Put("time", time); } -timestamp: field: time timezone: Asia/Shanghai layouts: - '2006-01-02 15:04:05' - '2006-01-02 15:04:999 'test: - '2019-06-22 16:33:51' # select timestamp, timestamp, timestamp, timestamp, timestamp, timestamp, timestamp, timestamp, timestamp, timestamp, timestamp, timestamp format_time_utc tag: enable source: > function process(event) { var utc_time=event.Get("@timestamp"); var T_pos = utc_time.indexOf('T'); var Z_pos = utc_time.indexOf('Z'); var year_month_day = utc_time.substr(0, T_pos); var hour_minute_second = utc_time.substr(T_pos+1, Z_pos-T_pos-1); var new_time = year_month_day + " " + hour_minute_second; timestamp = new Date(Date.parse(new_time)); timestamp = timestamp.getTime(); timestamp = timestamp/1000; var timestamp = timestamp + 8*60*60; var bj_time = new Date(parseInt(timestamp) * 1000 + 8* 3600 * 1000); var bj_time = bj_time.toJSON().substr(0, 19).replace('T', ' '); event.Put("time_utc", bj_time); } -timestamp: field: time_UTC layouts: - '2006-01-02 15:04:05' - '2006-01-02 15:04:999 'test: - '2019-06-22 16:33:51' - add_fields: target: '' fields: env: prod - add_kubernetes_metadata: default_indexers.enabled: true default_matchers.enabled: true host: ${NODE_NAME} matchers: - logs_path: logs_path: "/var/log/containers/" - script: lang: javascript id: format_k8s tag: enable source: > function process(event) { var k8s=event.Get("kubernetes"); var newK8s = { podName: k8s.pod.name, nameSpace: k8s.namespace, imageAddr: k8s.container.name, hostName: k8s.node.hostname, k8sName: "sg-saas-pro-hbali" } event.Put("k8s", newK8s); } -drop_fields: # drop fields: - host - tags - ecs - log - prospector - agent - input - beat - offset - stream - container - kubernetes ignore_missing: Kafka: hosts: ["10.127.91.90:9092","10.127.91.91:9092","10.127.91.92:9092"] topic: '%{[k8s.k8sName]}-%{[k8s.nameSpace]}' partition.round_robin: reachable_only: truekind: ConfigMap
metadata:
  name: filebeat-daemonset-config
  namespace: default
Copy the code

A few tweaks have been made to keep in mind the way we created our topic %{[k8s.k8sname]}-%{[k8s.namespace]}, which will be used later when we use logstash.

conclusion

Personally, I think it can shorten the processing time of the whole process to let FileBeat do some processing in the first layer of log collection. Because most of the bottleneck lies in ES and Logstash, some time-consuming operations should be dealt with in fileBeat as far as possible. If they cannot be dealt with with logstash, another point that is easy to ignore is that, As for the simplification of log content, it can significantly reduce the log volume. I have done a test, and for the same log number, the volume without simplification reaches 20G, while the volume after optimization is less than 10G. In this way, it can be said that it is very friendly and plays a great role for the whole ES cluster.

Welcome friends to pay attention to my public number, to learn and progress together oh