Recently, I want to sort out my previous log project and personal thinking, so I have this article.

background

Our application deployment environment is complex, including the following:

  • Direct machine deployment
  • Through native Docker deployment
  • Deploy through the Kubernates cluster

Different deployment environments make it inconvenient to view application logs.

The business requirements

The requirements for log collection are classified into the following types:

  • Text logs on the machine (application logs running directly on the physical machine or virtual machine)
  • Application logs running in the Docker container
  • Application logs running in the Kubernates cluster

Specific business requirements can be broken down into:

  • Retrieves logs by project, application, and instance dimensions and supports search keyword highlighting (because when you retrieve logs, you must be retrieving logs for a project, application, or instance)
  • After retrieving a desired log, you can view the context (view the log context of the log file in which the log resides)
  • Log download (Two scenarios are supported: search result download and context download. Support two ways: online download, offline download)
  • Supports automatic batch deployment and uninstallation of agents, and visualizes the deployment and uninstallation process
  • A single instance supports multiple ElasticSearch clusters
  • Support text log, docker log, K8S log and log and its business meaning on the corresponding. (that is, whatever the log form, source, eventually need and business sense of the project, application, for instance, because for log of users, the starting point of the query log query is definitely a project, an application (can not), an instance (can not), a certain period of time log.)
  • Supports deployment to service clusters

Now that the requirements are clear, let’s take a look at the industry proposals.

Industry log system architecture

  • The Collector is used to:
    • Clean and aggregate data to reduce the pressure on the back-end cluster.
    • Security: Agents are not allowed to directly connect to kafka and other internal clusters to ensure security. Even if the back-end is adjusted, the Agent connection and authentication mode can be stable.
  • MQ is for peak clipping, decoupling, and multiple consumption.

The architecture shown in the figure above is a common one in the industry, considering all scenarios.

Since the industry’s architecture is so complete, should we just adopt it?

For us, there are the following questions:

  • There are many components involved, and the links are long, which is difficult to operate and maintain
  • Such a set of architecture is not conducive to independent deployment (for example, a business application deployment room network is isolated, and the project is not large enough to provide only a limited number of machines. In this case, if this architecture needs to be deployed in the industry, resources will be limited. If you want to support pluggability of industry architecture components (such as flexibility in deciding whether to require Collector, MQ), then several sets of configuration or code will need to be operated and maintained.
  • The key is the functionality provided by the components, which we don’t currently use. For example MQ peak clipping, multiple consumption.

Component selection

In selecting components, we mainly consider the following aspects:

  1. The open source ecosystem corresponding to the component is complete and highly active
  2. The corresponding technology stack is familiar to us. Our language technology stack is mainly Java and Go. If the component language is C and Ruby, it should be excluded.
  3. Operational costs
  4. Easy to deploy and high performance

Agent

ELK(Elasticsearch, Logstash, Kibana) is the first solution that comes to mind. However, Logstash is not the first choice of a log collector because it depends on the JVM for performance and simplicity.

In my opinion, a good agent should occupy less resources and have good performance. It does not depend on other components and can be deployed independently. Logstash obviously doesn’t meet these requirements, and it’s probably because of these considerations that Elastic created Filebeat.

The Collector, the MQ

For Elasticsearch clusters, capacity, machine, and shard information is usually estimated in advance. It is difficult to expand the Elasticsearch cluster horizontally after it runs. However, due to business and cost constraints, we cannot estimate the capacity properly. The business side that uses the log service has its own machine, that is, the business side has a separate Elasticsearch cluster.

Each business side uses its own Elasticsearch cluster, so the clustering is not very stressful, and the Collector and MQ components are of little use to us.

ETL

Because Elasticsearch Ingest Node is perfectly fine for parsing, there is no need to include Logstash and other related components.

So far, we can basically see our architecture as follows:

Principles of architectural design:

  • Fit better than industry leading
  • Simple is better than complex
  • Evolution is better than one step

The specific implementation

Based on the requirements and EFK suite, sort out the special things in our scene:

  • Docker log scenarios are relatively simple. They are all deployed by the previous product A, whose Docker naming rules are relatively uniform. You can obtain the application name by intercepting docker.container.name. At the same time, the IP address of the target machine can be known at deployment time, so that the application + IP can be used as the instance name.
  • The K8S scenario is also unified, which was released and deployed by product B before, and its POD naming rules are unified. The application name can be obtained by intercepting Kubernetes.pod. name (but tenant needs to be associated with namespaces, and then tenant corresponds to the project one by one). The pod.name in K8S is unique and can be used as the instance name.
  • Text log: Text logs are mainly used for bare-metal applications. In this scenario, automatic application migration does not exist. Therefore, you can label the application name and instance name of text logs during deployment.

Specific rules and analysis are shown in the figure below (example processing is not marked yet) :

It is recommended to write logs to text files, using standard output.

Filebeat is selected as the log collector and Elasticsearch is used to store and retrieve logs.

So where does the cleaning of logs go?

There are two ways to clean logs:

  • The log is first collected into Kafka, and then the data is cleaned by consuming Kafka’s data through Logstash
  • The data is cleaned directly from Elasticsearch’s [Ingest Node] because Ingest Node also supports Grok expressions

For our scenario, the requirement of data cleaning is relatively simple, mainly including interception of application, instance name and processing of log time in text log (@timestamp reset, time zone processing), so we choose scheme 2.

In our plan, we did not provide the Kibana interface for users to use directly. Instead, we independently developed it according to our company’s business.

Why not use Kibana instead of developing your own front-end interface?

  1. Kibana has a learning cost for business developers
  2. Kibana interface does not correlate log content with business meaning well (interface selection is better than input again and again, which is why we parse out log items, applications, instances and other business information)
  3. Log-search supports Query String, so for developers familiar with Kibana, the results are the same in our own front-end interface.

For details about log-search, see github: github.com/jiankunking…

If the log needs to be cleaned a lot, you can use Solution 1. Alternatively, you can save the data to Elasticsearch and process the data during query. For example, in our scenario, we could drop the log into Elasticsearch and then use code to retrieve the app name when we need to retrieve the app name.

Monitoring and alarm

There are many things you can do based on logging, such as:

  • Log based monitoring (Google Dapper)
  • Generate alarms based on logs
  • Machine Learning based on logs

For specific ideas, please refer to the following figure:

Prerequisite: Users can be required to print logs in accordance with certain rules. Monitoring development: Monitoring basically means that the link trace is cleared first, and then the service identification is strengthened in the reported information or log information, that is, the perspective of service dimension is added to the monitoring.

other

DaemonSet

To deploy Filebeat DaemonSet way to collect the log, but collect are hosting/var/lib/docker/containers of log directory. Running Filebeat on Kubernetes

Sidecar

A SIDecar log Agent container is run in a POD to collect the logs generated by the POD master container.

Somehow I thought of Istio.

Filebeat can collect container logs in Sidecar mode, that is, Filebeat and a specific service container are deployed in the same pod, specify the path or file to collect logs, and > send logs to the specified location or search engine such as Elasticsearch. The FileBeat pattern is deployed within each POD with the benefit of low coupling to specific application services and high scalability, although additional configuration is required in YAML.

Personal wechat official Account:

Individual making:

github.com/jiankunking

Personal Blog:

jiankunking.com