Grafana Loki open source log aggregation system replaces ELK or EFK

preface

ELK (Elasticsearch, Logstash, Kibana) or EFK (Elasticsearch, Filebeat, Or Fluentd, Kibana) are very important when designing the company’s container cloud log solution. In addition, many of the complex search functions of ES cannot be used at the present stage, so THE Loki log system of Grafana open source is finally selected. Let’s take a look at some of the basic concepts and architectures of Loki, and of course EFK is a mature log aggregation solution in the industry that you should be familiar with and master.

Loki 2.0 Released: Transform Logs as you’re Querying them, and Set Up Alerts within Loki, Loki 2.0 has received an update to better use

Update history

02 October 2020 – First draft

Read the original – wsgzao. Making. IO/post/Loki

Loki profile

Grafana Loki is a set of components that can be composed into a fully featured logging stack.

Unlike other logging systems, Loki is built around the idea of only indexing metadata about your logs: labels (just like Prometheus labels). Log data itself is then compressed and stored in chunks in object stores such as S3 or GCS, or even locally on the filesystem. A small index and highly compressed chunks simplifies the operation and significantly lowers the cost of Loki.

Loki, the latest open source project from the Grafana Labs team, is a horizontally scalable, highly available, multi-tenant log aggregation system. It was designed to be economical, efficient and easy to use because it does not index the log content, but instead creates a set of tags for each log stream, optimized for Prometheus and Kubernetes users. The project was inspired by Prometheus, whose official description is: Like Prometheus,But For Logs.

Project address: github.com/grafana/lok…

Compared to other log aggregation systems, Loki has the following features:

Logs are not full-text indexed. By storing compressed unstructured logs and indexing metadata only, Loki is simpler and less costly to operate.
This allows logging to scale and operate more efficiently by indexing and grouping the logs using the same tag logging stream as Prometheus.
Especially suitable for storing Kubernetes Pod logs; Metadata such as Pod tags is automatically deleted and indexed.
Native support by Grafana.

Background and Motivation

When there is a problem with an application or node running in our container cloud, the solution should be as follows:

Our monitoring system is based on Prometheus, in which Metric and Alert are important. Metric indicates that a current or historical value has been achieved, and Alert sets the Metric to a specific base that triggers an alarm. But this information is clearly not enough.

As we all know, the basic unit of Kubernetes is Pod. Pod outputs logs to Stdout and Stderr.

Here’s an example: when one of our pods gets too big and triggers our Alert. At this time, the administrator, go to the page to check and confirm which Pod has the problem, and then to confirm the reason why the Pod memory becomes large, we also need to query the Pod log, if there is no log system, then we need to go to the page or use the command to query:

If, at this point, the application suddenly hangs, we will not be able to find the relevant log at this point. Therefore, you need to introduce a log system to collect logs in a unified manner. With ELK, you had to switch between Kibana and Grafana, affecting the user experience. So, the first purpose of Loki is to minimize the switching costs of metrics and logging, helping to reduce response time to abnormal events and improve the user experience.

Problems with ELK

Many existing log collection schemes use full-text retrieval to index logs (such as ELK scheme), which has the advantages of rich functions and complex operations. However, these schemes are often complex in scale, high in resource consumption and difficult to operate. Many of the features don’t work, and most queries focus on a certain time range and a few simple parameters (host, service, etc.), so using these solutions can be a bit of an overkill.

Therefore, the second purpose of Loki is to achieve a trade-off between the ease and complexity of the query language.

Cost considerations

The scheme of full-text search also brings cost problem. Simply speaking, it is high cost to slice and share the inverted index of full-text search (e.g. ES). Other different designs have since emerged, such as:

OKlog

Project address: github.com/oklog/oklog

Adopt the ultimate consistent, grid-based distribution strategy. These two design decisions provide a lot of cost savings and very simple operations, but are not easy to query. Therefore, Loki’s third purpose is to provide a more cost-effective solution.

The overall architecture

Loki’s architecture is as follows:

As you can see, Loki’s architecture is very simple and consists of the following three parts:

Loki is the master server, storing logs and processing queries.
Promtail is the broker that collects logs and sends them to Loki.
Grafana is used for UI presentation.

Loki uses the same label as Prometheus for its index. That is, you can use these tags to query both the log content and the monitored data, reducing switching costs between the two queries and greatly reducing the storage of log indexes.

Loki wrote Pormtail using the same service discovery and tag re-tagging library as Prometheus. In Kubernetes Promtail runs on each node as DaemonSet, gets the correct metadata for the logs through the Kubernetes API and sends them to Loki. Here is the log storage architecture:

Read and write

Writing log data relies on Distributor and Ingester, and the overall process is as follows:

Distributor

Once Promtail collects logs and sends them to Loki, Distributor is the first component to receive the logs. Because logs can be written to a large volume, they cannot be written to the database as they come in. This will destroy the database. We need to batch and compress data.

Loki does this by building blocks of compressed data by Gzip the logs as they come in. The Ingester component is a stateful component that builds and flushs Chunck into storage when chunks reach a certain number or time. Each stream log corresponds to an Ingester, and when the logs reach Distributor, the metadata and Hash algorithm calculates which Ingester to go to.

In addition, we copy it n (3 by default) times for redundancy and elasticity.

Ingester

Ingester receives the log and starts building Chunk:

Basically, the logs are compressed and attached to the Chunk. Once the Chunk fills up (a certain amount of data or a certain period of time), Ingester flushes it to the database. We use separate databases for blocks and indexes because they store different data types.

After refreshing a Chunk, Ingester then creates a new empty Chunk and adds new entries to that Chunk.

Querier

Reading is straightforward, with Querier taking care of a time range and a label selector, Querier looks at the index to determine which blocks match, and Greps displays the results. It also gets the latest data from Ingester that hasn’t been refreshed yet.

For each query, a finder will show you all the relevant logs. Query parallelization is implemented and distributed Grep is provided so that even large queries are sufficient.

scalability

Loki index storage can be Cassandra/Bigtable/Dynamodb, while Chuncks can be all kinds of object storage, Querier and Distributor are stateless components.

For Ingester, though, it is stateful. However, when new nodes are added or reduced, chunks between whole nodes are redistributed to accommodate the new hash ring. The Cortex, an implementation of Loki’s underlying storage, has been in production for years.

Install Loki

Installation methods

Instructions for different methods of installing Loki and Promtail.

Install using Tanka (recommended)
Install through Helm
Install through Docker or Docker Compose
Install and run locally
Install from source

General process

In order to run Loki, you must:

Download and install both Loki and Promtail.
Download config files for both programs.
Start Loki.
Update the Promtail config file to get your logs into Loki.
Start Promtail.

Loki official has written in detail, I take Docker-compose as an example to do a simple demonstration

Install with Docker Compose

version: "3"

networks:
  loki:

services:
  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    command: -config.file=/etc/loki/local-config.yaml
    networks:
      - loki

  promtail:
    image: grafana/promtail:latest
    volumes:
      - /var/log:/var/log
    command: -config.file=/etc/promtail/config.yml
    networks:
      - loki

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    networks:
      - loki
Copy the code

Run the following commands in your command line. They work for Windows or Linux systems.

Wget https://raw.githubusercontent.com/grafana/loki/v2.0.0/production/docker-compose.yaml - O docker - compose. Yaml docker-compose -f docker-compose.yaml up -d [root@localhost loki]# docker-compose ps Name Command State Ports --------------------------------------------------------------------------------- loki_grafana_1 /run.sh Up 0.0.0.0:3000->3000/ TCP loki_loki_1 /usr/bin/loki-config.file... Up 0.0.0.0:3100->3100/ TCP loki_promtail_1 /usr/bin/promtail-config.... UpCopy the code

Loki use

After installation, access grafana on port 3000 of the node above, using (admin:admin) access by default -> Select Add Data Source:

grafana-loki-dashsource

Select Loki from the data source list and configure the source address of Loki:

grafana-loki-dashsource-config

Set the source ADDRESS to http://loki:3100 and save the Settings.

After saving, switch to Explore on the left side of Grafana to enter Loki’s page:

grafana-loki

Then click Log Labels to display the Log labels collected by the current system. Filter the Log according to these labels:

grafana-loki-log-labels

For example, /var/log/messages will filter the messages below the file, but due to the time zone, you may need to set the time to see the data:

grafana-loki-logs

The selector

For the label part of the query expression, wrap it in curly braces {}, and then use key-value pair syntax to select the label. Multiple label expressions are separated by commas, such as:

{app="mysql",name="mysql-backup"}
Copy the code

Currently the following label matching operators are supported:

=Is equal to the
! =Not equal to the
= ~Regular expression matching
! ~The regular expression was not matched

Such as:

{name=~"mysql.+"} {name! ~"mysql.+"}Copy the code

The same rules that apply to Prometheus tag selectors also apply to Loki log stream selectors.

For Loki’s original design documentation, check out the documentation here: Loki Design Documentation

Loki Cheat Sheet

See your logs

Start by selecting a log stream from the Log labels selector.

Alternatively, you can write a stream selector into the query field:

{job="default/prometheus"}

Here are some example streams from your logs:

{job="varlogs"}

Combine stream selectors

{app="cassandra",namespace="prod"}

Returns all log lines from streams that have both labels.

Filtering for search terms.

{app="cassandra"} |~ "(duration|latency)s*(=|is|of)s*[d.]+"

{app="cassandra"} |= "exact match"

{app="cassandra"} ! = "do not match"

LogQL supports exact and regular expression filters.

Count over time

count_over_time({job="mysql"}[5m])

This query counts all the log lines within the last five minutes for the MySQL job.

Rate

rate(({job="mysql"} |= "error" ! = "timeout")[10s])

This query gets the per-second rate of all non-timeout errors within the last ten seconds for the MySQL job.

Aggregate, count, and group

sum(count_over_time({job="mysql"}[5m])) by (level)

Get the count of logs during the last five minutes, grouping by level.

LogQL

Loki uses a syntax called LogQL for log retrieval similar to PromQL

LogQL: Log Query Language

Loki comes with its own PromQL-inspired language for queries called LogQL. LogQL can be considered a distributed grep that aggregates log sources. LogQL uses labels and operators for filtering.

There are two types of LogQL queries:

Log queries return the contents of log lines.
Metric queries extend log queries and calculate sample values based on the content of logs from a log query.

Inspired by PromQL, Loki also has its own LogQL query statement. Officially, it is like a distributed grep log aggregation viewer. Like PromeQL, LogQL uses tags and operators to filter. It is divided into two parts:

Log Stream selector
Filter expression

We can use these two parts to combine the desired functionality in Loki, and usually we can use them to do the following

View the log content according to the log flow selector
The relevant metrics are calculated in the log flow by filtering rules

Log Stream Selector

The log stream selector part, like the PromQL syntax, determines the log stream you want to query by collecting the incoming log label. Generally, the matching operation of label supports the following types:

=: A complete match
! = : don’t match
=~: Matches the regular expression
! ~: The regular expression does not match

{app=”mysql”,name=~”mysql-backup.+”}

=: exactly equal.
! =: not equal.
= ~: regex matches.
! ~: regex does not match.

Filter Expression

{instance=~"kafka-[23]",name="kafka"} ! = "kafka.server:type=ReplicaManager"Copy the code

| =: Log line contains string.
! =: Log line does not contain string.
| ~: Log line matches regular expression.
! ~: Log line does not match regular expression.

Metric Queries

This is actually very similar to what Prometheus imagined.

rate({job="mysql"} |= "error" ! = "timeout" [5m])Copy the code

rate: calculates the number of entries per second
count_over_time: counts the entries for each log stream within the given range.
bytes_rate: calculates the number of bytes per second for each stream.
bytes_over_time: counts the amount of bytes used by each log stream for a given range.

Aggregation operators

Some aggregation operations are also supported, such as

avg(rate(({job="nginx"} |= "GET")[10s])) by (region)
Copy the code

sum: Calculate sum over labels
min: Select minimum over labels
max: Select maximum over labels
avg: Calculate the average over labels
stddev: Calculate the population standard deviation over labels
stdvar: Calculate the population standard variance over labels
count: Count number of elements in the vector
bottomk: Select smallest k elements by sample value
topk: Select largest k elements by sample value

There are many other operations such as’ and, or ‘that are supported

Grafana.com/docs/loki/l…

LogQL in 5 minutes

Refer to the article

Loki Documentation

Loki Posts

Loki: Prometheus-inspired, open source logging for cloud natives

Loki log system details

Grafana log aggregation tool Loki

Grafana Loki brief tutorial

Loki Best Practices

Loki is getting a major 2.0 update with a significant increase in LogQL syntax!