Loki production environment cluster solution

Many new Loki users have no idea where to start with the distributor, Ingester, Querier, and various dependent third-party stores. In addition, the official documentation for cluster deployment is sketchy, which makes the deployment difficult for beginners. In addition to the official HELM, there is a cluster deployment pattern for the production environment hidden in the Production directory of the Loki repository.

The community uses Docker-compose to quickly pull up a Loki cluster. We will not be so stupid as to deploy docker-swarm on a node for docker-compose when it is officially implemented in production. However, the architecture and configuration files for Loki are worth studying.

So what makes this solution special compared to a purely distributed Loki cluster? First, let’s take a look at the following picture:

As you can see, there are three obvious differences:

The Loki core services Distributor, Ingester, and Querier are not separated but start in an instance;
Discarding kv storage outside consul and ETCD, memberList is used to maintain cluster status directly in memory;
Use boltdb-shipper instead of other log indexing schemes

As a result, the overall architecture of the Loki cluster is clearer and less dependent on external systems. To summarize, in addition to the sticky S3 storage for chunks and indexes, a caching service is needed to speed up log queries and writes.

After Loki version 2.0, there has been a great reconstruction of boltDB to store indexes. The new BoltDB-Shipper mode can store indexes of Loki on S3, and completely get rid of Cassandra or Google BigTable. Horizontal scaling of services will become easier after that. For more details about bolt-shipper, see: grafana.com/docs/loki/l…

Say so mysterious, then let’s take a look at the configuration of this program what is not the same?

The original part

memberlist

memberlist:
  join_members: ["loki-1", "loki-2", "loki-3"]
  dead_node_reclaim_time: 30s
  gossip_to_dead_nodes_time: 15s
  left_ingesters_timeout: 30s
  bind_addr: ['0.0.0.0']
  bind_port: 7946
Copy the code

Loki memberList uses the Gossip protocol to achieve final consistency among all nodes in the cluster. This part of the configuration is almost always protocol frequency and timeout control, leave the default

ingester

ingester:
  lifecycler:
    join_after: 60s
    observe_period: 5s
    ring:
      replication_factor: 2
      kvstore:
        store: memberlist
    final_sleep: 0s
Copy the code

Ingester’s state is synchronized to all members of the cluster using the Gossip protocol, and ingester’s replication factor is set to 2. That is, a log stream is written simultaneously to both Ingster services to ensure data redundancy.

extension

The native part of the cluster mode configuration of the community is still not enough, except for memberList’s slightly more sincere configuration, the rest is still not enough for our production environment requirements. Here xiao Bai simple transformation, to share with you.

storage

Unified STORAGE of Index and chunks into S3 object storage management, so that Loki can completely get rid of three-party dependence.

schema_config:
  configs:
  - from: 2021-04-25
    store: boltdb-shipper
    object_store: aws
    schema: v11
    index:
      prefix: index_
      period: 24h
    
storage_config:
  boltdb_shipper:
    shared_store: aws
    active_index_directory: /loki/index
    cache_location: /loki/boltdb-cache
  aws:
    s3: s3://<S3_ACCESS_KEY>:<S3_SECRET_KEY>@<S3_URL>/<S3__BUCKET>  
    s3forcepathstyle: true
    insecure: true
Copy the code

It’s worth noting that the type of log flow index used is bolt_shipper, which can be written to S3 using shared storage. Active_index_directory is the Bucket path on S3, and cache_location is the cached data for Loki’s local Bolt index.

In fact, the index path ingester uploades to S3 is

/index/

redis

The native solution does not provide caching, here we introduce Redis to do the query and write caching. For many partners who struggle with the use of one Redis or multiple Redis alone, this depends on the size of your cluster, in the case of small, one Redis instance is enough to meet the needs.

query_range:
  results_cache:
    cache:
      redis:
        endpoint: redis:6379
        expiration: 1h
  cache_results: true

index_queries_cache_config:
  redis:
    endpoint: redis:6379
    expiration: 1h
    
chunk_store_config:
  chunk_cache_config:
    redis:
      endpoint: redis:6379
      expiration: 1h    
  write_dedupe_cache_config:
    redis:
      endpoint: redis:6379
      expiration: 1h
Copy the code

ruler

Since Loki has done the clustering deployment, of course, the service of Ruler has to follow the syncopation. It is unacceptable that this part of the community should be missing. So we have to complete it ourselves. We know that the ruler of the log can be written on the S3 object store, and each ruler instance is also assigned its own rules through a consistent hash ring. Therefore, we can refer to this part of the configuration as follows:

ruler:
  storage:
    type: s3
    s3:
      s3: s3://<S3_ACCESS_KEY>:<S3_SECRET_KEY>@<S3_URL>/<S3_RULES_BUCKET>
      s3forcepathstyle: true
      insecure: true
      http_config:
        insecure_skip_verify: true
    enable_api: true
    enable_alertmanager_v2: true
    alertmanager_url: "http://<alertmanager>"
    ring:
      kvstore:
      store: memberlist
Copy the code

Support kubernetes

Finally, the most important thing is to have the official Loki clustering solution deployed in Kubernetes, otherwise it’s all nonsense. Due to space constraints, I submitted the manifest to Github, and everyone directly clone to the local deployment.

GitHub address: github.com/CloudXiaoba…

The MANIFEST relies on only one S3 object store, so make sure you have accesskeys and secretkeys for the object store in advance when deploying to production. After configuring them into installation.sh, execute the script directly to start the installation.

The ServiceMonitor in the file is a Prometheus Operator Metrics service discovery for Loki, which you can optionally deploy

conclusion

This article introduces an official cluster deployment solution in Loki production environment, and adds some extended configurations such as cache, S3 object storage, and ADAPTS the official Docker-compose deployment mode to Kubernetes. The official solution effectively simplifies the complex structure of Loki distributed deployment, which is worth learning.

“Cloud Born white”