Architecture diagram

The whole system is divided into three parts:

  • Agent: Collects tracing and metrics and reports them
  • OAP: Collecting tracing and metric information puts data into persistent containers (ES, H2 (in-memory database), mysql, etc.) through the Analysis Core module, and performs secondary statistics and monitors alarms
  • Webapp (UI) : Separate the front end and the back end. The front end is responsible for rendering, and encapsulates the query request as graphQL and submits it to the back end. The back end forwards the query to the OAP cluster through the ribbon for load balancing, and then renders the query results for display

Mirror Version selection

This time, the 8.0.0 and 6.6.0 deployment files are provided, mainly three image files. Since the officially provided deployment mode is Helm’s, we went to the Docker Hub to find the official image and wrote yamL files to deploy

Select the version that matches the UI and OAP. Note that OAP has the ES version, because the official default is also es for storage, so we will also use the ES version

  1. Apache/skywalking – oap – server: 6.6.0 – es7 or apache/skywalking – oap – server: 8.0.0 – es7
  2. Apache/skywalking – UI: 6.6.0 or apache/skywalking – UI: 8.0.0

Since the agent does not officially provide the image, we need to find the agent’s file in the official distribution and make the image by ourselves

Download address skywalking.apache.org/downloads/ choose corresponding version download

So that’s all you need to know about mirrors

K8s deployment essentials

Deploy OAP and UI services through Deployment, mount agent to microservice application through Side Car mode, zero intrusion to application code, and no restriction on original application language. OAP needs to provide configuration files injected through ConfigMap, including

  1. Application. yml Indicates the application configuration
  2. Alarm-settings. yml Alarm rule configuration
  3. Log4j2. XML log format configured for ELK collection

OAP

Port mapping, mainly GRPC and REST

GRPC is important for proxy access, data push and remote call

Rest is a port that uses the official GraphQL API. See github.com/apache/skyw…

ports:
- containerPort: 11800
  name: grpc
  protocol: TCP
- containerPort: 12800
  name: rest
  protocol: TCP
Copy the code

The environment variable

If you are deploying version 6.x, you need to add this sentence

- name: SW_L0AD_CONFIG_FILE_FROM_VOLUME
  value: "true"
Copy the code

The reason is that there is such a logic in the startup script docker-entryPoint.sh in the official 6.x image

if [[ -z "$SW_L0AD_CONFIG_FILE_FROM_VOLUME"[[]] | |"$SW_L0AD_CONFIG_FILE_FROM_VOLUME"! ="true"]].then
    generateApplicationYaml
    echo "Generated application.yml"
    echo "-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --"
    cat ${var_application_file}
    echo "-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --"
fi
Copy the code

Without environment variables, system-generated configuration files will be used, and our injected configuration files will be invalidated, causing the service to fail to start

java agent

Data collection of Skywalking adopts the service push mode, and data indicators are pushed to OAP service for processing. There are a variety of implementation methods. The agent-based implementation method is less intrusive to the system, and the realization of agents through Side Car mode is also a common mode in microservice architecture

To make an image, we just need to copy the file into the image

FROM busybox:latest 

RUN mkdir -p /opt/skywalking/agent/

ADD agent/ /opt/skywalking/agent/

WORKDIR /
Copy the code

The agent is mounted to the volume using initContainers

initContainers:
- name: init-agent
  image: xxx.com/skywalking-agent:latest
  command:
  - 'sh'
  - '-c'
  - 'set -ex; mkdir -p /skywalking/agent; cp -r /opt/skywalking/agent/* /skywalking/agent; '
  volumeMounts:
  - name: agent
    mountPath: /skywalking/agent
Copy the code

The container’s start command is then overridden by commod

command: ["/bin/bash"."-c"."java -javaagent:/opt/skywalking/agent/skywalking-agent.jar -Dskywalking.agent.service_name=xxx-user -Dskywalking.collector.backend_service=$SKYWALKING_ADDR -jar /app.jar"]
Copy the code

This involves the use of Docker ENTRYPOINT, CMD and K8S YAML command and AGRS.

It is difficult to divide commands and parameters. Therefore, you are advised to use commands in a unified manner because no parameters are added to services.

Docker’s ENTRYPOINT corresponds to YAML’s Command. Commod overrides the container’s own startup command.

So we need to pay attention to the container startup is there any special, such as script startup, directly through the java-JAR method is not possible, according to the actual situation to adjust

Parameter Meanings:

  1. Javaagent A Java specification that specifies the agent path
  2. Dskywalking. Agent. Service_name service in skywalking display name
  3. Dskywalking. Collector. Backend_service connection GRPC OAP service address

Configure the Agent

Is mainly to the agent how to gather the data configuration, the configuration in the agent/config/agent. The config, so will be configured in the mirror the build, can also through the environment variable injection

Agent. config can read the value of the environment variable, so we can inject the environment variable in dockerfile or yamL to replace the agent parameter

Configuration reference: github.com/VanLiuZhi/d…

Ignore collection endpoint

  1. Under the agent, Copy apache-skywalking-apm-bin-es7\agent\optional-plugins\apm-trace-ignore-plugin-6.6.0.jar to apache-skywalking-apm-bin-es7\ agent\plugins

  2. Apache-skywalking -apm-bin-es7\agent\config Create a new configuration file apm-trace-ignore-plugin.config, Contents of trace. Ignore_path = ${SW_AGENT_TRACE_IGNORE_PATH: / healthy / * *}

Thus the healthy/** endpoints can be ignored

The configuration file

The following is an overview of configuration files. You can adjust the mount path by referring to the directory of the original image. It is better not to overwrite the entire directory. Specific configuration files can be obtained from within the image, or refer to the configuration of the official distribution

application.yml

Cluster configuration

This section describes how to configure the DEPLOYMENT mode of OAP services. In this section, K8S is used to deploy clusters. Changing the number of copies can achieve high availability of services

kubernetes:
  watchTimeoutSeconds: ${SW_CLUSTER_K8S_WATCH_TIMEOUT:60}
  namespace: ${SW_CLUSTER_K8S_NAMESPACE:skywalking-min}
  labelSelector: ${SW_CLUSTER_K8S_LABEL:app=oap}
  uidEnvName: ${SW_CLUSTER_K8S_UID:SKYWALKING_COLLECTOR_UID}
Copy the code

Modify the selection label to the corresponding OAP Deployment. In addition, if k8S is used as the cluster mode, you need to provide K8S RBAC access permission. After testing, under version 8.0.0, you cannot use K8S to deploy the cluster without access permission

Storage Configuration, data storage location

Note that it is the address and account password of ES. NameSpace is prefixed before indexes to distinguish other indexes in the cluster. Adjust the number of copies and fragments and specify the data storage time recordDataTTL

storage:
  elasticsearch7:
    nameSpace: "eos_sw"
    clusterNodes: The ${SW_STORAGE_ES_CLUSTER_NODES: 10.90 xx. Xx: 9200}
    protocol: ${SW_STORAGE_ES_HTTP_PROTOCOL:"http"}
    # trustStorePath: ${SW_SW_STORAGE_ES_SSL_JKS_PATH:".. /es_keystore.jks"}
    # trustStorePass: ${SW_SW_STORAGE_ES_SSL_JKS_PASS:""}
    user: ${SW_ES_USER:"elastic"}
    password: ${SW_ES_PASSWORD:"123456"}
    indexShardsNumber: ${SW_STORAGE_ES_INDEX_SHARDS_NUMBER:1}
    indexReplicasNumber: ${SW_STORAGE_ES_INDEX_REPLICAS_NUMBER:1}
    # Those data TTL settings will override the same settings in core module.
    recordDataTTL: ${SW_STORAGE_ES_RECORD_DATA_TTL:3} # Unit is day
    otherMetricsDataTTL: ${SW_STORAGE_ES_OTHER_METRIC_DATA_TTL:45} # Unit is day
    monthMetricsDataTTL: ${SW_STORAGE_ES_MONTH_METRIC_DATA_TTL:18} # Unit is month
    # Batch process setting, refer to https://www.elastic.co/guide/en/elasticsearch/client/java-api/5.5/java-docs-bulk-processor.html
    bulkActions: ${SW_STORAGE_ES_BULK_ACTIONS:1000} # Execute the bulk every 1000 requests
    flushInterval: ${SW_STORAGE_ES_FLUSH_INTERVAL:10} # flush the bulk every 10 seconds whatever the number of requests
    concurrentRequests: ${SW_STORAGE_ES_CONCURRENT_REQUESTS:2} # the number of concurrent requests
    resultWindowMaxSize: ${SW_STORAGE_ES_QUERY_MAX_WINDOW_SIZE:10000}
    metadataQueryMaxSize: ${SW_STORAGE_ES_QUERY_MAX_SIZE:5000}
    segmentQueryMaxSize: ${SW_STORAGE_ES_QUERY_SEGMENT_SIZE:200}
Copy the code

For example, the ES storage configurations of 8.X and 6.X versions are different. Therefore, you are advised to download the file from the official distribution and refer to the file to modify the configuration

alarm-settings.yml

The alarm configuration

Example: Monitoring service response time

# rule name
service_instance_resp_time_rule:
  Service_instance_resp_time is a metric defined by an OAL query
  metrics-name: service_instance_resp_time
  op: ">"
  # threshold
  threshold: 10
  # cycle
  period: 10
  # number
  count: 2
  # Silence time after alarm
  silence-period: 5
  message: Response time of service instance {name} is more than 10ms in 2 minutes of last 10 minutes
Copy the code

Official description: The threshold is set to 10 milliseconds, and basic requests easily reach the threshold. The cycle is 10 minutes and the frequency is 2 times. The alarm is generated when the average response time exceeds 10 milliseconds in the last 2 minutes. After the alarm is triggered, the alarm is generated after 5 minutes

In the official description, I always feel that the concept of “period” is not reflected. Of course, it may be the problem of Chinese translation. The above part of message is the official description of the original text.

Period indicates the evaluation period. The unit is minutes. Set this parameter to 10. In the 10-minute range, assuming that the threshold is reached twice in the first 30 seconds, an alarm will be triggered, followed by a silence period, and judgment will continue. If there is only one alarm within 10 minutes and another alarm occurs 11 minutes later, no alarm is generated. The silence period is 0 and will not be immediately sent, with a minimum interval of 1 minute

Actual test results:

Silence-period determines the silence time after an alarm. If the current service is always faulty and silence-period is set to 0, an alarm is pushed once every minute. If silence-period = 1, the alarm will be pushed once every 2 minutes (based on the premise that the current service has been faulty).

Indicators that can be monitored:

  1. Service monitoring
  2. Examples of monitoring
  3. Monitoring between services
  4. Monitoring between instances
  5. Endpoint monitoring (that is, interface path)
  6. Monitoring between endpoints
  7. JVM and database access monitoring

OAL example:

Endpoint_p99 = from(endpoint.latency). Filter (name in ("Endpoint1", Serv_Endpoint_p99 = from(endpoint.latency).filter(name like). Serv_Endpoint_p99 = from(endpoint.latency) ("serv%")).summary(0.99) // Calculate the average response time of each Endpoint Endpoint_avg = from(endpoint.latency).avg() // Calculate each Endpoint p50, p75, p90, Histogram of delay for P95 and P99, Percentile = from(end.latency). Percentile (10) // Statistics the percentage of each service whose response status is true Endpoint_success = From (Endpoint.*).filter(status = "true").percent() Endpoint_200 = from(Endpoint.*).filter(responseCode like "2%").percent() Endpoint_500 = from(Endpoint.*).filter(responseCode like "5%").percent( from(Endpoint.*).sum() disable(segment); disable(endpoint_relation_server_side); disable(top_n_database_statement);Copy the code

In versions later than 6.5.0, the alarm rule configuration can be dynamically modified through the configuration center

skywalking

Official Chinese translation: github.com/SkyAPM/docu…

Rapid build: skywalking.apache.org/zh/blog/202…

– Dskywalking. Agent. The service_name = skywalking – test – local – Dskywalking. Collector. Backend_service = 127.0.0.1:11800 -javaagent:D:\JavaLearProject\apache-skywalking-apm-bin-es7\agent\skywalking-agent.jar

concept

The gRPC API of Backend can access 0.0.0.0/11800, and the REST API can access 0.0.0.0/12800

Boot mode

Startup mode In different deployment tools, such as K8S, different startup modes may be required. We also offer two other optional boot modes.

Default Mode Default mode. Initialization, if required, starts listening and provides the service.

Run /bin/oapservice.sh (.bat) to start the mode. You can also use startup.sh(.bat) to start the vm.

Initialization mode In this mode, the OAP server starts to perform initialization and then exits. You can use this mode to initialize storage, such as ElasticSearch indexes, MySQL and TIDB tables, and init data.

Run /bin/oapServiceInit.sh(.bat) to start the mode.

Non-initialization mode In this mode, the OAP server is not initialized. But it waits for elastic search indexes, mysql, and TIDB tables to start listening and serving. This means that the OAP service wants another OAP server to initialize.

Run /bin/oapServicenoinit.sh (.bat) to start the mode.

The configuration file

Application. Yml as the core configuration file

Level 1: module name. Module definition item. Level 2, module type. Set the module type. Level 3. Type attributes.

storage:
  selector: mysql # the mysql storage will actually be activated, while the h2 storage takes no effect
  h2:
    driver: ${SW_STORAGE_H2_DRIVER:org.h2.jdbcx.JdbcDataSource}
    url: ${SW_STORAGE_H2_URL:jdbc:h2:mem:skywalking-oap-db}
    user: ${SW_STORAGE_H2_USER:sa}
    metadataQueryMaxSize: ${SW_STORAGE_H2_QUERY_MAX_SIZE:5000}
  mysql:
    properties:
      jdbcUrl: ${SW_JDBC_URL:"jdbc:mysql://localhost:3306/swtest"}
      dataSource.user: ${SW_DATA_SOURCE_USER:root}
      dataSource.password: ${SW_DATA_SOURCE_PASSWORD:root@1234}
      dataSource.cachePrepStmts: ${SW_DATA_SOURCE_CACHE_PREP_STMTS:true}
      dataSource.prepStmtCacheSize: ${SW_DATA_SOURCE_PREP_STMT_CACHE_SQL_SIZE:250}
      dataSource.prepStmtCacheSqlLimit: ${SW_DATA_SOURCE_PREP_STMT_CACHE_SQL_LIMIT:2048}
      dataSource.useServerPrepStmts: ${SW_DATA_SOURCE_USE_SERVER_PREP_STMTS:true}
    metadataQueryMaxSize: ${SW_STORAGE_MYSQL_QUERY_MAX_SIZE:5000}
  # other configurations
Copy the code

Storage is the module name Selector Module type Default Module default attributes Driver, URL,… MetadataQueryMaxSize Type attribute.

At the same time, modules include both required modules and optional modules. The required modules provide the back-end framework, and even though modularity supports pluggability, it doesn’t make sense to remove these modules. For optional modules, some of them have a provider implementation named “None”, which means it only provides a shell with no actual logic, The classic example is Telemetry. Setting “-” to “selector” means that the entire module is excluded at run time. We strongly recommend that you do not attempt to change the API of these modules unless you are familiar with the SkyWalking project and its code.

List of required modules

The Core. The basic and main framework for all data analysis and flow scheduling. Cluster. Manage multiple back-end instances in a cluster, which can provide high-throughput processing power. Storage. Persist analysis results. The Query. Provides a query interface to the UI. If there are multiple implementors for Cluster and Storage, check the Link list documents for Cluster Management and Choose Storage.

Some receiver modules are also provided. A Receiver is a module that receives incoming data requests from the back end. Most (all) are provided via RPC protocols such as GRPC and HTTPrestful. Receiver has a number of different module names. You can read the Set Receivers document in the Link List.

Java process in K8S

java -Dapp.id=spring-demo -javaagent:/opt/skywalking/agent/skywalking-agent.jar -Dskywalking.agent.service_name=spring-demo -Dskywalking.collector.backend_service=oap:11800 -jar /app/app.jar

The alarm

The entity name defines the relationship between the scope and the entity name.

Service: Service Name Instance: {service name} Endpoint: {endpoint name} in {service Name} Database: Database Service Name Service Relationship: {source service name} to {target service Name} Instance relationship: Endpoint relationship: {source endpoint name} in {source service Name} to {target endpoint name} in {target service Name}

Conditions for triggering alarms are determined by the period, times, and silence period

Official description:

The average endpoint response time exceeded 1 second in the last 2 minutes

service_instance_resp_time_rule:
    metrics-name: service_instance_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 2
    silence-period: 5
    message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
Copy the code

Example: Monitoring service response time

metrics-name: service_instance_resp_time
op: ">"
threshold: 10
period: 10
count: 2
silence-period: 5
Copy the code

The threshold is set to 10 milliseconds and is easily reached for basic requests. The cycle is 10 minutes and the frequency is 2 times. An alarm is generated if the average response time exceeds 10 milliseconds in the last 2 minutes. After the alarm is triggered, the alarm is generated after 5 minutes. This is the official description

The unit is 10 minutes. Count counts. If the indicator exceeds the threshold for two times within 10 minutes, an alarm is triggered and silence is kept for 5 minutes. Continue to cycle

Within 10 minutes, if the threshold is reached twice in the first 30 seconds, an alarm will be triggered and the silence period will start. Then, if there is only one alarm within 10 minutes and another alarm occurs after 11 minutes, no alarm will be generated. The silence period is 0 and will not be immediately sent, with a minimum interval of 1 minute

Actual test results:

Silence-period determines the silence time after an alarm. If the current service is always faulty and silence-period is set to 0, an alarm is pushed once every minute. If silence-period = 1, the alarm will be pushed once every 2 minutes (based on the premise that the current service has been faulty).

Monitor page UI indicator concepts

CPM: number of requests per minute SLA: Service level agreement

Is to guarantee the performance and availability of the service at a certain cost.

Website service availability SLA, the more 9, the longer the service available time throughout the year, the more reliable the service, and the shorter the downtime

1 year = 365 days = 8760 hours

99.9 = 8760 * 0.1% = 8760 * 0.001 = 8.76 hours

99.99 = 8760 * 0.0001 = 0.876 hours = 0.876 * 60 = 52.6 minutes

99.999 = 8760 * 0.00001 = 0.0876 hours = 0.0876 * 60 = 5.26 minutes

To put that in perspective, it takes 5.26 minutes of downtime a year to do 99.999%, or five nines

Percentile: In Skywalking, there are P50, P90 and P95 statistical caliber, which is the concept of percentile.

For example, p99 520 indicates that the average latency for 1% of requests in the past was 0.52 seconds and 99% of requests were less than 0.52 seconds; P95 300 indicates that the response time of 95% of requests is less than 0.3 seconds

Application Performance Index (APDEX) reflects system status by calculating scores. Calculation rules are composed of multiple indicators

K8s deployment

The following files are used in total. Currently, the official deployment is based on helm, so WE have to write yamL files by ourselves

Use the official latest version 8.0.0 image, 8.0.0-ES7 (OAP) 8.0.0(UI)

skywalking-min-oap-configmap.yaml skywalking-min-oap-deployment.yaml skywalking-min-oap-namespace.yaml skywalking-min-oap-service.yaml skywalking-min-oap-serviceaccount.yaml skywalking-min-ui-deployment.yaml skywalking-min-ui-service.yaml

Specifically divided into two modules, OAP and UI

The UI is relatively simple, connect to OAP :12800 service

The OAP involves two ports, which are exposed

  1. GRPC 11800 provides remote call, proxy access
  2. Rest 12800 provides the GraphQL API

Then mount the configuration file, copy the configuration file from the source code, and run the ConfigMap command to mount the configuration file. For details about the file content and path, you only need to mount the configuration file and alarm rule file. If you need to customize logs, mount the log configuration file

Configuration file Note:

Storage configuration. Es is used here. You need to modify the ES configuration

Then the cluster mode adopts the service discovery of K8S, for which the K8S RBAC service is required and a service-account is defined in YAML. Otherwise, k8S cannot be used in the cluster mode without access permission

ES to adjust

Refer to section ES for details

  1. Need to adjust the shard size, via Kibana
PUT /_cluster/settings
{
  "persistent": {
    "cluster": {
      "max_shards_per_node":10000
    }
  }
}
Copy the code
  1. Adjust max_buckets

  2. After the configuration file is modified, the write thread size is changed: thread_pool.write. Queue_size: 1000

  3. Call chain optimization: index.max_result_window: 1000000 By default, only 10000 can be returned. This configuration may need to be modified if the call chain is too long