SkyWalking – How to use it

This article uses Skywalk-8.5.0, with Server side, UI deployed by default, and ES-7.6 storage.

The Agent use

SkyWalking does not need to manually bury a spot, but inserts a snippet of monitoring code into a Java application after running the JVM to monitor the target application. SkyWalking Agent will send the monitoring data obtained during service running to the OAP cluster at the back end through gRPC for analysis and storage.Copy the code

To monitor JAVA project A, we need to completely copy the Agent directory under the Skywalking installation directory to the server where project A is located, which is generally placed under the same directory as project A. The Agent can be configured in the following two ways

  1. Start parameter mode

    Add the -JavaAgent parameter to the startup parameter of project A. There are two formats

    • Format 1:-javaagent:/path/skywalking-agent.jar={config1}={value1},{config2}={value2}
    • Format 2:-javaagent:/path/skywalking-agent.jar -Dskywalking.[option1]=[value2]

    Demo startup commands in probe mode are given:

    java -javaagent:/home/demo/agent/skywalking-agent.jar -Dskywalking.agent.namespace=demo_183 - Dskywalking. Agent. The service_name = demo - Dskywalking. Collector. Backend_service = 192.168.3.84:11800 - Xms128m - Xmx512m - jar / home/demo/demo - 0.0.1 - the SNAPSHOT. Jar -- spring. Profiles. The active = testCopy the code
  2. Configuration file Mode

    A corresponding modification project agent agent/config/agent. The config file. To start the project, you only need to specify the location of the Agent package. In this case, run the following command:

    Java - javaagent: / home/demo/agent/skywalking - agent. The jar - Xms128m - Xmx512m - jar/home/demo/demo - 0.0.1 - the SNAPSHOT. The jar --spring.profiles.active=testCopy the code

After the project starts, open the SkyWalking UI and the Demo service appears in the dashboard.

UI

The SkyWalking UI consists of six functional areas: Dashboard, topology, tracking, performance analysis, log, and alarm. Below the function area are indicator objects, and SW monitoring objects are divided intoservice,The endpointandThe instanceThree. The lower right corner is the time area, which is used to set the time area of the statistical indicator (all indicator presentations depend on this time range). Click on the top rightautomatic“Button can open the automatic refresh mode; The remaining space is the indicator disk display area, where indicators are displayed.Here are three basic concepts of SkyWalking:

  • Service: Represents a series or set of workloads that provide the same behavior for requests. When using Agent or SDK, you can define the name of the service. If not, SkyWalking will use the name defined in the application name. To interact with the alarm service, it is recommended that you set it to the application name in the application center.

    Here, we can see that the applied service is “owind_base_info”, which is defined in the Agent environment variable SW_AGENT_NAME.

  • Endpoint: method level. Such as the interface path of the request, the method of scheduled task, RPC remote call between services and so on.

    Here, we can see an endpoint of the application, API interface /owind/ API/webSocket /{id}.

  • Service Instance: Each workload in the set above is called an Instance. Just like the Pods in Kubernetes, a service instance is not necessarily a process on the operating system. But when you use an Agent, a service instance is actually a real process on the operating system.

    Here, we can see that the instance of the Spring Boot application is {process UUID}@{hostname}, automatically generated by the Agent.

The dashboard

In the dashboard, we focus on Application Performance Management (APM) and Database. The dashboard contains various performance metrics for the service.

The main performance indicators are:

  • ApdexScore: Application Performance Index (Apdex) is an international standard. Apdex is a quantified value of users’ satisfaction with Application Performance. It provides a unified method to measure and report user experience. The end user experience and application performance are uniformly measured as a complete index, with the highest value being 1 and the lowest value being 0.

  • ResponseTime indicates the average ResponseTime (ms) for all requests to the service during the selected period of time.

  • Throughput: Throughput refers to the number of service requests per minute (CPM) in a selected time

  • SLA: Service level agreement (SLA) : service level agreement (SLA) : service level agreement (SLA)

    The market will list the above indicators of the current average, and historical trends.

For example, the Slow Endpoints(ms) indicator lists the methods that take the longest time to respond in the selected time domain, where the number on the left represents the response time.

In the line graph of Global Response Latency,P75=530 means 75% response latency within 530 ms.

In the Instance indicator, in addition to the usual indicators such as throughput and response time, it also gives the JVM information of the current Instance running, such as stack usage, GC count and elapsed time.

The Database allows you to view response times, request pressures, and slow SQL for projects using the Database. The indicators are easy to understand.

The topology

Topology diagrams are used to show services and their dependencies. SW will automatically detect service dependencies based on the request data. Click services to display some indicators of the current service. Click on the dots on the dependency line to display dependencies between services, such as throughput per minute, average latency, and so on.

tracking

When a user sees a decrease in service SLA or a significant increase in the response time of a specific port, the tracing function can be used to query specific request records.

  • The top is the search area. Users can specify search conditions, such as which service, which instance, which port, or whether the request is successful or failed. The port name supports fuzzy query.
  • Each call in the call chain is called a span, and the time and execution results of each span are listed (the default is a list, or you can choose the form of tree structure and table).
  • If a step fails, it is marked red.
  • Click span to display span details. If an exception occurs, the type, information and stack of the exception will be automatically captured. If the span is a database operation, the SQL executed will also be automatically recorded.
  • If the project is associated with logs, you can click View Logs in the upper right corner to view logs of the current interface.

Performance analysis

Tracing shows the span of service invocation granularity, and if you want to see real-time stack information for applications, you can choose profiling. After a task is created on the performance profile page, the SOFTWARE starts to collect real-time stack information of the application. After the sampling is complete, users can click Analyze to view the specific stack information.

  1. Click “View” on the right side of the span to see the details of the call chain;
  2. Below the span directory is the specific process stack information and time taken by the SW.

It is important to note that profiling has a performance cost to the service itself due to the high frequency of real-time collection of JVM stack information for the service and is only suitable for behavior analysis of time-consuming endpoints.

Parameters for creating a task:

  • Services: Services that need to be analyzed
  • Endpoint: indicates the name of the endpoint in link monitoring, that is, the complete path of the endpoint in link tracing. As shown below{GET}/owind/api/fuseQuery/fuseQueryAll
  • Monitoring time: start time of data collection
  • Monitoring duration: Monitoring collection period
  • Start monitoring time: the number of seconds after the collection
  • Monitoring interval: how many seconds of collection
  • Maximum number of samples collected: Maximum number of samples collected

The log

Log This project uses Logback as the log framework. This project uses Logback as an example to integrate Skywalking. Other steps such as log4j2 are similar. First add dependencies to the project (other logging frameworks search for dependencies that need to be added) :

<dependency> <groupId>org.apache.skywalking</groupId> <artifactId>apm-toolkit-logback-1.x</artifactId> The < version > 8.5.0 < / version > < / dependency >Copy the code

Add a Skywalking appender to the logback-spring. XML configuration file

<appender name="SKYWALKING" class="org.apache.skywalking.apm.toolkit.log.logback.v1.x.log.GRPCLogClientAppender"> <! --> <encoder> <! Format output: %d indicates the date, %thread indicates the thread name, %-5level indicates the level from the left 5 character width % MSG: Log messages, SSS} [%thread] %-5level % Logger {50} - % MSG %n </pattern> <charset>UTF-8</charset> <! </encoder> </appender>Copy the code
<springProfile name="local">
    <root level="info">
        <appender-ref ref="SKYWALKING"/>
        <appender-ref ref="INFO_FILE"/>
        <appender-ref ref="WARN_FILE"/>
        <appender-ref ref="ERROR_FILE"/>
    </root>
</springProfile>
Copy the code

In the project A corresponding agent agent/config/agent. The config file add the following configuration

plugin.toolkit.log.grpc.reporter.server_host=${SW_GRPC_LOG_SERVER_HOST:XXX.XXX.XXX.XXX}
plugin.toolkit.log.grpc.reporter.server_port=${SW_GRPC_LOG_SERVER_PORT:11800}
plugin.toolkit.log.grpc.reporter.max_message_size=${SW_GRPC_LOG_MAX_MESSAGE_SIZE:10485760}
plugin.toolkit.log.grpc.reporter.upstream_timeout=${SW_GRPC_LOG_GRPC_UPSTREAM_TIMEOUT:30}
Copy the code

Server_host indicates the IP address of the SkyWalking server. After service A is restarted, logs of service A are displayed. Log information about the span is also viewed in the trace.

The alarm

SkyWalking uses Webhook to configure alarms. In the alarm-settings.yml file on the Server, configure specific alarm policies and the Webhooks to be pushed.

-- Alarm policy example -- rules: # Rule unique name, must be ended with '_rule'. Service_resp_time_rule: metrics-name: Service_resp_time op: ">" threshold: 1000 period: 10 count: 3 silence-period: 5 Message: the average response time of service {name} exceeds 1 second within the last 10 minutesCopy the code
  • Rule name:Rule name, which is also the unique name displayed in alarm information. Have to be_ruleThe prefix can be customized
  • The Metrics name:Metric name. The value is the metric name in the script in the OAL folder. Currently, only metric names are supportedlong,doubleandintType. See Official OAL Script for details
  • Include names: Which entity names this rule applies to, such as service names, terminal names (optional, default is all)
  • Exclude names: Exclude entity names that this rule does not apply to, such as service names and terminal names (optional, empty by default)
  • Threshold: the Threshold
  • OP:Operator, currently supported&gt;,&lt;,=
  • Period: Indicates how long the alarm rule needs to be verified. This is a time window that matches the time of the back-end deployment environment
  • Count: In a Period window, if values exceed the Threshold value (press op) and reach the Count value, an alert needs to be sent
  • Silence Period: No alarm is generated in the period TN -> TN + period after the alarm is triggered in N. By default, it is the same as Period, which means that the same alarms (with the same Id in the same Metrics name) will only be triggered once during the same Period
  • Message: indicates an alarm message

After an alarm rule is configured, you can view the alarm information corresponding to the rule. In order to achieve push nail, enterprise wechat, etc., it is necessary to configure the corresponding WebHook.