I am 3Y, a markdown programmer with one year CRUD experience and ten years’ experience 👨🏻💻 known as a professional octagon player for many years
Today, the Austin Project is going to give you something different: take some time to follow the article, and you’ll get your screen wallpaper. Let me show you a picture to get the idea.
Every time a coworker takes a look at your computer and sees a graphical, black, higher-looking interface: “HMM, I’m looking for a Bug again.”
Yeah, the topic of conversation is surveillance
01. Why monitoring
In interviews in the past, I remember being asked, “There is a problem online, how do you troubleshoot it? What is the thinking of the investigation?”
My former boss of the department valued stability very much and often asked us to comb the information of up and down links and interfaces of the system. I think: to improve the stability of the system requires complete monitoring and timely alarm.
With monitoring, problems can be quickly located (rather than printing logs to find problems, many problems can be directly seen through the monitoring data). With monitoring, we can configure metrics within monitoring, both technical and business (except that business data is called kanban, while system data is called monitoring). With monitoring, we look at the system in a different way (comprehensive understanding of system performance metrics and business metrics).
If your online system is not monitoring, it is not good
02. Monitor open source components
You can rely on open source components without even thinking about monitoring alarms. Only large companies have the manpower to develop their own alarm monitoring components.
I picked Prometheus, which is well known in the industry and used by many companies for monitoring and alarm.
An architecture diagram is available from the documentation on Prometheus’ website:
Let me make the above picture “unduly” simple in my understanding
After simplification, found: or his mother’s other people’s picture is good-looking
Generally speaking, the core of Prometheus is the Server. When we access Prometheus, we actually open the interface for Prometheus to fetch data, and then configure a graphical interface under web-UI to realize monitoring functions.
03. Environment construction of PROMETHEUS
For Prometheus, I used Docker directly, after all, Both Redis and Kafka are on Docker. Create a Prometheus folder and store the docker-comemess. yml information:
version: '2'
networks:
monitor:
driver: bridge
services:
prometheus:
image: prom/prometheus
container_name: prometheus
hostname: prometheus
restart: always
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
# - ./node_down.yml:/usr/local/etc/node_down.yml:rw
ports:
- "9090:9090"
networks:
- monitor
alertmanager:
image: prom/alertmanager
container_name: alertmanager
hostname: alertmanager
restart: always
# volumes:
# - ./alertmanager.yml:/usr/local/etc/alertmanager.yml
ports:
- "9093:9093"
networks:
- monitor
grafana:
image: grafana/grafana
container_name: grafana
hostname: grafana
restart: always
ports:
- "3000:3000"
networks:
- monitor
node-exporter:
image: quay.io/prometheus/node-exporter
container_name: node-exporter
hostname: node-exporter
restart: always
ports:
- "9100:9100"
networks:
- monitor
cadvisor:
image: google/cadvisor:latest
container_name: cadvisor
hostname: cadvisor
restart: always
volumes:
- /:/rootfs:ro
- /var/run:/var/run:rw
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
ports:
- "8899:8080"
networks:
- monitor
Copy the code
Here, the mirror images are:
cadvisor
Metrics used to get the Docker containernode-exporter
The user obtains server indicatorsgrafana
Monitoring ofweb-ui
Handy visual componentsalertmanager
Alarm Components (not currently used)prometheus
Core monitoring component
Create a new Prometheus configuration file, Prometheus. yml(which tells Prometheus which port to pull monitoring data from)
global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'prometheus' static_configs: -targets: [' IP :9090'] // TODO IP write it for itself. -job_name: 'cAdvisor' static_configs: -targets: -job_name: 'node' static_configs: -targets: [' IP :9100'] // TODO IP write it by itselfCopy the code
(Pay attention to the port here, according to their own configuration)
This Prometheus. Yml configuration to/etc/Prometheus/Prometheus yml path under a copy (there are a lot of configuration information I didn’t write my ignore, Prometheus function is very powerful, to monitor to see the document) can deeply understand website
Docker-compose up -d: docker-compose up -d: docker-compose up -d: docker-compose up -d
http://ip:9100/metrics
(View server metrics)http://ip:8899/metrics
(See docker container metrics)http://ip:9090/
(Prometheus’ native Web-UI)http://ip:3000/
(Grafana open Source Monitoring visual component page)
Docker-compose is composed for 5 servers.
04. GRAFANA configuration monitoring
Now that we have Grafana, we will use Grafana directly as a visualization tool for monitoring (Prometheus has a built-in visualization interface, but we won’t use it). To get to the Grafana home page, we first need to configure Prometheus as our data source
Go to the configuration page, write down the corresponding URL, and save it.
After configuring the data source, we can configure the corresponding monitoring information. Common configuration monitoring already has corresponding templates, so we do not need to configure one by one. (If not, you’ll have to match it yourself.)
Here, I’ll show you how to use an existing template by directly importing the corresponding template, which can be found at grafana.com/grafana/das… Here it is.
We can monitor the server directly by using 8913
Import and you can see the monitor page directly:
Because we use Docker to start the service is quite a lot, you can also take a look at the docker monitoring (above started CAdvisor service collected docker information), we use template 893 to configure the monitoring docker information:
05. JAVA System indicators
Unexpectedly, the monitoring of the server and Docker service has been configured through the above brief content, but there is still something missing, right? We write Java programs, JVM related monitoring is not up? That’s not gonna work.
So, brace up
Configuring monitoring in Java is also very easy, as long as we introduce two more POM dependencies in the project (monitoring components in SpringBoot).
<! -- monitoring --> <dependency> <groupId>org.springframework.boot</groupId> <artifactId> Spring-boot-starter -actuator</artifactId> </dependency> <! -- Adaptation Prometheus --> <dependency> <groupId> IO. Micrometer </groupId> <artifactId> </dependency>Copy the code
Then add the corresponding configuration to the configuration file (enable monitoring and allow Prometheus to pull the configuration)
TODO Management: health: show-details: always metrics: enabled: true Prometheus: enabled: true endpoints: web: exposure: include: '*' metrics: export: prometheus: enabled: trueCopy the code
When we start the service, access /actuator paths can see a bunch of output metrics, including those for Prometheus
Seeing these metrics printed indicates that our application access is complete, and all that remains is to collect the application metrics via Prometheus.
For Prometheus to collect Java application data, it would have been a matter of modifying the corresponding configuration file. Add the relevant configuration information to the prometheus.yml file you wrote earlier:
- job_name: 'Austin' metrics_PATH: '/actuator/ Prometheus' # [' IP :port'] # todo here the IP and port are written under their own applicationCopy the code
The targets path is: IP :9090/targets, and the endpoints collected by Prometheus are up, indicating that the targets status is normal.
So let’s go ahead and configure the corresponding monitor in Grafana. Here I use the 4701 template for JVM monitoring and the 12900SpringBoot monitoring, and can briefly see their effects:
06, pressure measurement
So far, when we want to send messages, we use the HTTP interface to make calls, but Spring actuator can monitor HTTP data, as it happens. Well, why don’t we just pressure the platform and see if it changes?
Here I use the WRK pressure tool (it is simple enough to use), so first install it (environment Centos 7.6) :
sudo yum groupinstall 'Development Tools' sudo yum install -y openssl-devel git git clone https://github.com/wg/wrk.git Sudo cp WRK /usr/local/bin # verify that the installation is successful WRK -vCopy the code
Wrk-t2-c100-d10s –latency http://www.baidu.com
Under the pressure test of our interface, see the data after finished: WRK – t4 – c100 – d10s — latency ‘http://localhost:8888/sendSmsTest? phone=13888888888&templateId=1’
Obviously, there are obvious fluctuations in the data, and the data seems to be inconsistent with our pressure measurement.
In my opinion, Prometheus pulls exposed data every N seconds (configurable), and visualizations configured on the interface perform queries every N seconds (configurable). Based on this architecture, it is difficult to calculate the value of a given moment (second).
So under Prometheus, it only sees values over a period of time, which is not very friendly to indicators like QPS and RT.
07. Deploy the project to LINUX
From the command above, I run the Austin project under Linux, although it is more basic. But for the sake of the new guy, I’d better post the details of the process. Can I get a thumbs up?
First, we need to download the JDK
Download JDK:https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html account: [email protected] password: OracleTest1234Copy the code
After that, we need to upload the downloaded package to Linux, I use a Mac (Windows students can baidu, I guess it is very simple), the IP itself to switch to the corresponding address
Scp-p22 /Users/3y/Downloads/ DDMG/JDK-8u311-linux-i586.tar. gz root@ip:/root/ AustinCopy the code
Unpack the Java package
tar -zxvf jdk-8u231-linux-x64.tar.gz
Copy the code
Configure the corresponding environment variables:
Vim /etc/profile export JAVA_HOME="/root/ Java /jdk1.8.0_311" export PATH="$JAVA_HOME/bin:$PATH" # Refresh the configuration file -bash: /root/java/jdk1.8.0_311/bin/java: /root/java/jdk1.8.0_311 /lib/ld-linux.so.2: Bad ELF interpreter: does not have that file or directoryCopy the code
Type the corresponding JAR package locally:
mvn package -Dmaven.test.skip=true
Copy the code
Upload it to a Linux server (as above) and start it in the background:
Nohup Java -jar austin-web-0.0.1 -snapshot. jar --server.port= 8888&Copy the code
08. Business Indicators
From above I have configured docker monitoring, server monitoring, SpringBoot application monitoring. But it can be found that most of these are indicators of the system monitoring. Some friends may ask: “Yikes? I thought you said there was business monitoring? Why don’t you do it? “
We can also implement custom indicator monitoring for Prometheus, but if the system itself is connected to an ELK-like system, we prefer to do business indicator data on ELK. After all, ELK is oriented to log data. As long as we record the log data, we can clean it out for service indicator panel monitoring.
For Austin projects, elK-related components will be added later. Therefore, I do not use Prometheus to collect information about business indicators; I prefer Prometheus to be a component of system indicators.
09,
This article focuses on the basics of monitoring (normally this is done by operations teams, but as a developer it’s good to know). If your system is not monitored in your company, it is easier to build with open source components.
A system really can’t do without monitoring, with monitoring troubleshooting problems will be much faster.
Finally, here are some answers to the interview questions you’ve been asked: “What do you do to troubleshoot online problems? What is the thinking of the investigation?”
Here’s how I understand it: First of all, if something goes wrong online. Think about whether you’ve released a system recently, and many times online problems are caused by changes to the system’s release. If you have recently released a system and it has a significant impact on online issues, roll back the system first, rather than troubleshooting the problem first.
If the system has not been released recently, check whether the monitoring of the system is normal (traffic monitoring, business monitoring, etc.). In general, we can find the problem from the monitoring (after all, we know the system best, and we can quickly locate the problem if there is an anomaly).
If there is no problem with system monitoring, it depends on whether there is a special error log on the offline, through the error log to troubleshoot the problem. Dev dev dev dev dev dev dev dev dev dev dev dev dev dev dev dev dev dev
So: we have a rollback mechanism, a monitoring mechanism, we will timely alarm the general error SMS, email and IM tools, if these are not, it may have to look at the error log to reproduce the problem, this is my general troubleshooting idea.
This article ends here, notice next: distributed configuration center I have access to the code
It’s been so long, it’s not too much to like, is it? I’m 3y. See you in the next video.
Follow my wechat public number [Java3y] in addition to technology I will also talk about some daily, some words can only say quietly ~ [line interview + write Java project from zero] continuous high intensity update! O star!!!!! Original is not easy!! Three times!!
Austin project source code Gitee link: gitee.com/austin
Austin project source code on GitHub: github.com/austin