The introduction

I wrote a few days ago the Linux three Swordmen awk, grep, sed details, we have introduced the basic use of Linux three swordmen in detail, next we look at the specific application in the field of performance testing, this article mainly introduces the statistical analysis of Tomcat and Nginx Access logs.

Tomcat collects statistics on request response time

Server.xml uses configuration mode, %D- request time, %F response time

<Valve

className

“org.apache.catalina.valves.AccessLogValve”

directory

“logs” prefix

“localhost_access_log.”

suffix

“.txt” pattern

“%h %l %u [%{yyyy-MM-dd HH:mm:ss}t] %{X-Real_IP}i “%r” %s %b %D %F”

/> The fields are described as follows:

%h – IP address of the client that initiates the request. The RECORDED IP address is not necessarily the IP address of the real user client. It may be the public network mapping address of the private client or the proxy server address.

% L – RFC 1413 identifier of the client (see). Only clients that implement the RFC 1413 specification can provide this information.

%u – Remote client user name, used to record the user name provided by the user for authentication, such as baidu login user name zuozewei, if not logged in is blank.

%t – Time when the request was received (access time and time zone, such as 18/Jul/2018:17:00:01+0800, the “+0800” at the end of the time message indicates that the server time zone is 8 hours after UTC)

%{x-real_IP} I – Indicates the real IP address of the client

%r – Request line from the client (request URI and HTTP protocol, this is the most useful information in PV logging to record what kind of request the server received)

%>s – The server returns the client status code, such as success 200.

%b – The size of the body content of the file sent to the client, excluding the size of the response header (you can sum this value from each record of the log to give a rough estimate of server throughput)

%D – Time to process the request in milliseconds

%F – The response time of the client browser information submission, in milliseconds

Log example:

89.212 47.203.

[ 19 / Apr / 2017 : 03 : 06 : 53

0000]

“The GET/HTTP / 1.1”

200

10599

50

49 Nginx collects statistics on the response time of requests and background services

Extend response_time&upstream_response_time on classic format using default Combined

Nginx.conf uses the configuration mode:

Log_format main ‘remoteADDR − remote_addr-remoteADDR −remote_user [timelocal]”time_local] “timelocal]”request” statusstatus statusbody_bytes_sent requesttimerequest_time requesttimeupstream_response_time ” httpreferer””http_referer” “httpreferer””http_user_agent” “$http_x_forwarded_for”‘ ; The fields are described as follows:

Remoteaddr − Client IP address that initiates the request. The RECORDED IP address is not necessarily the IP address of the real user client. It may be the public network mapping address of the private client or the proxy server address. Remote_addr – IP address of the client that initiated the request. The RECORDED IP address is not necessarily the IP address of the real user client. It may be the public network mapping address of the private client or the proxy server address. Remoteaddr − Client IP address that initiates the request. The RECORDED IP address is not necessarily the IP address of the real user client. It may be the public network mapping address of the private client or the proxy server address. Remote_user – Remote client user name, used to record the name provided by the visitor for authentication, such as zuozewei, which is blank if not logged in. [timelocal]− Time when the request is received (access time and time zone, for example, 18/Jul/2018:17:00:01+0800, “Time_local] – Time when the request is received (access time and time zone, such as 18/Jul/2018:17:00:01+0800, Timelocal]− Time when the request is received (access time and time zone, for example, 18/Jul/2018:17:00:01+0800) The “+0800” at the end of the time information indicates that the server time zone is 8 hours after UTC.) “Request” – The request line from the client (request URI and HTTP protocol, this is the most useful information in the entire PV log, − The server returns the status code of the client, for example, success is 200. Status – The server returns the status code of the client, such as 200 for success. Status − The server returns the client status code, for example, 200 for success. Body_bytes_sent – The size of the body content of the file sent to the client, excluding the size of the response header (this value can be added up for each log record to give a rough estimate of server throughput) RequestTime − The total time of the entire request, The unit is seconds (including the time to receive the request data from the client, the time to respond to the back-end program, and the time to send the response data to the client (excluding the log writing time).) Request_time – Total time of the entire request. The unit is seconds (including the time to receive the request data from the client, the time to respond to the back-end program, and the time to send the response data to the client (excluding the log writing time).) RequestTime specifies the total time for the entire request. Upstream_response_time – In seconds (including the time to receive the request data from the client, the time to respond to the back-end program, and the time to send the response data to the client (excluding the time to write the log). Upstream response time in seconds (from the time the connection is established to the time the connection is received and then closed) “httpreferer” − Record the page from which the link was accessed (the content of the request header Referer) “http_referer” – − Httpreferer − Http_user_agent – Indicates the page from which the request header is accessed “Http_x_forwarded_for” – the real IP address of the client. Usually, the web server is placed after the reverse proxy so that the client’s IP address cannot be obtained. The IP address obtained from remote_add is that of the reverse proxy server. The reverse proxy server can add x_forwarded_for** to the HTTP header to record the IP address of the original client and the server address requested by the original client. Log example:

42.148 218.56.

[ 19 / Apr / 2017 : 01 : 58 : 04

0000]

“The GET/HTTP / 1.1”

200

0

0.023

“-“

“Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36”

“-” use AWK

In order to understand AWK programs, let’s give a brief overview of the basics, as described above.

AWK programs can consist of one or more lines of text, the core of which is a combination of patterns and actions. Pattern {action} The pattern (pattern) is used to match each line of text in the input. For each line of text that matches, AWk takes the corresponding action. Use curly braces to separate patterns from actions. Awk scans each line of text sequentially, using record separators (typically newlines) to record each read line, and using field separators (typically Spaces or tabs) to split the line into multiple fields, each of which can be 1,1, 1,… N said. N said. N said. 1 for the first domain, 2 for the second domain, 2 for the second domain, 2 for the second domain, and n for the NTH domain. $0 represents the entire record. Neither mode nor action can be specified. By default, all rows are matched. By default, the action {print} is executed, that is, printing the entire record.

Use Nginx access.log as an example.

Use AWK to decompose information from Nginx Access logs

42.148 218.56.

[ 19 / Apr / 2017 : 01 : 58 : 04

0000]

“The GET/HTTP / 1.1”

200

0

0.023

“-“

“Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36”

“-” 0 is the entire record line 0 is the entire record line 0 is the entire record line 1 is the access IP “218.56.42.148” 4 is the first half of the request time “[19/Apr/2017:01:58:04” 4 is the first half of the request time “[19/Apr/2017:01:58:04” 4 is the first half of the request time “[19/Apr/2017:01:58:04” 5 is the last half of the request time “+0000]” and so on…

When we use the default field separator, we can parse out the following different types of information from the log:

awk ‘{print $1}’ access . log

The IP address ($remote_addr)

awk ‘{print $3}’ access . log

User name $remote_user)

Awk ‘{print 4,4,4,5}’ access.log

Date and time ([$time_local])

awk ‘{print $7}’ access _log

URI ($request)

awk ‘{print $9}’ access _log

Status code ($status)

awk ‘{print $10}’ access _log

Response size ($body_bytes_SENT)

awk ‘{print $11}’ access _log

Request time ($request_time)

awk ‘{print $12}’ access _log

Upstream response time ($upstream_response_time)

It is not hard to find that using the default field delimiter alone makes it difficult to parse out other information, such as the request line, reference page, and browser type, because this information contains an indefinite number of Spaces. Therefore, we need to change the field separator to “to read the information easily.

awk

F” ‘{print $2}’ access . log

The request line ($request)

awk

F” ‘{print $4}’ access . log

Refer to page ($http_referer)

awk

F” ‘{print $6}’ access . log

The browser ($http_user_agent)

awk

F” ‘{print $8}’ access . log

The true IP ($http_x_forwarded_for)

Note: Here to avoid the Linux Shell misunderstanding “for string start, we use backslash, escaped”. Now we have the basics of AWK and how it parses logs.

Application Scenario Example

Use Nginx access.log as an example

Browser Type Statistics

If we wanted to know which types of browsers visited the site, in reverse order of occurrence, I could use the following command:

awk

F” ‘{print $6}’ access . log | sort | uniq

c | sort

Fr This command line first parses out the browser domain and then pipes the output as input to the first sort command. The first sort command is mainly for uniq to count the number of occurrences in different browsers. The final sort command will print the previous statistics in reverse order.

Discover system problems

We can use the following command line to count the status code returned by the server and find possible problems in the system.

awk ‘{print $9}’ access . log | sort | uniq

C | sort under normal circumstances, a status code 200 or 30 x should be appear most times. 40X indicates a client access problem. 50X indicates a server problem. Here are some common status codes:

200 – The request was successful and the desired response header or data body will be returned with this response. 206 – The server has successfully processed part of the GET request 301 – the requested resource has been permanently moved to a new location 302 – The requested resource is now temporarily responding to the request from a different URI 400 – Incorrect request. The current request cannot be understood by the server 401 – The request is not authorized and the current request requires user authentication. 403 – No access. The server understands the request, but refuses to execute it. 404 – File not found, resource not found on server. 500 – The server encountered an unexpected condition that prevented it from completing processing the request. 503 – The server is currently unable to process requests due to temporary server maintenance or overload. The HTTP protocol status code can be defined at www.w3.org/Protocols/r…

Statistics about status codes

Find and display all requests with a 404 status code

Awk ‘($9 ~ /404/)’ access.log Counts all requests whose status code is 404

awk ‘(9 ~ /404/)’ access . log | awk ‘{print 9,$7}’

| sort now we assume that a request (for example: URI: / path/to/notfound) 404 generated a lot of mistakes, we can find this request is through the following command from which a reference page, and from any web browser.

awk

F” ‘(2 ~ “^GET /path/to/ notFound “){print 4,$6}’ access. log

Sometimes you find that other sites are using images saved on their own sites for some reason. If you want to know who is using the pictures on your website without authorization, we can use the following command:

awk

F” ‘(2 ~ /\.(jpg|gif|png)/ && 4! ~ / ^www.example.com/)\ {print $4}’ access . log \ | sort | uniq

C | sort note: before use, amend the www.example.com website domain name for yourself.

Break up each line with “; The request line must contain “.jpg “, “.gif “, or “.png “. The reference page does not start with your web site domain name string (in this case, www.example.com); Displays all reference pages and counts the number of times they appear. Ip-related statistics

Count how many different IP accesses there are:

Awk ‘{print $1}’ access. Log | sort | uniq | wc – l each IP access statistics how many page:

Awk ‘{++S[$1]} END {for (a in S) print a,S[a]}’ log_file order the number of pages accessed per IP from smallest to largest:

awk ‘{++S[$1]} END {for (a in S) print S[a],a}’ log_file | sort

N How many IP accesses were available at 14:00 on August 31, 2018:

Awk ‘4,4,4,1} {print access. Log | grep 31 / Aug / 2018:14

| awk ‘{print $2}’ | sort | uniq | wc

L Top 10 most accessed IP addresses are counted

awk ‘{print $1}’ access . log | sort | uniq

c | sort

nr | head

10 List the pages accessed by a certain IP address:

Grep ^ 202.106. 19.100 access. Log | awk ‘,1,1,7} {print 1 ‘statistics of an IP access details, according to the access frequency sorting

The grep ‘202.106.19.100’ access. Log | awk ‘{print $7}’ | sort | uniq

c | sort

rn | head

N 100 Response page size statistics

List the files with the largest transfer sizes

cat access . log | awk ‘{print
10 10 “”
1 “”
4 4 “”
7}’ | sort

nr | head

100 lists the pages whose output is larger than 204800 bytes (200kb) and the page occurrence times

cat access . log | awk ‘(10 > 200000){print 7}’ | sort

n | uniq

c | sort

nr | head

100 List the most frequently visited pages (TOP100)

awk ‘{print $7}’ access . log | sort | uniq

c | sort

rn | head

N 100 List the most frequently visited pages ([exclude PHP pages] (TOP100)

grep

v “.php” access . log | awk ‘{print $7}’

| sort | uniq

c | sort

rn | head

n 100

Lists pages that have been accessed more than 100 times

cat access . log | cut

d ‘ ‘

f 7

| sort | uniq

c | awk ‘{if (
1 > 100 ) p r i n t 1 > 100) print
0}’

| less listed 1000 records recently, most visited pages

tail

1000 access . log | awk ‘{print $7}’ | sort | uniq

c | sort

Nr | less PV related statistics

Count requests per minute,top100 points in time (accurate to the minute)

awk ‘{print $4}’ access . log | cut

c 14

18 | sort | uniq

c | sort

nr | head

N 100 number of requests per hour,top100 point in time (accurate to hour)

awk ‘{print $4}’ access . log | cut

c 14

15 | sort | uniq

c | sort

nr | head

N 100 number of requests per second,top100 point in time (accurate to second)

awk ‘{print $4}’ access . log | cut

c 14

21 | sort | uniq

c | sort

nr | head

N 100 Pv of the current day

grep “10/May/2018” access . log | wc

L:

Awk ‘{print $4}’ : sort Uniq -c: Prints the number of occurrences of each repeated line. Sort-nr: sort the repeated rows in reverse order. Head-n 100: displays statistics about the response time of the top 100 IP pages

You can use the following command to count all log records with response times greater than 3 seconds.

Awk ‘(NF > 1){print 11}’ access.log $NF is the last field.

List all PHP pages whose request time is longer than 3 seconds and count the number of times they appear. Display the top 100 pages

cat access . log | awk ‘(n | uniq

c | sort

nr | head

100 List the requests whose duration exceeds 5 seconds. The first 20 requests are displayed

awk ‘(| sort

n | uniq

c | sort

nr | head

20 Spider capture statistics

Count the number of spider crawls

grep ‘Baiduspider’ access . log | wc

L Count the number of times the spider grabs 404

grep ‘Baiduspider’ access . log | grep ‘404’

| wc

L summary

Through this introduction, I believe that students will find the Linux Three musketeers powerful. On the command line, it can also accept and execute external AWK program files and support very complex processing of text messages, so you can say, “There is nothing that you can’t do.”