APM is an Application Performance Monitor
This article has three premises:
- From the product form, this is certainly not a competitive product that can rival Ali’s products, so I’m sorry that I touched porcelain. You can replace ali here with any tool you find on Google using APM. But at the end of the article, I will use Ali’s tools to perform a performance test on the same site to see how far we are from each other. Interestingly, although the tool I wrote is very simple compared with the monitoring tool of Aliyun, it still achieves my purpose and helps me find out where the problem is. In that sense, it is indeed a victory
- Site2share is a tool site that I wrote for myself. I needed to know how users would feel about its performance once it was launched, so the tool was built for it and died after completing its mission of performance testing.
- This post is actually a response to my article last year on the Crisis of Faith in Performance Metrics. In that article, it was all about the rationale and design behind the tool, not a single line of actual code.
I remember a classic front end question that circulated on the Internet many years ago, which basically asked you to explain what happens from the moment you type in the URL in your browser’s address bar to the moment you see the page. What’s fascinating about this kind of question is that it gives you a big bang for your head and a big bang for your head — how blind we are to so many things under our eyes, and how big a problem there is behind the problem.
The question I’m going to answer in this article is straightforward: How do I know how slow my site is and where? This is the first question I need to figure out when the website is launched.
Problems to be solved
To determine the indicators
There are two sub-issues that are priority issues to figure out under the big problem,
- What metric should I use to measure speed?
- How do I find out where the slow bottleneck is?
These two sub-questions were explained in detail in my article “The Crisis of Faith in Performance Metrics” last year, but due to space constraints, only the conclusion is stated here, and the answers to the two questions are inextricably linked and must be discussed together
In short, technical metrics such as OnLoad or DOMContentLoaded are simply not enough, and even First Contentful Paint is still far from the user’s actual perception (as I’ll prove later). Good metrics should be as close to the user as possible, even deeply customized to the business. So I suggest that the timing of the DOM element on the page, which is used to host the core content, be a core indicator of performance. This timing is critical because it equates to the moment when the site can be called available.
Take the details page of a website for example. The key element is.single-Folder-container
But that doesn’t mean that one metric is enough, because if we find that one metric is not ideal, we can’t pinpoint the problem. Therefore, the effect of good data should be two-way: that is, it can accurately reflect the current operating state of the product (from product to data), and at the same time, by observing the data, we can also know what kind of problems the product has (from data to product).
In this context, we need to extract metrics from “potential” performance bottlenecks. It is assumed that the factors affecting the loading performance of the website are:
- How quickly resources load (scripts, styles, and other external resources)
- Interface response time
How about we keep tracking the load times for both?
To answer that question, we have to ask ourselves, are these two types of information enough to figure out what went wrong? The answer is yes compared to a single indicator, but there is still room for refinement. Taking Resource loading as an example, refer to Resource Timing as shown in the following figure. Resource loading is also divided into multiple stages:
We can even diagnose whether there is a problem at the DNS resolution or TCP connection stage. While we should not collect every indicator in detail, there are several factors to consider:
- Is it really necessary to fix the problem after it has been exposed? Do I have the ability to solve it? For example, hundreds of milliseconds of DNS resolution time may be several times that of the industry, but is it really the bottleneck of my entire site? Is it more cost-effective to adopt an existing CDN solution than to painstakingly increase my time by a few hundred seconds?
- My personal experience tells me that the code used to collect metrics has a maintenance cost, which is usually higher than the maintenance cost of the business code, and the cost is proportional to the invasveness of the code. It’s expensive because it’s hard to detect when it’s broken; Unit testing and regression testing are more difficult
Going back to the question of determining indicators, one of the things that we have to face is that we don’t know what data we need all at once, that’s normal, and determining indicators is a convergence process of hypothesis, test, hypothesis, and test. Trying is better than standing still to get us closer to the right answer. Let’s start by collecting the three metrics mentioned above
- Timing of key elements
- Resource load time
- The interface of time
The interface problem
The pitfall that front-end engineers are bound to fall into is to take a front-end view of the problem and ignore the most important aspect of interface performance. For most people, page loading is probably just linear:
But in the API Request segment, we should look at things from a microservices perspective. A request will go through different microservices to get data from the time it is made to the time it is received, if it can be traced down each link of the request. This helps us to locate problems in the online environment and measure the efficiency of individual microservices. This is Distributed Tracing. Currently, this technology has been quite mature. Jaegertracing and Zipkin are Distributed Tracing solutions
However, if you know something of the split to the back-end service layer, if you want to diagnose the performance of the single micro service where we can continue to drill down to a single service, to compare the different service layer method call performance (the service layer to the front also apply, detail can refer to the year before I translated this article presents architecture patterns and best practices)
What I’m trying to say is that to fully explore the application’s performance bottlenecks, we should look at both upstream and downstream, and the results from a separate perspective are biased
The solution
Collect the logs
If you have any experience collecting logs, you know that collecting and exporting logs are two different things. This is especially true for back-end applications. Logs can be recorded in local files or directly on the console, and in an online environment they need to be recorded in a professional logging service.
For example, NodeJS ‘open source logging library Winston supports the integration of multiple transports, a storage method for logging. It also supports writing custom transport, and currently the open source community’s Transport options support almost all of the major logging services on the market. The same concept applies to logging providers in.NET CORE
However, this “active” log collection method is not The best practice. The Twelve-Factor App, a methodology for building web applications, once proposed that The application itself should not consider log storage, but only ensure that logs are output in The form of STDout, and The environment is responsible for The collection and processing of logs. This proposal makes sense because the application should not and cannot know what cloud environment it will be deployed in, and different environments handle logging differently
For the sake of Fail Fast, I did not follow this philosophy when developing the Site2Share backend. When I needed to collect logs, I directly called the platform-specific collection method. Now all my log in Azure Application Insights, when records so I need to call Application Insights client methods: AppInsightsClient. TrackTrace (the message)
At the implementation level, however, Winston code can be more elegant and we can create a Logger that supports multiple log output channels at the same time
const logger = winston.createLogger({
transports: [
new AppInsightsTransport(),
new winston.transports.Console()
]
});
Copy the code
Because we are testing front-end performance, and the performance data is generated on the web page of the consumer browser. So we rely on each user to actively upload data after the visit by a script embedded in the page
Application Insights
I chose Azure Application Insights to store and query logs, One of the reasons I chose it is because I use Azure services from the front end (Azure Static Web App) to the back end (Azure Service App) and even DevOps, Nature’s official Application Insights are better integrated with these services; More importantly, it solved the problem of Distributed Tracing.
In order to collect logs, you will need to include the SDK of Application Insigths in your Application, which supports both front and back end applications. It collects logs in two ways: active collection and passive reporting. Taking a Web application in the JavaScript language as an example, Once the SDK is installed on the page, it automatically collects errors, asynchronous requests, console.log (as monkey Patch), and Performance information (through the Performance API) when the application is running. You can also call trackmetrics and TrackEvents provided by the SDK to actively report customized indicators and event information. We use both methods in performance acquisition
Metrics, logs, and other information are often referred to as Telemetry (Data/item), and are often stored in separate tables and buried with other data. How do you connect the two? The solution for Application Insights to correlate Telemetry is simple: Provide a unique context identifier, operation_Id, for each piece of data. For example, if the user visits a page once and the operation_Id in the data generated by this visit is called XYZ, then we can query the associated data (using Kusto syntax) with XYZ on the Application Insights platform
(requests | union dependencies | union pageViews)
| where operation_Id == "xyz"
Copy the code
We can not only associate the front-end data with the front-end data, but also the front-end data with the back-end data, which is the magic of working with Distributed Tracing. For microservice applications, Application Insights can even generate an Application Map for us to visualize the call process and time consumption between services.
After going through the technical details in this section, we can visualize what data we need and how they relate
Resource loading Indicators
Thanks to the Performance API, collecting metrics in modern browsers is surprisingly easy. Without an active trigger, the browser already encapsulates performance metrics in PerformanceEntry according to the timeline every time the page loads. After that, we only need to filter out the data we need, such as the script we care about:
window.performance.getEntries().filter(({initiatorType, entryType}) = > initiatorType === 'script' && entryType === 'resource')
Copy the code
According to the conclusion in the previous section, we will not record the data of every link of resource loading in detail. Here, I focus on collecting the duration of resource loading and the start time of resource loading, both of which can be obtained from PerformanceEntry, namely, duration and fetchStart. Because front-loading and shortening load times are a good way to improve performance in my opinion right now. If these two values do not show any abnormalities afterwards, more indicators can be collected
Report the occurrence time of the element
The simplest and most crude way to determine when an element is present is to poll the element with setInterval, but in modern browsers we can use the MutationObserver API to monitor all changes to the element, so the question can be asked differently: When does the.single-Folder-container element appear under the body tag
const observer = new MutationObserver(mutations= > {
if (document.querySelector('.single-folder-container')) {
observer.disconnect();
return; }}); observer.observe(document.querySelector('body'), {
subtree: true.childList: true
});
Copy the code
Here’s the problem: this code is critical and difficult to test.
The first problem is that, for example, there is no native MutationObserver in a Jest environment, so the point of testing a mock MutationObserver just to pass the test is lost;
Second, even if you’re testing in an environment like Headless Chrome that supports MutationObserver, how do you know it’s reporting you the correct element occurrence time? Since you don’t know the exact timing yourself (that is, you test Lee’s expect), 10 seconds must be wrong, but 2.2 seconds?
Additional performance indicators
Theoretically, these two are all the metrics we would expect to collect. But there are two additional metrics I’d like to collect: First Paint and First Contentful Paint, which simply record the key moments when the browser is drawing the page. These two metrics are also available from the Performance API
window.performance.getEntries().filter(entry= > entry.entryType === 'paint')
Copy the code
Paint Timing will be closer to the user experience than purely technical metrics, but how will it compare to actual users seeing elements appear
After the end of time
I suspect there are two possible performance bottlenecks: 1) Redis queries and 2) MySQL queries.
Redis is mainly used for session storage, and the back end is built by Node.js + ExpressJS, so it is difficult to monitor session reading performance. MySQL > select findByFolderId from findByFolderId;
const findFolderIdStartTime = +new Date(a);await FolderService.findByFolderId(parseInt(req.params! .id)) appInsightsClient.trackMetric({name: "APM:GET_SINGLE_FOLDER:FIND_BY_ID".value: +new Date - findFolderIdStartTime
});
Copy the code
conclusion
Finally, in order to facilitate finding corresponding indicators on the log platform and making statistics on indicators of the specified type, we need to name the above indicators. The following is the naming rule.
- Time for the back-end database to query a single piece of data — APM:GET_SINGLE_FOLDER:FIND_BY_ID
- First-paint indicator in browser — browser:first-paint
- First-contentful-paint — Browser: first-Contentful-paint
- Asynchronous request data in the front-end — resource: xmlHttprequest:
- — resource:script:
- Browser request style file data — Resource :link:
- Visible time of key elements in detail page — Folder-detail :visible
- Visible time of key elements of personal home page — Dashboard: Visible
The last post is coming to an end. I have explained the idea of this performance collection solution and basically implemented our performance collection script. With this code, we can basically collect a single performance data in its entirety
In the next installment we will address several of these issues
- How to get enough sample size to measure performance when the site is hardly visited (how to use Azure Serverless and Azure Logic App to solve)
- How do you find problems in the last 200,000 pieces of data generated
- Compare that with ARMS, Alibaba’s monitoring service
This article is also published on my website “Technology Roundtable” and Zhihu. Please subscribe