In this article, we will try to capture and analyze web pages using an intuitive web analytics tool (Chrome Developer Tools)
1. Test environment
Browser: Chrome
Browser Version: 67.0.3396.99 (official version) (32-bit)
Web analytics tools: Developer tools
2. Web analysis
(1) Web source code analysis
As we know, the webpage has the static webpage and the dynamic webpage, many people will mistake the static webpage is not the dynamic effect webpage, actually this kind of statement is wrong
- A static pageA non-interactive web page without a background database, often with
.htm
,.html
,.xml
For the suffix - Dynamic web pagesAn interactive web page that can transfer data to and from a background database
.aspx
,.asp
,.jsp
,.php
For the suffix
In addition, many dynamic sites have adopted asynchronous loading technology (Ajax), which is often the reason why the source code captured and the source code displayed on the site are inconsistent
As for how to crawl dynamic web pages, there are two methods:
- One is analyzing Ajax requests through packet capture, which I’ll cover next
- The second is to use Selenium and other tools for dynamic rendering, which you can refer to in my other article, Basic Use of Selenium
Below, we take JINGdong products as an example to analyze how to capture packages through Chrome. We first open the home page of a product
Go to the blank area of the page and right-click and choose View the source code of the page (or use the shortcut key Ctrl+U to open the page directly)
Note that looking at the source code of the web page gives you the original source code of the site, which is usually the source code we grab
Go to the blank area of the page again, right-click and select Check (or use the shortcut Ctrl+Shift+I/F12 to open directly).
Note that what you get is ajax-loaded and JavaScript rendered source code, which is the source code for the current site display
After comparison, we can see that the content of the two is not the same, which is a classic example of asynchronous loading technology (Ajax)
At present, at least the prices of jingdong products are generated through asynchronous loading. Here are three methods to determine whether a certain content in the web page is dynamically generated:
-
One is to analyze the source code generated by viewing the source code of a web page, which can be used to look for typical statements of dynamic requests, and can also be compared with the source code generated by examining it
-
Second, through the following will explain the web page packet capture analysis to judge, this method is the most commonly used, should be a good grasp
-
Third, a trick is to disable JavaScript loading in Chrome
Concrete can be input in the address bar of the Chrome Chrome: / / Settings/content/javascript to javascript Settings page
Then turn off the JavaScript option and reload the page to see a blank space where the price was previously displayed
This indicates that the original price was dynamically generated by JavaScript
(2) Web page packet capture analysis
Let’s take jingdong commodities as an example to explain. Open the home page of a commodity and try to grab the dynamically loaded commodity price data
Use the shortcut keys Ctrl+Shift+I or F12 to open the Developer tools, and then select the Network TAB for packet capture analysis
Press the shortcut key F5 to refresh the page, you can see various packages appear in the developer tool, we use Filter to Filter the packages
First, we select Doc and see that only one package appears in the list
Typically, this is the first package the browser receives to retrieve the original source code for the requesting site
Click on Header to see its Header parameter Settings
Click Response to see the returned source code. It is easy to find that it is actually the same as the information returned by viewing the source code of the web page
Let’s get back to the point. For dynamically loaded packet capture analysis, focus on the XHR and JS tabs
After selecting JS for filtering, we found that there were many packages in the list. After analysis, we screened out the packages marked in the figure below
This package returns information about prices, but after careful analysis, the prices are not for the current item, but for related items
But the package is still price dependent, so why don’t we look at the request URL for this package
https://p.3.cn/prices/mgets?callback=jQuery1609108&type=1&area=1_72_2799_0&pdtk=&pduid=1539779074977382417990&pdpin=&pin =null&pdbp=0&skuIds=J_25630711066%2CJ_26395831446%2CJ_20823451030%2CJ_11332156897%2CJ_14020547214%2CJ_26498549638&ext=11 100000&source=item-pcCopy the code
Filtering unnecessary parameters, including callback, yields simple and efficient urls
https://p.3.cn/prices/mgets?skuIds=J_25630711066%2CJ_26395831446%2CJ_20823451030%2CJ_11332156897%2CJ_14020547214%2CJ_264 98549638Copy the code
Open the URL directly in a browser, and you can see that it does return JSON data containing price information (except for prices of other items)
By analyzing the parameters of the URL, it can be inferred that skuId should be the unique symbol of each commodity. Then where can we find the skuId of the commodity we need?
In fact, SKU is an abbreviation commonly used in logistics, transportation, etc. The abbreviation stands for Stock Keeping Unit.
This is the basic unit of inventory measurement, which has now been extended to the abbreviation of uniform product number, and each product should have a unique SKU
Review the product homepage we just entered, item.jd.com/10072615543…
This is not hidden in the current product unique number identification (10072615543) it? Try it!
Sure enough, visit the full URL for commodity prices and we get it, p.3.cn/prices/mget…
We can get the current price information by visiting the website directly
In fact, we can also generalize the URL appropriately to accommodate price crawls for all jingdong products
Very simple, just separate out the skuIds as parameters, p.3.cn/prices/mget…
Through the generalized URL, theoretically as long as we can get the skuId of the product, we can access the price of the corresponding product