1.1 Ajax asynchronous loading generates web content
Nowadays, more and more web pages use Ajax asynchronous loading mode, that is, some content in the web page is dynamically generated by the front-end JS. Because the content rendered on a web page is generated from JS, it can be seen in a browser, but not in HTML source code.
For example, the search application page of App Treasure is displayed as follows:
As you can see above, the list of applications searched is under the DOM tree id=”J_SearchDefaultListBox”.
To view the source code of the current page (using the shortcut key “Ctrl + U”) :
In the source code, the element id=”J_SearchDefaultListBox” is empty without any application list information.
According to the above analysis, the application list displayed on the application search page of Application Treasure is dynamically generated and loaded by JS.
In this case, how do we crawl web content? There are generally two methods:
(1) Find the data returned by THE JS script from the web page response (mostly in JSON format, but also in XML format). ;
(2) Use Selenium to conduct simulated access to web pages.
We will introduce the first method below. The second method can be found here.
Since web content is dynamically generated and loaded by JS, JS needs to call an interface first, and then load and render according to the data returned by the interface. Then we can first find the data interface called by JS, from the data interface to find the data presented in the web page. Here we take the search application page of Application Treasure as an example to illustrate.
1.2.1 Finding the data interface requested by JS
Perform the following steps:
- Open the application of search page (android.myapp.com/myapp/searc…
- Press F12 to open the web debugger
- Select the “Network” TAB
- Select “XHR” (XMLHTTPRequest, a concept in Ajax (Asynchronous JavaScript and XML).
- Enter the application name (for example, wechat)
You will see the following information:
Here, we see that there is only one request (other pages may have multiple). Click on the request and select “Preview” to see the following data:
In the above data, we found “wechat”, “duokai assistant” and other information, these are the list of applications out of the search information.
Based on the above analysis, it can be seen that the information about the search application list is obtained through this request. The corresponding URL request for android.myapp.com/myapp/searc… Request -> then right-click -> Copy -> Copy Link Address).
Opening the URL above in a browser returns a string of data. It looks messy, but it’s actually data in JSON format.
Thus, we find the data interface for the JS request.
1.2.2 URL encoding the JS request we found above data interface is android.myapp.com/myapp/searc… What does %E5%BE%AE%E4%BF%A1 mean?
In fact %E5%BE%AE%E4%BF%A1 is the URL encoding of Chinese wechat. In Python, you can use urllib.parse.quote(), i.e. ‘%E5%BE%AE%E4%BF%A1’ = urllib.parse.quote(‘ wechat ‘).
According to the STANDARD, a URL can contain only Letters, numbers, and some symbols. Other characters (such as Chinese characters) do not comply with the URL standard. This is where URL encoding is needed.
1.2.3 Code implementation
Follow the following four steps:
(1) Introduce related libraries;
(2) Request the data interface of JS request;
(3) Json formatting of HTTP response data;
(4) Traverse to obtain the App information in the list (App name, package name, version number, APK download URL, etc.)
Import json import requests from urllib.parse import quote # HTTP request for data interface app_name = 'wechat' URL = 'https://android.myapp.com/myapp/searchAjax.htm?kw={}&pns=&sid='\ .format(quote(app_name)) web_data = Data = json.loads(web_data) app_infos = data['obj']['appDetails'] # Get the App information (App name, package name, version number, APK download URL, etc.) in the list for app_info in app_infos app_name = app_info['appName'] pkg_name = app_info['pkgName'] version_code = app_info['versionCode'] version_name = app_info['versionName'] download_url = app_info['apkUrl'] print(app_name, pkg_name, version_code, version_name, download_url)Copy the code
Recently, many friends have sent messages to ask about learning Python. For easy communication, click on blue to join yourselfDiscussion solution resource base