“This is the 18th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021”
preface
The data of many websites, such as commodity prices and comments of e-commerce websites, will be dynamically loaded, so the crawler may not be able to get the relevant data directly when it first visits. So how do you deal with this problem?
First, the use of dynamic web pages
Let’s start with an example:
This is the scene of reading a book on JD.com. We found that when we opened a book, the price, ranking and reviews of the book didn’t load immediately when we first opened the site. Instead, it is obtained by two requests or multiple asynchronous requests. Such pages are dynamic pages.
Scenarios for using dynamic pages:
Scenarios where you want to refresh asynchronously. Some web pages have a lot of content, one load is very heavy on the server, and some users will not go to see all the content;
Go back to the original method of sending the request data to the HTTP server
1. GET method
GET adds the parameter data queue to the URL. Each field of Key and Value corresponds to each other. You can see it in the URL.
Some symbols, characters in browser urls are not well recognized. Then we need to have a coded way of conveying information. So the sender needs to do urlencode; The receiver needs to do URldecode;
www.baidu.com/s?ie=utf-8&…
Online testing tools: tool.chinaz.com/tools/urlen…
1.www.baidu.com/s?wd=DNS
? XXX = YYY&time = ZZZ Get Identifier of a request
2. acb.com/login?name=…
login: name=zhangsan password=123
2. POST method
Here’s an example of how to use the POST method:
This is a page of Youdao Translation, and a careful observation shows that the URL information of the page does not change any time the user enters a word to translate. This is a typical asynchronous Ajax technique for transferring data in JSON format.
Third, more difficult to deal with dynamic sites
1, to deal with the need for multiple data interaction simulation of the site
We sometimes encounter large websites like Taobao, which attach special importance to data copyright. Their websites have a large number of engineers and technical personnel to maintain, and they may also use multiple interactive data packets to complete the interaction between the website server and the user’s browser in technical means. If the traditional way of analyzing data packets is adopted at this time, it will be more complicated and difficult. So, is there a way to solve this problem once and for all?
Our solution is Selenium + PhantomJS.
Our crawlers are essentially emulating the behavior of the browser.
2、 Selenium
A Web automation testing tool, originally developed for Web site automation testing; We play games with button sprites; Selenium can do similar things, but it does so in a browser.
Sudo PIP install Selenium (PIP install Selenium)
Test the installation from Selenium import WebDriver in Python
Note: Students who want to use Python for automated testing should study the use of Selenium.
PhantomJS and browser
Note: We use Firefox browser interface in class, in order to facilitate teaching;
A WebKit-based headless browser that can load a website into memory and execute JS on the page, but it doesn’t have a graphical user interface, so it costs less resources;
Install: sudo apt install Phantomjs
For a full installation of Linux Ubuntu (see blog.csdn.net/m0_38124502…
)
Wget
Bitbucket.org/ariya/phant…
CD download
The tar – XVF phantomjs 2.1.1 – Linux – x86_64. Tar..bz2
CD phantomjs – 2.1.1 – Linux – x86_64 /
cd bin/
sudo cp phantomjs /usr/bin
Python – start -> Browser process Phantomjs,
Testing:
SpiderCodes\Phantomjs.. For example helloworld.js, pageload.js
Conduct tests;
Note: **** may cause resource leakage; To avoid this, you need to have a strategy to kill phantomjs when appropriate.
Four, summary of dynamic website information capture
In general, our crawler tries to mimic the behavior of a real user visiting a server site in a browser. If we use GET or POST to simulate the behavior of browser-server communication, the cost is relatively low, but it is difficult to fool the server against complex sites or sites that the server carefully defends. Selenim+PhantomJS makes our application look more like a normal user, but it’s much less efficient and much slower. There are many new challenges that can be encountered when scaling data. (such as site size Settings, waiting time Settings, etc.)