An overview of the

When we were doing crawlers, we were doing crawlerswww.1688.com/?spm=a261p…. From here we can clearly see the url of the asynchronous request in the header returned by XHR. Here we access the URL directly (or we can see the returned data in Preview) and we can see that the returned data is the required data, that is, the data we want to crawl. Next is the URL format analysis, generally there will be rules to follow.

Problem to spy out

In fact, there will be visible, not touch the situation because the web page data is asynchronously loaded, so HTTP caught pages are not included in it. Some of the web page data is presented in asynchronous mode, will be in the background to send httprequest, and then use Ajax or other data backfill to the web page, this part can be found in F12 here to see if there may be data, not the data may be his web page has made many calculations, will present the final screen, You can check F12’s Status Code to see if redirect is relevant (EX 307).

The solution

If it is difficult to use console httprequest alone (there may be a lot of browser dependent things behind the web page), you can consider using the driver to open the web page, and use Selenium suite to control the driver. It’ll save us a lot of time.

For detailed methods, see:

Using C#+Selenium+ChromeDriver to crawl web pages perfectly simulates real user browsing behavior