Capture Jane book user information

All the crawlers I’ve written in the past have stored urls with known fixed data into a list and then iterated through the list of urls. This time for simple books, let’s try recursion.

The programming technique of what is a recursive program (or function) calling itself is called recursion. A method by which a process or function invokes itself directly or indirectly in its definition or specification, usually by converting a large and complex problem into a smaller problem similar to the original problem.

3. The ability of recursion lies in using finite statements to define an infinite set of objects. This case video explains as follows:




Grab taobao comments

Before my level is limited, for taobao reviews such a dynamic web page, because the data is not found in the source code of the web page, so I can not capture the data, can only use Selenium imitation people control browser to capture the data, the advantage is easy to see and should not be blocked by Taobao company; The disadvantage is that the speed is too slow.

After a day of study today, I finally learned to analyze data packets, and taobao review data packets are transmitted in JSON format. In addition to being able to capture packets, you also need to be able to extract the desired comment data from JSON.

Implementation difficulties: 1. Analyze data packets to find the url used for taobao comment transmission and analyze the characteristics of the url. 2

The video explanation of this case is as follows:



Douban is my favorite platform. Generally, people go to Douban to read movie reviews, book reviews, and decide whether to watch movies or buy books according to the reviews. So there are a lot of economic management students who have data acquisition needs in this area. Of course, I am one of them, and my interest in this area prompted me to learn Python.

So before writing a crawler, we must learn to analyze the structure of the web page and locate the node label where the data you want to capture is located. The positioning methods are as follows:

  1. If the tag is the only tag in the entire HTML page, look for it directly.

  2. If the tag is not unique, then you can start with the parent node of the node. If the parent node is unique, locate the parent node first and then select the children of the parent node. The child node is the target node.

The video explanation of this case is as follows:






The original post was published on March 19, 2017

Author: Deng Xudong

This article is from the Python Chinese Community, a cloud community partner. For more information, please follow the wechat official account of Python Chinese Community