After learning the basics of Python, I was determined to get a quick start on crawlers, because I was learning Python for crawlers, so I found this Douban movie to crawl. All right, let’s cut to the chase
1. Find the web page and analyze its structure
First enter the webpage of Douban Top250, press F12 to open the developer tool, as shown below
Click the arrow in the upper left corner of the developer tool to find the data you need. Here I find that the information of each movie is in the tag <li>, so I can use the regular expression to extract each movie first, and then extract the data in each movie separately. The data for each movie is now available, but there are only 25 movies in this URL. How do I get the next page? Here we can get the link of the next page on each page, and then continue to get the movie data of the next page through the loop
We can first use the developer tool arrow to click on the next page, and then display the right arrow data out, here we can also use the regular expression to get the link to the next page, and then the next work is the cycle, good analysis is over, start to knock code!
2. Use object-oriented method to crawl data
-
Request for requests (PIP install Requests, PIP install Requests, PIP install Requests, PIP install Requests, PIP Install Requests, PIP Install Requests)
The request header is viewed in developer tools, as shown below
-
Next, use regular expressions to get the data
Match every movie and every page of data first (the library that uses regular expressions is RE)
Next, get the data for each movie
Note: after obtaining the above data, some of them are empty, so I still need to judge whether they are empty. In order to look good, I use ternary expressions to judge, and then save them into the dictionary
-
The next step is to loop through the data on the next page
3. If you have some basic database, you can also save them to the database. Here I put these data into the MySQL database, the code is as follows, you need to build the database and form
-
This is the class that operates on the database (using the pymysql library)
-
Then go back to the crawler and store the data in the database
4.After success you will find the following data in the database
Finally, thank you very much for reading this article. If you like it, you can follow it, forward it and like it. If you have any questions, please leave a message in the comment area.