preface
If you often read the Python crawler related public account, it will be presented in the form of crawler + data analysis, which is very interesting, and the chart is also very good. Today, I would like to share the cat’s eye movie rating above 9 crawler and analysis, see what is worth watching?
The development tools
Python version: 3.6.4
Related modules:
Openpyxl module;
Requests module;
Win_unicode_console module;
Bs4 module;
And some modules that come with Python.
Environment set up
Install Python and add it to the environment variables. PIP installs the required related modules.
The main idea
Objective:
Cat eye movie information:
The contents to be climbed are:
Movie title, movie score, link to movie introduction page, and movie introduction.
Ideas:
First climb the movie title, movie score and movie introduction page link, then follow the movie introduction page link to climb the movie introduction page link.
That works out to 31 requests for 30 movies on a page.
It turns out that just to crawl this little bit of data, you can imagine the efficiency T_T
Finally, the crawled data is stored in Excel.
For details, see the source code in the relevant files on the home page.
added
(1) How to obtain cookies in the cookie. py file
Get the cookie value as shown above and fill it in the relevant location of the cookie. py file:
The first cookie list is a cookie that does not contain login information;
The second login_Cookies list is a cookie with login information.
The method of obtaining cookies containing login information is the same as that without login information, but you need to log in on the web page in advance (the login option is in the upper right corner of the web page).
(2) Anti-reptile on cat’s eye
The anti-crawler mechanism of the cat’s eye movie is very good. T_T
The first is that for non-logged-in users, you can only view the first 100 pages of the movie. Second, if your requests are too frequent, they will be blocked:
After the test, the conclusions are as follows:
Setting random intervals does not prevent IP packets from being blocked.
Changing the cookie value cannot prevent THE IP address from being blocked.
After crawling about 20 pages of movie data, the code would be GG and so on.
No careful study of how cat eye movies recognize crawlers.
But there is a sense of intelligence in the recognition mechanism.
For example, AFTER I was blocked, I tried to solve the problem by changing the IP address, but I found that the amount of data I could obtain after changing the IP address was much less than the previous ONE.
Therefore, I provide the legitimate source code and cat eye maintenance personnel did not think of a battle of wits, just add a random time interval. The amount of data that can be crawled is around 25 pages.
Crawler solution
Not to disappoint reptile lovers, it’s worth mentioning
There are many solutions about anti-crawler mechanism on the Internet, such as proxy pool and so on.
I just put forward a solution to the anti-crawler mechanism in the process of cat’s eye movie information crawling.
The solution is simple. It just wants us to input a captcha to prove that we are human.
A simple test shows that:
Just post the recognized captcha and some other parameters:
Maoyan.com/films?__oce…
This link can lift IP blocking ~~~ very simple.
I provide a test version (test.py) in the related file.
Using the demonstration
Run the my_spider.py file in a CMD window.
As shown below (climb 20 pages of data) :
The movies that scored 9 or more were:
Wrestling! dad
War Wolf 2
Monkey god for little girls
Good for you, my country
Coco
Farewell my concubine
Fast and Furious 7
The red sea action
Mysterious star
Zootopia
Snow and ice colors
Titanic
Invisible guests
Titanic 3D
Fast and Furious 8
Miracle boy
Sewing Machine band
Vexed Charlotte
Fight the Wolf
Mekong Operation
Batman: The Dark Knight
Nine taste sesame officials
No. 1 player
Detective Chinatown 2
Ex 3: Goodbye ex
Paddington 2
Adventures of the Bull
Schindler’s list
A one-man class
Take the wisdom of tiger Mountain
Your name.
hero
22
Captain America 2
Sword Realm: Battle of the Sequences
Guardians of the Galaxy
Transformers 4: Age of Extinction
My girlhood
Happy together
Flash girl
Boonie Bears. Metamorphosis
Shy iron fist
youth
Catch the demon remember
During the great cause
Jurassic World
Boonie bears · Fantasy Space
To search for these
The end of the collapse
Old gun son
The bear is back
Than a rabbit
Love cyclotron
Love at the South Pole
Chase the dragon
Guardians of the Galaxy 2
Fuck off!!!! Tumor jun
The little mermaid
Chinatown Detective
“Avatar”
Bomb disposal · expert
warcraft
Iron man 3
Smurfs: Find the mysterious village
To love Van Gogh the Mystery of the Starry Sky
Love before memories fade
Mission: Impossible – Rogue Nation
Pride and prejudice
That’all~~~
I went to the movies
To help those of you who are slow to learn Python, here is a rich learning package for you
If the cat’s eye movie has a larger update in the future, the source code will only be for reference.