preface

If you often read the Python crawler related public account, it will be presented in the form of crawler + data analysis, which is very interesting, and the chart is also very good. Today, I would like to share the cat’s eye movie rating above 9 crawler and analysis, see what is worth watching?

The development tools

Python version: 3.6.4
Related modules:

Openpyxl module;

Requests module;

Win_unicode_console module;

Bs4 module;

And some modules that come with Python.

Environment set up

Install Python and add it to the environment variables. PIP installs the required related modules.

The main idea

Objective:

Cat eye movie information:

The contents to be climbed are:

Movie title, movie score, link to movie introduction page, and movie introduction.

Ideas:

First climb the movie title, movie score and movie introduction page link, then follow the movie introduction page link to climb the movie introduction page link.

That works out to 31 requests for 30 movies on a page.

It turns out that just to crawl this little bit of data, you can imagine the efficiency T_T

Finally, the crawled data is stored in Excel.

For details, see the source code in the relevant files on the home page.

added

(1) How to obtain cookies in the cookie. py file

Get the cookie value as shown above and fill it in the relevant location of the cookie. py file:

The first cookie list is a cookie that does not contain login information;

The second login_Cookies list is a cookie with login information.

The method of obtaining cookies containing login information is the same as that without login information, but you need to log in on the web page in advance (the login option is in the upper right corner of the web page).

(2) Anti-reptile on cat’s eye

The anti-crawler mechanism of the cat’s eye movie is very good. T_T

The first is that for non-logged-in users, you can only view the first 100 pages of the movie. Second, if your requests are too frequent, they will be blocked:

After the test, the conclusions are as follows:

Setting random intervals does not prevent IP packets from being blocked.

Changing the cookie value cannot prevent THE IP address from being blocked.

After crawling about 20 pages of movie data, the code would be GG and so on.

No careful study of how cat eye movies recognize crawlers.

But there is a sense of intelligence in the recognition mechanism.

For example, AFTER I was blocked, I tried to solve the problem by changing the IP address, but I found that the amount of data I could obtain after changing the IP address was much less than the previous ONE.

Therefore, I provide the legitimate source code and cat eye maintenance personnel did not think of a battle of wits, just add a random time interval. The amount of data that can be crawled is around 25 pages.

Crawler solution

Not to disappoint reptile lovers, it’s worth mentioning

There are many solutions about anti-crawler mechanism on the Internet, such as proxy pool and so on.

I just put forward a solution to the anti-crawler mechanism in the process of cat’s eye movie information crawling.

The solution is simple. It just wants us to input a captcha to prove that we are human.

A simple test shows that:

Just post the recognized captcha and some other parameters:

Maoyan.com/films?__oce…

This link can lift IP blocking ~~~ very simple.

I provide a test version (test.py) in the related file.

Using the demonstration

Run the my_spider.py file in a CMD window.

As shown below (climb 20 pages of data) :

The movies that scored 9 or more were:

Wrestling! dad

War Wolf 2

Monkey god for little girls

Good for you, my country

Coco

Farewell my concubine

Fast and Furious 7

The red sea action

Mysterious star

Zootopia

Snow and ice colors

Titanic

Invisible guests

Titanic 3D

Fast and Furious 8

Miracle boy

Sewing Machine band

Vexed Charlotte

Fight the Wolf

Mekong Operation

Batman: The Dark Knight

Nine taste sesame officials

No. 1 player

Detective Chinatown 2

Ex 3: Goodbye ex

Paddington 2

Adventures of the Bull

Schindler’s list

A one-man class

Take the wisdom of tiger Mountain

Your name.

hero

22

Captain America 2

Sword Realm: Battle of the Sequences

Guardians of the Galaxy

Transformers 4: Age of Extinction

My girlhood

Happy together

Flash girl

Boonie Bears. Metamorphosis

Shy iron fist

youth

Catch the demon remember

During the great cause

Jurassic World

Boonie bears · Fantasy Space

To search for these

The end of the collapse

Old gun son

The bear is back

Than a rabbit

Love cyclotron

Love at the South Pole

Chase the dragon

Guardians of the Galaxy 2

Fuck off!!!! Tumor jun

The little mermaid

Chinatown Detective

“Avatar”

Bomb disposal · expert

warcraft

Iron man 3

Smurfs: Find the mysterious village

To love Van Gogh the Mystery of the Starry Sky

Love before memories fade

Mission: Impossible – Rogue Nation

Pride and prejudice

That’all~~~

I went to the movies

To help those of you who are slow to learn Python, here is a rich learning package for you

If the cat’s eye movie has a larger update in the future, the source code will only be for reference.