We used the requests library and regular expressions to grab the TOP100 of cat’s eye movies. Requests is more convenient to use than URllib, and we don’t have a systematic HTML parsing library yet, so we’ll use regular expressions as parsing tools.

  1. In this section, the target

This section, we need to extract the cat’s eye film of TOP100 information such as name, time, scores, pictures, extract the site URL for http://maoyan.com/board/4, the extracted results will be saved in the file form.

  1. The preparatory work

Before this section begins, make sure you have installed the Requests library properly. If not, refer to the installation instructions in Chapter 1.

  1. Grasping analysis

We need to grab the target site is http://maoyan.com/board/4, after open the can see the list of information, as shown in figure 3 to 11.

The no. 1 movie is Farewell My Concubine, and valid information displayed on the page includes the movie’s title, stars, release date, release region, ratings, pictures and other information.

Scroll down to the bottom of the page to find a paginated list, and click on page 2 to see how the URL and content of the page change, as shown in Figure 3-12.

Can find the URL of the page into http://maoyan.com/board/4?offset=10, more than the previous URL, a parameter that is offset = 10, and the show as a result of the number 11 ~ 20 films, preliminary infer that this is an offset parameters. Then click the next page, find the URL of the page into http://maoyan.com/board/4?offset=20, offset parameters into 20, and shows the result is ranking 21 ~ 30 films.

Offset represents the offset value. If the offset value is N, the movie serial numbers will be displayed from N +1 to N +10, with 10 movies per page. Therefore, if you want to get TOP100 movies, you only need to request 10 times separately, and the offset parameter of 10 times is set to 0, 10, 20,… In this way, after obtaining different pages, we can extract relevant information with regular expression, and then we can get all the information of TOP100 movies.

  1. Grab homepage

This process is then implemented in code. Grab the first page first. We implemented the Getonepage () method and passed it url parameters. The result of the fetched page is then returned and called via the main() method. The preliminary code implementation is as follows:

import requests
 
def get_one_page(url):
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    return None
 
def main():
    url = 'http://maoyan.com/board/4'
    html = get_one_page(url)
    print(html)
 
main()Copy the code

After this run, you can successfully obtain the source code of the home page. Once we get the source code, we need to parse the page to extract the information we want.

  1. Regular extraction

Next, go back to the web page and take a look at the actual source code for the page. View the source code in the Network listener component in developer mode, as shown in Figure 3-13.

Note that instead of looking directly at the source code in the Elements TAB, where it might be different from the original request through JavaScript manipulation, you need to look at the source code from the original request from the Network TAB section.

View the source code for one of the entries, as shown in Figure 3-14.

It can be seen that the source code corresponding to a movie information is a DD node, and we use regular expression to extract some movie information here. First, you need to extract its ranking information. Its ranking information is in the I node whose class is board-index. Here, non-greedy matching is used to extract the information in the I node, and the regular expression is written as follows:

<dd>.*? board-index.*? > (. *?) </i>Copy the code

Then you need to extract the picture of the movie. As you can see, there is an A node behind it with two IMG nodes inside it. Upon examination, the data-src attribute of the second IMG node is the link to the image. Here we extract the data-src attribute of the second IMG node. The regular expression can be rewritten as follows:

<dd>.*? board-index.*? > (. *?) </i>.*? data-src="(.*?) "Copy the code

After that, we need to extract the name of the movie, which is in the following P node, with class as name. Therefore, name can be used as a flag bit, and then the body content of node A can be further extracted. In this case, the regular expression can be rewritten as follows:

<dd>.*? board-index.*? > (. *?) </i>.*? data-src="(.*?) ". *? name.*? a.*? > (. *?) </a>Copy the code

The same principle applies when extracting the stars, release dates, ratings, etc. Finally, the regular expression is written as:

<dd>.*? board-index.*? > (. *?) </i>.*? data-src="(.*?) ". *? name.*? a.*? > (. *?) </a>.*? star.*? > (. *?) </p>.*? releasetime.*? > (. *?) </p>.*? integer.*? > (. *?) </i>.*? fraction.*? > (. *?) </i>.*? </dd>Copy the code

Such a regular expression can match the result of a movie, which matches seven pieces of information. Next, extract everything by calling the findAll () method.

From here on in, we define the parseonepage() method to parse the page, which uses regular expressions to extract the content we want from the results:

def parse_one_page(html): pattern = re.compile( '<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</ p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>', re.S) items = re.findall(pattern, html) print(items)Copy the code

This successfully extracts all 10 movie information on a page, in the form of a list, with the following output:

[(' 1 ', 'http://p1.meituan.net/movie/20803f59291c47e1e116c11963ce019e68711.jpg@160w_220h_1e_1c', 'farewell my concubine', '\ n starring: Cheung Kwok-wing, Zhang Feng-yi, Gong Li \ N ", 'Release date: The 1993-01-01 (Hong Kong) ', '9.', '6'), (' 2 ', 'http://p0.meituan.net/movie/__40191813__4767047.jpg@160w_220h_1e_1c', 'the shawshank redemption, '\n Starring: Tim Robbins, Morgan Freeman, Bob Gunton \ N ', 'Release Date: 1994-10-14(USA)', '2 ', ('3', 'http://p0.meituan.net/movie/fc9d78dd2ce84d20e53b6d1ae2eea4fb1515304.jpg@160w_220h_1e_1c', 'the killer is not too cold', '\ n starring: Jean Reno, Gary Oldman, Natalie Portman \ N ', 'Release Date: The 1994-09-14 (France) ', '9.', '5'), (' 4 ', 'http://p0.meituan.net/movie/23/6009725.jpg@160w_220h_1e_1c', 'Roman holiday', '\ n starring: Gregory Peck, Audrey Hepburn, Eddie Abbott \ N ', 'Release Date: The 1953-09-02 (us) ', '9.', '1'), (' 5 ', 'http://p0.meituan.net/movie/53/1541925.jpg@160w_220h_1e_1c', "forrest gump", '\ n starring: Tom Hanks, Robin Wright, Gary Sinise \ N ", 'Release Date: The 1994-07-06 (us) ', '9.', '4'), (' 6 ', 'http://p0.meituan.net/movie/11/324629.jpg@160w_220h_1e_1c', 'Titanic', '\ n starring: Leonardo dicaprio, Kate Winslet, Billy Zane \ N ', 'Release Date: The 1998-04-03 ', '9.', '5'), (' 7 ', 'http://p0.meituan.net/movie/99/678407.jpg@160w_220h_1e_1c', 'totoro', '\ n starring: Noriko Hitaka, Chixia Sakamoto, Shigeori ITO \ N ', 'Release Date: The 1988-04-16 (Japan) ', '9', '2'), (' 8 ', 'http://p0.meituan.net/movie/92/8212889.jpg@160w_220h_1e_1c', 'the godfather', '\ n starring: Marlon Brando, Al Pacino, James Caine \ N ', 'Release Date: 1972-03-24 (us) ', '9.', '3') and (' 9 ', 'http://p0.meituan.net/movie/62/109878.jpg@160w_220h_1e_1c', 'flirting scholar', '\ n starring: Chow Sing-chi, Gong Li, Cheng Pei-pei \ N ', ' 1993-07-01(Hong Kong, China)', '9.', '2', ('10', 'http://p0.meituan.net/movie/9bf7d7b81001a9cf8adbac5a7cf7d766132425.jpg@160w_220h_1e_1c', 'spirited away', '\ n starring: Zhong 瑠 beauty, into the wild, free summerwood truth \ n ', 'release time: 2001-07-20 (Japan)', '9', '3')]Copy the code

But this is not enough, the data is messy, we will deal with the matching results, iterate the extraction results and generate a dictionary, at this time the method is rewritten as follows:

def parse_one_page(html): pattern = re.compile( '<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</ p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>', re.S) items = re.findall(pattern, html) for item in items: yield { 'index': item[0], 'image': item[1], 'title': item[2].strip(), 'actor': item[3].strip()[3:] if len(item[3]) > 3 else '', 'time': item[4].strip()[5:] if len(item[4]) > 5 else '', 'score': item[5].strip() + item[6].strip() }Copy the code

This successfully extracts the movie’s ranking, image, title, actor, date, rating, etc., and assigns it to a dictionary to form structured data. The running results are as follows:

{‘image’: ‘http://p1.meituan.net/movie/20803f59291c47e1e116c11963ce019e68711.jpg@160w220h1e_1c’, ‘actor’: ‘Leslie cheung, fengyi zhang, gong li’, ‘score’ : ‘9.6’, ‘index’ : ‘1’, ‘title’, ‘farewell my concubine’, ‘time’, ‘1993-01-01 (Hong Kong)} {‘ image’ : ‘http://p0.meituan.net/movie/401918134767047.jpg@160w220h1e_1c’, ‘actors’ :’ Tim Robbins, Morgan freeman, Bob’s boss ‘, ‘score’ : ‘9.5’, ‘index’, ‘2’, ‘title’ : ‘shawshank redemption’, ‘time’ : ‘the 1994-10-14 (U.S.)} {‘ image’ : ‘http://p0.meituan.net/movie/fc9d78dd2ce84d20e53b6d1ae2eea4fb1515304.jpg@160w220h1e_1c’, ‘actor’: ‘jean renoir, Gary oldman, Natalie portman’, ‘score’ : ‘9.5’, ‘index’, ‘3’, ‘title’ : ‘this killer is not too cold’, ‘time’, ‘1994-09-14 (France)} {‘ image’ : ‘http://p0.meituan.net/movie/23/6009725.jpg@160w220h1e_1c’, ‘actors’ :’ gayle Gregory peck and Audrey Hepburn, Eddie Albert ‘, ‘score’ : ‘9.1’, ‘index’ : ‘4’, ‘title’ : ‘Roman holiday’, ‘time’ : ‘1953-09-02 (U.S.)} {‘ image’ : ‘http://p0.meituan.net/movie/53/1541925.jpg@160w220h1e_1c’, ‘actors’ :’ Tom Hanks, robin Wright, Gary west nice ‘, ‘score’ : ‘9.4’, ‘index’ : ‘5’, ‘title’ : ‘forrest gump’ and ‘time’ : ‘the 1994-07-06 (U.S.)} {‘ image’ : ‘http://p0.meituan.net/movie/11/324629.jpg@160w220h1e_1c’, ‘actors’ : leonardo dicaprio and Kate winslet, Billy zane’, ‘score’ : ‘9.5’, ‘index’, ‘6’, ‘title’, ‘Titanic’, ‘time’, ‘1998-04-03’} {‘ image ‘: ‘http://p0.meituan.net/movie/99/678407.jpg@160w220h1e_1c’, ‘actors’ :’ day high way, sakamoto qian xia, si well in heavy ‘and’ score ‘:’ 9.2 ‘, ‘index’ : ‘7’, ‘title’ : ‘totoro’, ‘time’, ‘1988-04-16 (Japan)} {‘ image’ : ‘http://p0.meituan.net/movie/92/8212889.jpg@160w220h1e_1c’, ‘actors’ :’ marlon brando, al pacino, James cayne ‘, ‘score’ : ‘9.3’, ‘index’ : ‘8’, ‘title’ : ‘the godfather’, ‘time’ : ‘the 1972-03-24 (U.S.)} {‘ image’ : ‘http://p0.meituan.net/movie/62/109878.jpg@160w220h1e_1c’, ‘actors’ :’ Stephen chow, gong li, cheng pei-pei ‘, ‘score’ : ‘9.2’, ‘index’ : ‘9’, ‘title’ : ‘flirting scholar’, ‘time’, ‘1993-07-01 (Hong Kong)’} {‘ image ‘: ‘http://p0.meituan.net/movie/9bf7d7b81001a9cf8adbac5a7cf7d766132425.jpg@160w220h1e_1c’, ‘actor’: ‘Zhong 瑠 beauty, into the wild, free summerwood truth’, ‘score’ : ‘9.3’, ‘index’ : ’10’, ‘title’ : ‘spirited away’, ‘time’, ‘2001-07-20 (Japan)}

At this point, we have successfully extracted a single page of movie information.

  1. Written to the file

We then write the extracted results to a file, in this case directly to a text file. Here the dictionary is serialized through the JSON library’s dumps() method, specifying the ensure_ASCII parameter to False to ensure that the output is in Chinese and not Unicode. The code is as follows:

def write_to_json(content):
    with open('result.txt', 'a') as f:
        print(type(json.dumps(content)))
        f.write(json.dumps(content, ensure_ascii=False,).encode('utf-8'))Copy the code

The dictionary is written to the text file by calling the writeTojson () method, where the content argument is the extraction of a movie, which is a dictionary.

  1. Integration of the code

Finally, the main() method is implemented to call the previously implemented method to write the single-page movie results to the file. The relevant codes are as follows:

def main():
    url = 'http://maoyan.com/board/4'
    html = get_one_page(url)
    for item in parse_one_page(html):
        write_to_json(item)Copy the code

At this point, we have completed the extraction of single-page movies, that is, the 10 movies on the front page can be successfully extracted and saved to a text file.

  1. Paging crawl

Since we need to crawl TOP100 movies, we also need to iterate, pass the offset parameter to this link, and realize the crawl of the other 90 movies, then add the following call:

if __name__ == '__main__':
    for i in range(10):
        main(offset=i * 10)Copy the code

You also need to modify the main() method to take an offset value as an offset and construct the URL to crawl. The implementation code is as follows:

def main(offset):
    url = 'http://maoyan.com/board/4?offset=' + str(offset)
    html = get_one_page(url)
    for item in parse_one_page(html):
        print(item)
        write_to_file(item)Copy the code

So far, our cat’s eye movie TOP100 crawler is complete, and a little more tidy, the complete code is as follows:

import json
import requests
from requests.exceptions import RequestException
import re
import time
 
def get_one_page(url):
    try:
        response = requests.get(url)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        return None
 
def parse_one_page(html):
    pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a'
                         + '.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'
                         + '.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>', re.S)
    items = re.findall(pattern, html)
    for item in items:
        yield {
            'index': item[0],
            'image': item[1],
            'title': item[2],
            'actor': item[3].strip()[3:],
            'time': item[4].strip()[5:],
            'score': item[5] + item[6]
        }
 
def write_to_file(content):
    with open('result.txt', 'a', encoding='utf-8') as f:
        f.write(json.dumps(content, ensure_ascii=False) + '\n')
 
def main(offset):
    url = 'http://maoyan.com/board/4?offset=' + str(offset)
    html = get_one_page(url)
    for item in parse_one_page(html):
        print(item)
        write_to_file(item)
 
if __name__ == '__main__':
    for i in range(10):
        main(offset=i * 10)
        time.sleep(1)Copy the code

Now there are many anti-crawlers in the cat’s eye. If the speed is too fast, there will be no response, so a delay is added here.

  1. The results

Finally, we run the code and the output looks something like this:

{‘index’: ‘1’, ‘image’: ‘http://p1.meituan.net/movie/20803f59291c47e1e116c11963ce019e68711.jpg@160w220h1e_1c’, ‘title’: ‘farewell my concubine’, ‘actors’ :’ Leslie cheung, fengyi zhang, gong li ‘, ‘time’, ‘1993-01-01 (Hong Kong)’, ‘score’ : ‘9.6’} {‘ index ‘, ‘2’, ‘image’ : ‘http://p0.meituan.net/movie/401918134767047.jpg@160w220h1e_1c’, ‘title’ : ‘shawshank redemption’, ‘actors’ : ‘Tim Robbins, Morgan freeman, Bob’s boss’, ‘time’ : ‘the 1994-10-14 (us)’, ‘score’ : ‘9.5’}… {‘ index ‘:’ 98 ‘, ‘image’ : ‘http://p0.meituan.net/movie/76/7073389.jpg@160w220h1e_1c’ and ‘title’ : ‘Tokyo’, ‘actors’ : ‘dai li think tank the hara, shan village spring son’, ‘time’, ‘1953-11-03 (Japan)’, ‘score’ : ‘9.1’} {‘ index ‘:’ 99 ‘, ‘image’ : ‘http://p0.meituan.net/movie/52/3420293.jpg@160w220h1e_1c’, ‘title’ : ‘I love you’, ‘actors’ :’ jae-ho song, chae-eun lee, hea-yeon gil ‘and’ time ‘: ‘the 2011-02-17 (South Korea)’, ‘score’ : ‘9.0’} {‘ index ‘:’ 100 ‘, ‘image’ : ‘http://p1.meituan.net/movie/443351388470779.jpg@160w220h1e_1c’, ‘title’ : ‘migratory birds’,’ actors’ : ‘Jacques henk bekedam, Philip poirot, Philippe Labro’ and ‘time’, ‘2001-12-12 (France)’, ‘score’ : ‘9.1’}

The intermediate output is omitted here. As you can see, this successfully crawls down the TOP100 movies.

Now look at the text file, and the result is shown in Figure 3-15.

As you can see, the movie information has also been saved to the text file, and you are done!