Python Crawler practice – Crawling movie rankings

Python Crawler practice: Use Requests and regular expressions to crawl TOP100 cat eye movies.

The target

We intend to extract the movie names, time, ratings, pictures and other information of the TOP100 maoyan movies, and we save the extracted results in the form of files.

To prepare

  • System environment: macOS High Sierra 10.13.6
  • Development Language: Python 3.7.2 (default)
  • Third-party libraries: Requests

Analysis of the

The target site: https://maoyan.com/board/4

After opening the page, we can find that the effective information displayed in the page is the movie name, starring, screening time, screening region, ratings, pictures.

Then, let’s take a look at the HTML source code for the top item, from which we’ll start designing regular expressions:

<dd>
        <i class="board-index board-index-1">1</i>
        <a href="/films/1203" title="Farewell my Concubine" class="image-link" data-act="boarditem-click" data-val="{movieId:1203}">
                <img src="//s0.meituan.net/bs/?f=myfe/mywww:/image/loading_2.e3d934bf.png" alt=""
                        class="poster-default" />
                <img data-src="https://p1.meituan.net/movie/20803f59291c47e1e116c11963ce019e68711.jpg@160w_220h_1e_1c"
                        alt="Farewell my Concubine" class="board-img" />
        </a>
        <div class="board-item-main">
                <div class="board-item-content">
                        <div class="movie-item-info">
                                <p class="name"><a href="/films/1203" title="Farewell my Concubine" data-act="boarditem-click"
                                                data-val="{movieId:1203}">Farewell my concubine</a></p>
                                <p class="star">Starring: Leslie Cheung, Feng Yi Zhang, Gong Li</p>
                                <p class="releasetime">Screening time: 1993-01-01</p>
                        </div>
                        <div class="movie-item-number score-num">
                                <p class="score"><i class="integer">9.</i><i class="fraction">6</i></p>
                        </div>

                </div>
        </div>

</dd>
Copy the code

Take a look at page: what would happen to the url from https://maoyan.com/board/4 to https://maoyan.com/board/4?offset=10. Yeah, one more? Offset = 10. On the next page, it becomes? Offset = 20. As it turns out, the offset is added by 10 per page (a page shows exactly 10 entries, which makes sense). In fact, we can even try to change the value to 0 (home page), or any value in the range [0, 100].

design

We are now designing a crawler that can accomplish this goal.

The reptile should have these parts:

  • Grab the page (also pay attention to the problem with page turning) : can be usedrequests.get()You’d better fake headers.
  • Regular extraction (match the movie name, starring, screening time, screening region, ratings, pictures) : usere.findall()And appropriate regular expressions to extract the information.
  • Writing to a file (using JSON format to save information) : involvesjson.dumps()Write to file

Now, we need to design a particularly critical regular expression:

'<dd>.*? board-index.*? > (. *?) </i>.*? data-src="(.*?) @. *? title="(.*?) ". *? Starring: (. *?) \s*</p>.*? Release date :(.*?) </p>.*? integer">(.*?) </i>.*? fraction">(.*?) < span style = "box-sizing: border-box! Important; word-wrap: break-word! Important; word-wrap: break-word! Important;Copy the code

implementation

Let’s now implement the first version of the program as designed:

import re
import json
import time

import requests

url = 'https://maoyan.com/board/4'
filename = './movies.txt'
pattern = r'
      
.*? board-index.*? > (. *?) .*? data-src="(.*?) @. *? title="(.*?) ". *? Starring: (. *?) \s*

.*? Release date :(.*?)

.*? integer">(.*?) .*? fraction">(.*?)

'
headers = { 'Accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9 * / *; Q = 0.8 '.'User-Agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15'.'Accept-Language': 'zh-cn' } def get_page(url) : # Grab the page and return the HTML string print('\tGetting... ') try: response = requests.get(url, headers=headers) return response.text except Exception as e: print('[Error]', e) return ' ' def extract(html) : # regular extract, return the list of the result dict print('\tExtracting... ') raws = re.findall(pattern, html, re.S) # [(rank, image address, name, lead actor, screening time, score integer part, score decimal part),...] result = [] for raw in raws: dc = { # adjusts the order here 'index': raw[0].'title': raw[2].'stars': raw[3].'otime': raw[4].'score': raw[5] + raw[6].# Merge integers and decimals 'image': raw[1] } result.append(dc) return result def save(data) : # write to file print('\tSaving... ') with open(filename, 'a', encoding='utf-8') as f: for i in data: f.write(json.dumps(i, ensure_ascii=False) + '\n') if __name__ == '__main__': for i in range(0.100.10) :# flip target = url + '? offset=' + str(i) print('[%s%%](%s)' % (i, target)) page = get_page(target) data = extract(page) save(data) time.sleep(0.5) # Prevention requests are blocked intensively print('[100%] All Finished.\n Results in', filename) Copy the code

debugging

Run the program, and if all goes well, we get the result:

{"index": "1"."title": "Farewell my Concubine"."stars": "Leslie Cheung, Feng Yi Zhang, Li Gong"."otime": "1993-01-01"."score": "9.6"."image": "https://p1.meituan.net/movie/20803f59291c47e1e116c11963ce019e68711.jpg"}
{"index": "2"."title": "The Shawshank Redemption"."stars": "Tim Robbins, Morgan Freeman, Bob Gunton."."otime": "The 1994-10-14 (us)"."score": "9.5"."image": "https://p0.meituan.net/movie/283292171619cdfd5b240c8fd093f1eb255670.jpg"}
{"index": "3"."title": Roman Holiday."stars": "Gregory Peck, Audrey Hepburn, Eddie Albert."."otime": "The 1953-09-02 (us)"."score": "9.1"."image": "https://p0.meituan.net/movie/54617769d96807e4d81804284ffe2a27239007.jpg"}
{"index": "4"."title": "The killer is not too cold."."stars": "Jean Reno, Gary Oldman, Natalie Portman."."otime": "The 1994-09-14 (France)"."score": "9.5"."image": "https://p0.meituan.net/movie/e55ec5d18ccc83ba7db68caae54f165f95924.jpg"}
{"index": "5"."title": "Titanic"."stars": "Leonardo dicaprio, Kate Winslet, Billy Zane."."otime": "1998-04-03"."score": "9.6"."image": "https://p1.meituan.net/movie/0699ac97c82cf01638aa5023562d6134351277.jpg"}
Copy the code

At first glance it looks like there’s no problem, we’ve got all the results we need. However, on closer inspection, there are still several problems:

  1. “Stars” usually have several people, so it’s best to use a list or tuple to place them.
  2. Each item of information is directly separated from each other, which is not easy to transmit and access.

To solve the first problem, we can add a section to the extract to handle the problem, while the second problem requires us to read all the pages, place them in the specified location, and write them uniformly to the file.

Modify the source program:

Add a new function that handles the main character information to get a list:

def stars_split(st) :
    return st.split(', ')
Copy the code

Modify extract() to add a call to stars_split:

def extract(html) :  # regular extract, return the list of the result dict
    print('\tExtracting... ')
    raws = re.findall(pattern, html, re.S)   # [(rank, image address, name, lead actor, screening time, score integer part, score decimal part),...]
    result = []
    for raw in raws:
        dc = {                      # adjusts the order here
                'index': raw[0].'title': raw[2].'stars': stars_split(raw[3]),   # [edit] : Separate the main actors
                'otime': raw[4].'score': raw[5] + raw[6].# Merge integers and decimals
                'image': raw[1]
                }
        result.append(dc)

    return result
Copy the code

Add a new global variable and function to integrate the results:

result = {'top movies': []}

def merge(data) :
    print('\tMerging... ')
    result['top movies'] += data
Copy the code

Modify the save:

def save(data) :      # write to file
    print('Saving... ')
    with open(filename, 'a', encoding='utf-8') as f:
        f.write(json.dumps(data, ensure_ascii=False))
Copy the code

Modify the program framework:

if __name__ == '__main__':
    for i in range(0.100.10) :# flip
        target = url + '? offset=' + str(i)
        print('[%s%%](%s)' % (i, target))
        page = get_page(target)
        data = extract(page)
        merge(data)
        time.sleep(0.5)     # Prevention requests are blocked intensively
        
    save(result)
    print('[100%] All Finished.\n Results in', filename)
Copy the code

Integration code:

import re
import json
import time

import requests

url = 'https://maoyan.com/board/4'
result = {'top movies': []}
filename = './movies.json'      You'd better change the name of the file you saved, otherwise it will be added to the end of the last run.

pattern = r'
      
.*? board-index.*? > (. *?) .*? data-src="(.*?) @. *? title="(.*?) ". *? Starring: (. *?) \s*

.*? Release date :(.*?)

.*? integer">(.*?) .*? fraction">(.*?)

'
headers = { 'Accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9 * / *; Q = 0.8 '.'User-Agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15'.'Accept-Language': 'zh-cn' } def get_page(url) : # Grab the page and return the HTML string print('\tGetting... ') try: response = requests.get(url, headers=headers) return response.text except Exception as e: print('[Error]', e) return ' ' def stars_split(st) : return st.split(', ') def extract(html) : # regular extract, return the list of the result dict print('\tExtracting... ') raws = re.findall(pattern, html, re.S) # [(rank, image address, name, lead actor, screening time, score integer part, score decimal part),...] result = [] for raw in raws: dc = { # adjusts the order here 'index': raw[0].'title': raw[2].'stars': stars_split(raw[3]), # Split the lead 'otime': raw[4].'score': raw[5] + raw[6].# Merge integers and decimals 'image': raw[1] } result.append(dc) return result def merge(data) : print('\tMerging... ') result['top movies'] += data def save(data) : # write to file print('Saving... ') with open(filename, 'a', encoding='utf-8') as f: f.write(json.dumps(data, ensure_ascii=False)) if __name__ == '__main__': for i in range(0.100.10) :# flip target = url + '? offset=' + str(i) print('[%s%%](%s)' % (i, target)) page = get_page(target) data = extract(page) merge(data) time.sleep(0.5) # Prevention requests are blocked intensively save(result) print('[100%] All Finished.\n Results in', filename) Copy the code

Run the modified program and get the new result:

{"top movies": [{"index": "1"."title": "Farewell my Concubine"."stars": ["Leslie Cheung"."Zhang Fengyi"."Gong li"]."otime": "1993-01-01"."score": "9.6"."image": "https://p1.meituan.net/movie/20803f59291c47e1e116c11963ce019e68711.jpg"}, {"index": "2"."title": "The Shawshank Redemption"."stars": ["Tim Robbins"."Morgan Freeman"."Bob Gunton"]."otime": "The 1994-10-14 (us)"."score": "9.5"."image": "https://p0.meituan.net/movie/283292171619cdfd5b240c8fd093f1eb255670.jpg"}, ..., {"index": "100"."title": "My neighbor totoro"."stars": ["Qin Lan"."Itoi Shigori"."Sumi Shimamoto"]."otime": "2018-12-14"."score": "9.2"."image": "https://p0.meituan.net/movie/c304c687e287c7c2f9e22cf78257872d277201.jpg"}}]Copy the code

That would be ideal.

complete

The crawler project is over. To sum up, we mainly used requests.get() to complete the request, and faked the headers; Re.findall () re parsed the results, and then adjusted the order of the information; Save the results with JSON formatting.

In fact, with a little modification, we can use this project to climb a lot of other movie list, such as we actually implemented a climb douban top250 program, really just a slight change, very easy.

We show the development process of this project, from the goal to the final completion, step by step. This development sequence is applicable to many projects, and is philosophic, which we think is worth understanding and practicing.