Python Crawler practice – Crawling movie rankings
Python Crawler practice: Use Requests and regular expressions to crawl TOP100 cat eye movies.
The target
We intend to extract the movie names, time, ratings, pictures and other information of the TOP100 maoyan movies, and we save the extracted results in the form of files.
To prepare
- System environment: macOS High Sierra 10.13.6
- Development Language: Python 3.7.2 (default)
- Third-party libraries: Requests
Analysis of the
The target site: https://maoyan.com/board/4
After opening the page, we can find that the effective information displayed in the page is the movie name, starring, screening time, screening region, ratings, pictures.
Then, let’s take a look at the HTML source code for the top item, from which we’ll start designing regular expressions:
<dd>
<i class="board-index board-index-1">1</i>
<a href="/films/1203" title="Farewell my Concubine" class="image-link" data-act="boarditem-click" data-val="{movieId:1203}">
<img src="//s0.meituan.net/bs/?f=myfe/mywww:/image/loading_2.e3d934bf.png" alt=""
class="poster-default" />
<img data-src="https://p1.meituan.net/movie/20803f59291c47e1e116c11963ce019e68711.jpg@160w_220h_1e_1c"
alt="Farewell my Concubine" class="board-img" />
</a>
<div class="board-item-main">
<div class="board-item-content">
<div class="movie-item-info">
<p class="name"><a href="/films/1203" title="Farewell my Concubine" data-act="boarditem-click"
data-val="{movieId:1203}">Farewell my concubine</a></p>
<p class="star">Starring: Leslie Cheung, Feng Yi Zhang, Gong Li</p>
<p class="releasetime">Screening time: 1993-01-01</p>
</div>
<div class="movie-item-number score-num">
<p class="score"><i class="integer">9.</i><i class="fraction">6</i></p>
</div>
</div>
</div>
</dd>
Copy the code
Take a look at page: what would happen to the url from https://maoyan.com/board/4 to https://maoyan.com/board/4?offset=10. Yeah, one more? Offset = 10. On the next page, it becomes? Offset = 20. As it turns out, the offset is added by 10 per page (a page shows exactly 10 entries, which makes sense). In fact, we can even try to change the value to 0 (home page), or any value in the range [0, 100].
design
We are now designing a crawler that can accomplish this goal.
The reptile should have these parts:
- Grab the page (also pay attention to the problem with page turning) : can be used
requests.get()
You’d better fake headers. - Regular extraction (match the movie name, starring, screening time, screening region, ratings, pictures) : use
re.findall()
And appropriate regular expressions to extract the information. - Writing to a file (using JSON format to save information) : involves
json.dumps()
Write to file
Now, we need to design a particularly critical regular expression:
'<dd>.*? board-index.*? > (. *?) </i>.*? data-src="(.*?) @. *? title="(.*?) ". *? Starring: (. *?) \s*</p>.*? Release date :(.*?) </p>.*? integer">(.*?) </i>.*? fraction">(.*?) < span style = "box-sizing: border-box! Important; word-wrap: break-word! Important; word-wrap: break-word! Important;Copy the code
implementation
Let’s now implement the first version of the program as designed:
import re
import json
import time
import requests
url = 'https://maoyan.com/board/4'
filename = './movies.txt'
pattern = r'
.*? board-index.*? > (. *?) .*? data-src="(.*?) @. *? title="(.*?) ". *? Starring: (. *?) \s*
.*? Release date :(.*?)
.*? integer">(.*?) .*? fraction">(.*?)
'
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9 * / *; Q = 0.8 '.'User-Agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15'.'Accept-Language': 'zh-cn'
}
def get_page(url) : # Grab the page and return the HTML string
print('\tGetting... ')
try:
response = requests.get(url, headers=headers)
return response.text
except Exception as e:
print('[Error]', e)
return ' '
def extract(html) : # regular extract, return the list of the result dict
print('\tExtracting... ')
raws = re.findall(pattern, html, re.S) # [(rank, image address, name, lead actor, screening time, score integer part, score decimal part),...]
result = []
for raw in raws:
dc = { # adjusts the order here
'index': raw[0].'title': raw[2].'stars': raw[3].'otime': raw[4].'score': raw[5] + raw[6].# Merge integers and decimals
'image': raw[1]
}
result.append(dc)
return result
def save(data) : # write to file
print('\tSaving... ')
with open(filename, 'a', encoding='utf-8') as f:
for i in data:
f.write(json.dumps(i, ensure_ascii=False) + '\n')
if __name__ == '__main__':
for i in range(0.100.10) :# flip
target = url + '? offset=' + str(i)
print('[%s%%](%s)' % (i, target))
page = get_page(target)
data = extract(page)
save(data)
time.sleep(0.5) # Prevention requests are blocked intensively
print('[100%] All Finished.\n Results in', filename)
Copy the code
debugging
Run the program, and if all goes well, we get the result:
{"index": "1"."title": "Farewell my Concubine"."stars": "Leslie Cheung, Feng Yi Zhang, Li Gong"."otime": "1993-01-01"."score": "9.6"."image": "https://p1.meituan.net/movie/20803f59291c47e1e116c11963ce019e68711.jpg"}
{"index": "2"."title": "The Shawshank Redemption"."stars": "Tim Robbins, Morgan Freeman, Bob Gunton."."otime": "The 1994-10-14 (us)"."score": "9.5"."image": "https://p0.meituan.net/movie/283292171619cdfd5b240c8fd093f1eb255670.jpg"}
{"index": "3"."title": Roman Holiday."stars": "Gregory Peck, Audrey Hepburn, Eddie Albert."."otime": "The 1953-09-02 (us)"."score": "9.1"."image": "https://p0.meituan.net/movie/54617769d96807e4d81804284ffe2a27239007.jpg"}
{"index": "4"."title": "The killer is not too cold."."stars": "Jean Reno, Gary Oldman, Natalie Portman."."otime": "The 1994-09-14 (France)"."score": "9.5"."image": "https://p0.meituan.net/movie/e55ec5d18ccc83ba7db68caae54f165f95924.jpg"}
{"index": "5"."title": "Titanic"."stars": "Leonardo dicaprio, Kate Winslet, Billy Zane."."otime": "1998-04-03"."score": "9.6"."image": "https://p1.meituan.net/movie/0699ac97c82cf01638aa5023562d6134351277.jpg"}
Copy the code
At first glance it looks like there’s no problem, we’ve got all the results we need. However, on closer inspection, there are still several problems:
- “Stars” usually have several people, so it’s best to use a list or tuple to place them.
- Each item of information is directly separated from each other, which is not easy to transmit and access.
To solve the first problem, we can add a section to the extract to handle the problem, while the second problem requires us to read all the pages, place them in the specified location, and write them uniformly to the file.
Modify the source program:
Add a new function that handles the main character information to get a list:
def stars_split(st) :
return st.split(', ')
Copy the code
Modify extract() to add a call to stars_split:
def extract(html) : # regular extract, return the list of the result dict
print('\tExtracting... ')
raws = re.findall(pattern, html, re.S) # [(rank, image address, name, lead actor, screening time, score integer part, score decimal part),...]
result = []
for raw in raws:
dc = { # adjusts the order here
'index': raw[0].'title': raw[2].'stars': stars_split(raw[3]), # [edit] : Separate the main actors
'otime': raw[4].'score': raw[5] + raw[6].# Merge integers and decimals
'image': raw[1]
}
result.append(dc)
return result
Copy the code
Add a new global variable and function to integrate the results:
result = {'top movies': []}
def merge(data) :
print('\tMerging... ')
result['top movies'] += data
Copy the code
Modify the save:
def save(data) : # write to file
print('Saving... ')
with open(filename, 'a', encoding='utf-8') as f:
f.write(json.dumps(data, ensure_ascii=False))
Copy the code
Modify the program framework:
if __name__ == '__main__':
for i in range(0.100.10) :# flip
target = url + '? offset=' + str(i)
print('[%s%%](%s)' % (i, target))
page = get_page(target)
data = extract(page)
merge(data)
time.sleep(0.5) # Prevention requests are blocked intensively
save(result)
print('[100%] All Finished.\n Results in', filename)
Copy the code
Integration code:
import re
import json
import time
import requests
url = 'https://maoyan.com/board/4'
result = {'top movies': []}
filename = './movies.json' You'd better change the name of the file you saved, otherwise it will be added to the end of the last run.
pattern = r'
.*? board-index.*? > (. *?) .*? data-src="(.*?) @. *? title="(.*?) ". *? Starring: (. *?) \s*
.*? Release date :(.*?)
.*? integer">(.*?) .*? fraction">(.*?)
'
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9 * / *; Q = 0.8 '.'User-Agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15'.'Accept-Language': 'zh-cn'
}
def get_page(url) : # Grab the page and return the HTML string
print('\tGetting... ')
try:
response = requests.get(url, headers=headers)
return response.text
except Exception as e:
print('[Error]', e)
return ' '
def stars_split(st) :
return st.split(', ')
def extract(html) : # regular extract, return the list of the result dict
print('\tExtracting... ')
raws = re.findall(pattern, html, re.S) # [(rank, image address, name, lead actor, screening time, score integer part, score decimal part),...]
result = []
for raw in raws:
dc = { # adjusts the order here
'index': raw[0].'title': raw[2].'stars': stars_split(raw[3]), # Split the lead
'otime': raw[4].'score': raw[5] + raw[6].# Merge integers and decimals
'image': raw[1]
}
result.append(dc)
return result
def merge(data) :
print('\tMerging... ')
result['top movies'] += data
def save(data) : # write to file
print('Saving... ')
with open(filename, 'a', encoding='utf-8') as f:
f.write(json.dumps(data, ensure_ascii=False))
if __name__ == '__main__':
for i in range(0.100.10) :# flip
target = url + '? offset=' + str(i)
print('[%s%%](%s)' % (i, target))
page = get_page(target)
data = extract(page)
merge(data)
time.sleep(0.5) # Prevention requests are blocked intensively
save(result)
print('[100%] All Finished.\n Results in', filename)
Copy the code
Run the modified program and get the new result:
{"top movies": [{"index": "1"."title": "Farewell my Concubine"."stars": ["Leslie Cheung"."Zhang Fengyi"."Gong li"]."otime": "1993-01-01"."score": "9.6"."image": "https://p1.meituan.net/movie/20803f59291c47e1e116c11963ce019e68711.jpg"}, {"index": "2"."title": "The Shawshank Redemption"."stars": ["Tim Robbins"."Morgan Freeman"."Bob Gunton"]."otime": "The 1994-10-14 (us)"."score": "9.5"."image": "https://p0.meituan.net/movie/283292171619cdfd5b240c8fd093f1eb255670.jpg"}, ..., {"index": "100"."title": "My neighbor totoro"."stars": ["Qin Lan"."Itoi Shigori"."Sumi Shimamoto"]."otime": "2018-12-14"."score": "9.2"."image": "https://p0.meituan.net/movie/c304c687e287c7c2f9e22cf78257872d277201.jpg"}}]Copy the code
That would be ideal.
complete
The crawler project is over. To sum up, we mainly used requests.get() to complete the request, and faked the headers; Re.findall () re parsed the results, and then adjusted the order of the information; Save the results with JSON formatting.
In fact, with a little modification, we can use this project to climb a lot of other movie list, such as we actually implemented a climb douban top250 program, really just a slight change, very easy.
We show the development process of this project, from the goal to the final completion, step by step. This development sequence is applicable to many projects, and is philosophic, which we think is worth understanding and practicing.