Following on from my previous post on Python Requests, which included regular expressions in the blog, this article is a crawler for the TOP100 cat’s eye movies. Including from web page analysis, regular writing, data conversion, data preservation. Here are the details.
Analysis of web page
Open the website cat eye TOP100, open the developer tool, view the source code of the web page, find the list of codes, or directly look at the code below.
<dd>
<i class="board-index board-index-4">4</i>
<a href="/films/4055" title=" This is not too cool "class="image-link" data-act="boarditem-click" data-val="{movieId:4055}">
<img src="//ms0.meituan.net/mywww/image/loading_2.e3d934bf.png" alt="" class="poster-default">
<img Alt =" this is not too cold "class="board-img" src="http://p0.meituan.net/movie/fc9d78dd2ce84d20e53b6d1ae2eea4fb1515304.jpg@160w_220h_1e_1c">
</a>
<div class="board-item-main">
<div class="board-item-content">
<div class="movie-item-info">
<p class="name">
<a href="/films/4055" title=" this movie is not too cold "data-act="boarditem-click" data-val="{movieId:4055}"> </a>
</p>
<p class="star">
Starring: Jean Reno, Gary Oldman, Natalie Portman
</p>
<p class=" 1994-09-14 "> </p>
</div>
<div class="movie-item-number score-num">
<p class="score">
<i class="integer">9.</i>
<i class="fraction">5</i>
</p>
</div>
</div>
</div>
</dd>
Copy the code
We have to filter the list for pictures, titles, dates, stars, positions and ratings. By analyzing the web page, we get the regular expression is
<dd>.*? board-index.*? > (. *?) </i>.*? data-src="(.*?) ". *? name.*? a.*? > (. *?) </a>.*? star.*? > (. *?) </p>.*? releasetime.*? > (. *?) </p>.*? integer.*? > (. *?) </i>.*? fraction.*? > (. *?) </i>.*? </dd>
Copy the code
Write the crawler
import requests
import re
import json
import time
def get_one_page(url):
headers = {
'the user-agent' : 'Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
response = requests.get(url,headers=headers)
if response.status_code == 200:
return response.text
return None
def parse_one_page(html):
pattern = re.compile(
'<dd>.*? board-index.*? > (. *?) </i>.*? data-src="(.*?) ". *? name.*? a.*? > (. *?) </a>.*? star.*? > (. *?) </p>.*? releasetime.*? > (. *?) </p>.*? integer.*? > (. *?) </i>.*? fraction.*? > (. *?) </i>.*? </dd>', re.S)
items = re.findall(pattern, html)
for item in items:
yield {
'index': item[0],
'image': item[1],
'title': item[2].strip(),
'actor': item[3].strip()[3:] if len(item[3]) > 3 else '',
'time': item[4].strip()[5:] if len(item[4]) > 5 else '',
'score': item[5].strip() + item[6].strip()
}
def write_to_json(content):
with open('result.json','ab') as f:
f.write(json.dumps(content, ensure_ascii=False,).encode('utf-8'))
def main(offset):
url = 'http://maoyan.com/board/4?offset='+str(offset)
html = get_one_page(url)
for item in parse_one_page(html):
print(item)
write_to_json(item)
if __name__ == '__main__':
for i in range(10):
main(offset=i*10)
time.sleep(1)
Copy the code
Let’s walk through the above code step by step
-
Get_one_page: Climbs the source code of the web page based on the URL.
-
Parse_one_page: Get the content we want from the web page based on the regular expression.
-
Write_to_json: Writes the resulting data to text.
-
Main: Circularly retrieves web page data. Because there are a hundred pieces of data, and the content of the page changes according to the URL path.
The results of
{
Time: 1993-01-01(Hong Kong, China),
"image": "http://p1.meituan.net/movie/20803f59291c47e1e116c11963ce019e68711.jpg@160w_220h_1e_1c",
Title: Farewell my Concubine,
"Score" : "9.6",
"index": "1",
"Actor ":" Leslie Cheung, Zhang Fengyi, Gong Li"
} {
1994-10-14(USA),
"image": "http://p0.meituan.net/movie/__40191813__4767047.jpg@160w_220h_1e_1c",
Title: Shawshank Redemption,
"Score" : "9.5",
"index": "2",
"Actor ":" Tim Robbins, Morgan Freeman, Bob Gunton"
}
Copy the code
So that’s the end of our first reptile. Try to catch snowball home news data tomorrow. Follow-up can try to grab the big man’s firm offer operation, when the other side of the stock, position any adjustment on the first time to send a notice to the mobile phone, timely operation.
welcomeLong press -> identify the QR code in the picture belowOr WeChatscanFollow my official account