Following on from my previous post on Python Requests, which included regular expressions in the blog, this article is a crawler for the TOP100 cat’s eye movies. Including from web page analysis, regular writing, data conversion, data preservation. Here are the details.

Analysis of web page

Open the website cat eye TOP100, open the developer tool, view the source code of the web page, find the list of codes, or directly look at the code below.


     
  1. <dd>

  2.    <i class="board-index board-index-4">4</i>

  3. <a href="/films/4055" title=" This is not too cool "class="image-link" data-act="boarditem-click" data-val="{movieId:4055}">

  4.        <img src="//ms0.meituan.net/mywww/image/loading_2.e3d934bf.png" alt="" class="poster-default">

  5. <img Alt =" this is not too cold "class="board-img" src="http://p0.meituan.net/movie/fc9d78dd2ce84d20e53b6d1ae2eea4fb1515304.jpg@160w_220h_1e_1c">

  6.    </a>

  7.    <div class="board-item-main">

  8.        <div class="board-item-content">

  9.            <div class="movie-item-info">

  10.                <p class="name">

  11. <a href="/films/4055" title=" this movie is not too cold "data-act="boarditem-click" data-val="{movieId:4055}"> </a>

  12.                </p>

  13.                <p class="star">

  14. Starring: Jean Reno, Gary Oldman, Natalie Portman

  15.                </p>

  16. <p class=" 1994-09-14 "> </p>

  17.            </div>

  18.            <div class="movie-item-number score-num">

  19.                <p class="score">

  20.                    <i class="integer">9.</i>

  21.                    <i class="fraction">5</i>

  22.                </p>

  23.            </div>

  24.        </div>

  25.    </div>

  26. </dd>

Copy the code

We have to filter the list for pictures, titles, dates, stars, positions and ratings. By analyzing the web page, we get the regular expression is


     
  1. <dd>.*? board-index.*? > (. *?) </i>.*? data-src="(.*?) ". *? name.*? a.*? > (. *?) </a>.*? star.*? > (. *?) </p>.*? releasetime.*? > (. *?) </p>.*? integer.*? > (. *?) </i>.*? fraction.*? > (. *?) </i>.*? </dd>

Copy the code

Write the crawler


     
  1. import requests

  2. import re

  3. import json

  4. import time

  5. def get_one_page(url):

  6.    headers = {

  7. 'the user-agent' : 'Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'

  8.    }

  9.    response = requests.get(url,headers=headers)

  10.    if response.status_code == 200:

  11.        return response.text

  12.    return None

  13. def parse_one_page(html):

  14.    pattern = re.compile(

  15. '<dd>.*? board-index.*? > (. *?) </i>.*? data-src="(.*?) ". *? name.*? a.*? > (. *?) </a>.*? star.*? > (. *?) </p>.*? releasetime.*? > (. *?) </p>.*? integer.*? > (. *?) </i>.*? fraction.*? > (. *?) </i>.*? </dd>', re.S)

  16.    items = re.findall(pattern, html)

  17.    for item in items:

  18.        yield {

  19.            'index': item[0],

  20.            'image': item[1],

  21.            'title': item[2].strip(),

  22.            'actor': item[3].strip()[3:] if len(item[3]) > 3 else '',

  23.            'time': item[4].strip()[5:] if len(item[4]) > 5 else '',

  24.            'score': item[5].strip() + item[6].strip()

  25.        }

  26. def write_to_json(content):

  27.    with open('result.json','ab') as f:

  28.        f.write(json.dumps(content, ensure_ascii=False,).encode('utf-8'))

  29. def main(offset):

  30.    url = 'http://maoyan.com/board/4?offset='+str(offset)

  31.    html = get_one_page(url)

  32.    for item in  parse_one_page(html):

  33.        print(item)

  34.        write_to_json(item)

  35. if __name__ == '__main__':

  36.    for i in range(10):

  37.        main(offset=i*10)

  38.        time.sleep(1)

Copy the code

Let’s walk through the above code step by step

  • Get_one_page: Climbs the source code of the web page based on the URL.

  • Parse_one_page: Get the content we want from the web page based on the regular expression.

  • Write_to_json: Writes the resulting data to text.

  • Main: Circularly retrieves web page data. Because there are a hundred pieces of data, and the content of the page changes according to the URL path.

The results of


     
  1. {

  2. Time: 1993-01-01(Hong Kong, China),

  3.    "image": "http://p1.meituan.net/movie/20803f59291c47e1e116c11963ce019e68711.jpg@160w_220h_1e_1c",

  4. Title: Farewell my Concubine,

  5. "Score" : "9.6",

  6.    "index": "1",

  7. "Actor ":" Leslie Cheung, Zhang Fengyi, Gong Li"

  8. } {

  9. 1994-10-14(USA),

  10.    "image": "http://p0.meituan.net/movie/__40191813__4767047.jpg@160w_220h_1e_1c",

  11. Title: Shawshank Redemption,

  12. "Score" : "9.5",

  13.    "index": "2",

  14. "Actor ":" Tim Robbins, Morgan Freeman, Bob Gunton"

  15. }

Copy the code

So that’s the end of our first reptile. Try to catch snowball home news data tomorrow. Follow-up can try to grab the big man’s firm offer operation, when the other side of the stock, position any adjustment on the first time to send a notice to the mobile phone, timely operation.


welcomeLong press -> identify the QR code in the picture belowOr WeChatscanFollow my official account