The first Python crawler

Following on from my previous post on Python Requests, which included regular expressions in the blog, this article is a crawler for the TOP100 cat’s eye movies. Including from web page analysis, regular writing, data conversion, data preservation. Here are the details.

Analysis of web page

Open the website cat eye TOP100, open the developer tool, view the source code of the web page, find the list of codes, or directly look at the code below.


     
      <dd>
          <i class="board-index board-index-4">4</i>
      <a href="/films/4055" title=" This is not too cool "class="image-link" data-act="boarditem-click" data-val="{movieId:4055}">
              <img src="//ms0.meituan.net/mywww/image/loading_2.e3d934bf.png" alt="" class="poster-default">
      <img Alt =" this is not too cold "class="board-img" src="http://p0.meituan.net/movie/fc9d78dd2ce84d20e53b6d1ae2eea4fb1515304.jpg@160w_220h_1e_1c">
          </a>
          <div class="board-item-main">
              <div class="board-item-content">
                  <div class="movie-item-info">
                      <p class="name">
      <a href="/films/4055" title=" this movie is not too cold "data-act="boarditem-click" data-val="{movieId:4055}"> </a>
                      </p>
                      <p class="star">
      Starring: Jean Reno, Gary Oldman, Natalie Portman
                      </p>
      <p class=" 1994-09-14 "> </p>
                  </div>
                  <div class="movie-item-number score-num">
                      <p class="score">
                          <i class="integer">9.</i>
                          <i class="fraction">5</i>
                      </p>
                  </div>
      
              </div>
          </div>
      </dd>
     
Copy the code

We have to filter the list for pictures, titles, dates, stars, positions and ratings. By analyzing the web page, we get the regular expression is


     
      <dd>.*? board-index.*? > (. *?) </i>.*? data-src="(.*?) ". *? name.*? a.*? > (. *?) </a>.*? star.*? > (. *?) </p>.*? releasetime.*? > (. *?) </p>.*? integer.*? > (. *?) </i>.*? fraction.*? > (. *?) </i>.*? </dd>
     
Copy the code

Write the crawler


     
      import requests
      import re
      import json
      import time
      
      
      def get_one_page(url):
          headers = {
      'the user-agent' : 'Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
          }
          response = requests.get(url,headers=headers)
          if response.status_code == 200:
              return response.text
          return None
      
      
      def parse_one_page(html):
          pattern = re.compile(
      '<dd>.*? board-index.*? > (. *?) </i>.*? data-src="(.*?) ". *? name.*? a.*? > (. *?) </a>.*? star.*? > (. *?) </p>.*? releasetime.*? > (. *?) </p>.*? integer.*? > (. *?) </i>.*? fraction.*? > (. *?) </i>.*? </dd>', re.S)
          items = re.findall(pattern, html)
          for item in items:
              yield {
                  'index': item[0],
                  'image': item[1],
                  'title': item[2].strip(),
                  'actor': item[3].strip()[3:] if len(item[3]) > 3 else '',
                  'time': item[4].strip()[5:] if len(item[4]) > 5 else '',
                  'score': item[5].strip() + item[6].strip()
              }
      
      def write_to_json(content):
          with open('result.json','ab') as f:
              f.write(json.dumps(content, ensure_ascii=False,).encode('utf-8'))
      
      
      def main(offset):
          url = 'http://maoyan.com/board/4?offset='+str(offset)
          html = get_one_page(url)
          for item in  parse_one_page(html):
              print(item)
              write_to_json(item)
      
      
      if __name__ == '__main__':
          for i in range(10):
              main(offset=i*10)
              time.sleep(1)
     
Copy the code

Let’s walk through the above code step by step

Get_one_page: Climbs the source code of the web page based on the URL.
Parse_one_page: Get the content we want from the web page based on the regular expression.
Write_to_json: Writes the resulting data to text.
Main: Circularly retrieves web page data. Because there are a hundred pieces of data, and the content of the page changes according to the URL path.

The results of


     
      {
      Time: 1993-01-01(Hong Kong, China),
          "image": "http://p1.meituan.net/movie/20803f59291c47e1e116c11963ce019e68711.jpg@160w_220h_1e_1c",
      Title: Farewell my Concubine,
      "Score" : "9.6",
          "index": "1",
      "Actor ":" Leslie Cheung, Zhang Fengyi, Gong Li"
      } {
      1994-10-14(USA),
          "image": "http://p0.meituan.net/movie/__40191813__4767047.jpg@160w_220h_1e_1c",
      Title: Shawshank Redemption,
      "Score" : "9.5",
          "index": "2",
      "Actor ":" Tim Robbins, Morgan Freeman, Bob Gunton"
      }
     
Copy the code

So that’s the end of our first reptile. Try to catch snowball home news data tomorrow. Follow-up can try to grab the big man’s firm offer operation, when the other side of the stock, position any adjustment on the first time to send a notice to the mobile phone, timely operation.

welcomeLong press -> identify the QR code in the picture belowOr WeChatscanFollow my official account

Analysis of web page

Write the crawler

The results of

Related Posts

Sync. Pool “one” | Go theme month

Distributed lock analysis: Using Redis to implement the locking mechanism in distributed transactions

CSS Block Element basics, inline elements (inline elements), inline block elements (inline block elements)