This article was originally published in the wechat public account “Geek Monkey”, welcome to follow the first time to get more original sharing

This article was written 3 months ago, just look at the ideas.

Foreword: Avengers: Infinity War was released in Chinese mainland on May 11, 2018. As of May 16, its cumulative box office has reached 1.525 billion yuan. That’s already more than the record for a single Marvel film. I have to say, Marvel movies have become a cultural trend.

Avengers: Infinity War will be the finale of marvel’s 10-year saga. Marvel confirmed that they put in a lot of work to deliver an amazing movie. I also went to the cinema on weekends. After watching, I feel that no matter in the fighting special effects or the story, are to give people a pleasant enjoyment. At the same time, the film also retains the humorous style of the past, often can make the audience laugh.

If you haven’t seen it yet, you can go to the cinema. It’s really worth seeing.

In this paper, a web crawler was made by Python to crawl douban movie reviews, analyze and then make douban movie reviews cloud map.

1 analysis

First check the movie review page to determine what to crawl. I’m going to crawl the user name, whether I’ve seen it, the five-star comment value, the comment time, the number of useful comments and the comment content.

Then determine the URL structure for each page of comments. Page 2 URL:

Start = = = = = = = = = = = = = = = = = =

2 Data crawl

This article uses xPaths from the Requests library and the LXML library to crawl the data. Douban website although the web crawler is very friendly, but there is still an anti-crawler mechanism. If you don’t set latency, you can block IP if you make a lot of requests all at once. In addition, if you do not log in to Douban, you can only access the first 10 pages of the movie. Therefore, HTTP requests to crawl data must carry their own cookies. Cookies are not difficult to get, you can log in to Douban through a browser, and then get them in developer mode.

I think from the home page reviews began to crawl, crawl entrance is: movie.douban.com/subject/247… , and then get the URL of the next page and the content to crawl, and then continue to visit the address of the next page.

import jieba
import requests
import pandas as pd
import time
import random
from lxml import etree

def start_spider(a):
    base_url = 'https://movie.douban.com/subject/24773958/comments'
    start_url = base_url + '? start=0' 

    number = 1
    html = request_get(start_url) 

    while html.status_code == 200:
        Get the next page URL
        selector = etree.HTML(html.text)
        nextpage = selector.xpath("//div[@id='paginator']/a[@class='next']/@href")
        nextpage = nextpage[0]
        next_url = base_url + nextpage
        # Get comments
        comments = selector.xpath("//div[@class='comment']")
        marvelthree = []
        for each in comments:
            marvelthree.append(get_comments(each))

        data = pd.DataFrame(marvelthree)
        # write CSV file,'a+' is append mode
        try:
            if number == 1:
                csv_headers = ['users'.'Have you seen it?'.Five Star rating.'Comment time'.'useful number'.'Comment content']
                data.to_csv('./Marvel3_yingpping.csv', header=csv_headers, index=False, mode='a+', encoding='utf-8')
            else:
                data.to_csv('./Marvel3_yingpping.csv', header=False, index=False, mode='a+', encoding='utf-8')
        except UnicodeEncodeError:
            print("Encoding error. Data cannot be written to file. Ignore it.")

        data = []

        html = request_get(next_url)
Copy the code

I added a randomly varying User-agent to the request header and cookies. Finally, increase the random waiting time of requests to prevent excessive requests from being blocked by IP addresses.

def request_get(url):
    "" Use sessions to hold certain parameters across requests. It also keeps cookies "" between all requests made by the same Session instance
    timeout = 3

    UserAgent_List = [
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"."Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36"."Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36"."Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36"."Mozilla / 5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2226.0 Safari/537.36"."Mozilla / 5.0 (Windows NT 6.4; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2225.0 Safari/537.36"."Mozilla / 5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2225.0 Safari/537.36"."Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2224.3 Safari/537.36".Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.93 Safari/537.36".Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.93 Safari/537.36"."Mozilla / 5.0 (Windows NT 6.3; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36"."Mozilla / 5.0 (Windows NT 4.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36"."Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36".Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36".Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.3319.102 Safari/537.36."Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.2309.372 Safari/537.36."Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML like Gecko) Chrome/35.0.2117.157 Safari/537.36"."Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/ Safari/537.36"."Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1866.237 Safari/537.36",
    ]

    header = {
        'User-agent': random.choice(UserAgent_List),
        'Host': 'movie.douban.com'.'Referer': 'https://movie.douban.com/subject/24773958/?from=showing',
    }

    session = requests.Session()

    cookie = {
        'cookie': "Your cookie value",
    }

    time.sleep(random.randint(5.15))  
    response = requests.get(url, headers=header, cookies=cookie_nologin, timeout = 3)
    ifresponse.status_code ! =200:
        print(response.status_code)
    return response
Copy the code

The last step is data retrieval:

def get_comments(eachComment):
    commentlist = []
    user = eachComment.xpath("./h3/span[@class='comment-info']/a/text()") [0]  # the user
    watched = eachComment.xpath("./h3/span[@class='comment-info']/span[1]/text()") [0]  # Have you seen
    rating = eachComment.xpath("./h3/span[@class='comment-info']/span[2]/@title")  # five-star rating
    if len(rating) > 0:
        rating = rating[0]

    comment_time = eachComment.xpath("./h3/span[@class='comment-info']/span[3]/@title")  # Comment time
    if len(comment_time) > 0:
        comment_time = comment_time[0]
    else:
        # Some comments do not have a five-star rating, please assign a null value
        comment_time = rating
        rating = ' '

    votes = eachComment.xpath("./h3/span[@class='comment-vote']/span/text()") [0]  # useful
    content = eachComment.xpath("./p/text()") [0]  # Comment content

    commentlist.append(user)
    commentlist.append(watched)
    commentlist.append(rating)
    commentlist.append(comment_time)
    commentlist.append(votes)
    commentlist.append(content.strip())
    # print(list)
    return commentlist
Copy the code

3 Make a cloud picture

Because the comment data extracted by crawling is a large string, we need to segment each sentence and count the comments for each word. I use Jieba library for word segmentation to make cloud images, while I throw the data after word segmentation to the website Worditout for processing.

def split_word(a):
    with codecs.open('Marvel3_yingpping.csv'.'r'.'utf-8') as csvfile:
        reader = csv.reader(csvfile)
        content_list = []
        for row in reader:
            try:
                content_list.append(row[5])
            except IndexError:
                pass

        content = ' '.join(content_list)

        seg_list = jieba.cut(content, cut_all=False)
        result = '\n'.join(seg_list)
        print(result)
Copy the code

The resulting cloud image looks like this:

It’s no surprise that the word “Thanos” appears most frequently. Because the story line for the entire movie is that Thanos collects six infinity Stones from various planets in the universe, and then each superhero teams up to stop thanos from destroying the entire universe.


This article was first published on wechat. The original address is douban review, to tell you what Avengers 3 is about. Welcome to reprint the article at any time, reprint please contact number to open the white list, respect the author’s original. I use my wechat account “Geek Monkey” to share original Python works every week. Related to web crawler, data analysis, Web development and other directions.