Crawler series (11) crawled Douban movie reviews using Requests and xpath

In this article, we continue to use Requests and xpath to extract the douban movie’s short comments.

1. Web page analysis

(1) Turn the page

We still use Chrome browser to open a review of a movie in Douban for analysis, for example, A Good Play.

As before, we can get the content of the entire web page by constructing a URL, but this time we’ll try a new method — turning pages

Use Ctrl+Shift+I to open the Developer tool, then Ctrl+Shift+C to open the Element selection tool

At this point, click on the back page of the web page with the mouse, it will automatically locate the corresponding position in the source code

Next we use xpath to match the link address of the next page:

html.xpath('//div[@id="paginator"]/a[@class="next"]/@href')
Copy the code

In this way, we simply loop through each page to retrieve the content of the next page

The core code is as follows:

Get the source code of the web page
def get_page(url) :
    Construct the request header
    headers = {
        'USER-AGENT':'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
    }
    Send the request and get the response
    response = requests.get(url=url,headers=headers)
    Get the source code for the web page
    html = response.text
    # return to page source code
    return html

# Parse the page source code to get the link to the next page
def parse4link(html,base_url) :
    Initialization returns the result
    link = None
    Construct the _Element object
    html_elem = etree.HTML(html)
    # Match the link address on the next page, note that it is a relative address
    url = html_elem.xpath('//div[@id="paginator"]/a[@class="next"]/@href')
    # If the match is successful, the matching result will be combined with the initial URL to form a complete link address
    if url:
        link = base_url + url[0]
    return link
Copy the code

(2) Analyze web content

The data we need this time includes (again using xpath for matching) :

Number of endorsers://div[@class="comment-item"]/div[2]/h3/span[1]/span/text()
Commentator://div[@class="comment-item"]/div[2]/h3/span[2]/a/text()
Evaluation://div[@class="comment-item"]/div[2]/h3/span[2]/span[2]/@title
Comments://div[@class="comment-item"]/div[2]/p/span/text()

The core code is as follows:

# Parse web source code, get data
def parse4data(html) :
    Construct the _Element object
    html = etree.HTML(html)
    # Number of endorsements
    agrees = html.xpath('//div[@class="comment-item"]/div[2]/h3/span[1]/span/text()')
    # Comment author
    authods = html.xpath('//div[@class="comment-item"]/div[2]/h3/span[2]/a/text()')
    # evaluation
    stars = html.xpath('//div[@class="comment-item"]/div[2]/h3/span[2]/span[2]/@title')
    # Comment content
    contents = html.xpath('//div[@class="comment-item"]/div[2]/p/span/text()')
	# Get results
    data = zip(agrees,authods,stars,contents)
    # return result
    return data
Copy the code

(3) Save data

Save the data as TXT file, JSON file, and CSV file respectively

import json
import csv
# Open file
def openfile(fm) :
    fd = None
    if fm == 'txt':
        fd = open('douban_comment.txt'.'w',encoding='utf-8')
    elif fm == 'json':
        fd = open('douban_comment.json'.'w',encoding='utf-8')
    elif fm == 'csv':
        fd = open('douban_comment.csv'.'w',encoding='utf-8',newline=' ')
    return fd

Save the data to a file
def save2file(fm,fd,data) :
    if fm == 'txt':
        for item in data:
            fd.write('----------------------------------------\n')
            fd.write('something: + str(item[0]) + '\n')
            fd.write('authod：' + str(item[1]) + '\n')
            fd.write('star: + str(item[2]) + '\n')
            fd.write('the content: + str(item[3]) + '\n')
    if fm == 'json':
        temp = ('agree'.'authod'.'star'.'content')
        for item in data:
            json.dump(dict(zip(temp,item)),fd,ensure_ascii=False)
    if fm == 'csv':
        writer = csv.writer(fd)
        for item in data:
            writer.writerow(item)
Copy the code

2. Code implementation

Note that this program requires the user to enter the movie ID to construct the initial URL, for example:

If the movie link address for: movie.douban.com/subject/269…

So the movie ID is: 26985127

[PS: Although this method is not very friendly to users, but due to personal level and time problems, I have not yet thought of a better solution.

The original idea was for the user to enter a movie name, and the program would automatically map the movie name to the movie ID to construct the initial URL.

import requests
from lxml import etree
import re
import json
import csv
import time
import random

Get the source code of the web page
def get_page(url) :
    headers = {
        'USER-AGENT':'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
    }
    response = requests.get(url=url,headers=headers)
    html = response.text
    return html

# Parse the page source code to get the link to the next page
def parse4link(html,base_url) :
    link = None
    html_elem = etree.HTML(html)
    url = html_elem.xpath('//div[@id="paginator"]/a[@class="next"]/@href')
    if url:
        link = base_url + url[0]
    return link

# Parse web source code, get data
def parse4data(html) :
    html = etree.HTML(html)
    agrees = html.xpath('//div[@class="comment-item"]/div[2]/h3/span[1]/span/text()')
    authods = html.xpath('//div[@class="comment-item"]/div[2]/h3/span[2]/a/text()')
    stars = html.xpath('//div[@class="comment-item"]/div[2]/h3/span[2]/span[2]/@title')
    contents = html.xpath('//div[@class="comment-item"]/div[2]/p/span/text()')
    data = zip(agrees,authods,stars,contents)
    return data

# Open file
def openfile(fm) :
    fd = None
    if fm == 'txt':
        fd = open('douban_comment.txt'.'w',encoding='utf-8')
    elif fm == 'json':
        fd = open('douban_comment.json'.'w',encoding='utf-8')
    elif fm == 'csv':
        fd = open('douban_comment.csv'.'w',encoding='utf-8',newline=' ')
    return fd

Save the data to a file
def save2file(fm,fd,data) :
    if fm == 'txt':
        for item in data:
            fd.write('----------------------------------------\n')
            fd.write('something: + str(item[0]) + '\n')
            fd.write('authod：' + str(item[1]) + '\n')
            fd.write('star: + str(item[2]) + '\n')
            fd.write('the content: + str(item[3]) + '\n')
    if fm == 'json':
        temp = ('agree'.'authod'.'star'.'content')
        for item in data:
            json.dump(dict(zip(temp,item)),fd,ensure_ascii=False)
    if fm == 'csv':
        writer = csv.writer(fd)
        for item in data:
            writer.writerow(item)

# Start crawling the page
def crawl() :
    moveID = input('Please enter movie ID:')
    while not re.match(r'\d{8}',moveID):
        moveID = input('Input error, please re-enter movie ID:')
    base_url = 'https://movie.douban.com/subject/' + moveID + '/comments'
    fm = input('Please input file save format (TXT, JSON, CSV) :')
    whilefm! ='txt' andfm! ='json' andfm! ='csv':
        fm = input('Input error, please re-enter file save format (TXT, JSON, CSV) :')
    fd = openfile(fm)
    print('Start crawling')
    link = base_url
    while link:
        print('Crawling' + str(link) + '... ')
        html = get_page(link)
        link = parse4link(html,base_url)
        data = parse4data(html)
        save2file(fm,fd,data)
        time.sleep(random.random())
    fd.close()
    print('End of crawl')

if __name__ == '__main__':
    crawl()
Copy the code

After writing, let’s run the code to see what happens:

Yi? It’s kind of weird. Why only 11 pages of comments? It’s not scientific. Good Play has over 100,000 reviews

Let’s go straight to the browser and open the last link:

As it turns out, the comments after 11 pages need to be logged in to gain access, so we have to write a mock login

Here we use the simplest way to simulate login, that is to use cookies, and manually get cookies (lazy)

In simple terms, cookies are data stored on the user’s local terminal for recording user information

When we log in to the browser, our login information will be recorded in the Cookie

After that, the browser automatically adds a Cookie to the request header, indicating that the request was sent by a specific user

So how do you get cookies? It is also very simple, use the browser to open douban movie home page for login, and then capture the package can be

Finally, all we need to do is copy the Cookie information and send it together in the request header, so we can continue to happily crawl the comments

[PS: Pay attention to the validity period of the Cookie and use it as soon as possible after obtaining the Cookie]

Crawler series (11) crawled Douban movie reviews using Requests and xpath

1. Web page analysis

(1) Turn the page

(2) Analyze web content

(3) Save data

2. Code implementation

Related Posts

How many ways do you know to collect Nginx logs?

☕ explore CyclicBarrier’s technical principles and source code analysis

Python: Play with HTML tags for images, links, lists, tables, etc