In this article, we continue to use Requests and xpath to extract the douban movie’s short comments.
1. Web page analysis
(1) Turn the page
We still use Chrome browser to open a review of a movie in Douban for analysis, for example, A Good Play.
As before, we can get the content of the entire web page by constructing a URL, but this time we’ll try a new method — turning pages
Use Ctrl+Shift+I to open the Developer tool, then Ctrl+Shift+C to open the Element selection tool
At this point, click on the back page of the web page with the mouse, it will automatically locate the corresponding position in the source code
Next we use xpath to match the link address of the next page:
html.xpath('//div[@id="paginator"]/a[@class="next"]/@href')
Copy the code
In this way, we simply loop through each page to retrieve the content of the next page
The core code is as follows:
Get the source code of the web page
def get_page(url) :
Construct the request header
headers = {
'USER-AGENT':'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
Send the request and get the response
response = requests.get(url=url,headers=headers)
Get the source code for the web page
html = response.text
# return to page source code
return html
# Parse the page source code to get the link to the next page
def parse4link(html,base_url) :
Initialization returns the result
link = None
Construct the _Element object
html_elem = etree.HTML(html)
# Match the link address on the next page, note that it is a relative address
url = html_elem.xpath('//div[@id="paginator"]/a[@class="next"]/@href')
# If the match is successful, the matching result will be combined with the initial URL to form a complete link address
if url:
link = base_url + url[0]
return link
Copy the code
(2) Analyze web content
The data we need this time includes (again using xpath for matching) :
- Number of endorsers:
//div[@class="comment-item"]/div[2]/h3/span[1]/span/text()
- Commentator:
//div[@class="comment-item"]/div[2]/h3/span[2]/a/text()
- Evaluation:
//div[@class="comment-item"]/div[2]/h3/span[2]/span[2]/@title
- Comments:
//div[@class="comment-item"]/div[2]/p/span/text()
The core code is as follows:
# Parse web source code, get data
def parse4data(html) :
Construct the _Element object
html = etree.HTML(html)
# Number of endorsements
agrees = html.xpath('//div[@class="comment-item"]/div[2]/h3/span[1]/span/text()')
# Comment author
authods = html.xpath('//div[@class="comment-item"]/div[2]/h3/span[2]/a/text()')
# evaluation
stars = html.xpath('//div[@class="comment-item"]/div[2]/h3/span[2]/span[2]/@title')
# Comment content
contents = html.xpath('//div[@class="comment-item"]/div[2]/p/span/text()')
# Get results
data = zip(agrees,authods,stars,contents)
# return result
return data
Copy the code
(3) Save data
Save the data as TXT file, JSON file, and CSV file respectively
import json
import csv
# Open file
def openfile(fm) :
fd = None
if fm == 'txt':
fd = open('douban_comment.txt'.'w',encoding='utf-8')
elif fm == 'json':
fd = open('douban_comment.json'.'w',encoding='utf-8')
elif fm == 'csv':
fd = open('douban_comment.csv'.'w',encoding='utf-8',newline=' ')
return fd
Save the data to a file
def save2file(fm,fd,data) :
if fm == 'txt':
for item in data:
fd.write('----------------------------------------\n')
fd.write('something: + str(item[0]) + '\n')
fd.write('authod:' + str(item[1]) + '\n')
fd.write('star: + str(item[2]) + '\n')
fd.write('the content: + str(item[3]) + '\n')
if fm == 'json':
temp = ('agree'.'authod'.'star'.'content')
for item in data:
json.dump(dict(zip(temp,item)),fd,ensure_ascii=False)
if fm == 'csv':
writer = csv.writer(fd)
for item in data:
writer.writerow(item)
Copy the code
2. Code implementation
Note that this program requires the user to enter the movie ID to construct the initial URL, for example:
If the movie link address for: movie.douban.com/subject/269…
So the movie ID is: 26985127
[PS: Although this method is not very friendly to users, but due to personal level and time problems, I have not yet thought of a better solution.
The original idea was for the user to enter a movie name, and the program would automatically map the movie name to the movie ID to construct the initial URL.
import requests
from lxml import etree
import re
import json
import csv
import time
import random
Get the source code of the web page
def get_page(url) :
headers = {
'USER-AGENT':'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
response = requests.get(url=url,headers=headers)
html = response.text
return html
# Parse the page source code to get the link to the next page
def parse4link(html,base_url) :
link = None
html_elem = etree.HTML(html)
url = html_elem.xpath('//div[@id="paginator"]/a[@class="next"]/@href')
if url:
link = base_url + url[0]
return link
# Parse web source code, get data
def parse4data(html) :
html = etree.HTML(html)
agrees = html.xpath('//div[@class="comment-item"]/div[2]/h3/span[1]/span/text()')
authods = html.xpath('//div[@class="comment-item"]/div[2]/h3/span[2]/a/text()')
stars = html.xpath('//div[@class="comment-item"]/div[2]/h3/span[2]/span[2]/@title')
contents = html.xpath('//div[@class="comment-item"]/div[2]/p/span/text()')
data = zip(agrees,authods,stars,contents)
return data
# Open file
def openfile(fm) :
fd = None
if fm == 'txt':
fd = open('douban_comment.txt'.'w',encoding='utf-8')
elif fm == 'json':
fd = open('douban_comment.json'.'w',encoding='utf-8')
elif fm == 'csv':
fd = open('douban_comment.csv'.'w',encoding='utf-8',newline=' ')
return fd
Save the data to a file
def save2file(fm,fd,data) :
if fm == 'txt':
for item in data:
fd.write('----------------------------------------\n')
fd.write('something: + str(item[0]) + '\n')
fd.write('authod:' + str(item[1]) + '\n')
fd.write('star: + str(item[2]) + '\n')
fd.write('the content: + str(item[3]) + '\n')
if fm == 'json':
temp = ('agree'.'authod'.'star'.'content')
for item in data:
json.dump(dict(zip(temp,item)),fd,ensure_ascii=False)
if fm == 'csv':
writer = csv.writer(fd)
for item in data:
writer.writerow(item)
# Start crawling the page
def crawl() :
moveID = input('Please enter movie ID:')
while not re.match(r'\d{8}',moveID):
moveID = input('Input error, please re-enter movie ID:')
base_url = 'https://movie.douban.com/subject/' + moveID + '/comments'
fm = input('Please input file save format (TXT, JSON, CSV) :')
whilefm! ='txt' andfm! ='json' andfm! ='csv':
fm = input('Input error, please re-enter file save format (TXT, JSON, CSV) :')
fd = openfile(fm)
print('Start crawling')
link = base_url
while link:
print('Crawling' + str(link) + '... ')
html = get_page(link)
link = parse4link(html,base_url)
data = parse4data(html)
save2file(fm,fd,data)
time.sleep(random.random())
fd.close()
print('End of crawl')
if __name__ == '__main__':
crawl()
Copy the code
After writing, let’s run the code to see what happens:
Yi? It’s kind of weird. Why only 11 pages of comments? It’s not scientific. Good Play has over 100,000 reviews
Let’s go straight to the browser and open the last link:
As it turns out, the comments after 11 pages need to be logged in to gain access, so we have to write a mock login
Here we use the simplest way to simulate login, that is to use cookies, and manually get cookies (lazy)
In simple terms, cookies are data stored on the user’s local terminal for recording user information
When we log in to the browser, our login information will be recorded in the Cookie
After that, the browser automatically adds a Cookie to the request header, indicating that the request was sent by a specific user
So how do you get cookies? It is also very simple, use the browser to open douban movie home page for login, and then capture the package can be
Finally, all we need to do is copy the Cookie information and send it together in the request header, so we can continue to happily crawl the comments
[PS: Pay attention to the validity period of the Cookie and use it as soon as possible after obtaining the Cookie]