“This is the 10th day of my participation in the November Gwen Challenge. See details of the event: The Last Gwen Challenge 2021”.

preface

Using Requests +xpath to crawl douban.com.

Let’s have a good time

The development tools

**Python version: **3.6.4

Related modules:

Requests module;

Jieba module;

Pandas module

Numpy module

Pyecharts module;

And some modules that come with Python.

Environment set up

Install Python and add it to the environment variables. PIP installs the required related modules.

preparation

1. Obtain the page content

# crawl page URL \
douban_url = 'https://movie.douban.com/subject/26647117/comments?status=P'\
# requests send requests \
get_response = requests.get(douban_url)\
# Convert the returned response code to text (the entire web page) \
get_data = get_response.text
Copy the code

2. Analyze page content to get what we want

  • Open the page we want to climb in the browser
  • Press F12 to go to developer tools and see where is the data we want
  • Here we just need the reviewer + comment inside

  • Analyze the xpath value \ we obtained
'/html/body/div[3]/div[1]/div/div[1]/div[4]/**div[1]** /div[2]/h3/span[2]/a'
'/html/body/div[3]/div[1]/div/div[1]/div[4]/**div[2]** /div[2]/h3/span[2]/a'
'/html/body/div[3]/div[1]/div/div[1]/div[4]/**div[3]** /div[2]/h3/span[2]/a'
Copy the code
  • We looked at the xpaths and saw that they were only slightly different, the bolded part had changed the formatting, so we just had to crawl all the shoutings and just change the xpath to:
'/html/body/div[3]/div[1]/div/div[1]/div[4]/**div**/div[2]/h3/span[2]/a'
Copy the code

That is, no serial number after it, and when we query, similar XPaths will be captured automatically.

  • In the same analysis, we can get the xpath of the comment content as follows:
'/html/body/div[3]/div[1]/div/div[1]/div[4]/**div**/div[2]/p'
Copy the code
# (following the code above) parse the page and print the retrieved contents \
a = etree.HTML(get_data)\
commentator = s.xpath('/html/body/div[3]/div[1]/div/div[1]/div[4]/div/div[2]/h3/span[2]/a/text()')\
comment_content = a.xpath('/html/body/div[3]/div[1]/div/div[1]/div[4]/div/div[2]/p/text()') \# parse the content, remove the superfluous content \
for i in range(0.len(files)):\
   print(commentator[i]+'said:')\
   files[i].strip(r'\n')\
   files[i].strip(' ') \print(comment_content[i])
Copy the code

The results

Oriol Paulo said.  'Wrath of silence' is quite different from the crime movies I've seen. It's a mix of genres. It's a crime movie,a mystery movie,an action movie,it's also a social realistic movie. Xin Yu Kun plays very well the mix of different genres  in this film,and it has a powerful ending. 'We should encourage young directors who are above average,' Mr. Wen said. 'We should punish those who are too old to say anything.' Xilou dust said: the boss son eat vacuum mutton, greedy ground into the meat machine; Butcher son drinks polluted well water, justice is only on the TV screen. If he blinds his left eye, his fellow countryman who is stabbed will be able to protect him. Bite off tongue, saved lawyer but dare not speak. You can't build a pyramid by force, you can't become mother rabbit by falsetto. Superman mask is like conscience mantra, can not be sent back to the original owner; The poster fluttered like a charmer in the wind. The truth is buried in the dirt, hidden in the cave, and finally no one knows. The second work of Xin Yukun is not a show operation "Labyrinth of the Heart 2.0". When it comes to the style, it is like and unlike anyone else: Kubrick's single point perspective staring at the cave, the neurotic killer shaped like the Cohen Brothers, the corridor Fight like Old Boys... The difference is, not just to tell you who the killer is, but his choices, and like a scalpel to cut through the social problems of gaffes on the top, moral failure on the middle, aphasia on the bottom, and anomie on the human side. Jiang Wu guessed the ending from the moment he picked up the ashtray. But I was afraid that the well water was getting saltier and saltier. Why are so many people oedema? The village head knows, or he wouldn't drink mineral water. However, this terrier, in the end, but not too much account big meat pot said: the upper hypocrisy and cruelty, the middle indifferent selfish, the lower speechless powerless. Wuxia little prince said: MOTOROLA's electricity is still not as good as Nokia. Liu Xiaoyang said: Only 80 percent of the film, it is already fantastic. That's how Chinese genre films should be made. Good multi-line narrative control, deep mapping of human nature, the explosive growth of the economy, the explosion of uncontrollable social problems, men's silent resentment and pain, just like the voiceless people at the bottom. The dark end, the child is not found, the truth is not revealed, this is the truth of society. Sometimes the wicked do evil only in order to become true Allies with their common interests. Europa says: the downward, downward descent into the dark kind of film, the major contradiction of society, and not responsible for providing puzzle-solving pleasure, so it will be very heavy, very clogged. If Heart Maze is still a spontaneous creation in the manual age, Crack Silent is obviously a consideration of the industrial age (cast action special effects). The three of them compete against each other, the lawyer is too weak, Song Yang is too strong in battle, and Jiang Wu is a stereotype. Both advantages and disadvantages are obvious. The Bavarian Dionysian says: the ending was so fucking cool, you gasped in the theater. The innuendo is also awesome, the motorcycle license plate in 1984, a low-class loser set as dumb (no voice), the lawyer (representing the middle class and the law) and the coal boss (representing the powerful and the Mafia). So even if Zhang Baomin has the explosive force value of Mianzhenghyuk in Yellow Sea, he can only become the victim of this cruel society. Ling Rui said: When you look into the abyss, the abyss is also looking at you. Frozenmoon says: Chang Wannian is a meat eater, Xu Wenjie is a soup eater, and Zhang Baomin himself is "meat". They used to play their roles in one place on the food chain, but accidentally destroyed everything. When chang takes off his wig and suit, he has to surrender to violence and luck. Xu, without the protection of money and words, has to face cruelty. Zhang's price may be even greater. The crack of humanity. Inglourious Basterds says: What struck me most about the film was not the obvious, or even the obvious, metaphors, but the whole film's aphasia. We belong to the "aphasia generation", corresponding to the film, not only the surface mute Zhang Baomin's "physiological aphasia", but also the elite lawyers choose the "active aphasia" at the end of the film. The film's accurate display of "aphasia" not only sensitively captures the pain point of The Times, but also extremely painful.Copy the code

3. Turn pages and save comments and comments into CSV files

  • Page 1

Instead of analyzing xpath earlier, we just need to find the differences and patterns in urls from page to page.

The # start attribute represents the start position \
turn_page1 = 'https://movie.douban.com/subject/26647117/comments?status=P'\
turn_page2 = 'https://movie.douban.com/subject/26647117/comments?start=20&limit=20&sort=new_score&status=P'\
turn_page3 = 'https://movie.douban.com/subject/26647117/comments?start=40&limit=20&sort=new_score&status=P'\
turn_page4 = 'https://movie.douban.com/subject/26647117/comments?start=60&limit=20&sort=new_score&status=P'
Copy the code

The value of “start” increases by 20 per page. The value of “start” increases by 20 per page. This limit controls the increment of start by 20.

  • Page 2
# get total comments \
 comment_counts = a.xpath('/html/body/div[3]/div[1]/div/div[1]/div[1]/ul/li[1]/span/text()')\
 comment_counts = int(comment_counts[0].strip("Seen ()")),# Calculate the total number of pages (20 comments per page) \
 page_counts = int(comment_counts/20) \Request access and save the crawl data to a CSV file
 for i in range(0,page_counts):\
    turn_page_url = 'https://movie.douban.com/subject/26647117/comments?start={}&limit=20&sort=new_score&status=P'.format(i*20)\
    get_respones_data(turn_page_url)
Copy the code

Before we do that, we definitely need to modify the code we wrote earlier to make it look good. We can wrap the code we wrote earlier as a function get_respones_data(), passing in an access URL argument to get the HTML returned.

Code implementation

import requests\
from lxml import etree\
import pandas as pd\
def get_respones_data(douban_url = 'https://movie.douban.com/subject/26647117/comments?status=P') :\
     # requests send requests \
     get_response = requests.get(douban_url)\
     # Convert the returned response code to text (the entire web page) \
     get_data = get_response.text\
     # parse page \
     a = etree.HTML(get_data)\
     return a\
 first_a = get_respones_data()\
 # flip \
 comment_counts = first_a.xpath('/html/body/div[3]/div[1]/div/div[1]/div[1]/ul/li[1]/span/text()')\
 comment_counts = int(comment_counts[0].strip("Seen ()"))\
 page_counts = int(comment_counts / 20) \The # editor has been tested and can only access a maximum of 10 pages, or 200 comments, without logging in
# Next edition will teach you how to deal with reverse crawling
 for i in range(0, page_counts+1):\
     turn_page_url = 'https://movie.douban.com/subject/26647117/comments?start={}&limit=20&sort=new_score&status=P'.format(\
         i * 20) \print(turn_page_url)\
     a = get_respones_data(turn_page_url)\
     # Get comments and comments \
     commentator = a.xpath('/html/body/div[3]/div[1]/div/div[1]/div[4]/div/div[2]/h3/span[2]/a/text()')\
     comment_content = a.xpath('/html/body/div[3]/div[1]/div/div[1]/div[4]/div/div[2]/p/text()') \Parse the content and store it in a CSV file
     content = [' ' for i in range(0.len(commentator))]\
     for i in range(0.len(commentator)):\
         comment_content[i].strip(r'\n')\
         comment_content[i].strip(' ')\
         content_s = [commentator[i],comment_content[i]]\
         content[i] = content_s\
     name = ['Critic'.'Comment content']\
     file_test = pd.DataFrame(columns=name, data=content)\
     if i == 0:\
         file_test.to_csv(r'H:\PyCoding\FlaskCoding\Test_all\test0609\app\comment_content.cvs',encoding='utf-8',index=False) \else:\
         file_test.to_csv(r'H:\PyCoding\FlaskCoding\Test_all\test0609\app\comment_content.cvs',mode='a+',encoding='utf-8',index=False)
Copy the code

Data visualization

Installing a new module

pip install jieba\
pip install re\
pip install csv\
pip install pyecharts\
pip install numpy
Copy the code

1 Parsing Data

1    with codecs.open(r'H:\PyCoding\FlaskCoding\Test_all\test0609\app\comment_content.cvs'.'r'.'utf-8') as csvfile:\
2        content = ' '\
3        reader = csv.reader(csvfile)\
4        i =0\
5        for file1 in reader:\
6            if i == 0 or i ==1: \7                pass\
8            else: \9                content =content + file1[1] \10            i = i +1\
11        # Remove all superfluous characters \ in comments
12        content = re.sub('[,,。. \r\n]'.' ', content)
Copy the code

2 Analyzing data

Break down the comment into words
segment = jieba.lcut(content)\
words_df = pd.DataFrame({'segment': segment})\
# quoting=3 Indicates that stopwords. TXT is not quoted
stopwords = pd.read_csv(r"H:\PyCoding\FlaskCoding\Test_all\test0609\app\stopwords.txt", index_col=False, quoting=3, sep="\t", names=['stopword'], encoding='utf-8')\
words_df = words_df[~words_df.segment.isin(stopwords.stopword)]\
# count the number of times each word is repeated \
words_stat = words_df.groupby(by=['segment'[])'segment'].agg({"Count": numpy.size})\
words_stat = words_stat.reset_index().sort_values(by=["Count"], ascending=False)
Copy the code

3 Data Visualization

1 test = words_stat.head(1000).values\
# Get all words \
2 words = [test[i][0] for i in range(0.len(test))]\
# get the number of occurrences of word pairs \
3 counts = [test[i][1] for i in range(0.len(test))]\
4wordcloud = WordCloud(width=1300, height=620) \# generate word cloud map \
5 wordcloud.add("Crack and Silence.", words, counts, word_size_range=[20.100\])6 wordcloud.render()
Copy the code

Results show