I. Project Overview

1. Project Description

This project mainly captures the specific content of all messages on the leadership message board, extracts and saves message details, reply details and evaluation details, and uses them for subsequent data analysis and further processing, which can provide a basis for government decision-making and the implementation of e-government. The website link isliuyan.people.com.cn/home?p=0, select any message and click to enter the details page as followsAll the data marked in the map should be climbed to form part of a message.

2. Configure the environment

(1) Python: 3.x (2)

  • dateutil
    • Installation method:
    • pip install python-dateutil
  • selenium
    • Installation method:
    • pip install selenium

(3) : to simulate the drive chromedriver, clickable download.csdn.net/download/CU… To download the Google browser version 80.0.3987.16 corresponding version, or click on the download chromedriver.storage.googleapis.com/index.html and corresponding version of Google, Place it in the Scripts directory of the Python installation directory.

Ii. Project implementation

1. Import required libraries

import csv
import os
import random
import re
import time

import dateutil.parser as dparser
from random import choice
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
Copy the code

It mainly imports the processing libraries needed in the crawl process and the classes used in Selenium.

2. Configure global variables and parameters

# Time node
start_date = dparser.parse('2019-06-01')
# Browser Settings options
chrome_options = Options()
chrome_options.add_argument('blink-settings=imagesEnabled=false')
Copy the code

We assume that only the messages after 2019.6.1 can be climbed, because the messages before this are automatically praised and have no reference value, so set the time node and forbid the webpage to load pictures, so as to reduce the bandwidth requirements on the network and improve the loading rate.

3. Generate random time and user agent

def get_time() :
    "Get random time"
    return round(random.uniform(3.6), 1)


def get_user_agent() :
    Get random user agent.
    user_agents = [
        "Mozilla / 4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; The.net CLR 1.1.4322; The.net CLR 2.0.50727)"."Mozilla / 4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; The.net CLR 2.0.50727; Media Center PC 5.0; The.net CLR 3.0.04506)"."Mozilla / 4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; The.net CLR 1.1.4322; The.net CLR 2.0.50727)"."Mozilla / 5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)"."Mozilla / 5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident / 5.0; The.net CLR 3.5.30729; The.net CLR 3.0.30729; The.net CLR 2.0.50727; Media Center PC (6.0)"."Mozilla / 5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident / 4.0; WOW64; Trident / 4.0; SLCC2; The.net CLR 2.0.50727; The.net CLR 3.5.30729; The.net CLR 3.0.30729; The.net CLR 1.0.3705; The.net CLR 1.1.4322)"."Mozilla / 4.0 (compatible; MSIE 7.0 b; Windows NT 5.2; The.net CLR 1.1.4322; The.net CLR 2.0.50727; InfoPath.2; The.net CLR 3.0.04506.30)"."Mozilla / 5.0 (Windows; U; Windows NT 5.1; Zh-cn) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)"."Mozilla / 5.0 (X11; U; Linux; En-us) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6"."Mozilla / 5.0 (Windows; U; Windows NT 5.1; en-US; The rv: 1.8.1.2 pre) - Ninja Gecko / 20070215 K / 2.1.1"."Mozilla / 5.0 (Windows; U; Windows NT 5.1; zh-CN; The rv: 1.9) Gecko / 20080705 Firefox/Kapiko / 3.0 3.0"."Mozilla / 5.0 (X11; Linux i686; U;) Gecko / 20070322 Kazehakase / 0.4.5"."Mozilla / 5.0 (X11; U; Linux i686; en-US; The rv: 1.9.0.8) Gecko Fedora / 1.9.0.8-1. Fc10 Kazehakase / 0.5.6"."Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11"."Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20"."Opera / 9.80 (Macintosh; Intel Mac OS X 10.6.8; U; Fr) Presto / 2.9.168 Version / 11.52"."Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11"."Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER"."Mozilla / 5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident / 5.0; SLCC2; The.net CLR 2.0.50727; The.net CLR 3.5.30729; The.net CLR 3.0.30729; Media Center PC 6.0; . NET4.0 C; . NET4.0 E; LBBROWSER)"."Mozilla / 4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; . NET4.0 C; . NET4.0 E; LBBROWSER)"."Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER."Mozilla / 4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident / 5.0; SLCC2; The.net CLR 2.0.50727; The.net CLR 3.5.30729; The.net CLR 3.0.30729; Media Center PC 6.0; . NET4.0 C; NET4.0 E)"."Mozilla / 5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident / 5.0; SLCC2; The.net CLR 2.0.50727; The.net CLR 3.5.30729; The.net CLR 3.0.30729; Media Center PC 6.0; . NET4.0 C; . NET4.0 E; QQBrowser / 7.0.3698.400)"."Mozilla / 4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; . NET4.0 C; NET4.0 E)"."Mozilla / 4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident / 4.0; SV1; QQDownload 732; . NET4.0 C; . NET4.0 E; 360SE)"."Mozilla / 4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; . NET4.0 C; NET4.0 E)"."Mozilla / 4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident / 5.0; SLCC2; The.net CLR 2.0.50727; The.net CLR 3.5.30729; The.net CLR 3.0.30729; Media Center PC 6.0; . NET4.0 C; NET4.0 E)".Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1."Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1"."Mozilla / 5.0 (the device; U; CPU OS 4_2_1 like Mac OS X; Zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5"."Mozilla / 5.0 (Windows NT 6.1; Win64; x64; The rv: b13pre) Gecko / 20110307 Firefox 2.0/4.0 b13pre"."Mozilla / 5.0 (X11; Ubuntu; Linux x86_64; The rv: 16.0) Gecko / 20100101 Firefox / 16.0"."Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11"."Mozilla / 5.0 (X11; U; Linux x86_64; zh-CN; Rv :1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10"."MQQBrowser / 26 Mozilla / 5.0 (Linux; U; Android 2.3.7. zh-cn; MB200 Build/GRJ22; Cyanogenmod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1"."Mozilla / 5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13B143 Safari/601.1"."Mozilla / 5.0 (Linux; Android 5.1.1. Nexus 6 Build/LYZ28E) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.23 Mobile Safari/537.36"."Mozilla / 5.0 (iPod; U; CPU iPhone OS 2_1 like Mac OS X; Ja-jp) AppleWebKit/525.18.1 (KHTML, like Gecko) Version/3.1.1 Mobile/5F137 Safari/525.20"."Mozilla / 5.0 (Linux; u; Android 4.2.2; zh-cn;) AppleWebKit/534.46 (KHTML,like Gecko) Version/5.1 Mobile Safari/10600.6.3 (compatible; Baiduspider / 2.0; +http://www.baidu.com/search/spider.html)"."Mozilla / 5.0 (compatible; Baiduspider / 2.0; + http://www.baidu.com/search/spider.html)"
    ]
    Create a random proxy in the user_Agent list that acts as a mock browser
    user_agent = choice(user_agents)
    return user_agent

Copy the code

Generate random time and random simulation browser used to visit web pages, reduce the possibility of being recognized by the server as crawler and banned.

4. Obtain the FID of the leader

def get_fid() :
    Get all leader ids.
    with open('url_fid.txt'.'r') as f:
        content = f.read()
        fids = content.split()
    return fids
Copy the code

Each leader has an FID to distinguish between them. Here, the FID is obtained manually and saved to TXT, and then read line by line at the start of the crawl.

5. Obtain the links for leaving messages

def get_detail_urls(position, list_url) :
    Get all message links for each leader.
    user_agent = get_user_agent()
    chrome_options.add_argument('user-agent=%s' % user_agent)
    drivertemp = webdriver.Chrome(options=chrome_options)
    drivertemp.maximize_window()
    drivertemp.get(list_url)
    time.sleep(2)
    # loop load the page
    while True:
        datestr = WebDriverWait(drivertemp, 10).until(
            lambda driver: driver.find_element_by_xpath(
                '//*[@id="list_content"]/li[position()=last()]/h3/span')).text.strip()
        datestr = re.search(r'\d{4}-\d{2}-\d{2}', datestr).group()
        date = dparser.parse(datestr, fuzzy=True)
        print('Climbing link --', position, The '-', date)
        if date < start_date:
            break
        # Simulate click-to-load
        try:
            WebDriverWait(drivertemp, 50.2).until(EC.element_to_be_clickable((By.ID, "show_more")))
            drivertemp.execute_script('window.scrollTo(document.body.scrollHeight, document.body.scrollHeight - 600)')
            time.sleep(get_time())
            drivertemp.execute_script('window.scrollTo(document.body.scrollHeight - 600, document.body.scrollHeight)')
            WebDriverWait(drivertemp, 50.2).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="show_more"]')))
            drivertemp.find_element_by_xpath('//*[@id="show_more"]').click()
        except:
            break
        time.sleep(get_time() - 1)
    detail_elements = drivertemp.find_elements_by_xpath('//*[@id="list_content"]/li/h2/b/a')
    Get all links
    for element in detail_elements:
        detail_url = element.get_attribute('href')
        yield detail_url
    drivertemp.quit()
Copy the code

Follow step 4 to find a link to all messages for the leader provided by FID. Since the leader’s message list is not displayed all at once, there is a link belowTo load moreButton, as followsYou need to click down to load each time, so simulate clicking by sliding down and then clicking again until you get to the bottom. When a function returns a value, instead of returning a list at a time, the yield keyword generates a generator that generates urls at the pace of program execution, reducing the stress on memory.

6. Obtain message details

def get_message_detail(driver, detail_url, writer, position) :
    Get message details
    print('Climbing message -', position, The '-', detail_url)
    driver.get(detail_url)
    # judge, skip if there are no comments
    try:
        satis_degree = WebDriverWait(driver, 2.5).until(
            lambda driver: driver.find_element_by_class_name("sec-score_firstspan")).text.strip()
    except:
        return
    # Get all parts of the message
    message_date_temp = WebDriverWait(driver, 2.5).until(
        lambda driver: driver.find_element_by_xpath("/html/body/div[6]/h3/span")).text
    message_date = re.search(r'\d{4}-\d{2}-\d{2}', message_date_temp).group()
    message_datetime = dparser.parse(message_date, fuzzy=True)
    if message_datetime < start_date:
        return
    message_title = WebDriverWait(driver, 2.5).until(
        lambda driver: driver.find_element_by_class_name("context-title-text")).text.strip()
    label_elements = WebDriverWait(driver, 2.5).until(lambda driver: driver.find_elements_by_class_name("domainType"))
    try:
        label1 = label_elements[0].text.strip()
        label2 = label_elements[1].text.strip()
    except:
        label1 = ' '
        label2 = label_elements[0].text.strip()
    message_content = WebDriverWait(driver, 2.5).until(
        lambda driver: driver.find_element_by_xpath("/html/body/div[6]/p")).text.strip()
    replier = WebDriverWait(driver, 2.5).until(
        lambda driver: driver.find_element_by_xpath("/html/body/div[8]/ul/li[1]/h3[1]/i")).text.strip()
    reply_content = WebDriverWait(driver, 2.5).until(
        lambda driver: driver.find_element_by_xpath("/html/body/div[8]/ul/li[1]/p")).text.strip()
    reply_date_temp = WebDriverWait(driver, 2.5).until(
        lambda driver: driver.find_element_by_xpath("/html/body/div[8]/ul/li[1]/h3[2]/em")).text
    reply_date = re.search(r'\d{4}-\d{2}-\d{2}', reply_date_temp).group()
    review_scores = WebDriverWait(driver, 2.5).until(
        lambda driver: driver.find_elements_by_xpath("/html/body/div[8]/ul/li[2]/h4[1]/span/span/span"))
    resolve_degree = review_scores[0].text.strip()[:-1]
    handle_atti = review_scores[1].text.strip()[:-1]
    handle_speed = review_scores[2].text.strip()[:-1]
    review_content = WebDriverWait(driver, 2.5).until(
        lambda driver: driver.find_element_by_xpath("/html/body/div[8]/ul/li[2]/p")).text.strip()
    is_auto_review = 'is' if (('Automatic praise' in review_content) or ('Default rating' in review_content)) else 'no'
    review_date_temp = WebDriverWait(driver, 2.5).until(
        lambda driver: driver.find_element_by_xpath("/html/body/div[8]/ul/li[2]/h4[2]/em")).text
    review_date = re.search(r'\d{4}-\d{2}-\d{2}', review_date_temp).group()
    Save to a CSV file
    writer.writerow(
        [position, message_title, label1, label2, message_date, message_content, replier, reply_content, reply_date,
         satis_degree, resolve_degree, handle_atti, handle_speed, is_auto_review, review_content, review_date])
Copy the code

We only need comments with comments, so we filter out comments without comments at the beginning. Then, use xpath or class_name to locate the corresponding element to obtain the contents of each part of the message. Each message contains 14 contents and is saved in CSV.

7. Obtain and save all messages left by leaders

def get_officer_messages(index, fid) :
    Get and save all messages from the leader.
    user_agent = get_user_agent()
   	chrome_options.add_argument('user-agent=%s' % user_agent)
    driver = webdriver.Chrome(options=chrome_options)
    list_url = "http://liuyan.people.com.cn/threads/list?fid={}#state=4".format(fid)
    driver.get(list_url)
    position = WebDriverWait(driver, 10).until(
        lambda driver: driver.find_element_by_xpath("/html/body/div[4]/i")).text
    # time.sleep(get_time())
    print(index, '-- climbing --', position)
    start_time = time.time()
    # encoding='gb18030'
    csv_name = position + '.csv'
    If the file exists, delete it and create it again
    if os.path.exists(csv_name):
        os.remove(csv_name)
    with open(csv_name, 'a+', newline=' ', encoding='gb18030') as f:
        writer = csv.writer(f, dialect="excel")
        writer.writerow(
            ['Position name'.'Message title'.'Message Tag 1'.'Message Tag 2'.'Message date'.'Message Content'.'Responder'.'Reply content'.'Reply Date'.'Satisfaction'.'Degree of solution'.'Attitude score'.'Processing speed points'.'Is it automatically favorable?'.'Evaluation content'.'Evaluation Date'])
        for detail_url in get_detail_urls(position, list_url):
            get_message_detail(driver, detail_url, writer, position)
            time.sleep(get_time())
    end_time = time.time()
    crawl_time = int(end_time - start_time)
    crawl_minute = crawl_time // 60
    crawl_second = crawl_time % 60
    print(position, 'Climb over!! ')
    print('The lead time: {} minutes {} seconds. '.format(crawl_minute, crawl_second))
    driver.quit()
    time.sleep(5)
Copy the code

Obtain the position information of the leader and create an independent CSV file for the leader to save the extracted message information. Call get_message_detail() to obtain the specific information of each message and save it, and calculate the execution time of each leader.

8. Merge files

def merge_csv() :
    Merge all files.
    file_list = os.listdir('. ')
    csv_list = []
    for file in file_list:
        if file.endswith('.csv'):
            csv_list.append(file)
    If the file exists, delete it and create it again
    if os.path.exists('DATA.csv'):
        os.remove('DATA.csv')
    with open('DATA.csv'.'a+', newline=' ', encoding='gb18030') as f:
        writer = csv.writer(f, dialect="excel")
        writer.writerow(
            ['Position name'.'Message title'.'Message Tag 1'.'Message Tag 2'.'Message date'.'Message Content'.'Responder'.'Reply content'.'Reply Date'.'Satisfaction'.'Degree of solution'.'Attitude score'.'Processing speed points'.'Is it automatically favorable?'.'Evaluation content'.'Evaluation Date'])
        for csv_file in csv_list:
            with open(csv_file, 'r', encoding='gb18030') as csv_f:
                reader = csv.reader(csv_f)
                line_count = 0
                for line in reader:
                    line_count += 1
                    ifline_count ! =1:
                        writer.writerow(
                            (line[0], line[1], line[2], line[3], line[4], line[5], line[6], line[7], line[8],
                             line[9], line[10], line[11], line[12], line[13], line[14], line[15]))
Copy the code

Merges all leaders’ data from the crawl.

Main function call

def main() :
    Principal function
    fids = get_fid()
    print('Crawler starts execution:')
    s_time = time.time()
    for index, fid in enumerate(fids):
        try:
            get_officer_messages(index + 1, fid)
        except:
            get_officer_messages(index + 1, fid)
    print('Crawler execution completed!! ')
    print('Start composing file:')
    merge_csv()
    print('File synthesis finished!! ')
    e_time = time.time()
    c_time = int(e_time - s_time)
    c_minute = c_time // 60
    c_second = c_time % 60
    print('{} bit leaders total time: {} minutes {} seconds. '.format(len(fids), c_minute, c_second))


if __name__ == '__main__':
    Execute main function
    main()
Copy the code

In the main function, all messages of leaders are first obtained, and then all data files are merged to complete the whole crawling process, and the running time of the whole program is counted, which is convenient to analyze the running efficiency.

Results, analysis and description

1. Description of results

Complete code and test execution results are availableDownload.csdn.net/download/CU…Download, for exchange learning only,Please do not abuse. The whole execution process is long, because it is a single thread, and we must wait for the data of one leader to be climbed before the next one can be climbed. I selected 10 leaders for testing, and the running results in the cloud server are as follows Obviously, the whole operation time is nearly 5 hours, which is relatively low efficiency and has a lot of room for improvement. And you end up with the mergedDATA.csv:

2. Improve the analysis

(1) This version of the code does not automatically crawl all FID, need to manually save, is one of the deficiencies, can be improved in the later. (2) Climbing the message details page is also simulated by Selenium, which will reduce the request efficiency, so we can consider using the Requests library for request. (3) This version is single-process (thread), and the crawl of the next leader can only be carried out after one leader finishes the crawl, which is inefficient, especially for leaders with many messages, which takes a long time. Therefore, multi-process or multi-thread can be considered for optimization.

3. Description of legality

  • This project is for the purpose of learning and scientific research, all readers can refer to the implementation of ideas and program code, but can not be used for malicious and illegal purposes (malicious attack on the website server, illegal profit, etc.), if there is a violation, please take responsibility for it.
  • The data obtained in this project is after the further analysis for the implementation of the improvement to the electronic government affairs, can play a certain reference role in government decisions, not to the malicious data to grab the advantage of unfair competition, also not used for commercial purposes to seek illegal interests, running the code only with a few fid testing, is not a big scope to crawl, At the same time, strictly control the crawl rate and strive not to cause pressure on the server. If the interests of the party concerned are infringed (namely, the network subject to be captured), please contact to change or delete.
  • This project is the first of a series of message board crawls. it will continue to be updated later.