Versailles literature was in vogue. This special style of online writing, often found in moments or micro blogs, pretends to show off wealth and love casually in a placid tone. Normal ostentation involves Posting photos of sports cars on social media or inadvertently revealing the logo of a designer bag, but Versailles literature is more direct. Microbloggers have even created an instructional video on Versailles literature, explaining its three essential elements:On Douban, there is also a study group called Versailles. The group members define Versailles as the spirit of performing a higher life. Ok, to get to the topic, today we will quickly climb the answers about Versailles quotes in Zhihu.

1. Crawled websites

In zhihu search Versailles quotations, the second more suitable, use this.

Click on it and there are 393 answers to the question.

Web site:www.zhihu.com/question/42…

Remove the answer and the following part is the url to climb the question. Especially at the back of a string of Numbers is id:www.zhihu.com/question/42 problem…

2. Crawl the answers to the questions

Looking at the above url, we find that we need to crawl two pieces of data:

  1. Crawl details, including creation time, number of followers, page views, question description, etc
  2. The crawling answers include the user name, number of fans and other information of each answer, the specific content of the answer, the release time, the number of comments, the number of likes and other information

The answer to the question can be determined by following the link below and setting the starting subscript and the offset of the page content, similar to the page content crawl.

def init_url(question_id, limit, offset) :  
    base_url_start = "https://www.zhihu.com/api/v4/questions/"  
    base_url_end = "/answers? include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_det ail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_conte nt%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info %2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_reco gnized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cb adge%5B%2A%5D.topics&limit={0}&offset={1}".format(limit, offset)  

    return base_url_start + question_id + base_url_end
Copy the code

Set the number of responses per page to limit=20, offset can be 0, 20, 40… Question_id is a string of numbers after the website mentioned above, here it is 429548386. After the logic is clear, the data can be obtained by writing crawler. The following is the complete crawler code, and you only need to modify the ID of the question when running.

3. Complete code

# import the appropriate library
import json
import re
import time
from datetime import datetime
from time import sleep
import pandas as pd
import numpy as np
import warnings
import requests
from bs4 import BeautifulSoup
import random
import warnings
warnings.filterwarnings('ignore')


def get_ua() :
    "" select a random UA from the UA library :return: returns a random UA from the UA library. ""
    ua_list = [
        "Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60"."Opera / 8.0 (Windows NT 5.1; U; en)"."Mozilla / 5.0 (Windows NT 5.1; U; en; Rv :1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50"."Mozilla / 4.0 (compatible; MSIE 6.0; Windows NT 5.1; En) Opera 9.50"."Mozilla / 5.0 (Windows NT 6.1; WOW64; The rv: 34.0) Gecko / 20100101 Firefox / 34.0"."Mozilla / 5.0 (X11; U; Linux x86_64; zh-CN; Rv :1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10"."Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML like Gecko) Version/5.1.7 Safari/534.57.2"."Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36"."Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11"."Mozilla / 5.0 (Windows; U; Windows NT 6.1; En-us) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16"."Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36"."Mozilla / 5.0 (Windows NT 6.1; WOW64; Trident / 7.0; The rv: 11.0) like Gecko"."Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11"."Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER"."Mozilla / 5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident / 5.0; SLCC2; The.net CLR 2.0.50727; The.net CLR 3.5.30729; The.net CLR 3.0.30729; Media Center PC 6.0; . NET4.0 C; . NET4.0 E; LBBROWSER)"."Mozilla / 4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; . NET4.0 C; . NET4.0 E; LBBROWSER)"."Mozilla / 5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident / 5.0; SLCC2; The.net CLR 2.0.50727; The.net CLR 3.5.30729; The.net CLR 3.0.30729; Media Center PC 6.0; . NET4.0 C; . NET4.0 E; QQBrowser / 7.0.3698.400)"."Mozilla / 4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; . NET4.0 C; NET4.0 E)"."Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0"."Mozilla / 4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident / 4.0; SV1; QQDownload 732; . NET4.0 C; . NET4.0 E; SE 2. MetaSr 1.0 X)"."Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36"."Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36"."Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"."Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"."Mozilla / 5.0 (Windows; U; Windows NT 5.2) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.27 Safari/525.13"."Mozilla / 5.0 (Macintosh; U; Intel Mac OS X 10_6_8; En-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50"."Mozilla / 5.0 (Macintosh; U; IntelMac OS X 10_6_8; En-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1Safari/534.50"."Mozilla / 5.0 (Windows NT 10.0; WOW64; The rv: 51.0) Gecko / 20100101 Firefox / 51.0"."Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"."Mozilla / 5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident / 5.0"."Mozilla / 5.0 (Windows; U; Windows NT 6.1; En-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50"."Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6".Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6"."Mozilla / 5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1"."Mozilla / 5.0 (Macintosh; U; Intel Mac OS X 10_6_8; En-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50"."Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5"."Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3"."Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3"."Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3"."Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3"."Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3"."Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3"."Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3"."Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3"."Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3"."Mozilla / 5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"]

    return random.choice(ua_list)
    

def filter_emoij(text) :
    Filter emoij @param text: @return: """
    try:
        co = re.compile(u'[\U00010000-\U0010ffff]')
    except re.error:
        co = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
    text = co.sub(' ', text)

    return text


def get_question_base_info(url) :
    """ Get a detailed description of the problem @param URL: @return: """
    response = requests.get(url=url, headers={'User-Agent': get_ua()}, timeout=10)

    """ Get the data and parse it. ""
    soup = BeautifulSoup(response.text, 'lxml')
    # Question title
    title = soup.find("h1", {"class": "QuestionHeader-title"}).text
    # Specific questions
    question = ' '
    try:
        question = soup.find("div", {"class": "QuestionRichText--collapsed"}).text.replace('\u200b'.' ')
    except Exception as e:
        print(e)
    # followers
    follower = int(soup.find_all("strong", {"class": "NumberBoard-itemValue"}) [0].text.strip().replace(",".""))
    # been viewed
    watched = int(soup.find_all("strong", {"class": "NumberBoard-itemValue"}) [1].text.strip().replace(",".""))
    # Number of questions answered
    answer_str = soup.find_all("h4", {"class": "List-headerText"}) [0].span.text.strip()
    # extract XXX number of answers: [re] number of occurrences >=0
    answer_count = int(re.findall('\d*', answer_str)[0])

    # Question tag
    tag_list = []
    tags = soup.find_all("div", {"class": "QuestionTopic"})
    for tag in tags:
        tag_list.append(tag.text)

    return title, question, follower, watched, answer_count, tag_list


def init_url(question_id, limit, offset) :
    Construct the url for each page visited @param question_id: @param limit: @param offset: @return: ""
    base_url_start = "https://www.zhihu.com/api/v4/questions/"
    base_url_end = "/answers? include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed" \
                   "%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by" \
                   "%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count" \
                   "%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info" \
                   "%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting" \
                   "%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B" \
                   "%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics" \
                   "&limit={0}&offset={1}".format(limit, offset)

    return base_url_start + question_id + base_url_end


def get_time_str(timestamp) :
    "" convert timestamp to standard date character @param TIMESTAMP: @return: """
    datetime_str = ' '
    try:
        Timestamp Indicates the time format
        datetime_time = datetime.fromtimestamp(timestamp)
        # datetime Time format to date string
        datetime_str = datetime_time.strftime("%Y-%m-%d %H:%M:%S")
    except Exception as e:
        print(e)
        print("Date conversion error")

    return datetime_str


def get_answer_info(url, index) :
    @param URL: @param index: @return: """
    response = requests.get(url=url, headers={'User-Agent': get_ua()}, timeout=10)
    text = response.text.replace('\u200b'.' ')

    per_answer_list = []
    try:
        question_json = json.loads(text)
        """ Get the answer data for the current page """
        print("Climb answer list on page {0}, get {1} answer on current page".format(index + 1.len(question_json["data")))for data in question_json["data"] :""" Information about the problem """
            # question type, ID, question type, creation time, modification time
            question_type = data["question"] ['type']
            question_id = data["question"] ['id']
            question_question_type = data["question"] ['question_type']
            question_created = get_time_str(data["question"] ['created'])
            question_updated_time = get_time_str(data["question"] ['updated_time'])

            """ Relevant information of the respondent """
            # User name, signature, gender, number of followers
            author_name = data["author"] ['name']
            author_headline = data["author"] ['headline']
            author_gender = data["author"] ['gender']
            author_follower_count = data["author"] ['follower_count']

            """ Relevant information about the answer """
            # Question answer ID, creation time, update time, number of approval, number of comments, specific content
            id = data['id']
            created_time = get_time_str(data["created_time"])
            updated_time = get_time_str(data["updated_time"])
            voteup_count = data["voteup_count"]
            comment_count = data["comment_count"]
            content = data["content"]

            per_answer_list.append([question_type, question_id, question_question_type, question_created,
                                    question_updated_time, author_name, author_headline, author_gender,
                                    author_follower_count, id, created_time, updated_time, voteup_count, comment_count,
                                    content
                                    ])

    except:
        print("Json format verification error")
    finally:
        answer_column = ['Problem type'.'question id'.'Question type'.'Problem creation time'.'Issue Update Time'.'Answer primary username'.'Signature of master'.'Answer main sex'.'Number of followers answered'.'the answer id'.'Answer creation time'.'Answer update time'.'Answer agreement number'.'Number of answer comments'.'Answer details']
        per_answer_data = pd.DataFrame(per_answer_list, columns=answer_column)

    return per_answer_data


if __name__ == '__main__':
    # question_id = '424516487'
    question_id = '429548386'
    url = "https://www.zhihu.com/question/" + question_id
    """ Get a detailed description of the problem """
    title, question, follower, watched, answer_count, tag_list = get_question_base_info(url)
    print("Problem URL:"+ url)
    print("Question title:" + title)
    print("Problem Description:" + question)
    print("The problem is defined as:" + ', '.join(tag_list))
    print("Number of followers: {0}, has been viewed by {1} people".format(follower, watched))
    print("Cutoff {}, there are {} answers to this question.".format(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()), answer_count))

    """ Get the answer data for the question """
    # constructs the url
    limit, offset = 20.0
    page_cnt = int(answer_count/limit) + 1
    answer_data = pd.DataFrame()
    for page_index in range(page_cnt):
        answer_url = init_url(question_id, limit, offset+page_index*limit)
        # Fetch data
        data_per_page = get_answer_info(answer_url, page_index)
        answer_data = answer_data.append(data_per_page)
        sleep(3)
    
    print("\n Climb complete, data saved!!")

    answer_data.to_csv('Sand Sculpture of Versailles _{0}.csv'.format(question_id), encoding='utf-8', index=False)
Copy the code

Results 4.

The final file format is UTF-8. If you read garbled characters, please check whether the format is consistent.

The screenshot of the result part of the crawl is as follows:

Thank you for reading here, and you can follow me on my home page for more Python highlights. Your likes, favorites, and comments keep me going. Thanks.

Click to collect 🎁 Q group number: 675240729 (pure technical exchange and resource sharing) for self-help.

More than 3000 Python ebooks are available, including: (1) Professional advice, (2) Installation of Python development environment, (3) 400 self-taught videos, (4) Common vocabulary of software development, (5) the latest learning roadmap, (6) More than 3000 Python ebooks