Remember the youth in QQ space with Python

QQ space, which once accompanied us from childhood to youth to adulthood, from the 2G era to the end of 4G now, occupies too much of our youth memories, now friends space dynamic update is not as frequent as before. I vaguely remember the days when friends bought and sold, grabbed a parking space or a couple’s space, now think about that time is really childish, that is our silly childhood, what mutual stepping, Mars, running hall witnessed our carefree childhood.

Sometimes look at QQ push “that year today”, see oneself a few years ago hair dynamic, say silly words, oneself all afraid of oneself. Sometimes when I see a good friend’s movements a few years ago, I can’t help laughing and scolding, “How come this grandson is so 2! Even now not how to use QQ, sometimes look at once hair say there is space to leave a message. Even if let me again embarrassed also not willing to delete, because they are full of youth memories.

Space message thousands, say also more, a page of the more troublesome turn. Simply download all this data locally. In the same way, we can export all contacts’ comments and message boards.

Selenium

Since accessing the buddy message board requires a login, we used the Selenium tool for Web application testing for convenience. This tool can be used for unit testing, integration testing, system testing, and more. It can operate browsers like a real user. It supports Mozilla Firefox, Google Chrome, Safari, Opera, Internet Explorer, and more.

To use this tool, you need to install the Selenium library and download the corresponding browser driver. Then through the analysis of qzone login interface, we found that the default login is scanning code, so we need to switch to account password login.

By analyzing the ATTRIBUTES of the HTML tag, we found that ID =”switcher_plogin” is a globally unique attribute for switching logins. In the same way, we need to find the account, password input field, and click on the login element to simulate the login using Selenium

The login code is as follows:

from selenium import webdriver

driver = webdriver.Chrome()
    # Get Google Chrome driver
    driver = webdriver.Chrome()
    # Login to the website
    driver.get('https://i.qq.com')
    Select a password to log in
    driver.switch_to_frame('login_frame')
    Click the input box to get input
    driver.find_element_by_id('switcher_plogin').click()
    # Enter your account
    driver.find_element_by_id('u').send_keys('Your QQ number')
    # Input password
    driver.find_element_by_id('p').send_keys('qq password')
    # Click login
    driver.find_element_by_id('login_button').click()
Copy the code

Prepare parameters before work

Looking at the requests in the developer tools, we found that each request after the login carried the tokens and G_TK acquired by the login in addition to the necessary parameters. Token can be obtained directly from the web source code, but g_TK is not in the source code. According to previous experience, the first step can only be viewed from JS, and sure enough, an encryption code was found. Combined with the context, it was found that the value of “p_skey” was extracted from the cookie, and then a series of operations were performed to obtain the value of G_TK. Because we need to get cookies first, and then get g_TK through cookies.

Part of JS encryption logic code

if (e) {
     if (e.host && e.host.indexOf("qzone.qq.com") > 0) {
        try {
           t = parent.QZFL.cookie.get("p_skey")}catch(e) {
           t = QZFL.cookie.get("p_skey")}}... }"g_tk=" + QZFL.pluginsDefine.getACSRFToken(t)

QZFL.pluginsDefine.getACSRFToken._DJB = function(e) {
        var t = 5381;
        for (var n = 0,
        r = e.length; n < r; ++n) {
            t += (t << 5) + e.charCodeAt(n)
        }
        return t & 2147483647
    };
Copy the code

Get token && cookie && g_TK code

""" Get the value of g_tk """
def get_g_tk(cookie) :
    hashes = 5381
    for letter in cookie['p_skey']:
        hashes += (hashes << 5) + ord(letter)
    return hashes & 0x7fffffff

Get cookie information after login
cookie = {}
for elem in driver.get_cookies():
    cookie[elem['name']] = elem['value']
# get g_tk
g_tk = get_g_tk(cookie)
Use xpath to get the source code of the web page after login
html = driver.page_source
xpath = r'window.g_qzonetoken = (function(){ try{return "(.*?) "; } '
Obtain the token after login through xpath
token = re.compile(xpath).findall(html)[0]
Copy the code

To make things

After cracking a simple anti-crawler, we can write the formal crawler code. First, we determine the target URL, analyze the json object of the response through the browser, and write headers

Since each request needs to carry login information, we use the session class for convenience. Secondly, by observing the corresponding response, we find that there are useless characters in the returned response, so interception is needed

headers = {
    'authority': 'user.qzone.qq.com'.'method': 'GET'.'scheme': 'https'.'accept-language': 'zh-CN,zh; Q = 0.9 '.'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',}def get_resp(cookie, g_tk, token, page) :
    session = requests.session()
    Convert cookie dictionary to RequestsCookieJar
    c = requests.utils.cookiejar_from_dict(cookie)
    Add headers to session
    session.headers = headers
    Copy # RequestsCookieJar to session
    session.cookies = c
    # Access the url of the message board
    url = f'https://user.qzone.qq.com/proxy/domain/m.qzone.qq.com/cgi-bin/new/get_msgb?uin={login QQ}&hostUin={to query message content QQ number}&start={page}&num=10&g_tk={g_tk}&qzonetoken={token}'
    print(url)
    response = session.get(url)
    # Intercept useless characters
    resp_text = response.text[10: -3]
    # to json
    resp_json = json.loads(resp_text)
    return resp_json
Copy the code

In the above method, we only get the corresponding interface of a page. We get the total number of messages from JSON and divide by the number of messages on each page to know the total number of pages. Then go through again to get the data of each page, for the convenience of viewing the data saved in CSV file, in addition to the message content saved in TXT file, generate word cloud.

def get_zone_xx(cookie, g_tk, token, page=0) :
    Initialize the request to get the total number of entries
    resp_json = get_resp(cookie, g_tk, token, page)
    The total number of article
    total = resp_json['data'] ['total']
    print(F"{total}A message ')
    # total number of pages
    size = int(total/10 + 1)
    Number of messages that have been read
    use_page = 0
    Save each piece of data information, generate CSV file used
    content_arr = []
    for i in range(0, size):
        # Request the content of each page
        resp_json = get_resp(cookie, g_tk, token, i)
        # Break out of the loop when the number of items is greater than or equal to the total number
        if use_page >= total:
            break
        Fetch the required field values from each page of data
        for comment in resp_json['data'] ['commentList']:
            use_page += 1
            print(F 'is currently reading th{use_page}The ')
            page_json = []
            # Message date
            page_json.append(comment['pubtime'])
            # nickname
            page_json.append(comment['nickname'])
            # content
            content = replace_html(comment['htmlContent'])
            # Write content to text to generate word clouds
            with open('zone_text111.txt'.'a') as f:
                f.write(content)

            page_json.append(content)
            content_arr.append(page_json)
Copy the code

Generating a CSV file

    Convert the total data to a data frame and output it
    df = pd.DataFrame(data=content_arr,
                      columns=['Message date'.'nickname'.'Message Content'])
    df.to_csv('QQ_ZONE.csv', index=False, encoding='utf-8_sig')
    print('Saved as CSV file.')
Copy the code

Run the above code to generate a CSV file part of the content is as follows

The code for generating a wordcloud is as follows

from wordcloud import WordCloud
import matplotlib.pyplot as plt
with open('zone_text.txt'.'r') as f:
    mytext = f.read()

font = r'C:WindowsFontssimfang.ttf'
wc = WordCloud(collocations=False, font_path=font, width=1400, height=1400, margin=2).generate(mytext)
plt.imshow(wc)
plt.axis("off")
plt.show()

plt.show()
Copy the code

The running results are as follows:

Write in the last

The above code is not too complicated, maybe it is a feeling of nostalgia, maybe it is a conflict to all kinds of chaotic information in the moments of friends, so I try to recall those memories of my youth.

Circle of friends and space cannot measure whether a person is mature or not, but for most post-90s, space really carries too many pure memories. Stay true to your original aspiration and forge ahead!!

Click to read the original article and participate in the Ali Cloud Carnival! Only 3 days left!

Remember the youth in QQ space with Python

Related Posts

For free! Excel data analysis boot camp! Let you build a visual data analysis thinking model!

Leetcode 1114. Print in Order (python)

Advanced (DAY04)-MySQL Logical Architecture Anatomy