There are a lot of excellent contents on “Nuggets of Gold”, and I often look for some articles I am interested in on the hot list. On the one hand, I can learn practical technical solutions, and on the other hand, I can understand Daniel’s way of thinking. However, the platform has a large number of new articles every day, plus the accumulated history, it is almost impossible for one person to read all these content.

However, I was curious about what Daniel said. So I decided to dig up the keywords of hot articles and make a word cloud. At the same time, this process will be recorded in detail, and share with you, in order to facilitate the future when you have similar needs can be used for reference.

Since we want to analyze nuggets hot list articles, we need to get content data from the articles. Fortunately, nuggets has a great website architecture. The content of the web page is basically captured through the interface, and the interface data is in JSON format, which is very easy to parse.

The code is as follows:

import json
import requests

JUEJIN_HEAD = {
    'Accept':'* / *'.'Accept-Encoding':'gzip, deflate, br'.'Accept-Language':'en-US,en; Q = 0.8, useful - CN; Q = 0.6, useful; Q = 0.4 '.'Connection':'keep-alive'.'Content-Length':'170'.'Content-Type':'application/json'.'DNT':'1'.'Host':'web-api.juejin.im'.'Origin':'https://juejin.cn'.'Referer':'https://juejin.im/timeline'.'User-Agent':'the Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36'.'X-Agent':'Juejin/Web',
}

JUEJIN_QUERY = "https://web-api.juejin.im/query"

COUNT = 0

def get_juejin_articles(after="", order="WEEKLY_HOTTEST"):
    global COUNT
    
    The interface is a POST request, and data is the parameter to be passed
    data = {
        "operationName": ""."query": ""."variables": {
            "first":20."after":after,
            "order":order},
        "extensions": {
            "query": {
                "id":"21207e9ddb1de777adeaca7a2fb38030"
            }
        }
    }

    rep = requests.post(JUEJIN_QUERY, data=json.dumps(data), headers=JUEJIN_HEAD)
    articles = rep.json()['data'] ['articleFeed'] ['items']
    edges, pageInfo = articles['edges'], articles['pageInfo']

    for edge in edges:
        node = edge['node']

        if node['type'] = ='post':
            The output can be redirected to a file
            print("{}\t{}".format(node["title"], node["originalUrl"])

    COUNT += 1
    if COUNT <= 500 and pageInfo['hasNextPage']:
        get_juejin_article(after=pageInfo['endCursor'], order=order)
Copy the code

By calling the code above, we can get a list of the top articles of the week, including the title and the link to the article details page.

Caught analysis details page, will see the specific content of the article is through the https://post-storage-api-ms.juejin.im/v1/getDetailData interface to get, that identifies the ID parameter is postId article, can be obtained from the details page links to, as follows:

Juejin. Cn/post / 684490…

5DCCBE0FF265DA795315A119 is the ID of the article

We have now solved the problem of getting content for hot list articles. Next, data cleaning is needed. The content data we grab back is in HTML format. We need to remove useless HTML tags and get plain text data.

The code is as follows:

from lxml import etree

def get_content(html):
    # parse HTML
    parser = etree.HTML(html, etree.HTMLParser())
    Get plain text
    content = html.xpath('//text()')
    
    return content
Copy the code

After obtaining the plain text data, it comes to the most critical step — Chinese word segmentation. The quality of word segmentation directly affects the final result of mining. We use jieba, a common Chinese word segmentation tool.

The code is as follows:

import re
import jieba

# regular expressions for words
re_han_default = re.compile('([\u4E00-\u9FD5a-zA-Z0-9+#&\.\_]+)', re.U)

def cut_all(content):
    # participle
    seg_lst = jieba.lcut(content)
    # Extract words
    words = [seg for seg in seg_lst if re_han_default.match(seg)]
    
    return "".join(words)
Copy the code

This is only the most basic Chinese word segmentation method, the quality of word segmentation is relatively general. If you are interested in Chinese word segmentation, then recommend you to read a book I wrote the nuggets pamphlets on the deep understanding of Chinese word segmentation of NLP: from principle to practice “, which is a combination of a large number of examples and illustrations, and explains in detail the theory and practice of Chinese word segmentation, so that you can master Chinese word segmentation technology from scratch, can easily manage all kinds of NLP tasks.

Once we have the word segmentation results, we then calculate the TF-IDF value for each word (also explained in detail in the booklet above), which measures the importance of the word to the article.

The code is as follows:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

Corpus is a list of word segmentation results divided by articles in corpus, where N stands for the number of the first words
def get_keywords(corpus, N=25):vectorizer = CountVectorizer() X = vectorizer.fit_transform(corpus) words = vectorizer.get_feature_names() transformer =  TfidfTransformer() tfidf = transformer.fit_transform(X)# tFIDf values are converted to arrays
    weight = tfidf.toarray()

    A dictionary of keywords that default to float
    keywords = collections.defaultdict(float)

    for w in weight:
        loc = np.argsort(-w)

        for idx in range(N):
            # Add tF-IDF value of keyword
            keywords[words[loc[idx]]] += w[loc[idx]]

    return keywords
Copy the code

At this point, the keyword mining of the nuggets hot list article is completed, and we can directly print out the results for viewing. But this is still not intuitive, we can further make keywords into word clouds, using a visual way to show the data.

Here we use the WordCloud package, which looks like this:

from wordcloud import WordCloud

def gen_word_could(keywords):
    wc = WordCloud(
        font_path=u"./SimHei.ttf".# specify a Chinese font
        max_words=3000,
        width=1920,
        height=1080,
        background_color="black",
        margin=5
    )
    
    # Generate word clouds from weights
    wc.generate_from_frequencies(keywords)
    # Save the word cloud image
    wc.to_file("juejin_keywords.png")
Copy the code

It should be noted that if you use WordCloud to display Chinese, you need to download Chinese fonts. You can download them from this link: simhei.ttf.

We got a picture of a word cloud like this:

If you want to further improve the effect, you can add stop words or optimize the Chinese word segmentation tool, here we leave it to you DIY.

Finally, I would like to write amway’s gold digging booklet “In-depth Understanding of NLP Chinese word segmentation: From Principle to Practice”. I hope the above content is helpful to you. If you like it, please like, forward, comment, thank you very much!