Python crawler, take a look at my recent blog post, showing you how to create a high-resolution data aggregation cloud

Reprint please indicate the source: blog.csdn.net/forezp/arti… This article is from Fang Zhipeng’s blog

Today, ON a whim, I wanted to crawl my blog in Python, do data aggregation, create a high-resolution cloud map (a visual display of word frequency), and see what I’ve been writing lately.

First, directly on a few of my blog data cloud

1.1 Crawl the aggregation of article titles

1.2 Crawl the aggregation of abstracts of articles

1.3 Crawl the title + summary of the article

I recently wrote a series of SpringCloud tutorials, and a few moreMicro servicearchitectureFrom the cloud view, it’s a good match. If you don’t believe me, check out my blog. It’s pretty accurate

2. Technology stack

Development tool: PyCharm
Crawler technology: BS64, Requsts, Jieba
Analysis tool: wordArt

Three, crawler structure design

The entire crawler architecture is very simple:

Crawl my blog: blog.csdn.net/forezp
To get the data
Use “stutter” libraries of data, word segmentation.
The obtained data are used to make cloud images on Artword.
Show the cloud image to the user.

Fourth, concrete implementation

First, crawl the data according to the blog address:

url = 'http://blog.csdn.net/forezp'

titles=set()

def download(url):
    if url is None:
        return None
    try:
        response = requests.get(url, headers={
            'User-Agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36',})if (response.status_code == 200) :return response.content
        return None
    except:
        return None
Copy the code

Parsing the title

def parse_title(html):
    if html is None:
        return None
    soup = BeautifulSoup(html, "html.parser")
    links = soup.find_all('a', href=re.compile(r'/forezp/article/details'))
    for link in links:

        titles.add(link.get_text())
Copy the code

Analysis summary:


def parse_descrtion(html):
    if html is None:
        return None
    soup=BeautifulSoup(html, "html.parser")
    disciptions=soup.find_all('div',attrs={'class': 'article_description'})
    for link in disciptions:

        titles.add(link.get_text())
Copy the code

Use “stutter” participle, “excited 8” participle how to use, see here: github.com/fxsjy/jieba… .

“‘ def jiebaSet(): STRS =” if titles. Len ()==0: return for item in titles: STRS = STRS +item;

tags = jieba.analyse.extract_tags(strs, topK=100, withWeight=True)
for item in tags:
    print(item[0] + '\t' + str(int(item[1] * 1000)))
Copy the code

“‘ Because there was less data, I printed it directly to the console and copied it, better to store it in MongoDB.

Make cloud images: Use artword online tool, wordart.com

First: import data copied from the console:

Embarrassingly, this site does not support Chinese when drawing diagrams. You need to select a font that supports Chinese from c:/ Windows/Fonts. MAC users can also copy folders from Windows or download them online.

Then click on Visulize to generate a high-resolution cloud image. Finished explanation, what needs to improve please leave a message.

Download the source code: github.com/forezp/Zhih…

5. Article reference

Super simple: quickly make a high force grid word cloud map

Excellent articles recommended:

How to crawl millions of Zhihu user information, and do a simple analysis