Reprint please indicate the source: blog.csdn.net/forezp/arti… This article is from Fang Zhipeng’s blog
Today, ON a whim, I wanted to crawl my blog in Python, do data aggregation, create a high-resolution cloud map (a visual display of word frequency), and see what I’ve been writing lately.
First, directly on a few of my blog data cloud
1.1 Crawl the aggregation of article titles
1.2 Crawl the aggregation of abstracts of articles
1.3 Crawl the title + summary of the article
I recently wrote a series of SpringCloud tutorials, and a few moreMicro servicearchitectureFrom the cloud view, it’s a good match. If you don’t believe me, check out my blog. It’s pretty accurate
2. Technology stack
- Development tool: PyCharm
- Crawler technology: BS64, Requsts, Jieba
- Analysis tool: wordArt
Three, crawler structure design
The entire crawler architecture is very simple:
- Crawl my blog: blog.csdn.net/forezp
- To get the data
- Use “stutter” libraries of data, word segmentation.
- The obtained data are used to make cloud images on Artword.
- Show the cloud image to the user.
Fourth, concrete implementation
First, crawl the data according to the blog address:
url = 'http://blog.csdn.net/forezp'
titles=set()
def download(url):
if url is None:
return None
try:
response = requests.get(url, headers={
'User-Agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36',})if (response.status_code == 200) :return response.content
return None
except:
return None
Copy the code
Parsing the title
def parse_title(html):
if html is None:
return None
soup = BeautifulSoup(html, "html.parser")
links = soup.find_all('a', href=re.compile(r'/forezp/article/details'))
for link in links:
titles.add(link.get_text())
Copy the code
Analysis summary:
def parse_descrtion(html):
if html is None:
return None
soup=BeautifulSoup(html, "html.parser")
disciptions=soup.find_all('div',attrs={'class': 'article_description'})
for link in disciptions:
titles.add(link.get_text())
Copy the code
Use “stutter” participle, “excited 8” participle how to use, see here: github.com/fxsjy/jieba… .
“‘ def jiebaSet(): STRS =” if titles. Len ()==0: return for item in titles: STRS = STRS +item;
tags = jieba.analyse.extract_tags(strs, topK=100, withWeight=True)
for item in tags:
print(item[0] + '\t' + str(int(item[1] * 1000)))
Copy the code
“‘ Because there was less data, I printed it directly to the console and copied it, better to store it in MongoDB.
Make cloud images: Use artword online tool, wordart.com
First: import data copied from the console:
Embarrassingly, this site does not support Chinese when drawing diagrams. You need to select a font that supports Chinese from c:/ Windows/Fonts. MAC users can also copy folders from Windows or download them online.
Then click on Visulize to generate a high-resolution cloud image. Finished explanation, what needs to improve please leave a message.
Download the source code: github.com/forezp/Zhih…
5. Article reference
Super simple: quickly make a high force grid word cloud map