Author: Ying Zhaokang
I use crawler to take “Tencent cloud technology community” all articles, see what I get
preface
Free weekend practice crawler will take Tencent cloud technology community to open, ha, classic pikachu at the beginning
This time I use Python crawler plus an “imperfect” word segmentation system to build, Tencent cloud technology community all articles of the word cloud, to see the overall probably write what xi xi 🙂
The body of the
Programming ideas
- Gets the addresses of all articles
- Content extraction for single article pages
- All articles are content extracted and the results are stored in the MongoDB database
- Use word segmentation system and wordcloud to construct wordcloud
** Note :** before storing all the article addresses, I added a random number. Later, articles were randomly selected for extraction to prevent local results due to different dates
Get the article list page, all the article information
Save format:
- Index Indicates the random number index
- The title of article
- Address
- Article content
def get_one_page_all(self, url):
try:
html = self.get_page_index(self.baseURL)
# BeautifulSoup parse
soup = BeautifulSoup(html, 'lxml')
title = soup.select('.article-item > .title')
address = soup.select('.article-item > .title > a[href]')
for i in range(len(title)):
# generate random index
random_num = random.randrange(0, 6500)
content = self.parse_content('https://www.qcloud.com' + address[i].get('href').strip())
yield {
'index' : random_num,
'title':title[i].get_text().strip(),
'address' : 'https://www.qcloud.com' + address[i].get('href').strip(),
'content' : content
}
Skip if an index error is encountered
except IndexError:
pass
Copy the code
Parse the text
def parse_content(self, url):
html = self.get_page_index(url)
soup = BeautifulSoup(html, 'lxml')
}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}
content = soup.select('.J-article-detail')
return content[0].get_text()
Copy the code
The results of
In this case, I will directly display the final result, which is not very ideal because the word segmentation system is not very good. In this case, I use regular expression to remove all non-Chinese characters in the content
Since personal computers aren’t very well configured, I split the results into 20 pieces, each made up of 100 randomly selected articles
This is the word cloud generated by all articles, and word segmentation and screening are not very good, resulting in numerals and personal nouns
conclusion
It can be seen that Tencent cloud technology community on the article, most are related to data
Ha ha, not very good, later to improve (word selection)
Finally hit a small advertisement, I hope you pay attention to my public number:
Ikang_kj hey hey:)
reading
In Python3, you can use proxy to automate the web of a crawler.
This article has been published by Tencent Cloud Technology community authorized by the author
The original link: cloud.tencent.com/community/a…
Massive technical practical experience, all in Tencent cloud community