The annual dog abuse festival has finally passed, circle of friends all kinds of sun, sun selfie, sun baby, sun food, show love. What are the programmers Posting? The programmers are working overtime. But the gift is indispensable, send what good? As a programmer, I prepared a special gift to create a “heart” with the data of the past micro blog. I think she will be moved to cry. Ha ha
The preparatory work
Had the idea after the start, first think of is natural in Python, the general idea is to weibo data climbed down, again after cleaning processing data for word processing, the processed data to word cloud tools, cooperate with the scientific computing tools and drawing tool into images, involved in the kit are:
Requests are used for network requests to crawl microblog data, stammer word segmentation for Chinese word segmentation, wordcloud processing library wordcloud, image processing library Pillow, scientific computing tool NumPy, 2D drawing library Matplotlib similar to MATLAB
Tools installation
When installing these toolkits, there may be different errors on different system platforms. Wordcloud, Requests, and Jieba can all be installed online through ordinary PIP.
pip install wordcloud
pip install requests
pip install jiebaCopy the code
Install Pillow, NumPy, Matplotlib directly with PIP on Windows platform. One recommended way is to download the corresponding.whl file for installation on a third-party platform called Python Extension Packages for Windows 1. Cp27 corresponds to python2.7 and amd64 corresponds to a 64-bit system according to your system environment. Download it to the local PC and install it
PIP install Pillow - 4.0.0 - cp27 - cp27m - win_amd64. WHL
PIP install scipy 0.18.0 - cp27 - cp27m - win_amd64. WHL
PIP install numpy 1.11.3 + MKL - cp27 - cp27m - win_amd64. WHL
PIP install matplotlib 1.5.3 - cp27 - cp27m - win_amd64. WHLCopy the code
Other platforms can follow the error prompt Google resolve. Or develop directly based on Anaconda, a branch of Python with a large number of built-in modules for scientific computing and machine learning.
To get the data
The API provided by The official sina Weibo is a slag, which can only obtain 5 pieces of data newly released by users. The next best thing is to use crawlers to capture data. Before capturing data, evaluate the difficulty to see if someone has written it. It gave me some ideas, so I decided to write my own crawler. Use the m.wibo.cn/mobile url to crawl the data. Discover interface m.weibo.cn/index/my?fo… You can get microblog data in pages, and the returned data is in JSON format, which saves a lot of trouble. However, this interface requires cookies after login, which can be found through Chrome browser after logging in to your account.
Implementation code:
def fetch_weibo():
api = "http://m.weibo.cn/index/my?format=cards&page=%s"
for i in range(1, 102):
response = requests.get(url=api % i, cookies=cookies)
data = response.json()[0]
groups = data.get("card_group") or []
for group in groups:
text = group.get("mblog").get("text")
text = text.encode("utf-8")
text = cleanring(text).strip()
yield textCopy the code
The total number of pages to view a tweet is 101. Given the memory cost of returning a list object at once, the yield function returns a generator. In addition, the text must be data-cleaned, such as removing punctuation marks, HTML tags, and words like “retweet.”
Save the data
Once the data is retrieved, we want to store it offline for easy reuse and avoid repeated crawling. Save the file in CSV format to Weibo. CSV for further use. Data saved in CSV file may be garbled when opened, it doesn’t matter, notepad++ is not garbled.
def write_csv(texts):
with codecs.open('weibo.csv', 'w') as f:
writer = csv.DictWriter(f, fieldnames=["text"])
writer.writeheader()
for text in texts:
writer.writerow({"text": text})
def read_csv():
with codecs.open('weibo.csv', 'r') as f:
reader = csv.DictReader(f)
for row in reader:
yield row['text']Copy the code
Word processing
Every microblog read from the weibo. CSV file is processed by word segmentation and then sent to wordcloud to generate wordcloud. Stutter word segmentation is suitable for most Chinese usage scenarios. Use stopwords. TXT to filter out useless information (e.g. :, then, because, etc.).
def word_segment(texts):
jieba.analyse.set_stop_words("stopwords.txt")
for text in texts:
tags = jieba.analyse.extract_tags(text, topK=20)
yield " ".join(tags)Copy the code
Generate images
After word segmentation, the data can be processed by WordCloud, which displays the font size of keywords in a column based on the frequency and weight of each word in the data. Generate a square image, as shown below:
Yes, the generated picture is not beautiful. After all, it is to be given to people, so we need to show off. Then we can find a picture with artistic sense as the template and copy a beautiful picture. I found a picture of a “heart” on the Internet:
Generate image code:
def generate_img(texts):
data = " ".join(text for text in texts)
mask_img = imread('./heart-mask.jpg', flatten=True)
wordcloud = WordCloud(
font_path='msyh.ttc',
background_color='white',
mask=mask_img
).generate(data)
plt.imshow(wordcloud)
plt.axis('off')
plt.savefig('./heart.jpg', dpi=600)Copy the code
C:\Windows\Fonts\Microsoft YaHei UI copy this font to the matplotlib installation directory: C: \ Python27 \ Lib \ site – packages \ matplotlib \ MPL – data \ fonts \ under the vera.ttf
That’s about it.
When I proudly sent her this image, this conversation came up:
What’s this? I: Love ah, personally do so professional, so touched ah, you only have eyes for Python, not me (crying and laughing) ME: clearly “heart” have Python ah
I think I said something wrong. Ha, ha, ha
The full code can be downloaded at “H”.
Share Python dry and warm content by clicking on the public account “A programmer’s micro site”