Planning to create a Cloud map of Chinese words? Then you have to learn how to do Chinese word segmentation first. Follow our tutorial to get your hands dirty with Python.
demand
In “How to Make word Clouds in Python”, we introduced the method of making word clouds for English text. Did everyone have a good time?
As mentioned in this article, English text was chosen as an example because it is easiest to process. But it didn’t take long for readers to try to make word clouds out of Chinese texts. Did you succeed in following the above method?
Probably not successful. Because there’s an important step missing.
Observe your English text. You’ll find that English words use Spaces as mandatory separators.
Such as:
Yes Minister is a satirical British sitcom written by Sir Antony Jay and Jonathan Lynn that was first transmitted by BBC Television between 1980 and 1984, split over three seven-episode series.
However, Chinese text does not have this kind of spacing. To make a word cloud, we first need to know what “words” are in Chinese text.
You might think this isn’t a problem at all — I know the boundaries between words when I see them!
Yes, of course you can. You can process 1 sentence, 100 sentence, even 10,000 sentence manually. But what if I gave you a million sentences?
This is the most significant difference between manual processing and computer automation — scale.
Don’t be so quick to give up. You can use a computer to help.
Your question should be: how do you use a computer to divide Chinese text into words correctly?
The technical term for this kind of work is participle.
Before introducing the word segmentation tool and its installation, make sure you have read and followed the steps in How to Make Word Clouds in Python, and then follow the step-by-step instructions in this article.
participles
There are many tools for Chinese word segmentation. Some are free, some are charged. Some can be installed and used on your laptop, while others need to be connected to the Internet for cloud computing.
Today, I’m going to show you how to use Python to do Chinese word segmentation for free on your laptop.
The tool we use is characteristically called “stutter participles.”
Why such a strange name?
After reading this article, you should be able to figure it out for yourself.
Let’s start by installing the word segmentation tool. Go back to your terminal or command prompt.
Go to the demo folder you created earlier.
Enter the following command:
pip install jieba
Copy the code
Well, Python on your computer now knows how to split Chinese words.
data
In How to Make Word Clouds in Python, we use the Wikipedia introduction to the British drama Yes, Minister. This time we found the Corresponding Chinese page of the British drama from Wikipedia. The translation is “Yes, Minister”.
After copying the text of the web page, save it to the text file yes-minister-cn. TXT and move this file to our working directory, demo.
Ok, we have Chinese text data for analysis.
Don’t get too busy programming just yet. One more thing we need to do before entering the code is to download a Chinese font file.
Go to this website to download Simsun.ttf.
Once downloaded, move the TTF font file to the demo directory as well, alongside the text file.
code
On the command line, run:
jupyter notebook
Copy the code
The browser automatically starts and the following page is displayed.
And here’s what we did the last time we made the word cloud. At this time, a text file is added in the directory, which is the Chinese introduction information of “Yes, Minister”.
Open the file and browse the contents.
We confirm that the Chinese text content has been stored correctly.
Go back to the main page of Jupyter Notebook. Click the New button to create a New Notebook. Notebooks, choose The Python 2 option.
We will be prompted for the name of the Notebook. To distinguish it from the last wordcloud notebook, let’s call it wordcloud-cn.
We enter the following three statements in the unique code text box on the web page. Then press Shift+Enter.
filename = "yes-minister-cn.txt"
with open(filename) as f:
mytext = f.read()
Copy the code
Then we try to display the contents of MyText. Once you have entered the following statement, you still have to press Shift+Enter.
print(mytext)
Copy the code
The result is shown in the following figure.
Since there is no problem reading the Chinese text content, let’s start word segmentation. Enter the following two lines:
import jieba
mytext = " ".join(jieba.cut(mytext))
Copy the code
The system will prompt you with some information that you need to do in preparation for the first time you enable the stuttering participle. Just ignore it.
What are the results of word segmentation? Let’s see. Input:
print(mytext)
Copy the code
You can see the segmentation result as shown below.
Words are no longer linked together, but separated by Spaces, as is natural in the English language.
Can’t wait to use the segmented Chinese text to compose a word cloud?
Yes, enter the following statement:
from wordcloud import WordCloud
wordcloud = WordCloud().generate(mytext)
%pylab inline
import matplotlib.pyplot as plt
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off"
Copy the code
Excited for the Chinese word cloud?
Unfortunately, this is what you see in the word cloud.
Are you furious, feeling like you’ve fallen in the hole again?
Don’t worry, it’s not the segmentation or the word cloud tool that’s wrong, it’s not our tutorial steps that are wrong, it’s just the missing fonts. Wordcloud uses a font in English by default, without Chinese encoding, which is why it has a box. The solution is to use simsun.ttf, which you downloaded earlier, as the specified output font.
Enter the following statement:
from wordcloud import WordCloud
wordcloud = WordCloud(font_path="simsun.ttf").generate(mytext)
%pylab inline
import matplotlib.pyplot as plt
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
Copy the code
This time the output looks something like this:
In this way, we realized the necessity of Chinese word segmentation through the production process of Chinese word cloud.
Here is a question for you to compare the Chinese word cloud generated this time with the English word cloud made last time:
What are the similarities and differences between the two word clouds, which are both from Wikipedia and describe the same show? From this comparison, what interesting patterns can you see between the Chinese and English articles on Wikipedia?
discuss
After mastering this method, what kind of Chinese word cloud map have you made by yourself? Besides making word clouds, what other application scenarios of Chinese word segmentation do you know? Please leave a comment and share with us. We discuss with each other.
If you like, please give it a thumbs up. You can also follow and top my official account “Nkwangshuyi” on wechat.
If you’re interested in data science, check out my series of tutorial index posts entitled how to Get started in Data Science Effectively. There are more interesting problems and solutions.