This article shows you how to use Python to extract keywords from Chinese text step by step. If you need to “read” a long text, give it a try.
demand
My friend recently became interested in natural language processing (NLP) because he wanted to use automated methods to extract keywords from long texts to determine topics.
He asked me for advice, and I recommended reading my article “How to Extract Topics from Massive Text with Python?” .
After reading it, he said he had learned a lot, but the application scenarios were different from his own needs.
How to Extract Topics from Massive Text in Python? The article is faced with a large number of documents, using the topic discovery function to cluster articles. He does not need to deal with many documents, nor does he need clustering, but each document he needs to deal with is very long. He hopes to extract keywords from long articles in an automated way to see the overall picture.
I suddenly found that I had forgotten to write a text to introduce the extraction method of single text keywords.
While this feature isn’t complicated to implement, there are some pitfalls to avoid.
In this article, I will show you step by step how to implement Chinese keyword extraction in Python.
The environment
Python
The first step is to install the Python runtime environment. We use the integrated environment Anaconda.
Please download the latest version of Anaconda at this website. Drop down the page to find the download location. Depending on the system you are currently using, the site will automatically recommend a suitable version for you to download. I use macOS and download files in PKG format.
The download page area shows Python version 3.6 on the left and 2.7 on the right. Please select version 2.7.
Double-click the downloaded PKG file and follow the Instructions in Chinese to install it step by step.
The sample
I’ve prepared a Github project for you to host the accompanying source code and data for this article. Please download the zip file from this address and unzip it.
The decompressed directory is demo-keyword-extraction-master. The example directory contains the following contents:
In addition to readme. md, the github project default description file, there are two other files in the directory, namely the data file sample. TXT and the program source file demo-extract-keyword.ipynb.
Stammer participle
The key word extraction tool we use is stuttering word segmentation.
Previously in how to Do Chinese Word Segmentation with Python? In this paper, we have used this tool to make word segmentation for Chinese sentences. This time we use, it is another function, namely keyword extraction.
Go to the terminal and run the CD command to access the decompressed demo-keyword-extraction-master folder. Then run the following command:
pip install jieba
Copy the code
Well, the package tools are ready to go. So let’s do
jupyter notebook
Copy the code
Enter the Jupyter notebook environment.
Now that the environment is ready, let’s introduce the Chinese text data used in this article.
data
At first, I had trouble finding ready-made Chinese texts.
There is a vast amount of Chinese text available online.
But I’m not sure if there’s a copyright issue with using it as a demo. “Where did you get this digital copy?” might be the question when someone analyzes it.
I can’t defend myself against repeated lawsuits over this.
As it turned out, I was asking for trouble — what was I doing with someone else’s text? Why don’t you just use my own?
This year since more than, I have written more than 90 articles, the total number of words has exceeded 270,000.
I purposely chose a non-technical one to avoid extracting keywords that were all Python commands.
I chose last year’s article “Two or three Things about Ride-hailing Drivers”.
This article tells some interesting stories.
I extracted the text from the web page and stored it in sample.txt.
Mind you, this is a pothole spot. In a workshop in the summer, several students were stuck for a long time because they had problems picking Chinese texts from the Internet.
This is because unlike English, Chinese characters have coding problems. Different systems have different default encoding, and different versions of Python accept different encoding. The text files you download from the Internet may also differ from your system’s code.
Either way, these factors are likely to result in unreadable garbled text when you open it.
Therefore, the correct way to use Chinese text data is to create a new text file in the Jupyter Notebook.
Then, the following blank file appears.
You can avoid coding errors by opening the text you downloaded from somewhere else in any editor that works, copying the entire content and pasting it into the blank text file.
Avoiding this pit can save you a lot of unnecessary annoyance trying.
Now that you know this trick, you can have fun with keyword extraction.
perform
Go back to the main screen of Jupyter Notebook and hit Demo-extract-keyword. Ipynb to see the source code.
Yes, you read that right. Just need these four short statements, can complete two different methods (TF-IDF and TextRank) keyword extraction.
In this section, we first explain the execution steps. The principle of different key word extraction methods is introduced later.
First we import all the key word extraction functions from the stutter segmentation analysis toolbox.
from jieba.analyse import *
Copy the code
Press Shift+Enter on the corresponding statement to execute the statement and retrieve the result.
Then, have Python open our sample file and read the entire contents into the data variable.
with open('sample.txt') as f:
data = f.read()
Copy the code
Keywords and weights were extracted by TF-IDF and displayed in sequence. If you don’t specify it, the default is 20 keywords.
for keyword, weight in extract_tags(data, withWeight=True):
print('%s %s' % (keyword, weight))
Copy the code
I’ll give you a little hint before I display the content, so let it go.
Building prefix dict from the default dictionary ... Loading the model from the cache/var/folders / 8 s/k8yr4zy52q1dh107gjx280mw0000gn/T/jieba cache Loading model cost 0.547 seconds. Prefix dict has been built succesfully.Copy the code
Then the list comes up:
Uber 0.280875594782 driver 0.119951947597 passenger 0.105486129485 master 0.0958888107815 Master Zhang 0.0838162334963 Destination 0.0753618512886 online ride-hailing 0.0702188986954 sister 0.0683412127766 own 0.0672533110661 car 0.0623276916308 job 0.0600134354214 Tianjin 0.0569158056792 10 0.0526641740216 driving Uber 0.0526641740216 thing 0.048554456767 Li Master 0.0485035501943 Tianjin people 0.0482653686026 detour 0.0478244723097 taxi When 0.0448480260748 0.0440840298591Copy the code
I took a look, think keyword extraction or more reliable. Of course, there’s a number 10 mixed in, which is harmless.
If you need to change the number of keywords, you need to specify the topK parameter. For example, if you want to output 10 keywords, you can do this:
for keyword, weight in extract_tags(data, topK=10, withWeight=True):
print('%s %s' % (keyword, weight))
Copy the code
Let’s try another keyword extraction method – TextRank.
for keyword, weight in textrank(data, withWeight=True):
print('%s %s' % (keyword, weight))
Copy the code
Key words extraction results are as follows:
Uber 1.0 Driver 0.749405996648 passenger 0.594284506457 sister 0.485458741991 Tianjin 0.451113490366 destination 0.429410027466 times 0.418083863303 author 0.416903838153 no 0.357764515052 jobs 0.291371566494 on the car 0.277010013884 detour 0.274608592084 reprint 0.271932903186 out 0.242580745393 taxi 0.238639889991 thing 0.228700322713 singular 0.213450680366 taxi 0.212049665481 door 0.205816713637 follow 0.20513470986Copy the code
Note that the extracted results are different from those of TF-IDF. At least, the odd “10” was missing.
But does this mean that the TextRank method is necessarily superior to TF-IDF?
This question, leave as a thinking question, hope after you read the principle part of the back carefully, can make an independent solution.
If you only need to apply this method to a practical problem, skip the principles and move on to the discussion.
The principle of
Let’s briefly explain the basic principles of the two different keyword extraction methods mentioned above — TF-IDF and TextRank.
In order to keep you from getting bored, we’re not going to use mathematical formulas. I’ll give you a link to that later. If you are interested in the details, you are welcome to learn from them.
Say first TF – idf.
Its full name is Term Frequency – Inverse Document Frequency. There is a hyphen in the middle, and parts on each side, which combine to determine the importance of a word.
The first part is Term Frequency, which refers to the Frequency of a certain word.
We often say “tell the important thing three times”.
In the same way, the frequency with which a word appears indicates that it is likely to be of high importance.
But that’s a possibility, not a certainty.
For example, many function words in modern Chinese — “de, di, de”, and many sentence endings in ancient Chinese — “zhi, hu, zhe, also, xi” may appear many times in the text, but they are obviously not key words.
This is why we need the second part (IDF) in order to determine keywords.
Inverse Document frequency first calculates the frequency of occurrence of a certain word in various documents. Suppose there are 10 documents, in which A certain word A appears first in 10 articles, and another word B appears in only 3 of them. Which word is more important?
I’ll give you a minute to think about it, and then read on.
It’s time to announce the answers.
The answer is that B is more important.
A may be A function word, or A theme word shared across documents. B appears in only three documents, so it is likely to be a key word.
The inverse document frequency is the inverse of the document frequency. So the first part and the second part are as high as possible. If they’re both high, they’re probably keywords.
Now that TF-IDF is done, let’s talk about TextRank.
Compared to TF-IDF, TextRank is more complex. It’s not simply adding, subtracting, multiplying and dividing, but graph-based calculation.
Below is an example diagram from the original literature.
TextRank first extracts terms and forms nodes. Then build links based on the association of words.
Assign an initial weight value to each node according to the number of connected nodes.
Then you start iterating.
Recalculate the weight of a word based on the weights of all the words it is linked to, and then pass the recalculated weight on. Until this change reaches an equilibrium state, the weight value does not change. This is the same idea as Google’s PageRank algorithm.
According to the final weight value, the top words are taken as the result of keyword extraction.
If you are interested in the original literature, please refer to the following links:
- Tf-idf original literature links.
- TextRank original literature links.
discuss
To summarize, this article explores how to use Python to do keyword extraction for Chinese text. Specifically, we use tF-IDF and TextRank methods respectively, and the results of keyword extraction may be different between them.
Have you ever done Chinese keyword extraction? What tools are used? How effective is it? Is there a more efficient method than this one? Welcome to leave a message, share your experience and thinking to everyone, we exchange and discuss together.
If you like, please give it a thumbs up. You can also follow and top my official account “Nkwangshuyi” on wechat.
If you’re interested in data science, check out my series of tutorial index posts entitled how to Get started in Data Science Effectively. There are more interesting problems and solutions.