The background,
The author wrote an article two years ago the PHP installation and use elasticsearch search word segmentation method, record the implementing steps of using ES participles, recently and you need to use words, Chinese word segmentation on the Internet and found a baidu project, Chinese lexical analysis (LAC), decided to use, and copy down the trial by the article, Give a friend in need a reference
LAC support Python, C++, JAVA, Android interface call, other languages then need developers to encapsulate their own, the author decided to use Python experiment, in the article will also put the relevant code to post out, for everyone as a reference
The installation of LAC
The installation method of LAC2.0 is very simple, using PIP directly to install, but domestic access to the PIP software source may be slow, so we can use the accelerated address, refer to the command shown below
pip3 install lac -i https://mirror.baidu.com/pypi/simple
Copy the code
After the command is executed, the following information is displayed
Installation Precautions
- LAC 1.0 and 2.0 are available, but they are displayed as 1.0 in the code cloud, but they are not specified. The installation process of THE 1.0 version is very troublesome and error-prone. You are advised to check the installation method of THE LAC 2.0 on Github
- If you are running Windows and want to install WSL, do not use WSL1.0 because it does not support LAC dependent components
paddle
The LAC cannot be installed correctly - LAC installation needs to pay attention to your Python version, not greater than 3.7, the author used Python3.8 in the experiment, results in the following error
3. Run DEMO
To verify that the LAC has been installed successfully, I run a DEMO code here. First, I create a code file lac.py, and then copy the DEMO code into it, as shown below
# -*- coding: utf-8 -*-
from LAC import LAC
# load participle model
lac = LAC(mode='seg')
# Single sample input, enter a Unicode encoded string
text = u"Your Majesty asked me to patrol the mountains."
seg_result = lac.run(text)
print(seg_result)
# Batch sample input, input as a list of multiple sentences, the average speed will be faster
texts = [u"There's a temple in the mountains.", u"There was an old monk and a young monk in the temple."]
seg_result = lac.run(texts)
print(seg_result)
Copy the code
Then run the file using Python, using the following command
python lac.py
Copy the code
After the command is executed, the following information is displayed
As can be seen from the above picture, LAC has segmented a text, indicating that LAC installation is successful.
In addition to word segmentation, LAC can also be used for partof speech tagging and entity recognition. We then continue to run the demo code. The author first creates a code file lac2.py, and then copies the demo code used for partof speech tagging and entity recognition into it, as shown below
from LAC import LAC
Load the LAC model
lac = LAC(mode='lac')
# Single sample input, enter a Unicode encoded string
text = u"I want a raise."
lac_result = lac.run(text)
print(lac_result)
# Batch sample input, input as a list of multiple sentences, the average speed is faster
texts = [u"Tang Qingsong is so handsome.", u"I love being a safety development engineer."]
lac_result = lac.run(texts)
print(lac_result)
Copy the code
Then run the file using Python, using the following command
python lac2.py
Copy the code
After the command is executed, the following information is displayed
In the figure above, we can see that LAC not only returns the word segmentation result, but also returns a list of another word type. After a rough check by the author, it can be basically matched. For example, the author’s name is marked as PER person name type, while Haoshuai is marked as an adjective type
Below is a collection of pos and proper name category tags, in which we mark the four most commonly used proper name categories as uppercase:
The label | meaning | The label | meaning | The label | meaning | The label | meaning |
---|---|---|---|---|---|---|---|
n | Common noun | f | Bearing the noun | s | Place a noun | nw | Entry – |
nz | Other proper noun | v | Regular verbs | vd | Dynamic adverbs | vn | A verb |
a | adjectives | ad | Deputy form word | an | A form word | d | adverbs |
m | quantifiers | q | quantifiers | r | pronouns | p | prepositions |
c | conjunctions | u | A partical | xc | Other function | w | punctuation |
PER | The person’s name | LOC | Place names | ORG | Agency name | TIME | time |
Four, trial feelings
LAC is a very good word segmentation tool. It is not used to provide search support for businesses directly, but as a basic tool for a search engine.
For example, when you want to post title for on-site search site, using LAC for word segmentation, word segmentation after the data you need extra storage, intended for use as a search, because LAC only provide function of word segmentation, so feel LAC for as part of the word search engine, if you want to used to participle search site information, rather than ES so convenient.
I am also curious about the applicable scenario of LAC project. According to the LAC project product, the answers are as follows:
LAC applicable scenario is more related to the entity recognition, such as knowledge map, knowledge quiz, information extraction, etc., also can be used as the basis of other model algorithm tools, because the participle particle size on the entity, and the effect of both entity recognition, and are generally used in the search engine participle particle size will be smaller, or at the same time provide a wide range of particle size, For search-oriented segmentation, users need to fine-tune the model themselves
Author: Tang Qingsong
Date: 2020-07-07