The UDF participle
participles
This is a relatively common scenario, such as the company’s products are produced every day a lot of barrage or comment, this time we might want to analyze what is the most concerned about the hot topic of people, or we will analyze the recent period of time what is the trend of network, but there is a problem is your word of the construction of the library, Because you use the generic term library may not be able to achieve good segmentation result, especially there are many network popular language it is not in the vocabulary, another is the stop, stop words is meaningless because a lot of time, so here we need to put its filtering, and filtering approach is filtered through the stop word
At this time, we mainly have two solutions, one is to use some thesaurus provided by the third party, and the other is to build our own thesaurus and have someone to maintain it, which is also a relatively common situation
The last one is the word segmentation tool we use, because there are many mainstream word segmentation tools at present, choosing different word segmentation tools may have a lot of influence on our word segmentation results
Word segmentation tools
1: Elasticsearch – IK Analysis(Star:2471)
Use of IK Chinese word segmentation on Elasticsearch. While native IK Chinese segmentation reads dictionaries from the file system, ES-IK itself can be extended to read dictionaries from different sources. Currently available to read from the SQlite3 database. 1. Set the location of your sqlite3 dictionary in elasticSearch. yml: ik_analysis_db_path: /opt/ik/dictionary.db
2: Open Source Java Chinese word Analyzer IKAnalyzer(Star:343)
IK Analyzer is an open source, lightweight Chinese word segmentation toolkit developed based on the Java language. Since the release of version 1.0 in December 2006, IKAnalyzer has been available in four major versions. At first, it is based on the open source project Luence as the main application, combining dictionary segmentation and grammar analysis algorithm of Chinese phrase segmentation. Since version 3.0, IK has evolved as a common phrase widget for Java, independent of the Lucene project
3: Java Open Source Chinese word segmentation Ansj(Star:3019)
This is a Java implementation of ICTCLAS. Basically rewrote all the data structures and algorithms. The dictionary is provided with an open source version of ICTCLas. The word segmentation speed is about 2 million words per second and the accuracy rate can reach above 96%
At present, Chinese word segmentation is realized. Chinese name recognition. Pos tagging, user-defined dictionary, keyword extraction, automatic summarization, keyword tagging and other functions
It can be applied to natural language processing and other aspects, and is suitable for all kinds of projects requiring high word segmentation effect.
4: ElasticSearch with Stammer (Star:188)
– ElasticSearch will only offer smartCN for Chinese word segmentation, but it’s not working very well. Luckily, there are two Chinese word segmentation plug-ins written by Medcl, one for IK and one for MMSEG
5: Java Distributed Chinese Word Component-Word (Star:672)
Word segmentation is a Java distributed Chinese phrase segmentation, which provides a variety of dictionary-based segmentation algorithms and uses ngram model to disambiguate. Can recognize English, numbers, date, time and other quantifiers accurately, can recognize people’s names, place names, organization names and other unknown words
6: Java Open Source Chinese Word Segmentation jcSEG (Star:400)
What is Jcseg? Jcseg is a lightweight open source Chinese word segmentation based on MMSEG algorithm. It integrates key word extraction, key phrase extraction, key sentence extraction and automatic article summarizer, and provides the latest version of Lucene, Solr, ElasticSearch word segmentation interface. Jcseg comes with a jcSEg.properties file…
7:Chinese word segmentation database Paoding
Poding Chinese word segmentation is a Java developed, can be combined with Lucene application, for the Internet, enterprise Intranet use of Chinese search engine phrases. Paoding fills the gap of Chinese word segmentation open source components in China, and is committed to this and hopes to become the preferred Chinese word segmentation open source components for Internet websites. Paoding Chinese word segmentation pursues high efficiency and good user experience.
8:Chinese word segmentation mmSeg4j
Mmseg4j use Chih – Hao Tsai MMSeg algorithm (technology.chtsai.org/mmseg/) implementation of Chinese word segmentation, And implement Lucene’s Analyzer and Solr’s TokenizerFactory to facilitate the…
9: Chinese participle Ansj(Star:3015)
This is a Java implementation of ICTCLAS. Basically rewrote all the data structures and algorithms. The dictionary is provided with an open source version of ICTCLas. And the partial manual optimization of Chinese word segmentation in memory is about 1 million words per second (speed has exceeded ICTCLAS) file reading word segmentation is about 300,000 words per second accuracy can reach more than 96% currently achieved….
10:Lucene Chinese word segmentation library ICTCLAS4J
Ictclas4j Chinese word segmentation system is a Java open source word segmentation project completed by Sinboy on the basis of FreeICTCLAS developed by Zhang Huaping and Liu Quan in Chinese Academy of Sciences. It simplifies the complexity of the original word segmentation program and aims to provide a better learning opportunity for the majority of Chinese word segmentation fans.
Code implementation
** Step 1: ** Introduce dependencies
Here we introduce two dependencies, actually two different word segmentation tools
<dependency>
<groupId>org.ansj</groupId>
<artifactId>ansj_seg</artifactId>
<version>5.16.</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>com.janeluo</groupId>
<artifactId>ikanalyzer</artifactId>
<version>2012_u6</version>
</dependency>
Copy the code
So before we start, let’s write a little demo just to give you a basic idea
@Test
public void testAnsjSeg(a) {
String str = "My name is Li Taibai. I'm a poet. I live in the Tang Dynasty." ;
// select BaseAnalysis ToAnalysis NlpAnalysis IndexAnalysis to use
Result result = ToAnalysis.parse(str);
System.out.println(result);
KeyWordComputer kwc = new KeyWordComputer(5);
Collection<Keyword> keywords = kwc.computeArticleTfidf(str);
System.out.println(keywords);
}
Copy the code
The output
I /r, called /v li Taibai /nr, /w, I /r, is /v, a /m, poet /n, /w, I /r, life /vn, in /p, Tang Dynasty /tCopy the code
** Step 2: ** introduces the stop word library
Since it’s a stop word library, which itself is not very large, I put it directly in the project, but you can also put it in other places, such as HDFS
** Step 3: ** Write udFs
The code is very simple I will not do a detailed explanation, need to pay attention to GenericUDF inside some methods of the use of rules, as for the code design and what to improve the scheme we say later, the following two sets of implementation ideas are almost the same, is not the same in the use of word segmentation tools
The realization of the ansj
/** * Chinese words segmentation with user-dict in com.kingcall.dic * use Ansj(a java open source analyzer) */
// This information is returned each time you use desc to retrieve function information
@Description(name = "ansj_seg", value = "_FUNC_(str) - chinese words segment using ansj. Return list of words.", Extended = "Example: select _FUNC_(' I am a test string ') from SRC limit 1; \n" + "[\" I \, \" is \, \" test \, \" string \"]")
public class AnsjSeg extends GenericUDF {
private transient ObjectInspectorConverters.Converter[] converters;
private static final String userDic = "/app/stopwords/com.kingcall.dic";
//load userDic in hdfs
static {
try {
FileSystem fs = FileSystem.get(new Configuration());
FSDataInputStream in = fs.open(new Path(userDic));
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String line = null;
String[] strs = null;
while((line = br.readLine()) ! =null) {
line = line.trim();
if (line.length() > 0) {
strs = line.split("\t");
strs[0] = strs[0].toLowerCase();
DicLibrary.insert(DicLibrary.DEFAULT, strs[0]); //ignore nature and freq
}
}
MyStaticValue.isNameRecognition = Boolean.FALSE;
MyStaticValue.isQuantifierRecognition = Boolean.TRUE;
} catch (Exception e) {
System.out.println("Error when load userDic"+ e.getMessage()); }}@Override
public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
if (arguments.length < 1 || arguments.length > 2) {
throw new UDFArgumentLengthException(
"The function AnsjSeg(str) takes 1 or 2 arguments.");
}
converters = new ObjectInspectorConverters.Converter[arguments.length];
converters[0] = ObjectInspectorConverters.getConverter(arguments[0], PrimitiveObjectInspectorFactory.writableStringObjectInspector);
if (2 == arguments.length) {
converters[1] = ObjectInspectorConverters.getConverter(arguments[1], PrimitiveObjectInspectorFactory.writableIntObjectInspector);
}
return ObjectInspectorFactory.getStandardListObjectInspector(PrimitiveObjectInspectorFactory.writableStringObjectInspector);
}
@Override
public Object evaluate(DeferredObject[] arguments) throws HiveException {
boolean filterStop = false;
if (arguments[0].get() == null) {
return null;
}
if (2 == arguments.length) {
IntWritable filterParam = (IntWritable) converters[1].convert(arguments[1].get());
if (1 == filterParam.get()) filterStop = true;
}
Text s = (Text) converters[0].convert(arguments[0].get());
ArrayList<Text> result = new ArrayList<>();
if (filterStop) {
for (Term words : DicAnalysis.parse(s.toString()).recognition(StopLibrary.get())) {
if (words.getName().trim().length() > 0) {
result.add(newText(words.getName().trim())); }}}else {
for (Term words : DicAnalysis.parse(s.toString())) {
if (words.getName().trim().length() > 0) {
result.add(newText(words.getName().trim())); }}}return result;
}
@Override
public String getDisplayString(String[] children) {
return getStandardDisplayString("ansj_seg", children); }}Copy the code
The realization of ikanalyzer
@Description(name = "ansj_seg", value = "_FUNC_(str) - chinese words segment using Iknalyzer. Return list of words.", Extended = "Example: select _FUNC_(' I am a test string ') from SRC limit 1; \n" + "[\" I \, \" is \, \" test \, \" string \"]")
public class IknalyzerSeg extends GenericUDF {
private transient ObjectInspectorConverters.Converter[] converters;
// A list of stop words
Set<String> stopWordSet = new HashSet<String>();
@Override
public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
if (arguments.length < 1 || arguments.length > 2) {
throw new UDFArgumentLengthException(
"The function AnsjSeg(str) takes 1 or 2 arguments.");
}
// Read in the stop words file
BufferedReader StopWordFileBr = null;
try {
StopWordFileBr = new BufferedReader(new InputStreamReader(new FileInputStream(new File("stopwords/baidu_stopwords.txt"))));
// Set of words
String stopWord = null;
for(; (stopWord = StopWordFileBr.readLine()) ! =null;){
stopWordSet.add(stopWord);
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
converters = new ObjectInspectorConverters.Converter[arguments.length];
converters[0] = ObjectInspectorConverters.getConverter(arguments[0], PrimitiveObjectInspectorFactory.writableStringObjectInspector);
if (2 == arguments.length) {
converters[1] = ObjectInspectorConverters.getConverter(arguments[1], PrimitiveObjectInspectorFactory.writableIntObjectInspector);
}
return ObjectInspectorFactory.getStandardListObjectInspector(PrimitiveObjectInspectorFactory.writableStringObjectInspector);
}
@Override
public Object evaluate(DeferredObject[] arguments) throws HiveException {
boolean filterStop = false;
if (arguments[0].get() == null) {
return null;
}
if (2 == arguments.length) {
IntWritable filterParam = (IntWritable) converters[1].convert(arguments[1].get());
if (1 == filterParam.get()) filterStop = true;
}
Text s = (Text) converters[0].convert(arguments[0].get());
StringReader reader = new StringReader(s.toString());
IKSegmenter iks = new IKSegmenter(reader, true);
List<Text> list = new ArrayList<>();
if (filterStop) {
try {
Lexeme lexeme;
while((lexeme = iks.next()) ! =null) {
if(! stopWordSet.contains(lexeme.getLexemeText())) { list.add(newText(lexeme.getLexemeText())); }}}catch (IOException e) {
}
} else {
try {
Lexeme lexeme;
while((lexeme = iks.next()) ! =null) {
list.add(newText(lexeme.getLexemeText())); }}catch (IOException e) {
}
}
return list;
}
@Override
public String getDisplayString(String[] children) {
return "Usage: evaluate(String str)"; }}Copy the code
** Step 4: ** Write test cases
GenericUDF provides my sister with methods that can be used to build the environment and parameters needed for the test so that we can test the code
@Test
public void testAnsjSegFunc(a) throws HiveException {
AnsjSeg udf = new AnsjSeg();
ObjectInspector valueOI0 = PrimitiveObjectInspectorFactory.javaStringObjectInspector;
ObjectInspector valueOI1 = PrimitiveObjectInspectorFactory.javaIntObjectInspector;
ObjectInspector[] init_args = {valueOI0, valueOI1};
udf.initialize(init_args);
Text str = new Text("I'm a test string.");
GenericUDF.DeferredObject valueObj0 = new GenericUDF.DeferredJavaObject(str);
GenericUDF.DeferredObject valueObj1 = new GenericUDF.DeferredJavaObject(0);
GenericUDF.DeferredObject[] args = {valueObj0, valueObj1};
ArrayList<Object> res = (ArrayList<Object>) udf.evaluate(args);
System.out.println(res);
}
@Test
public void testIkSegFunc(a) throws HiveException {
IknalyzerSeg udf = new IknalyzerSeg();
ObjectInspector valueOI0 = PrimitiveObjectInspectorFactory.javaStringObjectInspector;
ObjectInspector valueOI1 = PrimitiveObjectInspectorFactory.javaIntObjectInspector;
ObjectInspector[] init_args = {valueOI0, valueOI1};
udf.initialize(init_args);
Text str = new Text("I'm a test string.");
GenericUDF.DeferredObject valueObj0 = new GenericUDF.DeferredJavaObject(str);
GenericUDF.DeferredObject valueObj1 = new GenericUDF.DeferredJavaObject(0);
GenericUDF.DeferredObject[] args = {valueObj0, valueObj1};
ArrayList<Object> res = (ArrayList<Object>) udf.evaluate(args);
System.out.println(res);
}
Copy the code
We can see that the load stop word is not found, but the whole thing is still running because the file on HDFS cannot be read
But our second example does not need to load the stop word information from HDFS, so the test runs perfectly
Note that in order to update the file externally, I put it on HDFS, just like the code in AnsjSeg
** Step 5: ** Create a UDF and use it
Add the jar/Users/liuwenqiang/workspace/code/idea/HiveUDF/target/HiveUDF - 0.0.4. Jar; create temporary function ansjSeg as 'com.kingcall.bigdata.HiveUDF.AnsjSeg'; Select ansjSeg(" I am a string, what are you "); Select ansjSeg(" I am a string, what are you ",1); create temporary function ikSeg as 'com.kingcall.bigdata.HiveUDF.IknalyzerSeg'; Select ikSeg(" I am a string, what are you "); Select ikSeg(" I am a string, what are you ",1);Copy the code
The second argument to the above method is whether to enable stop word filtering, which we demonstrate using the ikSeg function
Let’s try to get a description of the function
If it doesn’t, it looks something like this
conclusion
- In this section we looked at a relatively common UDF through implementation
GenericUDF
This section focuses on the implementation of the code and the implementation of theGenericUDF
An understanding of methods in a class - One problem with the implementation of the above code is the loading of stop words. Can we load stop words dynamically?