Hive implements udF-based text segmentation

This article outline

UDF profile

Hive as an SQL query engine, Hive provides some basic functions, such as count and sum. Sometimes these basic functions cannot meet our requirements. In this case, Hive HDF (user defined Funation), also known as user-defined functions, is required. Hive UDF:

You need to select the appropriate version of the dependency, mainly Hadoop and Hive, to usehadoop versionandhive --versionTo view versions separately)
inheritanceorg.apache.hadoop.hive.ql.exec.UDFClass to implement the Evaluate method, and then package it;
useaddMethod to add jar packages to the distributed cache,If the JAR package is uploaded below the $HIVE_HOME/lib/ directory, you do not need to execute the add command;
throughcreate temporary functionCreate a temporary function without addingtemporaryCreates a permanent function;
Use the UDF you created in SQL;

The UDF participle

This is a relatively common scenario, such as the company’s products are produced every day a lot of barrage or comment, this time we might want to analyze what is the most concerned about the hot topic of people, or we will analyze the recent period of time what is the trend of network, but there is a problem is your word of the construction of the library, Because you use the generic term library may not be able to achieve good segmentation result, especially there are many network popular language it is not in the vocabulary, another is the stop, stop words is meaningless because a lot of time, so here we need to put its filtering, and filtering approach is filtered through the stop word.

At this time, we mainly have two solutions, one is to use some thesaurus provided by the third party, and the other is to build our own thesaurus and have someone to maintain it, which is also a relatively common situation.

The last one is the word segmentation tool we use, because there are many mainstream word segmentation tools at present, choosing different word segmentation tools may have a lot of influence on our word segmentation results.

Word segmentation tools

1: Elasticsearch – IK Analysis(Star:2471)

Use of IK Chinese word segmentation on Elasticsearch. While native IK Chinese segmentation reads dictionaries from the file system, ES-IK itself can be extended to read dictionaries from different sources. Currently available to read from the SQlite3 database. 1. Set the location of your sqlite3 dictionary in elasticSearch. yml: ik_analysis_db_path: /opt/ik/dictionary.db

2: Open Source Java Chinese word Analyzer IKAnalyzer(Star:343)

IK Analyzer is an open source, lightweight Chinese word segmentation toolkit developed based on the Java language. Since the release of version 1.0 in December 2006, IKAnalyzer has been available in four major versions. At first, it is based on the open source project Luence as the main application, combining dictionary segmentation and grammar analysis algorithm of Chinese phrase segmentation. Since version 3.0, IK has evolved as a common phrase widget for Java, independent of the Lucene project

3: Java Open Source Chinese word segmentation Ansj(Star:3019)

This is a Java implementation of ICTCLAS. Basically rewrote all the data structures and algorithms. The dictionary is provided with an open source version of ICTCLas. The word segmentation speed is about 2 million words per second and the accuracy rate can reach above 96%.

At present, Chinese word segmentation is realized. Chinese name recognition. Pos tagging, user-defined dictionary, keyword extraction, automatic summarization, keyword tagging and other functions.

It can be applied to natural language processing and other aspects, and is suitable for all kinds of projects requiring high word segmentation effect.

4: ElasticSearch with Stammer (Star:188)

– ElasticSearch will only offer smartCN for Chinese word segmentation, but it’s not working very well. Luckily, there are two Chinese word segmentation plug-ins written by Medcl, one for IK and one for MMSEG

5: Java Distributed Chinese Word Component-Word (Star:672)

Word segmentation is a Java distributed Chinese phrase segmentation, which provides a variety of dictionary-based segmentation algorithms and uses ngram model to disambiguate. Can recognize English, numbers, date, time and other quantifiers accurately, can recognize people’s names, place names, organization names and other unknown words

6: Java Open Source Chinese Word Segmentation jcSEG (Star:400)

What is Jcseg? Jcseg is a lightweight open source Chinese word segmentation based on MMSEG algorithm. It integrates key word extraction, key phrase extraction, key sentence extraction and automatic article summarizer, and provides the latest version of Lucene, Solr, ElasticSearch word segmentation interface. Jcseg comes with a jcSEg.properties file…

7:Chinese word segmentation database Paoding

Poding Chinese word segmentation is a Java developed, can be combined with Lucene application, for the Internet, enterprise Intranet use of Chinese search engine phrases. Paoding fills the gap of Chinese word segmentation open source components in China, and is committed to this and hopes to become the preferred Chinese word segmentation open source components for Internet websites. Paoding Chinese word segmentation pursues high efficiency and good user experience.

8:Chinese word segmentation mmSeg4j

Mmseg4j use Chih – Hao Tsai MMSeg algorithm (technology.chtsai.org/mmseg/) implementation of Chinese word segmentation, And implement Lucene’s Analyzer and Solr’s TokenizerFactory to facilitate the…

9: Chinese participle Ansj(Star:3015)

This is a Java implementation of ICTCLAS. Basically rewrote all the data structures and algorithms. The dictionary is provided with an open source version of ICTCLas. And the partial manual optimization of Chinese word segmentation in memory is about 1 million words per second (speed has exceeded ICTCLAS) file reading word segmentation is about 300,000 words per second accuracy can reach more than 96% currently achieved….

10:Lucene Chinese word segmentation library ICTCLAS4J

Ictclas4j Chinese word segmentation system is a Java open source word segmentation project completed by Sinboy on the basis of FreeICTCLAS developed by Zhang Huaping and Liu Quan in Chinese Academy of Sciences. It simplifies the complexity of the original word segmentation program and aims to provide a better learning opportunity for the majority of Chinese word segmentation fans.

Code implementation

Step 1: Introduce dependencies

Here we introduce two dependencies, actually two different word segmentation tools

<dependency>
  <groupId>org.ansj</groupId>
  <artifactId>ansj_seg</artifactId>
  <version>5.16.</version>
  <scope>compile</scope>
</dependency>
<dependency>
  <groupId>com.janeluo</groupId>
  <artifactId>ikanalyzer</artifactId>
  <version>2012_u6</version>
</dependency>
Copy the code

So before we start, let’s write a little demo just to give you a basic idea

@Test
public  void testAnsjSeg(a) {
    String str = "My name is Li Taibai. I'm a poet. I live in the Tang Dynasty." ;
  	// select BaseAnalysis ToAnalysis NlpAnalysis IndexAnalysis to use
    Result result = ToAnalysis.parse(str);
    System.out.println(result);
    KeyWordComputer kwc = new KeyWordComputer(5);
    Collection<Keyword> keywords = kwc.computeArticleTfidf(str);
    System.out.println(keywords);
}
Copy the code

The output

I /r, called /v li Taibai /nr, /w, I /r, is /v, a /m, poet /n, /w, I /r, life /vn, in /p, Tang Dynasty /tCopy the code

The second step:Introduce the stop word library

Since it’s a stop word library, which itself is not very large, I put it directly in the project, but you can also put it in other places, such as HDFS

Step 3:Write the UDF

The code is very simple I will not do a detailed explanation, need to pay attention to GenericUDF inside some methods of the use of rules, as for the code design and what to improve the scheme we say later, the following two sets of implementation ideas are almost the same, is not the same in the use of word segmentation tools

The realization of the ansj

/** * Chinese words segmentation with user-dict in com.kingcall.dic * use Ansj(a java open source analyzer) */

// This information is returned each time you use desc to retrieve function information
@Description(name = "ansj_seg", value = "_FUNC_(str) - chinese words segment using ansj. Return list of words.", Extended = "Example: select _FUNC_(' I am a test string ') from SRC limit 1; \n" + "[\" I \, \" is \, \" test \, \" string \"]")

public class AnsjSeg extends GenericUDF {
    private transient ObjectInspectorConverters.Converter[] converters;
    private static final String userDic = "/app/stopwords/com.kingcall.dic";

    //load userDic in hdfs
    static {
        try {
            FileSystem fs = FileSystem.get(new Configuration());
            FSDataInputStream in = fs.open(new Path(userDic));
            BufferedReader br = new BufferedReader(new InputStreamReader(in));

            String line = null;
            String[] strs = null;
            while((line = br.readLine()) ! =null) {
                line = line.trim();
                if (line.length() > 0) {
                    strs = line.split("\t");
                    strs[0] = strs[0].toLowerCase();
                    DicLibrary.insert(DicLibrary.DEFAULT, strs[0]); //ignore nature and freq
                }
            }
            MyStaticValue.isNameRecognition = Boolean.FALSE;
            MyStaticValue.isQuantifierRecognition = Boolean.TRUE;
        } catch (Exception e) {
            System.out.println("Error when load userDic"+ e.getMessage()); }}@Override
    public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
        if (arguments.length < 1 || arguments.length > 2) {
            throw new UDFArgumentLengthException(
                    "The function AnsjSeg(str) takes 1 or 2 arguments.");
        }

        converters = new ObjectInspectorConverters.Converter[arguments.length];
        converters[0] = ObjectInspectorConverters.getConverter(arguments[0], PrimitiveObjectInspectorFactory.writableStringObjectInspector);
        if (2 == arguments.length) {
            converters[1] = ObjectInspectorConverters.getConverter(arguments[1], PrimitiveObjectInspectorFactory.writableIntObjectInspector);
        }
        return ObjectInspectorFactory.getStandardListObjectInspector(PrimitiveObjectInspectorFactory.writableStringObjectInspector);
    }


    @Override
    public Object evaluate(DeferredObject[] arguments) throws HiveException {
        boolean filterStop = false;
        if (arguments[0].get() == null) {
            return null;
        }
        if (2 == arguments.length) {
            IntWritable filterParam = (IntWritable) converters[1].convert(arguments[1].get());
            if (1 == filterParam.get()) filterStop = true;
        }

        Text s = (Text) converters[0].convert(arguments[0].get());
        ArrayList<Text> result = new ArrayList<>();

        if (filterStop) {
            for (Term words : DicAnalysis.parse(s.toString()).recognition(StopLibrary.get())) {
                if (words.getName().trim().length() > 0) {
                    result.add(newText(words.getName().trim())); }}}else {
            for (Term words : DicAnalysis.parse(s.toString())) {
                if (words.getName().trim().length() > 0) {
                    result.add(newText(words.getName().trim())); }}}return result;
    }


    @Override
    public String getDisplayString(String[] children) {
        return getStandardDisplayString("ansj_seg", children); }}Copy the code

The realization of ikanalyzer

@Description(name = "ansj_seg", value = "_FUNC_(str) - chinese words segment using Iknalyzer. Return list of words.", Extended = "Example: select _FUNC_(' I am a test string ') from SRC limit 1; \n" + "[\" I \, \" is \, \" test \, \" string \"]")
public class IknalyzerSeg extends GenericUDF {
    private transient ObjectInspectorConverters.Converter[] converters;
    // A list of stop words
    Set<String> stopWordSet = new HashSet<String>();

    @Override
    public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
        if (arguments.length < 1 || arguments.length > 2) {
            throw new UDFArgumentLengthException(
                    "The function AnsjSeg(str) takes 1 or 2 arguments.");
        }
        // Read in the stop words file
        BufferedReader StopWordFileBr = null;
        try {
            StopWordFileBr = new BufferedReader(new InputStreamReader(new FileInputStream(new File("stopwords/baidu_stopwords.txt"))));
            // Set of words
            String stopWord = null;
            for(; (stopWord = StopWordFileBr.readLine()) ! =null;){
                stopWordSet.add(stopWord);
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }

        converters = new ObjectInspectorConverters.Converter[arguments.length];
        converters[0] = ObjectInspectorConverters.getConverter(arguments[0], PrimitiveObjectInspectorFactory.writableStringObjectInspector);
        if (2 == arguments.length) {
            converters[1] = ObjectInspectorConverters.getConverter(arguments[1], PrimitiveObjectInspectorFactory.writableIntObjectInspector);
        }
        return ObjectInspectorFactory.getStandardListObjectInspector(PrimitiveObjectInspectorFactory.writableStringObjectInspector);

    }

    @Override
    public Object evaluate(DeferredObject[] arguments) throws HiveException {
        boolean filterStop = false;
        if (arguments[0].get() == null) {
            return null;
        }
        if (2 == arguments.length) {
            IntWritable filterParam = (IntWritable) converters[1].convert(arguments[1].get());
            if (1 == filterParam.get()) filterStop = true;
        }
        Text s = (Text) converters[0].convert(arguments[0].get());
        StringReader reader = new StringReader(s.toString());
        IKSegmenter iks = new IKSegmenter(reader, true);
        List<Text> list = new ArrayList<>();
        if (filterStop) {
            try {
                Lexeme lexeme;
                while((lexeme = iks.next()) ! =null) {
                    if(! stopWordSet.contains(lexeme.getLexemeText())) { list.add(newText(lexeme.getLexemeText())); }}}catch (IOException e) {
            }
        } else {
            try {
                Lexeme lexeme;
                while((lexeme = iks.next()) ! =null) {
                    list.add(newText(lexeme.getLexemeText())); }}catch (IOException e) {
            }
        }
        return list;
    }

    @Override
    public String getDisplayString(String[] children) {
        return "Usage: evaluate(String str)"; }}Copy the code

Step 4:Write test cases

GenericUDF gives us methods that can be used to build the environments and parameters needed for testing so that we can test the code

@Test
public void testAnsjSegFunc(a) throws HiveException {
    AnsjSeg udf = new AnsjSeg();
    ObjectInspector valueOI0 = PrimitiveObjectInspectorFactory.javaStringObjectInspector;
    ObjectInspector valueOI1 = PrimitiveObjectInspectorFactory.javaIntObjectInspector;
    ObjectInspector[] init_args = {valueOI0, valueOI1};
    udf.initialize(init_args);

    Text str = new Text("I'm a test string.");

    GenericUDF.DeferredObject valueObj0 = new GenericUDF.DeferredJavaObject(str);
    GenericUDF.DeferredObject valueObj1 = new GenericUDF.DeferredJavaObject(0);
    GenericUDF.DeferredObject[] args = {valueObj0, valueObj1};
    ArrayList<Object> res = (ArrayList<Object>) udf.evaluate(args);
    System.out.println(res);
}


@Test
public void testIkSegFunc(a) throws HiveException {
    IknalyzerSeg udf = new IknalyzerSeg();
    ObjectInspector valueOI0 = PrimitiveObjectInspectorFactory.javaStringObjectInspector;
    ObjectInspector valueOI1 = PrimitiveObjectInspectorFactory.javaIntObjectInspector;
    ObjectInspector[] init_args = {valueOI0, valueOI1};
    udf.initialize(init_args);

    Text str = new Text("I'm a test string.");

    GenericUDF.DeferredObject valueObj0 = new GenericUDF.DeferredJavaObject(str);
    GenericUDF.DeferredObject valueObj1 = new GenericUDF.DeferredJavaObject(0);
    GenericUDF.DeferredObject[] args = {valueObj0, valueObj1};
    ArrayList<Object> res = (ArrayList<Object>) udf.evaluate(args);
    System.out.println(res);
}
Copy the code

We can see that the load stop word is not found, but the whole thing is still running because the file on HDFS cannot be read

But our second example does not need to load the stop word information from HDFS, so the test runs perfectly

Note that in order to update the file externally, I put it on HDFS, just like the code in AnsjSeg

Step 5:Create a UDF and use it

Add the jar/Users/liuwenqiang/workspace/code/idea/HiveUDF/target/HiveUDF - 0.0.4. Jar; create temporary function ansjSeg as 'com.kingcall.bigdata.HiveUDF.AnsjSeg'; Select ansjSeg(" I am a string, what are you "); Select ansjSeg(" I am a string, what are you ",1); create temporary function ikSeg as 'com.kingcall.bigdata.HiveUDF.IknalyzerSeg'; Select ikSeg(" I am a string, what are you "); Select ikSeg(" I am a string, what are you ",1);Copy the code

The second argument to the above method is whether to enable stop word filtering, which we demonstrate using the ikSeg function

Let’s try to get a description of the function

If it doesn’t, it looks something like this

Other Application Scenarios

Hive UDFs can easily implement a number of common requirements by writing Hive UDFs. Other scenarios include:

ipTurn the addressregion: in user logs to be reportedipField conversion toCountry-province-cityFormat, easy to do regional distribution statistical analysis;
useHive SQLCalculate label data, do not want to writeSparkProgram, can passUDFInitialize the connection pool in a static code block usingHiveInitiated parallelism先生Task, parallel fast import large amounts of data tocodisIn, applied to some recommendation business;
There are othersqlImplementation of relatively complex tasks can be written permanentlyHive UDFTo transform;

conclusion

In this section we looked at a relatively common UDF through implementationGenericUDFThis section focuses on the implementation of the code and the implementation of theGenericUDFAn understanding of methods in a class
One problem with the implementation of the above code is the loading of stop words. Can we load stop words dynamically?

Hadoop3 data tolerance technology (erudite code) Hadoop Data migration Usage In detail: Flink Real-time computing topN hot list warehouse modeling layered theory article to understand Hive data storage and compression components focus on learning these several