preface

All stories have a beginning and an end.

This article will be the final chapter of NLP Chinese character similarity, which will bring the series to a close.

The idea of calculating the similarity of Chinese form near word from NLP

Thanks to the realization of Chinese shape similarity algorithm, it makes a contribution to Chinese NLP

Turn – what is the most expensive character in contemporary China?

Deficiency in

The reason for this section is that the last algorithm implementation has some shortcomings.

Tower of Babel

There is a story in the Bible about the construction of the Tower of Babel, and eventually people stopped work because of language problems.

Gen 11:6 "Behold! They have become one people, all of them speaking the same language. Now that they have begun to do this, nothing will be impossible for them to do afterward. Gen 11:7 And let us go down, and there confound their language, that they may not understand one another's speech. Gen 11:8 And the Lord scattered them abroad from thence upon the face of the whole earth; So they stopped building the city.Copy the code

To avoid language problems, I started by implementing a comparison program packaged with Exe4J that ran smoothly on my own.

Running fails. A variety of environment configuration operation, and finally reported an error.

So I wrote a simple version of Python for those of you doing NLP research.

Github.com/houbb/nlp-h…

Java is a language, Python is a language.

Programming languages, which allow people to communicate with machines, create barriers between people.

Divine by means of character

What is the most expensive Chinese character in modern China? In this paper, we explain the uncoupling of Chinese characters for the first time.

One of the core purposes of Chinese character splitting is to improve the similarity comparison of Chinese characters.

The accuracy of comparison can be improved by comparing the split parts of Chinese characters and obtaining the similarity of the split characters.

Word similarity

Simple requirements

In order to facilitate the small partners to understand, we use the thinking of product managers and we introduce the way to achieve.

My needs are relatively simple. You see, [Ming] can be split [day] [month], [冐] can also be split into [day] [month]. By contrast, the results are clear. I don't care how. Let's go online tomorrow.Copy the code

Guys, you already know how to do that, right?

Experience with

As the product says, this requirement has been fulfilled.

Maven is introduced into

<dependency>
    <groupId>com.github.houbb</groupId>
    <artifactId>nlp-hanzi-similar</artifactId>
    <version>1.2.0</version>
</dependency>
Copy the code

use

double rate1 = HanziSimilarHelper.similar(At the end of the ' '.'not');
Copy the code

The corresponding result is 0.96969696969697

For more details, please refer to the open source address:

Github.com/houbb/nlp-h…

Write it before the end

Projects Involved

This is the end of the similarity calculation of Chinese characters.

The main information and projects involved are:

pinyin

Divine by means of character

Quadrangle coded word bank

Chinese structural thesaurus

Chinese word library

Stroke numerals

Of course, opencC4J can also result in simplified and simplified processing, not extended here.

Future plans

There is still a lot of work to be done in NLP, after all, Chinese NLP has just started.

The technology is not yet successful and the comrades still need to work hard.

It is said that goose city recently a certain Huang Master made everyone complain.

Many of you said that if there was a program that could do the same thing: 卂 ditto.

The so-called speaker is not intentional, the listener intends.

Write a communication software, mainly in order to consolidate the next Netty learning, the other are not important.

He knew that even if he did, they would not change, but he was ready to try.

Java Implementation

Warning, if you have very little hair left or are not interested in implementing it.

Then you can bookmark + like + comment and leave.

Here’s the boring code implementation.

Programmer’s mind

Here’s what a programmer thinks.

First of all, we need to solve several problems:

(1) The split realization of Chinese characters

This direct reuse has been implemented Chinese characters split implementation.

List<String> stringList = ChaiziHelper.chai(charWord.charAt(0));
Copy the code

The same Chinese character can be split in many ways, so for simplicity, we’ll default to the first one.

(2) Similar comparison

Suppose we compare the two Chinese characters A and B and divide them into the following subset.

A = {A1, A2, … , Am}

B = {B1, B2, … , Bm}

/** * Gets the split character *@paramCharWord character *@returnResults the * /
private char[] getSplitChars(String charWord) {
    List<String> stringList = ChaiziHelper.chai(charWord.charAt(0));

    // There is a particular choice here. For simplicity, the first one is chosen by default.
    String string = stringList.get(0);
    return string.toCharArray();
}
Copy the code

A split subset comparison can be implemented in a variety of ways. For simplicity, we simply iterate over the elements to determine whether another subset exists.

Of course, when you iterate, you want to split the smaller number of things.

int minLen = Math.min(charsOne.length, charsTwo.length);

/ / to compare
double totalScore = 0.0;
for(int i = 0; i <  minLen; i++) {
    char iChar = charsOne[i];
    String textChar = iChar+"";
    if(ArrayPrimitiveUtil.contains(charsTwo, iChar)) {
        // Add up the score}}Copy the code

(3) Weight of split subset

For example, the Chinese characters “month” and “month” are both subsets, but their weights are different because of the different number of strokes.

We calculate the weight by the number of strokes in a subset of the total number of strokes in Chinese characters.

 int textNumber = getNumber(textChar, similarContext);

double scoreOne = textNumber*1.0 / numberOne * 1.0;
double scoreTwo = textNumber*1.0 / numberTwo * 1.0;

totalScore += (scoreOne + scoreTwo) / 2.0;
Copy the code

Ps: This is divided by 2 for normalization. Make sure the final result is between 0 and 1.

(4) Number of strokes

To get the number of strokes, we can directly reuse the previous method.

If there is no match, the default stroke count is 1.

private int getNumber(String text, IHanziSimilarContext similarContext) {
    Map<String, Integer> map = similarContext.bihuashuData().dataMap();
    Integer number = map.get(text);
    if(number == null) {
        return 1;
    }
    return number;
}
Copy the code

Java full implementation

We put all the pieces together and we get a complete implementation.

/** ** *@authorThe old horse shouts the west wind@since1.0.0 * /
public class ChaiziSimilar implements IHanziSimilar {

    @Override
    public double similar(IHanziSimilarContext similarContext) {
        String hanziOne = similarContext.charOne();
        String hanziTwo = similarContext.charTwo();

        int numberOne = getNumber(hanziOne, similarContext);
        int numberTwo = getNumber(hanziTwo, similarContext);

        / / split
        char[] charsOne = getSplitChars(hanziOne);
        char[] charsTwo = getSplitChars(hanziTwo);

        int minLen = Math.min(charsOne.length, charsTwo.length);

        / / to compare
        double totalScore = 0.0;
        for(int i = 0; i <  minLen; i++) {
            char iChar = charsOne[i];
            String textChar = iChar+"";
            if(ArrayPrimitiveUtil.contains(charsTwo, iChar)) {
                int textNumber = getNumber(textChar, similarContext);

                double scoreOne = textNumber*1.0 / numberOne * 1.0;
                double scoreTwo = textNumber*1.0 / numberTwo * 1.0;

                totalScore += (scoreOne + scoreTwo) / 2.0; }}return totalScore * similarContext.chaiziRate();
    }

    /** * Gets the split character *@paramCharWord character *@returnResults the * /
    private char[] getSplitChars(String charWord) {
        List<String> stringList = ChaiziHelper.chai(charWord.charAt(0));

        // There is a particular choice here. For simplicity, the first one is chosen by default.
        String string = stringList.get(0);

        return string.toCharArray();
    }

    /** * get stroke count *@paramThe text text *@paramSimilarContext Context *@returnResults the * /
    private int getNumber(String text, IHanziSimilarContext similarContext) {
        Map<String, Integer> map = similarContext.bihuashuData().dataMap();

        Integer number = map.get(text);
        if(number == null) {
            return 1;
        }

        returnnumber; }}Copy the code

summary

In this paper, Chinese character disassembling is introduced to further enrich the realization of similarity.

Of course, the implementation itself still has a lot to improve, such as the choice after splitting, whether it can be recursive splitting, etc., this is left to future generations to study.

I am old ma, looking forward to meeting with you next time.