Application of Bayesian inference: An idea to identify whether a string of text is an address based on Chinese word segmentation

The scene that

Through hanLP Chinese word segmentation database, the input word segmentation result set can be obtained. For example, for “Shenzhen Nanshan Software Industry Base”, the following results can be obtained

Shenzhen/NS, Nanshan District/NS, Software/N, Industry/N, Base /n]Copy the code

Naturally, other inputs will also have corresponding word segmentation result sets

Under the requirement of address recognition, due to the non-standard address text,

Dictionary-based regular judgment has certain limitations (provincial, urban, town and township keywords may not appear, the name of the building has no rules to find)
The result of judging word segmentation is ns+City of province | | | | | | area county road street bridge building no. | | | | ladder house room | | | | | lane avenue garden garden township unit | | | | house) the method of the proportion of the number of ending words in the word segmentation result set is easy to misjudge (threshold is hard to determine)

If you can predict rain when you see a cloudy day, you can improve the probability of the address when you see a NS or a word ending in keywords, and reduce the probability of the address when you see words basically unrelated to the address, such as “, “, “we” and “you”. Then you can judge which result according to the weighted average of different parts of speech. In theory, that would reduce the chance of miscalculation

Bayes formula

Bayes’ formula, which I learned in college, can be used in this scenario. Ruan Yifeng’s article bayesian Inference and its Internet Application (I) : A brief introduction to the theorem has been made a good introduction

Identify the ideas

Concept design

Assuming that

Event A is: if the input is an address, P(A) is the probability that the input is an address
Event B is: the speech of XX appears, then P(B) is the probability of the speech of XX

, in turn,

P (A | B) is: XX part of speech, the probability that the input is the address
P (B | A) as follows: when the input is address, XX probability of parts of speech

According to the bayesian formula, can calculate P (A | B)

Calculate P (A | B) of each part of speech, predict whether an input for the address again, can according to the part of speech it is concluded that the sum of probability after word, average and then compared with threshold to determine whether the input for the address

Data preparation

Because the calculation of prior probability requires sample data, and the quality of sample data will directly affect the effect of Bayesian inference, it is better for sample data to reflect the real situation

1. Positive samples

Prepare 22,000 pieces of real address data

2. Negative samples

Extract the text from the news site by following the “, “and”. Cut the article into simple sentences, filter the length, and get 10, 000 negative samples

Probability calculation

1. Calculate P(A)

The probability of P(A) can be calculated from the data preparation: the number of positive samples/the total number of samples

2. Calculate P(B)

Extract all the appearing parts of speech and calculate for each part of speech: the number of samples/the total number of samples

3, calculation of P (B | A)

For each part of speech, calculate: the number of samples/positive samples of the part of speech

4, calculate P (A | B)

According to the above three values and Bayesian formula, it can be concluded that when a certain part of speech appears, the input is the probability of an address

Subsequent complement

Because of the limitations on participles, and sometimes with keywords (city of province | | | | | | area county road street bridge building no. | | | | ladder room | | | | lane avenue house garden | | | house township lane unit | |) at the end of the word does not identify as “ns” part of speech, and this word is the address of probability and increases.

Therefore, the probability of the words matching the above characteristics is artificially adjusted to: the probability that the input is an address when the NS part of speech appears.

Threshold to determine

The probability threshold was set according to the calculation of positive and negative test data sets, and the misjudgment rate was controlled within 1%