This is a note for learning introduction to Natural Language Processing by Teacher He Han
The part of speech tagging
- Part-of-speech (POS) : Grammatical classification Of words. The function is to provide abstract representations of words.
- Part-of-speech tagging set: The set of all parts of speech.
The sample
The part of speech | instructions | Chinese sample | English sample |
---|---|---|---|
r | pronouns | I you he | me you he |
u | A partical | If you get it, you get it | oh well be |
n | noun | Sun and Moon Nanjing | city house name |
v | The verb | Go run and jump | run sit study |
p | prepositions | Was that to | at on in before after |
a | adjectives | Cute, smart, smart. | beautiful pretty good |
nr | The person’s name | Ma Yun Ma Huateng | James Kobe Bush |
- Part-of-speech tagging: The task of predicting a part-of-speech tag for each word in a sentence.
- A flow-through lexical analyzer is used to train word segmentation on large scale corpora, and then flexibly combine with the part-of-speech tagging model on small part-of-speech tagging corpora
PKU format annotation:
We /r China /ns so /r big /a /u a /m many /a nation/N/U country /n If/C does not /d unity /a, /w will /d does not/D may/V development /v economy /n, /w people /n life /n standard /n also /d on /d not /d may /v get /v improve /vn and /c improve /vn. /wCopy the code
Part-of-speech tagging based on perceptron
Bush to become PRESIDENT of the United States Bush/NR/D/V US/NS President/NCopy the code
Transfer characteristics | State characteristics | The sample | The sample |
---|---|---|---|
_B_1 -> 1 | |||
bush | |||
will | |||
The first character | |||
Long 2 prefix | |||
The long 3 prefix | |||
The last character | |||
The long 2 suffix | |||
The long 3 suffix |
POSInstance is a specific feature
protected int[] extractFeature(String[] words, FeatureMap featureMap, int position) {
List<Integer> featVec = new ArrayList<Integer>();
// The previous word
String preWord = position >= 1 ? words[position - 1] : "_B_";
/ / the current term
String curWord = words[position];
// Next word
String nextWord = position <= words.length - 2 ? words[position + 1] : "_E_";
// The previous word
StringBuilder sbFeature = new StringBuilder();
sbFeature.append(preWord).append('1');
addFeatureThenClear(sbFeature, featVec, featureMap);
/ / the current term
sbFeature.append(curWord).append('2');
addFeatureThenClear(sbFeature, featVec, featureMap);
// Next word
sbFeature.append(nextWord).append('3');
addFeatureThenClear(sbFeature, featVec, featureMap);
int length = curWord.length();
// The first character of the current word, such as' Chinese nation ', the first character of 'zhong'
sbFeature.append(curWord.substring(0.1)).append('4');
addFeatureThenClear(sbFeature, featVec, featureMap);
if (length > 1) {
// The first two characters of the current word, such as' Zhonghua nationality ', the first character 'Zhonghua'
sbFeature.append(curWord.substring(0.2)).append('4');
addFeatureThenClear(sbFeature, featVec, featureMap);
}
if (length > 2) {
// The first three characters of the current word, such as' Zhonghua nationality 'and the first character' Zhonghua people '
sbFeature.append(curWord.substring(0.3)).append('4');
addFeatureThenClear(sbFeature, featVec, featureMap);
}
// Suf fix(w0, I)(I = 1, 2, 3)
// The last character of the current word, such as' Chinese nation 'and the last character' clan '
sbFeature.append(curWord.charAt(length - 1)).append('5');
addFeatureThenClear(sbFeature, featVec, featureMap);
if (length > 1) {
// The suffix of the current word is two characters, such as' Zhonghua ethnic group 'with two suffix' ethnic group '
sbFeature.append(curWord.substring(length - 2)).append('5');
addFeatureThenClear(sbFeature, featVec, featureMap);
}
if (length > 2) {
// The suffix of the current word is three characters, such as' Zhonghua nationality '. The first character is' Hua Nationality '.
sbFeature.append(curWord.substring(length - 3)).append('5');
addFeatureThenClear(sbFeature, featVec, featureMap);
}
return toFeatureArray(featVec);
}
Copy the code
Named entity recognition
Named entity Recognition based on sequence annotation
The boundary is determined by {B, M, E, S}, and its category can be determined by additional category labels such as B-nt. Conversion from generic corpora to sequential annotated named entity recognition corpora, such as:
Sahaf/NR said/V, / W Iraq/NS will /d/with/P [UN/NT destruction/V Iraq/NS WMD/N weapons/N special/A Commission/N]/ NT continue/V maintain/V cooperate/V. /wCopy the code
The input variable | The input variable | The output variable |
---|---|---|
Mohammed al-sahaf | nr | S |
said | v | O |
. | w | O |
Iraq | ns | S |
will | d | O |
with | p | O |
The United Nations | nt | B-nt |
The destruction | v | M-nt |
The committee | n | E-nt |
Feature extraction
Transfer characteristics | Characteristics of words | The part of speech characteristics |
---|---|---|