This is a note for learning introduction to Natural Language Processing by Teacher He Han

The part of speech tagging

Part-of-speech (POS) : Grammatical classification Of words. The function is to provide abstract representations of words.
Part-of-speech tagging set: The set of all parts of speech.

The sample

The part of speech	instructions	Chinese sample	English sample
r	pronouns	I you he	me you he
u	A partical	If you get it, you get it	oh well be
n	noun	Sun and Moon Nanjing	city house name
v	The verb	Go run and jump	run sit study
p	prepositions	Was that to	at on in before after
a	adjectives	Cute, smart, smart.	beautiful pretty good
nr	The person’s name	Ma Yun Ma Huateng	James Kobe Bush

Part-of-speech tagging: The task of predicting a part-of-speech tag for each word in a sentence.
A flow-through lexical analyzer is used to train word segmentation on large scale corpora, and then flexibly combine with the part-of-speech tagging model on small part-of-speech tagging corpora

PKU format annotation:

We /r China /ns so /r big /a /u a /m many /a nation/N/U country /n If/C does not /d unity /a, /w will /d does not/D may/V development /v economy /n, /w people /n life /n standard /n also /d on /d not /d may /v get /v improve /vn and /c improve /vn. /wCopy the code

Part-of-speech tagging based on perceptron

Bush to become PRESIDENT of the United States Bush/NR/D/V US/NS President/NCopy the code

Transfer characteristics	State characteristics	The sample	The sample
		_B_1 -> 1
		bush
		will
	The first character
	Long 2 prefix
	The long 3 prefix
	The last character
	The long 2 suffix
	The long 3 suffix

POSInstance is a specific feature

protected int[] extractFeature(String[] words, FeatureMap featureMap, int position) {
    List<Integer> featVec = new ArrayList<Integer>();
    // The previous word
    String preWord = position >= 1 ? words[position - 1] : "_B_";
    / / the current term
    String curWord = words[position];
    // Next word
    String nextWord = position <= words.length - 2 ? words[position + 1] : "_E_";

    // The previous word
    StringBuilder sbFeature = new StringBuilder();
    sbFeature.append(preWord).append('1');
    addFeatureThenClear(sbFeature, featVec, featureMap);
    / / the current term
    sbFeature.append(curWord).append('2');
    addFeatureThenClear(sbFeature, featVec, featureMap);
    // Next word
    sbFeature.append(nextWord).append('3');
    addFeatureThenClear(sbFeature, featVec, featureMap);
    
    int length = curWord.length();
    // The first character of the current word, such as' Chinese nation ', the first character of 'zhong'
    sbFeature.append(curWord.substring(0.1)).append('4');
    addFeatureThenClear(sbFeature, featVec, featureMap);
    if (length > 1) {
        // The first two characters of the current word, such as' Zhonghua nationality ', the first character 'Zhonghua'
        sbFeature.append(curWord.substring(0.2)).append('4');
        addFeatureThenClear(sbFeature, featVec, featureMap);
    }
    if (length > 2) {
        // The first three characters of the current word, such as' Zhonghua nationality 'and the first character' Zhonghua people '
        sbFeature.append(curWord.substring(0.3)).append('4');
        addFeatureThenClear(sbFeature, featVec, featureMap);
    }

    // Suf ﬁx(w0, I)(I = 1, 2, 3)
    // The last character of the current word, such as' Chinese nation 'and the last character' clan '
    sbFeature.append(curWord.charAt(length - 1)).append('5');
    addFeatureThenClear(sbFeature, featVec, featureMap);
    if (length > 1) {
        // The suffix of the current word is two characters, such as' Zhonghua ethnic group 'with two suffix' ethnic group '
        sbFeature.append(curWord.substring(length - 2)).append('5');
        addFeatureThenClear(sbFeature, featVec, featureMap);
    }
    if (length > 2) {
        // The suffix of the current word is three characters, such as' Zhonghua nationality '. The first character is' Hua Nationality '.
        sbFeature.append(curWord.substring(length - 3)).append('5');
        addFeatureThenClear(sbFeature, featVec, featureMap);
    }

    return toFeatureArray(featVec);
}
Copy the code

Named entity recognition

Named entity Recognition based on sequence annotation

The boundary is determined by {B, M, E, S}, and its category can be determined by additional category labels such as B-nt. Conversion from generic corpora to sequential annotated named entity recognition corpora, such as:

Sahaf/NR said/V, / W Iraq/NS will /d/with/P [UN/NT destruction/V Iraq/NS WMD/N weapons/N special/A Commission/N]/ NT continue/V maintain/V cooperate/V. /wCopy the code