What is word representation?

In normal human language, every word has its own meaning. In the human world, we have all kinds of dictionaries to record and define words for us to understand and look up. So for machines, how do they express and record words? Wordnet, from Princeton University, is one solution. WordNet is a huge database of English words, different types of words, such as: Noun, verb, Adjectives and adverbs are grouped into cognitive synsets (synsets), and each word has its own synset list and corresponding synonyms for hypernyms by conceptual semantics and lexical relations Relations). From this, you can call the wordNet package in NLTK to find synonym sets and superwords for words.

Use WordNet to find a synonym set for “good” :

import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn

poses = {'n':'noun'.'v':'verb'.'s':'adj (s)'.'a':'adj'.'r':'adv'}

for synset in wn.synsets("happy") :print("{}, {}".format(poses[synset.pos()],",".join([lemma.name() for lemma in synset.lemmas()])))
Copy the code

Get results:

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
adj:  happy
adj (s):  felicitous, happy
adj (s):  glad, happy
adj (s):  happy, well-chosen
Copy the code

Use WordNet to find superwords for “cat” :

from nltk.corpus import wordnet as wn
cat = wn.synset("cat.n.01")
hyper = lambda s: s.hypernyms()
list(cat.closure(hyper))
Copy the code

Get results:

[Synset('feline.n.01'),
 Synset('carnivore.n.01'),
 Synset('placental.n.01'),
 Synset('mammal.n.01'),
 Synset('vertebrate.n.01'),
 Synset('chordate.n.01'),
 Synset('animal.n.01'),
 Synset('organism.n.01'),
 Synset('living_thing.n.01'),
 Synset('whole.n.02'),
 Synset('object.n.01'),
 Synset('physical_entity.n.01'),
 Synset('entity.n.01')]
Copy the code

The problem with corpora like WordNet

Wordnet, like a human dictionary, is a good corpus. But its shortcomings are as obvious as our human dictionary, and WordNet has no way of even adding new words to its vocabulary as we come up with them. Therefore, if you want a corpus like WordNet to keep up with the trend of The Times, a large number of manual input is inevitable, which is also the common disease of many corpora. Also, as you can see from the above two examples, WordNet is missing an important part of machine learning languages, which is the similarity between two words. Wordnet doesn’t offer any opportunity for machines to learn the similarity between two words, which is one of its big weaknesses.