1, Gensim

Gensim: Topic modelling for Humans

Gensim is an open source third-party Python toolkit for unsupervised learning of topic vector representations from raw, unstructured text to hidden layers of text. Support tF-IDF, LSA, LDA, Word2Vec and other topic model algorithms, support distributed training, provide similarity calculation, information retrieval and other commonly used API interfaces.

The above algorithm is unsupervised, meaning that no human input is required and only a plain text corpus is required.

The characteristics of

  • Memory independence: Task time does not require the entire training corpus to be in RAM (can handle large Web-level corpora)
  • Memory sharing: Trained models can be saved to hard disk and loaded back via Mmap. Multiple processes can share the same data, reducing the RAM footprint.
  • Contains implementations of several popular vector space algorithms such asWord2Vec.Doc2Vec.FastText.TF-IDF, latent semantic analysis (LSI.LSA, etc.),LDAAnd so on.
  • I/O wrappers and readers for several process data formats.
  • Perform text semantic similarity query for documents.

Design goals

  • Simple interfaces and apis
  • Memory independence, with all intermediate steps and algorithms running in streaming mode, accessing one document at a time.

2. Save and load

Note the distinction between saving and loading the model and saving and loading the word vector file

  • Model saving and loading: all state information of model training, such as weight file, binary tree and vocabulary frequency, is retained, which can be carried out after loadingRetraining/additional training
  • Word vector file saving and loading: the state information of model training is discarded and cannot be carried out after loadingRetraining/additional training

Please refer to gensim: API Reference for details

2.1 Model saving and loading

Save the model

Using the model.save() method, models saved in this way can be retrained (appended) after reading, because all the training information is saved

from gensim.models import Word2Vec

# Train Word2Vec vector
model = Word2Vec(texts, size=100, window=5, min_count=1, workers=4) 
# Save the model
model.save("word2vec.model")
Copy the code

If you need to continue training, you need the complete Word2Vec object state, stored by save(), not just KeyedVectors.

Load model

Load the model for retraining

from gensim.models import Word2Vec

model = Word2Vec.load("word2vec.model")
model.train([["hello"."world"]], total_examples=1, epochs=1)
Copy the code

The trained word vector can be stored in an instance of KeyedVectors using model.wv as follows:

vector = model.wv['computer'] # numpy vector of a word
Copy the code

If you have completed model training (that is, no model updates, just queries), switch to an instance of KeyedVectors:

word_vectors = model.wv
del model
Copy the code

2.2 Word vector file loading and saving

Save the trained word vector file

save

  1. usemdoel.wv.saveIn order toKededVectorsThe word vector file is saved in the form of instances. The model saved in this way loses the complete model state and cannot be retrained. The saved object is smaller and faster.
    model.wv.save("model.wv")
    Copy the code
  2. usewv.save_word2vec_formatSave the word vector file (previouslymodel.save_word2vec_format(), deprecated)
    model.wv.save_word2vec_format("model.bin", binary=True)
    Copy the code

loading

  1. useKeyedVectors.loadLoad the word vector file and save it inKeyedVectorsExample (for complete model state not required, no more training)
    from gensim.models import KeyedVectors
    wv = KeyedVectors.load("model.wv", mmap='r')
    vector = wv['computer'] # numpy vector of a word
    Copy the code
  2. In order toword2vec C formatLoad the word vector and save it inKeyedVectorsIn the exampleKeyedVector.load_word2vec_format()Word vector files can be loaded in two formats:C Text FormatandC bin format (binary)
    from gensim.models import KeyedVectors
    wv_from_text = KeyedVectors.load_word2vec_format("model_kv_c", binary=False) # C text format
    wv_from_bin = KeyedVectors.load_word2vec_format("model_kv.bin", binary=True) # C bin format
    Copy the code

The vector loaded from C format cannot continue training due to the absence of weight files, binary trees and lexical frequencies.

3. KeyedVectors

The module models. KeyedVectors implements word vector and similarity search. Trained circuits are independent of the training mode, so they can be represented by independent structures.

This structure is called KeyedVectors and is essentially a mapping between entities and vectors. Each entity is identified by its string ID and is therefore a mapping between a string and a 1-dimensional array. Entities usually correspond to a word, so words are mapped to one-dimensional vectors, and for some, values can also correspond to a document, an image, or something else.

KeyedVectors differ from full models in the lack of further training, smaller RAM footprint, and simpler interfaces.

3.1 How to obtain word vector

Train a complete model and then get its model.wv property, which contains the independent Keyed Vectors. For example, use word2vec to train vectors.

from gensim.test.utils import common_texts
from gensim.models import Word2Vec

model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)
word_vectors = model.wv
Copy the code

Load the word vector file from disk

from gensim.models import KeyedVectors

word_vectors.save("vectors_wv")
word_vectors = KeyedVectors.load("vectors_wv", mmap='r')
Copy the code

Load a raw Google’s Word2vec C format word vector file from disk as an instance of KeyedVectors

wv_from_text = KeyedVectors.load_word2vec_format(datapath('word2vec_pre_kv_c'), binary=False)  # C text format
wv_from_bin = KeyedVectors.load_word2vec_format(datapath('word2vec_vector.bin'), binary=True)  # C text format
Copy the code

3.2 What can be done with these word vectors?

Can perform various NLP syntactic/semantic word tasks.

>>> import gensim.downloader as api
>>>
>>> word_vectors = api.load("glove-wiki-gigaword-100")  # load pre-trained word-vectors from gensim-data
>>>
>>> result = word_vectors.most_similar(positive=['woman'.'king'], negative=['man'])
>>> print("{}: {:.4f}".format(*result[0]))
queen: 0.7699

>>> result = word_vectors.most_similar_cosmul(positive=['woman'.'king'], negative=['man'])
>>> print("{}: {:.4f}".format(*result[0]))
queen: 0.8965
>>>
>>> print(word_vectors.doesnt_match("breakfast cereal dinner lunch".split()))
cereal
# Similarity between two words
>>> similarity = word_vectors.similarity('woman'.'man')
>>> similarity > 0.8
True
# List of words closest to the specified word
>>> result = word_vectors.similar_by_word("cat")
>>> print("{}: {:.4f}".format(*result[0]))
dog: 0.8798
>>>
>>> sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
>>> sentence_president = 'The president greets the press in Chicago'.lower().split()
# Two sentence WMD distance
>>> similarity = word_vectors.wmdistance(sentence_obama, sentence_president)
>>> print("{:.4f}".format(similarity))
3.4893
# Distance between two words
>>> distance = word_vectors.distance("media"."media")
>>> print("{:.1f}".format(distance))
0.0
# Similarity between two sentences
>>> sim = word_vectors.n_similarity(['sushi'.'shop'], ['japanese'.'restaurant'])
>>> print("{:.4f}".format(sim))
0.7067
# word vector
>>> vector = word_vectors['computer']  # numpy vector of a word
>>> vector.shape
(100,) > > >>>> vector = word_vectors.wv.word_vec('office', use_norm=True)
>>> vector.shape
(100.)Copy the code