1, Gensim
Gensim: Topic modelling for Humans
Gensim is an open source third-party Python toolkit for unsupervised learning of topic vector representations from raw, unstructured text to hidden layers of text. Support tF-IDF, LSA, LDA, Word2Vec and other topic model algorithms, support distributed training, provide similarity calculation, information retrieval and other commonly used API interfaces.
The above algorithm is unsupervised, meaning that no human input is required and only a plain text corpus is required.
The characteristics of
- Memory independence: Task time does not require the entire training corpus to be in RAM (can handle large Web-level corpora)
- Memory sharing: Trained models can be saved to hard disk and loaded back via Mmap. Multiple processes can share the same data, reducing the RAM footprint.
- Contains implementations of several popular vector space algorithms such as
Word2Vec
.Doc2Vec
.FastText
.TF-IDF
, latent semantic analysis (LSI
.LSA
, etc.),LDA
And so on. - I/O wrappers and readers for several process data formats.
- Perform text semantic similarity query for documents.
Design goals
- Simple interfaces and apis
- Memory independence, with all intermediate steps and algorithms running in streaming mode, accessing one document at a time.
2. Save and load
Note the distinction between saving and loading the model and saving and loading the word vector file
- Model saving and loading: all state information of model training, such as weight file, binary tree and vocabulary frequency, is retained, which can be carried out after loading
Retraining/additional training
- Word vector file saving and loading: the state information of model training is discarded and cannot be carried out after loading
Retraining/additional training
Please refer to gensim: API Reference for details
2.1 Model saving and loading
Save the model
Using the model.save() method, models saved in this way can be retrained (appended) after reading, because all the training information is saved
from gensim.models import Word2Vec
# Train Word2Vec vector
model = Word2Vec(texts, size=100, window=5, min_count=1, workers=4)
# Save the model
model.save("word2vec.model")
Copy the code
If you need to continue training, you need the complete Word2Vec object state, stored by save(), not just KeyedVectors.
Load model
Load the model for retraining
from gensim.models import Word2Vec
model = Word2Vec.load("word2vec.model")
model.train([["hello"."world"]], total_examples=1, epochs=1)
Copy the code
The trained word vector can be stored in an instance of KeyedVectors using model.wv as follows:
vector = model.wv['computer'] # numpy vector of a word
Copy the code
If you have completed model training (that is, no model updates, just queries), switch to an instance of KeyedVectors:
word_vectors = model.wv
del model
Copy the code
2.2 Word vector file loading and saving
Save the trained word vector file
save
- use
mdoel.wv.save
In order toKededVectors
The word vector file is saved in the form of instances. The model saved in this way loses the complete model state and cannot be retrained. The saved object is smaller and faster.model.wv.save("model.wv") Copy the code
- use
wv.save_word2vec_format
Save the word vector file (previouslymodel.save_word2vec_format()
, deprecated)model.wv.save_word2vec_format("model.bin", binary=True) Copy the code
loading
- use
KeyedVectors.load
Load the word vector file and save it inKeyedVectors
Example (for complete model state not required, no more training)from gensim.models import KeyedVectors wv = KeyedVectors.load("model.wv", mmap='r') vector = wv['computer'] # numpy vector of a word Copy the code
- In order to
word2vec C format
Load the word vector and save it inKeyedVectors
In the exampleKeyedVector.load_word2vec_format()
Word vector files can be loaded in two formats:C Text FormatandC bin format (binary)from gensim.models import KeyedVectors wv_from_text = KeyedVectors.load_word2vec_format("model_kv_c", binary=False) # C text format wv_from_bin = KeyedVectors.load_word2vec_format("model_kv.bin", binary=True) # C bin format Copy the code
The vector loaded from C format cannot continue training due to the absence of weight files, binary trees and lexical frequencies.
3. KeyedVectors
The module models. KeyedVectors implements word vector and similarity search. Trained circuits are independent of the training mode, so they can be represented by independent structures.
This structure is called KeyedVectors and is essentially a mapping between entities and vectors. Each entity is identified by its string ID and is therefore a mapping between a string and a 1-dimensional array. Entities usually correspond to a word, so words are mapped to one-dimensional vectors, and for some, values can also correspond to a document, an image, or something else.
KeyedVectors differ from full models in the lack of further training, smaller RAM footprint, and simpler interfaces.
3.1 How to obtain word vector
Train a complete model and then get its model.wv property, which contains the independent Keyed Vectors. For example, use word2vec to train vectors.
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)
word_vectors = model.wv
Copy the code
Load the word vector file from disk
from gensim.models import KeyedVectors
word_vectors.save("vectors_wv")
word_vectors = KeyedVectors.load("vectors_wv", mmap='r')
Copy the code
Load a raw Google’s Word2vec C format word vector file from disk as an instance of KeyedVectors
wv_from_text = KeyedVectors.load_word2vec_format(datapath('word2vec_pre_kv_c'), binary=False) # C text format
wv_from_bin = KeyedVectors.load_word2vec_format(datapath('word2vec_vector.bin'), binary=True) # C text format
Copy the code
3.2 What can be done with these word vectors?
Can perform various NLP syntactic/semantic word tasks.
>>> import gensim.downloader as api
>>>
>>> word_vectors = api.load("glove-wiki-gigaword-100") # load pre-trained word-vectors from gensim-data
>>>
>>> result = word_vectors.most_similar(positive=['woman'.'king'], negative=['man'])
>>> print("{}: {:.4f}".format(*result[0]))
queen: 0.7699
>>> result = word_vectors.most_similar_cosmul(positive=['woman'.'king'], negative=['man'])
>>> print("{}: {:.4f}".format(*result[0]))
queen: 0.8965
>>>
>>> print(word_vectors.doesnt_match("breakfast cereal dinner lunch".split()))
cereal
# Similarity between two words
>>> similarity = word_vectors.similarity('woman'.'man')
>>> similarity > 0.8
True
# List of words closest to the specified word
>>> result = word_vectors.similar_by_word("cat")
>>> print("{}: {:.4f}".format(*result[0]))
dog: 0.8798
>>>
>>> sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
>>> sentence_president = 'The president greets the press in Chicago'.lower().split()
# Two sentence WMD distance
>>> similarity = word_vectors.wmdistance(sentence_obama, sentence_president)
>>> print("{:.4f}".format(similarity))
3.4893
# Distance between two words
>>> distance = word_vectors.distance("media"."media")
>>> print("{:.1f}".format(distance))
0.0
# Similarity between two sentences
>>> sim = word_vectors.n_similarity(['sushi'.'shop'], ['japanese'.'restaurant'])
>>> print("{:.4f}".format(sim))
0.7067
# word vector
>>> vector = word_vectors['computer'] # numpy vector of a word
>>> vector.shape
(100,) > > >>>> vector = word_vectors.wv.word_vec('office', use_norm=True)
>>> vector.shape
(100.)Copy the code