preface
In the previous article “Talk about the principle of Google Word2vec”, the context of Word2vec has been clearly explained. In the next article, we will try to train word vector according to the principle of Word2vec and use TensorFlow. Here we choose the Skpp-Gram model.
Corpus preparation
Here we only collected online articles about real estate news, and put all articles together to form a corpus.
Skip-gram brief description
The core idea of Skip-Gram can be seen from The following figure. Assuming that our window size is 2, The text “The quick Brown fox jumps over The lazy dog.” will generate training samples with The sliding of The window. So if I start with (the,quick)(the, Brown), and then I move one step to the right and the training sample is (quick,the)(quick,brown)(quick,fox), The training sample is: (Brown,the)(Brown,quick)(Brown,fox)(brown,jumps). Then the training sample is generated by shifting to the right repeatedly. The core idea of skIP-Gram model is just as mentioned above.
Expect loading & participles
def read_data(filename):
with codecs.open(filename, 'r', encoding='utf-8') as f:
data = f.read()
seg_list = jieba.cut(data, cut_all=False)
text = tf.compat.as_str("/".join(seg_list)).split('/')
return text
filename = "D:\\data6\\house_train\\result.txt"
vocabulary = read_data(filename)Copy the code
To achieve the corpus file loading and word segmentation. Filename specifies a corpus file, and jieba is used to divide words, which returns a list containing all the words in the corpus.
Build a dictionary
vocabulary_size = 50000
def build_dataset(words, n_words):
count = [['UNK', -1]]
count.extend(collections.Counter(words).most_common(n_words - 1))
dictionary = dict()
for word, _ in count:
dictionary[word] = len(dictionary)
data = list()
unk_count = 0
for word in words:
if word in dictionary:
index = dictionary[word]
else:
index = 0
unk_count += 1
data.append(index)
count[0][1] = unk_count
reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
return data, count, dictionary, reversed_dictionary
data, count, dictionary, reverse_dictionary = build_dataset(vocabulary, vocabulary_size)
del vocabularyCopy the code
Here, we want to establish a word with a size of 50000. Vocabulary is obtained from all words in the corpus set, and the frequency of occurrence of each word of Vocabulary is counted, and the top 49999 words with the highest frequency are selected. The count dictionary is convenient for querying the corresponding frequency of occurrence of a word later. Next, we build dictionary, which is a dictionary of words and indexes. It is convenient to query the corresponding index position of a word later. Then we converted all words of Vocabulary into indexes and stored them in data. All words not among the 49999 words with the highest frequency were regarded as unknown words and their indexes were set to 0. In this process, the number of unknown words included in Vocabulary was counted. In addition, a reverse-indexed dictionary, reversed_dictionary, can be used to retrieve words by location index.
Obtaining batch Data
def generate_batch(batch_size, num_skips, skip_window):
global data_index
assert batch_size % num_skips == 0
assert num_skips <= 2 * skip_window
batch = np.ndarray(shape=(batch_size), dtype=np.int32)
labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
span = 2 * skip_window + 1
buffer = collections.deque(maxlen=span)
if data_index + span > len(data):
data_index = 0
buffer.extend(data[data_index:data_index + span])
data_index += span
for i in range(batch_size // num_skips):
target = skip_window
targets_to_avoid = [skip_window]
for j in range(num_skips):
while target in targets_to_avoid:
target = random.randint(0, span - 1)
targets_to_avoid.append(target)
batch[i * num_skips + j] = buffer[skip_window]
labels[i * num_skips + j, 0] = buffer[target]
if data_index == len(data):
buffer[:] = data[:span]
data_index = span
else:
buffer.append(data[data_index])
data_index += 1
data_index = (data_index + len(data) - span) % len(data)
return batch, labelsCopy the code
Batch_size is the number of samples we obtain at one time, while num_SKIP can be regarded as the number of words we want to go to a word window. For example, as the window size mentioned above is 2, there are altogether 4 words near a word, which can eventually form up to 4 training samples. But if you only need to compose 2 samples then use num_skip. Skip_window is used to set the size of the window. Sample was carried out in the whole vocabulary through the sliding window, and the batch and labels obtained are the dictionary indexes corresponding to words, which is convenient for the subsequent calculation.
Build the figure
graph = tf.Graph()
with graph.as_default():
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
with tf.device('/cpu:0') : Embeddings = tf.Variable(tF.random_uniform ([vocabulary_size, embedding_size], -1.0, 1.0)) embed = tf. Nn. Embedding_lookup (embeddings, train_inputs) nce_weights = tf.Variable( tf.truncated_normal([vocabulary_size, embedding_size], Stddev =1.0 / math.sqrt(embedding_size))) nce_biases = tf.variable (tf.zeros([vocabulary_size])) loss = tf.reduce_mean( tf.nn.nce_loss(weights=nce_weights, biases=nce_biases, labels=train_labels, inputs=embed, num_sampled=num_sampled, Num_classes = vocabulary_size)) optimizer = tf. Train. GradientDescentOptimizer (1.0). Minimize norm = (loss) tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True)) normalized_embeddings = embeddings / norm valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset) similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True) init = tf.global_variables_initializer()Copy the code
Train_inputs is an input placeholder in the shape of [BATch_size] that represents the index of a batch of input data. Train_labels is a correct classification label in the shape of [batch_size, 1], which represents the correct classification labels corresponding to a batch of inputs.
The embeddings variable is used to represent the 128-dimensional vector of all the words in the dictionary, which are updated over the course of training. It is a [vocabulary_size, embedding_size] shaped matrix, which in this case is [50000,128]. Because we set the word to have 50,000 words, and its elements are between -1 and 1.
The embedding_LOOKUP function then retrieves a batch of 128-dimensional input embed according to train_inputs.
We then use NCE as a loss function, which is the negative sampling loss function provided by vocabulary_size and the word vector dimension embedding_size, or try using other loss functions. Nce_weights and NCE_biases are the weights and biases of NCE process, and the loss function is optimized by gradient descent method after averaging.
Finally, the embeddings were standardized to obtain standard word vectors, and the similarity (distance) between all word vectors and the words we selected to check was calculated.
Create a session
with tf.Session(graph=graph) as session:
init.run()
average_loss = 0
for step in range(num_steps):
batch_inputs, batch_labels = generate_batch(batch_size, num_skips, skip_window)
feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}
_, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)
average_loss += loss_val
if step % 2000 == 0:
if step > 0:
average_loss /= 2000
print('Average loss at step ', step, ':', average_loss)
average_loss = 0
if step % 10000 == 0:
sim = similarity.eval()
for i in range(valid_size):
valid_word = reverse_dictionary[valid_examples[i]]
top_k = 8
nearest = (-sim[i, :]).argsort()[1:top_k + 1]
log_str = 'Nearest to %s:' % valid_word
for k in range(top_k):
close_word = reverse_dictionary[nearest[k]]
log_str = '%s %s,' % (log_str, close_word)
print(log_str)
final_embeddings = normalized_embeddings.eval()Copy the code
Create session start training and set the number of training rounds as specified by num_steps. Generate_batch then takes a batch of inputs and their corresponding labels, specifies the optimizer object and the loss function object to start the training, outputs each 2000 training rounds to see the specific loss, and every 10000 rounds uses the parity data to see what their closest 8 words are.
Dimension drawing
def plot_with_labels(low_dim_embs, labels, filename='tsne.png'):
assert low_dim_embs.shape[0] >= len(labels), 'More labels than embeddings'
plt.figure(figsize=(18, 18)) # in inches
for i, label in enumerate(labels):
x, y = low_dim_embs[i, :]
plt.scatter(x, y)
plt.annotate(label,
xy=(x, y),
xytext=(5, 2),
textcoords='offset points',
ha='right',
va='bottom')
plt.savefig(filename)
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000, method='exact')
plot_only = 300
low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only, :])
labels = [reverse_dictionary[i] for i in range(plot_only)]
plot_with_labels(low_dim_embs, labels)Copy the code
Select 300 words and use TSNE to reduce their dimensions and then draw a graph.
github
Github.com/sea-boat/De…
======== advertising time ========
My new book “Analysis of Tomcat kernel Design” has been sold in JINGdong, friends in need can go to item.jd.com/12185360.ht… Make a reservation. Thank you all.
Why to write “Analysis of Tomcat Kernel Design”
= = = = = = = = = = = = = = = = = = = = = = = = =
Welcome to: