Use the embedding_lookup module to save and simple use Word2Vec training

  • Word2Vec profile

One hot representation is a very simple word vector, but there are many problems. The biggest problem is that our vocabularies tend to be very large, such as the mega level, so that each word is represented in a vector of one million dimensions is a memory disaster. In fact, all the positions of such vector are 0 except 1, which is not efficient in expression. Can we reduce the dimension of word vector?

Dristributed Representation can solve the problem of One hot representation. The idea of Dristributed representation is to map each word into a short word vector through training. All these word vectors constitute vector space, and then we can use ordinary statistical methods to study the relationship between words. What is the dimension of this shorter word vector? This usually needs to be specified by ourselves during training.

This blog post explores the saving and simple use of Word2Vec training using the embedding_lookup module of TensorFlow. On this basis, we can use our own trained Word2Vec for RNN processing applications. The data set used for this exercise is Text8.zip

  • Tf. Nn. Embedding_lookup is introduced

    Tf. Nn. Embedding_lookup (params, ids, partition_strategy = ‘mod’, name = None, validate_indices = True, max_norm = None)

Find the corresponding element in params according to the ID in IDS, which can be understood as an index. Therefore, the value of the element in IDS cannot exceed the first dimension of Params. For example,ids=[1,3,5], find the vectors with subscripts of 1,3,5 in params and return them as a matrix. Embedding_lookup is not a simple table lookup, and the vector id can be trained. The number of training parameters should be category num*embedding size, which means that lookup is a fully connected layer.

Params: represents the complete embedding tensor, or the list of P tensors with the same shape except for the first dimension, representing the segmented embedding tensor. Ids: A Tensor of type INT32 or int64 that contains the ID to look up in params

  • Word2Vec training and saving

Code part:

# encode : utf - 8
# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
#
Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
import collections
import pickle
import math
import os
os.environ["KMP_DUPLICATE_LIB_OK"] ="TRUE"
import random
import zipfile

import numpy as np
import urllib
import tensorflow as tf

# Step 1: Download the data.
url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes) :
  """Download a file if not present, and make sure it's the right size."""
  if not os.path.exists(filename):
    filename, _ = urllib.request.urlretrieve(url + filename, filename)
  statinfo = os.stat(filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified', filename)
  else:
    print(statinfo.st_size)
    raise Exception(
        'Failed to verify ' + filename + '. Can you get to it with a browser? ')
  return filename

filename = maybe_download('text8.zip'.31344016)


# Read the data into a list of strings.
def read_data(filename) :
  """Extract the first file enclosed in a zip file as a list of words"""
  with zipfile.ZipFile(filename) as f:
    data = tf.compat.as_str(f.read(f.namelist()[0])).split()
  return data

words = read_data(filename)
print('Data size'.len(words))

# Step 2: Build the dictionary and replace rare words with UNK token.
vocabulary_size = 50000

def build_dataset(words) :
  count = [['UNK', -1]]
  count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
  dictionary = dict(a)for word, _ in count:
    dictionary[word] = len(dictionary)
  data = list()
  unk_count = 0
  for word in words:
    if word in dictionary:
      index = dictionary[word]
    else:
      index = 0  # dictionary['UNK']
      unk_count += 1
    data.append(index)
  count[0] [1] = unk_count
  reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
  return data, count, dictionary, reverse_dictionary

data, count, dictionary, reverse_dictionary = build_dataset(words)
del words  # Hint to reduce memory.
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10], [reverse_dictionary[i] for i in data[:10]])

data_index = 0


# Step 3: Function to generate a training batch for the skip-gram model.
def generate_batch(batch_size, num_skips, skip_window) :
  global data_index
  assert batch_size % num_skips == 0
  assert num_skips <= 2 * skip_window
  batch = np.ndarray(shape=(batch_size), dtype=np.int32)
  labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
  span = 2 * skip_window + 1 # [ skip_window target skip_window ]
  buffer = collections.deque(maxlen=span)
  for _ in range(span):
    buffer.append(data[data_index])
    data_index = (data_index + 1) % len(data)
  for i in range(batch_size // num_skips):
    target = skip_window  # target label at the center of the buffer
    targets_to_avoid = [ skip_window ]
    for j in range(num_skips):
      while target in targets_to_avoid:
        target = random.randint(0, span - 1)
      targets_to_avoid.append(target)
      batch[i * num_skips + j] = buffer[skip_window]
      labels[i * num_skips + j, 0] = buffer[target]
    buffer.append(data[data_index])
    data_index = (data_index + 1) % len(data)
  return batch, labels

batch, labels = generate_batch(batch_size=8, num_skips=2, skip_window=1)
for i in range(8) :print(batch[i], reverse_dictionary[batch[i]],
      '- >', labels[i, 0], reverse_dictionary[labels[i, 0]])

# Step 4: Build and train a skip-gram model.

batch_size = 128
embedding_size = 128  # Dimension of the embedding vector.
skip_window = 1       # How many words to consider left and right.
num_skips = 2         # How many times to reuse an input to generate a label.

# We pick a random validation set to sample nearest neighbors. Here we limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent.
valid_size = 16     # Random set of words to evaluate similarity on.
valid_window = 100  # Only pick dev samples in the head of the distribution.
valid_examples = np.random.choice(valid_window, valid_size, replace=False)
num_sampled = 64    # Number of negative examples to sample.

graph = tf.Graph()
with graph.as_default():

  # Input data.
  train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
  train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
  valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

  # Ops and variables pinned to the CPU because of missing GPU implementation
  with tf.device('/cpu:0') :# Look up embeddings for inputs.
    embeddings = tf.Variable(
        tf.random_uniform([vocabulary_size, embedding_size], -1.0.1.0))
    embed = tf.nn.embedding_lookup(embeddings, train_inputs)

    # Construct the variables for the NCE loss
    nce_weights = tf.Variable(
        tf.truncated_normal([vocabulary_size, embedding_size],
                            stddev=1.0 / math.sqrt(embedding_size)))
    nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

  # Compute the average NCE loss for the batch.
  # tf.nce_loss automatically draws a new sample of the negative labels each
  # time we evaluate the loss.
  loss = tf.reduce_mean(
      tf.nn.nce_loss(weights=nce_weights,
                     biases=nce_biases,
                     labels=train_labels,
                     inputs=embed,
                     num_sampled=num_sampled,
                     num_classes=vocabulary_size))

  Construct the SGD optimizer using a learning rate of 1.0.
  optimizer = tf.train.GradientDescentOptimizer(1.2).minimize(loss)

  # Compute the cosine similarity between minibatch examples and all embeddings.
  norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
  normalized_embeddings = embeddings / norm
  valid_embeddings = tf.nn.embedding_lookup(
      normalized_embeddings, valid_dataset)
  similarity = tf.matmul(
      valid_embeddings, normalized_embeddings, transpose_b=True)

  # Add variable initializer.
  init = tf.global_variables_initializer()

# Step 5: Begin training.
num_steps = 100001

with tf.Session(graph=graph) as session:
  # We must initialize all variables before we use them.
  init.run()
  saver = tf.train.Saver()
  print("Initialized")

  average_loss = 0
  for step in range(num_steps):
    batch_inputs, batch_labels = generate_batch(
        batch_size, num_skips, skip_window)
    feed_dict = {train_inputs : batch_inputs, train_labels : batch_labels}

    # We perform one update step by evaluating the optimizer op (including it
    # in the list of returned values for session.run()
    _, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)
    average_loss += loss_val

    if step % 1000= =0:
      if step > 0:
        average_loss /= 1000
      # The average loss is an estimate of the loss over the last 2000 batches.
      print("Average loss at step ", step, ":", average_loss)
      average_loss = 0

    # Note that this is expensive (~20% slowdown if computed every 500 steps)
    if step % 10000= =0:
      sim = similarity.eval(a)for i in range(valid_size):
        valid_word = reverse_dictionary[valid_examples[i]]
        top_k = 8 # number of nearest neighbors
        nearest = (-sim[i, :]).argsort()[1:top_k+1]
        log_str = "Nearest to %s:" % valid_word
        for k in range(top_k):
          close_word = reverse_dictionary[nearest[k]]
          log_str = "%s %s," % (log_str, close_word)
        print(log_str)
  
  final_embeddings = normalized_embeddings.eval()
  saver_path = saver.save(session, './2RNN/3_1Word2Vec/MyModel')
  print("saver path: ",saver_path)
  with open('./2RNN/3_1Word2Vec/tf_128_2.pkl'.'wb') as fw:
    pickle.dump({'embeddings': final_embeddings, 'word2id': dictionary, 'id2word': reverse_dictionary}, fw, protocol=4)


# Step 6: Visualize the embeddings.

def plot_with_labels(low_dim_embs, labels, filename='tsne.png') :
  assert low_dim_embs.shape[0] > =len(labels), "More labels than embeddings"
  plt.figure(figsize=(18.18))  #in inches
  for i, label in enumerate(labels):
    x, y = low_dim_embs[i,:]
    plt.scatter(x, y)
    plt.annotate(label,
                 xy=(x, y),
                 xytext=(5.2),
                 textcoords='offset points',
                 ha='right',
                 va='bottom')

  plt.savefig(filename)

  # % %
try:
  from sklearn.manifold import TSNE
  import matplotlib.pyplot as plt

  tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
  plot_only = 200
  low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only,:])
  labels = [reverse_dictionary[i] for i in range(plot_only)]
  plot_with_labels(low_dim_embs, labels)

except ImportError:
  print("Please install sklearn, matplotlib, and scipy to visualize embeddings.")
Copy the code

Running results:

Average Loss at Step 1000:149.03840727233887 Average Loss at Step 2000: 86.77497396659851 Average Loss at Step 3000:61.10482195854187... Average Loss at Step 97000:4.575266252994537 Average Loss at Step 98000: 4.605689331054688 Average Loss at Step 99000:4.6487927632331845 Average Loss at step 100000: Nearest to or: and, AGouti, Microcebus, SSBN, Dasyprocta, THAN, Clodius, orthodoxis agouti, when, microcebus, ssbn, bpp, amalthea, roshan, michelob, Nearest to i: we, ii, you, t, subcode, they, tabula, g, Nearest to they: there, he, we, you, it, these, not, who, Nearest to and: or, but, microcebus, agouti, mucus, dasyprocta,while, michelob,
Nearest to zero: eight, five, seven, four, six, nine, dasyprocta, michelob,
Nearest to states: nations, bandanese, kingdom, absalom, dasyprocta, aediles, applescript, kv,
Nearest to have: had, has, are, were, be, klister, having, agouti,
Nearest to five: four, six, seven, eight, three, two, zero, nine,
Nearest to used: known, agouti, microcebus, iit, abitibi, spoken, dasyprocta, upanija,
Nearest to an: wernicke, riley, binds, oddly, tunings, rearranged, tamarin, apparition,
Nearest to between: with, within, into, from, in,through, jarman, saracens,
Nearest to time: reginae, year, callithrix, iit, albury, upanija, brahma, microcebus,
Nearest to it: he, she, this, there, they, which,amalthea, microcebus,
Nearest to from: into, through, during, in, within, between, dominican, with,
Nearest to six: four, seven, five, eight, nine, three, two, agouti,
saver path:  ./2RNN/3_1Word2Vec/MyModel
Copy the code

Results analysis:

After 100,000 training steps, loss decreased from 149 to 4.6, and each data found a more suitable spatial location of corpus. Nearest to five: four, six, seven, eight, three, two, zero, nine. The number words are in close proximity.

  • Reuse of models

In the last part of our training process, we also saved the training results to the TF_128_2.pkL file, what we need to do in this part is to take out the saved data.

Code section

# -*- coding: utf-8 -*-

import tensorflow as tf
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import pickle

with open('./2RNN/3_1Word2Vec/tf_128_2.pkl'.'rb') as fr:
    data = pickle.load(fr)
    final_embeddings = data['embeddings']
    word2id = data['word2id']
    id2word = data['id2word']


print("word2id:".type(word2id),len(word2id))
print("word2id one:".list(word2id.items())[0])
print("id2word:".type(id2word),len(id2word))
print("id2word one:".list(id2word.items())[0])
print("final_embeddings:".type(final_embeddings),final_embeddings.shape)
print("final_embeddings one:",final_embeddings[0])
Copy the code

The results

word2id: <class 'dict'> 50000
word2id one: ('UNK', 0)
id2word: <class 'dict'> 50000
id2word one: (0, 'UNK')
final_embeddings: <class 'numpy.ndarray'> (50000128) final_embeddings one: [0.07824267 0.02380653-0.04904078-0.15769418-0.03343008-0.00123829-0.00840652 0.11035322 0.05255153-0.01701773 -0.03454393 0.07412812 0.12529139 0.08700892 0.13564599 0.06016889-0.02242458 0.01967838-0.08621006 0.19164786 0.05878171 0.150539930.15180601 0.11737475 0.02684335-0.02697461 0.02076019-0.074430790.0905515-0.00580214 -0.10034874 0.10663538 0.10468851-0.0018832-0.03854908-0.04377652-0.07925367-0.01276041 0.06139744-0.04612593 -0.0026719-0.14129621 0.03356975-0.08864117 0.03864674 0.06496057-0.03393148-0.18256697 0.1531667 0.01806654 0.25479555 -0.0102073 -0.01091281 -0.13244723 0.03231056 -0.04288295 0.00475867-0.063878960.16555941-0.1105833 0.16233324-0.01569812-0.03743415 0.118394350.14104177 -0.06637108-0.02597998-0.05089493 0.05379589 0.02132376 -0.0230114 0.16737887-0.07722343 0.06376561-0.06996173 0.07367135 -0.04434428 -0.05931331 0.13638481-0.12992401 0.05051441 0.100753180.1285995 0.03757066-0.15496145 0.02049168-0.02400574 0.04723364-0.05883536 0.20387387 -0.01346673 0.09482987 0.02737017 0.079759790.02752302 0.1652701 -0.06379505-0.01461394-0.01188034 0.118714-0.0942675 0.08787307-0.06561033 0.04986798 0.18926224 0.111620020.01565995 0.09576936-0.02896462 0.03163688 0.08406845 0.07642328-0.04427774-0.03355639-0.07277506 -0.20906252-0.00820385-0.006069670.02557734 0.03273683 0.04223491 0.04725773-0.011081-0.02940390.04183002-0.00577809 0.13359077-0.02493091]Copy the code

Result analysis word2ID and ID2word: both are dictionary variables with 50000 elements, see for id and word conversion. Final_embeddings is a 2-d embeddings with 50000 pieces of data, each 128 vector, similar to the 784 pixels ina Mnist handwritten data set, which is the essence of word vector. Later we can use our own trained word vector to do semantic analysis processing.