How to train word vectors with TensorFlow

preface

In the previous article “Talk about the principle of Google Word2vec”, the context of Word2vec has been clearly explained. In the next article, we will try to train word vector according to the principle of Word2vec and use TensorFlow. Here we choose the Skpp-Gram model.

Corpus preparation

Here we only collected online articles about real estate news, and put all articles together to form a corpus.

Skip-gram brief description

The core idea of Skip-Gram can be seen from The following figure. Assuming that our window size is 2, The text “The quick Brown fox jumps over The lazy dog.” will generate training samples with The sliding of The window. So if I start with (the,quick)(the, Brown), and then I move one step to the right and the training sample is (quick,the)(quick,brown)(quick,fox), The training sample is: (Brown,the)(Brown,quick)(Brown,fox)(brown,jumps). Then the training sample is generated by shifting to the right repeatedly. The core idea of skIP-Gram model is just as mentioned above.

Expect loading & participles

def read_data(filename):
    with codecs.open(filename, 'r', encoding='utf-8') as f:
        data = f.read()
        seg_list = jieba.cut(data, cut_all=False)
        text = tf.compat.as_str("/".join(seg_list)).split('/')
    return text

filename = "D:\\data6\\house_train\\result.txt"

vocabulary = read_data(filename)Copy the code

To achieve the corpus file loading and word segmentation. Filename specifies a corpus file, and jieba is used to divide words, which returns a list containing all the words in the corpus.

Build a dictionary

vocabulary_size = 50000

def build_dataset(words, n_words):
    count = [['UNK', -1]]
    count.extend(collections.Counter(words).most_common(n_words - 1))
    dictionary = dict()
    for word, _ in count:
        dictionary[word] = len(dictionary)
    data = list()
    unk_count = 0
    for word in words:
        if word in dictionary:
            index = dictionary[word]
        else:
            index = 0
            unk_count += 1
        data.append(index)
    count[0][1] = unk_count
    reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return data, count, dictionary, reversed_dictionary

data, count, dictionary, reverse_dictionary = build_dataset(vocabulary, vocabulary_size)
del vocabularyCopy the code

Here, we want to establish a word with a size of 50000. Vocabulary is obtained from all words in the corpus set, and the frequency of occurrence of each word of Vocabulary is counted, and the top 49999 words with the highest frequency are selected. The count dictionary is convenient for querying the corresponding frequency of occurrence of a word later. Next, we build dictionary, which is a dictionary of words and indexes. It is convenient to query the corresponding index position of a word later. Then we converted all words of Vocabulary into indexes and stored them in data. All words not among the 49999 words with the highest frequency were regarded as unknown words and their indexes were set to 0. In this process, the number of unknown words included in Vocabulary was counted. In addition, a reverse-indexed dictionary, reversed_dictionary, can be used to retrieve words by location index.

Obtaining batch Data

def generate_batch(batch_size, num_skips, skip_window):
    global data_index
    assert batch_size % num_skips == 0
    assert num_skips <= 2 * skip_window
    batch = np.ndarray(shape=(batch_size), dtype=np.int32)
    labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
    span = 2 * skip_window + 1
    buffer = collections.deque(maxlen=span)
    if data_index + span > len(data):
        data_index = 0
    buffer.extend(data[data_index:data_index + span])
    data_index += span
    for i in range(batch_size // num_skips):
        target = skip_window
        targets_to_avoid = [skip_window]
        for j in range(num_skips):
            while target in targets_to_avoid:
                target = random.randint(0, span - 1)
            targets_to_avoid.append(target)
            batch[i * num_skips + j] = buffer[skip_window]
            labels[i * num_skips + j, 0] = buffer[target]
        if data_index == len(data):
            buffer[:] = data[:span]
            data_index = span
        else:
            buffer.append(data[data_index])
            data_index += 1
    data_index = (data_index + len(data) - span) % len(data)
    return batch, labelsCopy the code

Batch_size is the number of samples we obtain at one time, while num_SKIP can be regarded as the number of words we want to go to a word window. For example, as the window size mentioned above is 2, there are altogether 4 words near a word, which can eventually form up to 4 training samples. But if you only need to compose 2 samples then use num_skip. Skip_window is used to set the size of the window. Sample was carried out in the whole vocabulary through the sliding window, and the batch and labels obtained are the dictionary indexes corresponding to words, which is convenient for the subsequent calculation.

Build the figure

graph = tf.Graph()
with graph.as_default():
    train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
    train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
    valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

    with tf.device('/cpu:0') : Embeddings = tf.Variable(tF.random_uniform ([vocabulary_size, embedding_size], -1.0, 1.0)) embed = tf. Nn. Embedding_lookup (embeddings, train_inputs) nce_weights = tf.Variable( tf.truncated_normal([vocabulary_size, embedding_size], Stddev =1.0 / math.sqrt(embedding_size))) nce_biases = tf.variable (tf.zeros([vocabulary_size])) loss = tf.reduce_mean( tf.nn.nce_loss(weights=nce_weights, biases=nce_biases, labels=train_labels, inputs=embed, num_sampled=num_sampled, Num_classes = vocabulary_size)) optimizer = tf. Train. GradientDescentOptimizer (1.0). Minimize norm = (loss) tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True)) normalized_embeddings = embeddings / norm valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset) similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True) init = tf.global_variables_initializer()Copy the code

Train_inputs is an input placeholder in the shape of [BATch_size] that represents the index of a batch of input data. Train_labels is a correct classification label in the shape of [batch_size, 1], which represents the correct classification labels corresponding to a batch of inputs.

The embeddings variable is used to represent the 128-dimensional vector of all the words in the dictionary, which are updated over the course of training. It is a [vocabulary_size, embedding_size] shaped matrix, which in this case is [50000,128]. Because we set the word to have 50,000 words, and its elements are between -1 and 1.

The embedding_LOOKUP function then retrieves a batch of 128-dimensional input embed according to train_inputs.

We then use NCE as a loss function, which is the negative sampling loss function provided by vocabulary_size and the word vector dimension embedding_size, or try using other loss functions. Nce_weights and NCE_biases are the weights and biases of NCE process, and the loss function is optimized by gradient descent method after averaging.

Finally, the embeddings were standardized to obtain standard word vectors, and the similarity (distance) between all word vectors and the words we selected to check was calculated.

Create a session

with tf.Session(graph=graph) as session:
    init.run()
    average_loss = 0
    for step in range(num_steps):
        batch_inputs, batch_labels = generate_batch(batch_size, num_skips, skip_window)
        feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}
        _, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)
        average_loss += loss_val

        if step % 2000 == 0:
            if step > 0:
                average_loss /= 2000
            print('Average loss at step ', step, ':', average_loss)
            average_loss = 0

        if step % 10000 == 0:
            sim = similarity.eval()
            for i in range(valid_size):
                valid_word = reverse_dictionary[valid_examples[i]]
                top_k = 8
                nearest = (-sim[i, :]).argsort()[1:top_k + 1]
                log_str = 'Nearest to %s:' % valid_word
                for k in range(top_k):
                    close_word = reverse_dictionary[nearest[k]]
                    log_str = '%s %s,' % (log_str, close_word)
                print(log_str)
    final_embeddings = normalized_embeddings.eval()Copy the code

Create session start training and set the number of training rounds as specified by num_steps. Generate_batch then takes a batch of inputs and their corresponding labels, specifies the optimizer object and the loss function object to start the training, outputs each 2000 training rounds to see the specific loss, and every 10000 rounds uses the parity data to see what their closest 8 words are.

Dimension drawing

def plot_with_labels(low_dim_embs, labels, filename='tsne.png'):
    assert low_dim_embs.shape[0] >= len(labels), 'More labels than embeddings'
    plt.figure(figsize=(18, 18))  # in inches
    for i, label in enumerate(labels):
        x, y = low_dim_embs[i, :]
        plt.scatter(x, y)
        plt.annotate(label,
                     xy=(x, y),
                     xytext=(5, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')

    plt.savefig(filename)

plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000, method='exact')
plot_only = 300
low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only, :])
labels = [reverse_dictionary[i] for i in range(plot_only)]
plot_with_labels(low_dim_embs, labels)Copy the code

Select 300 words and use TSNE to reduce their dimensions and then draw a graph.

github

Github.com/sea-boat/De…

======== advertising time ========

My new book “Analysis of Tomcat kernel Design” has been sold in JINGdong, friends in need can go to item.jd.com/12185360.ht… Make a reservation. Thank you all.

Why to write “Analysis of Tomcat Kernel Design”

= = = = = = = = = = = = = = = = = = = = = = = = =

Welcome to: