preface

Recently, I was arranged by my tutor to learn the paper [Knowledge Graph Embedding Based Question Answering], which focuses on the application of Knowledge Graph as dataset. Answering natural language processing oriented questions without knowing the data structure. This note is only used to record the code fragments that may be useful when I read the github code of this paper. As an undergraduate student who is very good at it, I will make a separate note for the paper notes

I hope I can learn and progress with you! Come on!


Paper links:

Delivery.acm.org/10.1145/330… Please copy all = 1564312374 _9607150c0f9e4d7029cba11e69cb8903 acm ()

Making links:

Github.com/xhuang31/KE…


This will be gradually updated below

Begin the text!

  1. if the question contains specific words, delete it

For example, if we want to remove what is from what is your name and get the result your name, we can use the following code:

whhowset = [{'what'.'how'.'where'.'who'.'which'.'whom'},
{'in which'.'what is'."what 's".'what are'.'what was'.'what were'.'where is'.'where are'.'where was'.'where were'.'who is'.'who was'.'who are'.'how is'.'what did'}, 
{'what kind of'.'what kinds of'.'what type of'.'what types of'.'what sort of'}]
question = ["what"."is"."your"."name"]
for j in range(3, 0, -1):
  if ' '.join(question[0:j]) in whhowset[j - 1]:
    del question[0:j]
    continue
print(question)
Copy the code

output: ["your","name"]

  1. create n-gram list for sentence word list

The following is a wiki explanation of n-gram: n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.

N Can be customized, such as unigram, bigram. Specific examples of n-gram are:

  • Word: word: apple, n – “gramm list: [‘ a ‘, ‘p’, ‘l’, ‘e’, ‘ap’, ‘pp’, ‘pl’, ‘pl’, ‘app’, ‘PPL’, ‘ple’, ‘appl’, ‘pple’, ‘apple’]
  • ‘How are you’, n-gram list: [‘how’, ‘are you’, ‘u’, ‘how are you’, ‘are you’, ‘how are you’]
question = ["how"."are"."u"]
grams = []
maxlen = len(question)
for token in question:
    grams.append(token)

for j in range(2, maxlen + 1):
    for token in [question[idx:idx + j] for idx in range(maxlen - j + 1)]:
        grams.append(' '.join(token))

print(grams)
Copy the code

output: [‘how’, ‘are’, ‘u’, ‘how are’, ‘are u’, ‘how are u’]

  1. write the output into a file
import os
mids = ["I"."I"."am"."a"."human"]
with open(os.path.join('output.txt'), 'w')as outfile:
    for i, entity in enumerate(set(mids)):
            outfile.write("{}\t{}\n".format(entity, i))
Copy the code

Output: is a file: output. TXT: contains:

      Human	0
        a	1
        am	2
        I       3
Copy the code
  1. argParser in PyTorch:makes it easy to write user-friendly command-line interface. Define how a single command-line argument should be parsed.

function:parser.add_argument(name or flags… [, action][, nargs][, const][, default][, type][, choices][, required][, help][, metavar][, dest])

parameters (cite from the Pytorch documentation):

  • const – A constant value required by some action and nargs selections.
  • dest – The name of the attribute to be added to the object returned by parse_args().
  • action – The basic type of action to be taken when this argument is encountered at the command line.
import argparse

parser = argparse.ArgumentParser(description='Process some integers.')
parser.add_argument('integers', metavar='N'.type=int, nargs='+'.help='an integer for the accumulator')
parser.add_argument('--sum', dest='accumulate', action='store_const',
                    const=sum, default=max,
                    help='sum the integers (default: find the max)')

args = parser.parse_args()
print args.accumulate(args.integers)
Copy the code

output: python prog.py 1 2 3 4 –> 4(get the maximum), python prog.py 1 2 3 4 --sum –>10(get the sum)

  1. Counter Object A counter tool is provided to support convenient and rapid tallies
from collections import Counter

cnt = Counter()
for word in ['red'.'blue'.'red'.'green'.'blue'.'blue']:
  cnt[word] += 1
print(cnt)
Copy the code

output: Counter({‘blue’: 3, ‘red’: 2, ‘green’: 1})

  1. PyTorch Manualseed
import torch

torch.manual_seed(3)
print(torch.rand(3))
Copy the code

output: Tensor ([0.0043, 0.1056, 0.2858]),this array will always be the same, if you don’t have the manual_seed function, the output will be different every time

  1. CUDNN deterministic n some circumstances when using the CUDA backend with CuDNN, this operator may select a nondeterministic algorithm to increase performance.If this is undesirable, you can try to make the operation deterministic (potentially at a performance cost) by setting torch.backends.cudnn.deterministic = True

Example:

torch.backends.cudnn.deterministic = True
Copy the code
  1. Torchtext Note: The following part is from the introduction of Torchtext by Zhihu leader Lee. It is easy to play with text data processing only as learning notes.

Torchtext components:

  • Field: mainly contains the following data processing configuration information, such as specifying the word segmentation method, whether to convert to lowercase, start character, end character, complete character and dictionary, etc
  • Dataset: A Dataset inherited from Pytorch that is used to load data. It provides the TabularDataset with path pointing, format, and Field information for easy data loading. The TorchText also provides a pre-built Dataset object which can be loaded directly and used. The Splits method loads training sets, validation sets and test sets simultaneously.
  • Iterator: Primarily an Iterator for a model of data output that can support batch customization
  1. The field:
TEXT = data.Field(lower=True)
Copy the code

Here the data preprocessing is set to all lowercase

  1. Dataset

Torchtext’s Dataset is inherited from PyTorch’s Dataset and provides a method for downloading and decompressing compressed data (support.zip,.gz,.tgz).

Our methods have access to training, validation, and test sets at the same time

TabularDataset can easily read CSV, TSV, or JSON files

train = data.TabularDataset(path=os.path.join(args.output, 'dete_train.txt'), format='tsv', fields=[('text', TEXT), ('ed', ED)])
dev, test = data.TabularDataset.splits(path=args.output, validation='valid.txt'.test='test.txt', format='tsv', fields=field)
Copy the code

After the data is loaded, the dictionary can be built, and the word vector can be used to build the dictionary

TEXT. Build_vocab (train, vectors ="text.6B.100d")
Copy the code
  1. Iterator

The Iterator is a Torchtext to our model output, which provides our general processing methods for data such as shuring, sorting, etc. The Batch size can be dynamically modified, and our other methods include training sets, validation sets, and test sets at the same time

train_iter = data.Iterator(train, batch_size=args.batch_size, device=torch.device('cuda', args.gpu), train=True,
                               repeat=False, sort=False, shuffle=True, sort_within_batch=False)
    dev_iter = data.Iterator(dev, batch_size=args.batch_size, device=torch.device('cuda', args.gpu), train=False,
                             repeat=False, sort=False, shuffle=False, sort_within_batch=False)
Copy the code
  1. Floor division: Python Arithmetic Operators — // The division of operands where the result is the quotient in which the digits after the decimal point are removed. But if one of the operands is negative, the result is floored, i.e., rounded away from
print(9 / / 4)print/ / 3 (11)Copy the code

output: 2 -4