Much natural language processing involves machine learning, so it is useful to understand some of the basic tools and techniques of machine learning. Some of these tools have been discussed in previous chapters and others have not, but we will discuss all of them here.
D.1 Data selection and avoidance of bias
Data selection and feature engineering carry the risk of bias (in human terms). Once we incorporate our own biases into the algorithm, by selecting a particular set of features, the model ADAPTS to those biases and produces biased results. If we are lucky enough to find this bias before we go into production, then a lot of work will have to go into eliminating it. For example, an entire pipeline must be rebuilt and retrained so that it can take full advantage of the segmentation’s new vocabulary. We must make a fresh start.
One example is the well-known Word2vec model for data and feature selection. Word2vec was trained for a large number of news stories, and about a million N-grams were selected from this corpus as the vocabulary (features) for this model. It produces a model that excites data scientists and linguists, who can perform mathematical operations on word vectors such as “King − man + woman = queen”. But as the research went further, more problematic relationships emerged in the model.
For example, the answer to “nurse” in the expression “doctor − father + mother = nurse” is not the unbiased and logical result one would hope for. Gender bias is inadvertently trained into the model. Similar racial, religious, and even geographic biases were prevalent in the original Word2vec model. The Google researchers had no intention of creating these biases, and the biases were in the data, the word usage statistics from the Google News Corpus they trained Word2vec to use.
Many news stories are simply culturally biased because they are written by journalists to make readers happy. These journalists write about a world with institutional biases and real-life biases about how people treat events. The word usage statistics in Google News simply reflect that mothers are more likely to be nurses than doctors, while fathers are more likely to be doctors than nurses. The Word2vec model simply gives us a window into the world we are creating.
Fortunately, models like Word2vec don’t need to tag training data. Therefore, we are free to choose any text we like to train the model. We can choose a data set that is more balanced and more representative of the beliefs and inferences that we want our model to make. While others hide behind algorithms and say they’re just following models, we can share our own data sets with them that more fairly represent a society in which we aspire to provide equal opportunities for everyone.
When training and testing models, one can rely on one’s innate sense of fairness to help determine when a model can make predictions that affect users’ lives. If the resulting model treats all users the way we want it to, then we can sleep well at night. It also helps to keep an eye on the needs of users who are different from everyone else, especially those who are often socially disadvantaged. If you need a more formal justification for your behavior, you can enhance your computer science skills by learning more about statistics, philosophy, ethics, psychology, behavioral economics, and anthropology.
As a natural language processing practitioner and machine learning engineer, you have the opportunity to train machines that can do better than humans. Bosses and colleagues don’t tell people which texts to add or remove from their training sets; they themselves have the power to influence the behavior of the machines that shape their communities and society as a whole.
We’ve provided you with some ideas on how to assemble a less biased and fairer data set. Now, we will show how to fit the resulting models with unbiased data so that they are accurate and useful in the real world.
D.2 Degree of model fitting
For all machine learning models, a major challenge is overcoming excessive superior performance. What is “over-excellence”? When processing sample data in all models, a given algorithm does a good job of finding patterns in a given data set. However, considering that we already know the labels of all given samples in the training set (not knowing its labels indicates that it is not in the training set), the algorithm will not be particularly useful for the above predicted results of the training samples. Our real goal is to use these training samples to build a model that has the ability to generalize, to label a new sample correctly. Although this sample is similar to the sample of the training set, it is outside the training set. Predictive performance on new samples outside the training set is what we want to optimize.
We call the model “overfit” that perfectly describes (and predicts) the training sample (see Figure D-1). Such a model would have little or no ability to describe new data. It is not a general-purpose model, and when given a sample that is not in the training set, it is hard to believe it will do well.
Figure D-1 Over-fitting phenomenon on training samples
Conversely, if our model makes many wrong predictions on the training sample and does poorly on the new sample, it is said to be “underfit” (see Figure D-2). In the real world, neither model does much for prediction. So, here’s a look at which techniques can detect both of these fitting problems and, more importantly, how to avoid them.
Figure D-2 Underfitting phenomenon on training samples
D.3 Data Set Division
In machine learning practice, if data is gold, then annotated data is Raritanium. Our first instinct might be to take the annotated data and pass it all to the model. More training data leads to more resilient models, right? But that leaves us with no way to test the model and only hope that it will yield good results in the real world. This is obviously impractical. The solution is to split the annotated data into two, sometimes three, datasets: a training set, a validation set, and in some cases, a test set.
The training set is obvious. In a training round, the validation set is the small amount of annotated data that we keep hidden from the model. Getting good performance on the validation set is the first step in verifying that the trained model performs well on new data outside of the training set. It is common to see a given annotation data set divided into training/validation ratios of 80/20% or 70/30%. A test set is similar to a validation set and is a subset of annotated training data used to test the model and measure performance. But how is this test set different from the validation set? They are not really any different in composition, but in the way in which they are used.
When the model is trained on the training set, there will be several iterations, during which there will be different hyperparameters. The final model we choose will be the one that performs best on the validation set. But here’s the question, how do we know we’re not optimizing a model that’s just a highly fitting validation set? There is no way to verify that the model performs well on other data. That’s what our bosses or readers of the paper are most interested in — how well does the model work on their data?
Therefore, if you have enough data, you need to use the third part of the annotated data set as the test set. This will give our readers (or bosses) more confidence that the model can be trained and tuned to perform well on data they have never seen before. Once a trained model has been selected based on the performance of the validation set, and the model is no longer trained or adjusted, prediction (inference) can be made for each sample in the test set. If the model performs well on the third part of the data, then it has good generalization. To get this kind of model validation with high confidence, you often see data sets divided by a 60/20/20% training/verification/test ratio.
Note that it is important to reorder the dataset before dividing it into training, validation, and test sets. We want each subset of the data to be a sample representative of the “real world,” and they need to have roughly the same ratio as each label we expect to see. If the training set has 25% positive samples and 75% negative samples, then the test set and verification set are also expected to have 25% positive samples and 75% negative samples. If the original data set is preceded by negative samples and the data is not scrambled before dividing the data set into a 50/50 training/test set, 100% negative samples will be obtained in the training set and 50% negative samples in the test set. In this case, the model can never learn from positive samples in the dataset.
D.4 Cross-fitting training
Another way to divide a training/test set is cross-validation or K-fold cross-validation (as shown in Figure D-3). The concept behind cross-validation is very similar to the data partitioning we just discussed, but it allows training with all tagged datasets. This process divides the training set into k-bisects, or k-folds. Then the model is trained by using K − 1 data as the training set and verified on the KTH data. Then, one of the k − 1 data used for training in the first attempt was used as the verification set, and the remaining K − 1 data became the new training set for retraining.
Figure D-3 K-fold cross verification
This technique is of great value for analyzing the structure of the model and finding the hyperparameters that perform well for each validation data. Once the hyperparameters have been selected, the best-performing trained model needs to be selected and is therefore vulnerable to the biases expressed in the previous section, so it is still recommended to keep a test set in this process.
This approach also provides some new information about model reliability. We can calculate a p-value that represents the likelihood that the relationship between the input features found by the model and the output predictions is statistically significant, rather than the result of random selection. If the training set is indeed a representative sample of the real world, then this is a very important new piece of information.
The trade-off for this extra confidence in the model is that it takes k times more training time to cross-verify the K-folds. So, if you want to get 90% of the answer to a question, you can usually simply do a 1-fold cross-check. This validation method is exactly the same as the training set/validation set partition method we did earlier. We can’t be 100% confident that the model is a reliable dynamic description of the real world, but if it performs well in the test set, we can be pretty confident that it’s a useful model for predicting target variables. So machine learning models derived from this practical approach make sense for most business applications.
D.5 Inhibition model
In Model.fit (), gradient descent is overly zealous in the pursuit of reducing possible errors in the model. This can lead to overfitting, where the learned model works well on the training set but poorly on the new unseen sample set (test set). Therefore, we might want to “preserve” control of the model. Here are 3 ways to do it:
- Regularization.
- Random dropout;
- Batch normalization.
D. 5.1 regularization
In all machine learning models, there will eventually be a fit. Fortunately, there are several tools available to solve this problem. The first is regularization, which is a penalty for the learning parameters of each training step. It is usually but not always a factor of the argument itself. Among them, L1 norm and L2 norm are the most common practices.
L1 regularization:
L1 is the sum of the absolute value of all parameters (weights) multiplied by some λ (hyperparameters), usually a small floating point number between 0 and 1. This sum applies to updates to weights – the idea is that larger weights produce larger penalties and therefore encourage models to use more, even weights…
L2 regularization:
Similarly, L2 is a weight penalty, but with a slightly different definition. In this case, it is the sum of the square of the weight multiplied by some λ value, which is a separate hyperparameter to be selected before training.
D. 5.2 dropout
Dropout is another approach to overfitting in neural networks that at first glance seems amazing. The idea of dropout is that at any layer of a neural network, we’ll be training to turn off a certain percentage of the signals passing through that layer. Note that this only happens during training, not reasoning. During all training, a subset of neurons in the network layer is “ignored” and these output values are explicitly set to zero. Because they have no input to the predicted results, no weight updates are made in the backpropagation step. In the next training step, subsets of the different weights in the layer are selected and the other weights are zeroed out.
How can a network learn when 20% of the brain is shut down at any given time? The idea is that no particular weight path can fully define a particular property of the data. The model must generalize its internal structure so that the model can process data in all its multiple paths through neurons.
The percentage of signals that are turned off is defined as a hyperparameter because it is a floating point number between 0 and 1. In practice, dropout from 0.1 to 0.5 is usually optimal, of course, depending on the model. Dropout is ignored during reasoning, making full use of trained weights to process the new data.
Keras provides a very simple implementation, as you can see in the examples and code listing D-1 in this book.
The Dropout layer in Listing D-1 Keras reduces overfitting
>>> from keras.models import Sequential
>>> from keras.layers import Dropout, LSTM, Flatten, Dense
>>> num_neurons = 20
>>> maxlen = 100
>>> embedding_dims = 300
>>> model = Sequential()
>>> model.add(LSTM(num_neurons, return_sequences=True,
... input_shape=(maxlen, embedding_dims)))
>>> model.add(Dropout(.2))
>>> model.add(Flatten())
>>> model.add(Dense(1, activation='sigmoid'))
Copy the code
D.5.3 Batch normalization
A new concept in neural networks called batch normalization can help standardize and generalize models. The idea of batch normalization is that, much like the input data, the output of each network layer should be normalized to a value between 0 and 1. There is still some debate about how, why and when this is beneficial, and under what conditions it should be used. We hope you can explore this research direction yourself.
But the BatchNormalization layer of Keras provides a simple implementation, as shown in Listing D-2.
Code Listing D-2 normalized BatchNormalization
>>> from keras.models import Sequential
>>> from keras.layers import Activation, Dropout, LSTM, Flatten, Dense
>>> from keras.layers.normalization import BatchNormalization
>>> model = Sequential()
>>> model.add(Dense(64, input_dim=14))
>>> model.add(BatchNormalization())
>>> model.add(Activation('sigmoid'))
>>> model.add(Dense(64, input_dim=14))
>>> model.add(BatchNormalization())
>>> model.add(Activation('sigmoid'))
>>> model.add(Dense(1, activation='sigmoid'))
Copy the code
D.6 Disequilibrium training set
Machine learning models are only as good as the data they are presented with. Having a large amount of data is helpful only if the sample covers all of the scenarios you want at the prediction stage, and it is not enough for the dataset to cover each scenario just once. Imagine we’re trying to predict whether an image is a dog or a cat. At this point we had a training set with 20, 000 pictures of cats, but only 200 pictures of dogs. If a model were to be trained on this data set, it would probably simply learn to predict that any given image would be a cat, regardless of the input. From a model point of view, this is acceptable, right? I mean, 99 percent of the training samples were correct. Of course, this argument is actually completely untenable, and the model is worthless. However, completely outside the scope of a particular model, the most likely cause of this failure is an unbalanced training set.
The model may pay a lot of attention to the training set for the simple reason that signals from oversampled classes in the marker data overwhelm signals from undersampled classes. Weights will be updated more often by errors in the main class signal, while signals from the minor class will be ignored. Getting an absolutely uniform representation of each class is not important because the model can overcome some of the noise on its own. The goal here is just to get the proportion of classes to an even level.
As with any machine learning task, the first step is to take a long, careful look at the data, learn some details and make some rough statistics about what the data actually represents. Not just how much data there is, but how many kinds of data there are.
So what do people do if there’s nothing special about it from the start? If the goal is to make the presentation of the class uniform (and it is), there are three main methods to choose from: oversampling, undersampling, and data enhancement.
D. 6.1 sampling
Oversampling is a technique of repeatedly sampling samples from one or more underrepresented classes. Let’s take the previous example of dog/cat classification (only 200 dogs, 20 000 cats). We could simply repeat the existing 200 dog images 100 times and end up with 40, 000 samples, half of them dogs and half of them cats.
This is an extreme example, and therefore causes its own inherent problems. The network is likely to do a good job of identifying these 200 specific dogs and not generalize well to other dogs that are not in the training concentration. However, under less extreme imbalances, oversampling techniques certainly help to balance the training set.
D. 6.2 undersampling
Undersampling is the opposite of the same coin. In this case, you are removing part of the sample from the over-represented class. In the cat/dog example above, we will randomly delete 19 800 cat images, leaving 400 samples, half dogs and half cats. Of course, one of the glaring problems with this approach is that we throw away most of the data and work on a less broad basis. Such an extreme is not ideal in the example above, but it may be a good solution if the underrepresentation class itself contains a large number of samples. Of course, having so much data is an absolute luxury.
D.6.3 Data Enhancement
Data enhancement is tricky, but in the right circumstances it can help. Enhancement means to generate new data, either from a perturbation of existing data, or to regenerate it. AffNIST is one such example. The famous MNIST dataset consists of a set of handwritten numbers from 0 to 9 (figure D-4). AffNIST tilts, rotates, and scales each number in various ways while retaining the original label.
The entries in the leftmost column of Figure D-4 are samples from the original MNIST, and the other columns are affNIST data after affine transformation (image authorized by “affNIST”).
The purpose of this particular approach is not to balance the training set, but to make networks like convolutional neural networks more resilient to new data written in other ways, but the concept of data enhancement still applies here.
One must be careful, however, that adding data that is not truly representative of the model to be built can do more harm than good. Suppose the data set is a previous photo set of 200 dogs and 20, 000 cats. We further assume that these images are high-resolution color images taken under ideal conditions. Now, giving 19, 000 kindergarten teachers a box of crayons does not necessarily yield the desired enhanced data. So consider how enhanced data will affect the model. The answer is not always clear, so if you must go down this path, keep in mind the impact of your model when validating it, and try to test around its edges to make sure you don’t inadvertently introduce unexpected behavior.
Finally, one thing that may be least valuable, but it is true: if the data set is “incomplete,” the first thing to consider is going back to the original data source for additional data. This is not always possible, but it should at least be an option.
D.7 Performance Specifications
The most important part of any machine learning pipeline is performance metrics. You can’t make a machine learning model better if you don’t know how well it works. The first thing to do when starting the machine learning pipeline is to set up a performance measure on any SkLearn machine learning model, such as “.score() “. We then build a completely random classification/regression pipeline and calculate the performance score at the end. This allows us to make incremental improvements to the pipeline to gradually improve the score to get closer to the final goal. It’s also a great way to reassure your boss and colleagues that you’re on the right track.
D.7.1 Measurement Indicators by Category
For a classifier, we want it to do two things right: mark objects that really belong to that class with a class tag, and mark objects that don’t belong without that tag. The correct numbers for these two events are called true positive and true negative respectively. If you have a NUMPY array that contains all the results of the model’s classification or prediction, you can calculate the correct prediction, as shown in Listing D-3.
Listing D-4 computes the wrong results from the model
Sometimes these four numbers are combined into a 4-by-4 matrix called the error matrix or confusion matrix. Listing D-5 shows what the predicted and true values look like in the confusion matrix.
Listing D-5 confusion matrix
>>> confusion = [[true_positives, false_positives],
... [false_negatives, true_negatives]]
>>> confusion
[[4, 3], [1, 2]]
>>> import pandas as pd
>>> confusion = pd.DataFrame(confusion, columns=[1, 0], index=[1, 0])
>>> confusion.index.name = r'pred \ truth'
>>> confusion
1 0
pred \ truth
1 4 1
0 3 2
Copy the code
In an obfuscation matrix, we want the numbers above the diagonal (upper left and lower right) to be larger and the numbers outside the diagonal (upper left and lower left) to be smaller. However, the order of the positive and negative classes is arbitrary, so it is sometimes possible to see the numbers in this table transposed. Always mark the obfuscation matrix columns and subscripts. You may sometimes hear statisticians refer to this matrix as a classifier contingency table, but confusion can be avoided by sticking to the name “confusion matrix.”
For machine learning classification problems, there are two useful ways to combine some of these four metrics into a single performance metric: precision and recall. Information retrieval (search engines) and semantic search are examples of this classification problem, because the goal there is to classify documents (and input queries) as matching or mismatching. In Chapter 2, we learned how stem reduction and form merge can improve recall but reduce accuracy.
Accuracy is a measure of the model’s ability to detect all objects of the class of interest (called positive classes), so it is also known as positive predictive value. Since the true positive is the number of positive class samples predicted correctly, and the false positive is the number of negative class samples incorrectly labeled as positive, the accuracy rate can be calculated as shown in listing D-6.
Code listing D-6 correct rate
>>> Precision = true_positives/(true_posi_POSItives + false_positives) >>> Precision 0.571...Copy the code
The obfuscation matrix in the example above gave an accuracy of about 57%, because it was correct in about 57% of all the samples with positive predictions.
The recall rate is similar to the accuracy rate and is also known as sensitivity, true positive rate or recall rate. Since the total number of samples in the dataset is the sum of true positives and false negatives, recall rate can be calculated, that is, the percentage of positive class samples with correct prediction detected in all samples, as shown in Listing D-7.
Listing D-7 recall rate
>>> recall = true_positives/(true_posi_negatives) >>> Recall 0.8Copy the code
This means that the model obtained in the above example detects 80% of the positive class samples in the data set.
D.7.2 Measures of regression
The two most common performance metrics used in machine learning regression problems are root mean square Error (RMSE) and Pearson correlation coefficient (R2). It turns out that behind the classification problem is actually a regression problem. Therefore, if the class tag has been converted to a number (as we did in the previous section), regression metrics can be used on it. The following code examples reuse the predicted and real values from the previous section. RMSE is the most useful for most problems because it gives you the extent to which the predicted value is likely to differ from the real value. RMSE gives the standard deviation of the error, as shown in Listing D-8.
Listing D-8 Root mean Square Error (RMSE)
>>> y_true = np.array([0, 0, 0, 1, 1, 1, 1, 1, 1, 1]) >>> y_pred = np.array([0, 0, 1, 1, 1, 1, 1, 0, 0, 0]) > > > rmse = np. SQRT ((y_true - y_pred) * * 2)/len (y_true)) > > > rmse 0.632...Copy the code
Pearson correlation coefficient is another common performance index of regression function. The Sklearn module defaults to attaching it to most models as a.score() function. If you’re not sure how to calculate these metrics, you should do it manually to get a feel for it. See Listing D-9 for the calculation of the correlation coefficient.
Listing D-9 correlation coefficient
> > > corr = pd. DataFrame ([y_true y_pred]) tc ORR () > > > corr [0] [1] 0.218... >>> np.mean((y_pred - np.mean(y_pred)) * (y_true - np.mean(y_true))) / ... Np. STD (y_pred)/np. STD (y_true) 0.218...Copy the code
Thus, the correlation between the predicted value of our sample and the actual value is only 28%.
D.8 Professional skills
Once you’ve mastered the basics, these simple tips will help you build good models faster:
- A small random sample subset of the dataset is used to find possible pipeline defects; .
- When ready to deploy the model into production, train the model with all the data;
- You should first try what you know best. This technique also applies to feature extraction and the model itself.
- Scatter plots and scatter matrices are used on low-dimensional features and targets to ensure that no obvious patterns are missed;
- Draw high-dimensional data as the original image to detect the transfer of features **;
- When you want to maximize the difference between vector pairs, try using PCA for high-dimensional data (using LSA for NLP data);
- Nonlinear dimensionality reduction, such as T-SNE, can be used to perform regression or find matching vector pairs in low-dimensional space.
- Build a Sklear.pipeline object to improve the maintainability and reusability of model and feature extractors;
- Automate the tuning of hyperparameters so that models can understand the data and people can spend time learning machine learning.
Hyperparameter tuning Hyperparameters are all those values that determine pipeline performance, including model types and how they are configured. Hyperparameters can also be the number of neurons and layers contained in the neural network, or the alpha value in the Sklear.linear_Model.Ridge Ridge regression model. Hyperparameters also include values that control all preprocessing steps, such as word segmentation type, list of all ignored words, minimum and maximum document frequency of TF-IDF vocabulary, whether word form merge is used, TF-IDF normalization method, etc.
Hyperparameter tuning can be a very slow process because each experiment requires training and validation of a new model. Therefore, when searching for a wide range of hyperparameters, we need to reduce the data set to the smallest sample set that is representative. As the search approaches the final model that meets the requirements, you can increase the size of the data set to use as much of the required data as possible.
Optimizing the superparameters of pipeline is a way to improve the performance of model. Automating hyperparameter tuning can save you more time reading books like this one, or visualizing and analyzing the final results. Of course, you can still intuitively set the range of hyperparameters to try to guide tuning.
The most efficient algorithm for hyperparameter tuning is (from best to worst) :
(1) Bayesian search;
(2) Genetic algorithm;
(3) Random search;
(4) Multi-resolution grid search;
(5) Grid search.
But in any case, all the computer search algorithms that work while everyone is asleep are better than guessing new parameters by hand.
This article is excerpted from Understanding, Analyzing, and generating Natural Language Processing with Python.
Hobson Lane, Cole Howard, Hannes Max Hapke, Translated by Shi Liang, Lu Xiao, Tang Kexin, Wang Bin
- Python Natural Language processing NLP to get started
- A practical guide for practitioners in the field of modern natural language processing
- Source code provided, translated by NLP team of Xiaomi AI Lab
This book is an introduction to natural language processing (NLP) and deep learning. NLP has become the core application field of deep learning, and deep learning is a necessary tool in the research and application of NLP. The book is divided into three parts: The first part introduces the basic of NLP, including word segmentation, TF-IDF vectorization and transformation from word frequency vector to semantic vector; The second part describes deep learning, including neural network, word vector, convolutional neural network (CNN), recurrent neural network (RNN), long and short-term memory (LSTM) network, sequence to sequence modeling, attention mechanism and other basic deep learning models and methods. The third part introduces the practical content, including information extraction, question answering system, human-machine dialogue and other real world system model building, performance challenges and solutions. This book is aimed at middle – to advanced Python developers, combining basic theory with practical programming, and is a practical reference book for modern practitioners in the NLP field.