How can machine learning make you more professional?

It’s hard to know what’s not working, and sometimes making mistakes can help us become more professional.

In this post, data scientist Archy de Berker details the pits he and his peers have trod through in their quest for machine learning, which is a common problem. In this post, he hopes to take you through some of the interesting mistakes in machine learning — mistakes that you only get to see if you’re in the field.

This is not an entry level article and it is best to practice destroying models on Pytorch or Tensorow.

This article focuses mainly on the errors of the green distribution, but the purple distribution and the yellow distribution will also be covered in part

Common mistakes in machine learning

Berker categorized errors in machine learning into three broad categories, from lowest to highest in severity.

These mistakes will only waste your time

Two of the most difficult things in computing science are naming and cache invalidation, which is epitomized by this tweet. Shape Error is the scariest and most common error, usually caused by multiplying matrices of different sizes.

This article won’t spend much time discussing these kinds of errors, because they are so obvious. It’s easy to find mistakes, fix them, make them again, fix them again. It’s a repeating process.

2. These errors can lead to inaccurate results

This kind of error can cost you a lot because it can lead to inaccurate model results.

In 2015, an AirAsia flight from Sydney, Australia, to Kuala Lumpur, Malaysia, made an emergency landing at Melbourne airport because of technical problems. If the model turns out to be inaccurate, it’s like a technical glitch on this plane, and it ends up going to the wrong destination.

For example, if you add a feature to your model and add a lot of parameters at the same time, you compare the previous performance without hyperparameter tuning and find that the model performance is worse after adding features, so you conclude that the added features will make the model performance worse. This is not true, in fact you need more formal operations to look for more expressive models.

The influence of errors on the model will increase over time, leading to more inaccurate experimental results. Therefore, it is very valuable to catch errors early.

3. These errors can lead you to believe that your model is “perfect”

This is a serious mistake and can cause you to overestimate the performance of your model. Such errors are often hard to spot because we are viscerally reluctant to admit that a seemingly “perfect” model may be an illusion.

When a model does surprisingly poorly, we tend to disbelieve it and test it again, but when it does surprisingly well, we usually believe it and start gloating. This is what is known as Confirmation Bias, the tendency of individuals to support their own beliefs and assumptions, regardless of whether they are factual or not.

The seemingly “perfect” model performance is usually due to over-fitting, which leads to the training data being no longer representative, or the wrong evaluation index, both of which will be explained in detail later in this paper.

If you can only take one thing away from this article, remember that there is nothing more embarrassing and frustrating than finding out that the real outcome of your model is actually bad.

The life cycle of machine learning

Machine learning is like the three stages of the sausage machine above: take the data, feed it into the model, and then quantify the output through some metrics.

Next we’ll look at some seemingly silly mistakes at each stage.

1, output what: evaluation indicators

Machine learning boils down to the process of constantly reducing the value of the loss function.

However, the loss function is by no means the ultimate optimization goal, it is only an approximation. For example, when training classification tasks, cross-entropy loss function is used to optimize the training set or verification set, but in fact, we trust the results on the test set or F1 and AUC evaluation indicators more. When the data difference of the actual optimization target is very small, using low confidence in model evaluation to accelerate the evaluation process will lead to more serious problems.

In either case, if the loss function is no longer representative of the model’s true performance, you’re in big trouble.

There are also ways to make the model worse:

1) Mixed training set and test set

Mixing training sets and test sets is easy and often results in good performance, but such models perform badly in complex real-world environments.

Therefore, the data of training set, verification set and test set cannot intersect, and they need to contain different sample data. We need to think about what generalization capabilities the model needs, which will ultimately be quantified by the performance of the test set.

Taking store receipt data as an example, using store receipts for analytic forecasting, the test set obviously needs to include new data that has not been seen before, but does the test set also need to include new items that have not been seen before to ensure that the model does not overtest (over-fit) specific stores?

The best way to do this is to divide the data into training set, validation set, and test set all at once and put them in different folders with very specific names, such as TrainDataLoader and TestDataLoader.

2) Wrong use of loss function

The misuse of loss functions is actually rare, because there are countless materials that teach you how to use them. The two most common misuses of a loss function are confusion about whether the loss function uses a probability distribution or a logarithm (i.e., whether softmax should be added), and confusion between a regression function and a classification function.

Even in academia, it is common to confuse regression and classification functions. The Amazon Reviews data set, for example, is often used by top LABS for categorization tasks, but this is not true because a 5-star review is clearly more similar to a 4-star review than a 1-star review, and should use ordered regression.

2. Choose the wrong evaluation index

Different tasks use different loss functions, and in the validation of model performance, we often combine multiple evaluation indexes with different emphasis to evaluate model performance. For example, BLEU is preferred for machine translation, ROUGE is used for automatic abstracts to verify performance, and accuracy, accuracy, or recall can be used for other tasks.

In general, evaluation indicators are easier to understand than loss functions. A good idea is to log as much as possible.

Think carefully about how to divide the disjoint training set, test set and verification set, so that the model has excellent but not excessive generalization ability. You can use evaluation metrics to test model performance during training rather than waiting until the end to start testing with test sets. This helps to better understand the current training results of the model and prevents problems from being exposed at the end.

More attention should be paid to the selection of evaluation indicators. For example, you can’t use simple accuracy to evaluate the performance of a sequence model, because misalignment between sequences would result in zero accuracy. Therefore, distance is used to evaluate sequence data. It is very painful to choose the wrong evaluation indicators.

Again using the sequence model as an example, make sure to exclude all special characters, which are usually the beginning, end, and padding of a sequence. If you forget to exclude special characters, you might get model performance that looks good, but such a model can actually only predict long sequences filled with padding characters.

One mistake that impressed the author was that he had done some semantic parsing to turn natural language statements into database queries that answered questions like “How many flights are there from Montreal to Atlanta tomorrow?” This is a typical SQL problem. To evaluate the accuracy of the model, they send the SQL queries escaped from the model to the database to check whether the returned content matches the content of the real query. He sets up a situation in which the database returns “error” if a meaningless query is sent to it. He then sends the corrupted predictive SQL and real SQL queries to the database, both returning “error,” which the model calculates to be 100% accurate.

This leads to the guideline that any mistakes you make will only make performance worse. Keep checking what the model actually predicts, not just the results of the metrics.

3. Avoid the error of evaluation indicators

1) Run through all the indicators first

If the model does well without any training, there is something wrong.

2) All processes are logged

Machine learning is a quantitative discipline, but numbers can sometimes be deceiving. Log every conceivable number, but in a way that’s easy to understand.

In NLP, this usually means that you need to reverse the markers, which is complicated, but 100 percent worth it, and the logs provide qualitative explanations during model training. For example, language models typically start by learning to output strings like EEEEEEeeee, because these are the most common characters in the data.

If you’re dealing with image tasks, logging is even more troublesome because you can’t log images as text. This problem can be solved by using ASCII to save logs during OCR training so that the input image data can be visualized:

3) Research validation set

Test metrics are used to determine the best and worst performing samples in the collection. Residual analysis can be useful in regression tasks by knowing the sample and using some method of quantifying confidence (such as SoftMax) to understand which distributions the model is likely to perform well and which are likely to perform poorly.

But remember, as Anscombe Quartet has pointed out, averages can be misleading.

Anscombe Quartet: The mean and variance of all 4 models are the same, and they all fit the same regression line T. Therefore, don’t rely too much on statistics, but understand the data itself.

If you encounter multidimensional problems, try to plot the error to a single feature to figure out the cause. Are there areas of input space where the model performs very poorly? If so, you may need to add more data or data enhancement to that data area.

The influence of ablation and interference on model performance was considered. Tools such as LIME and Eli5 can simplify the model. The disturbance analysis is well described in the following article, which reveals that the CNN model used for X-ray classification uses labels introduced by the X-ray machine itself to determine whether a patient has pneumonia, rather than the correlation that X-ray machine use itself may have with prevalence:

medium.com/@jrzech/wha…

Three, models,

Many courses and articles now focus on modeling. But in reality, as a machine learning practitioner, you spend most of your time crunching data and metrics, not researching innovative algorithms.

The vast majority of deep learning errors are shape errors, which lead to many superficial errors.

1. Model error type

There are many types of model errors, as follows:

1) A model containing non-differentiable operations

In a deep learning model, everything must be end-to-end differentiable to support reverse computation. Therefore, you might expect non-differentiable operations to be explicitly identified in deep learning frameworks such as TensorFlow. This is not true, as Berker was particularly confused by the Keras Lambda layer, which can break backcalculations. One workaround is to check with model.summary() to verify that most parameters are trainable, and if a layer with untrainable parameters is found, the automatic differentiation capability may be broken.

2) Failed to shut down dropout during the test

We all know that we need to turn off dropout when testing data, otherwise we might get random results. This can be very confusing, especially to someone who is deploying the model and starting to run test sets.

This problem can be solved by eval(). It is also important to note that dropout during training can result in the oddity of the model being more accurate on the validation set than on the training set. This is due to the use of dropout on validation sets, which can seem underfitting and can cause some headaches.

3) Dimension parameters are wrong

Different frameworks have different conventions on batch size, sequence length and channels. Some frameworks provide room for modification in these three aspects, while others do not allow arbitrary modification, which may lead to errors.

Dimension parameter errors can cause strange phenomena. For example, if you get the sample number and sequence length wrong, you might end up ignoring some of the sample information and failing to save the information over time.

2. Avoid model errors

1) Modular, testable

If a layer with untrainable parameters is found, the automatic differentiation capability may be broken.

Writing well-structured code and unit testing helps avoid model errors.

A model can be effectively tested by dividing it into discrete blocks of code, each with a clear functional definition. The focus of the test is to verify whether the model is consistent with expectations under the condition of changing the number of samples and the amount of input data. Berker recommended a post by Chase Roberts detailing unit testing for ML code:

medium.com/@keeper6928…

2) Dimension judgment

Berker prefers to incorporate dimension arguments into ML code so that the reader can clearly see which dimensions should be changed and which should not. Of course, it can throw an error if something unexpected happens.

Expressive Tensorflow code, courtesy of Keith Ito. Note modularity and comments.

At the very least, get in the habit of adding dimension comments to your code so that the reader can read it without having to memorize a lot of information. Go to the following address to see Keith Ito’s code for beautifulTacotron, an excellent example of a comment:

Github.com/keithito/ta…

3) Overfitting of simple models with small data

Tip: First make sure the model has been trained to fit on a very small part of the data set and eliminate obvious errors in a short time.

Make the model as easy to configure through the configuration file as possible, and specify test configurations with the fewest parameters. Then add a step to CI/CD to check for overfitting of very small data sets and run it automatically. This will help catch code changes that break the model and the training pipeline.

Four, data

1. First, understand the data

You should be tired of data probing before you start modeling.

Most machine learning models attempt to replicate some of the brain’s pattern-recognition abilities. Familiarize yourself with data before you start writing code, practice pattern recognition, and make your code writing easier! Understanding the data sets helps with overall architectural consideration and metric selection, and enables quick identification of potential performance issues.

In general, the data itself can identify problems: data imbalance, file type problems, or data bias. Data bias is hard to assess algorithmically unless you have a very “smart” model that identifies these problems. For example, this “smart” model is aware of bias on its own, “All cat pictures were taken indoors, all dog pictures were taken outdoors, so maybe I’m training indoor/outdoor classifiers instead of a classifier that recognizes cats and dogs?” .

Karpathy built an annotation platform for ImageNet to evaluate his own performance and deepen his understanding of the data set.

As Karpathy points out, data probing systems can do data viewing, data slicing and slicing. In his talk at KDD 2018 in London, he stressed that many ML engineers at Uber are not writing code to optimize models, but code to optimize data tags.

To understand the data, you need to understand the following three data distributions:

Distribution of input data, such as average sequence length, average pixel value, audio duration
Distribution of output data, classification imbalance is a big problem
The output/input distribution, that’s usually what you want to model

2. How to load data

Loading and preprocessing data efficiently is one of the more painful parts of machine learning engineering, often a trade-off between efficiency and transparency.

Proprietary data structures like Tensorow Records can turn data sequences into large packets, reducing frequent reads/writes to disk, but this undermines transparency: these structures make it difficult to further explore or decompose data, and if you want to add or remove data, you have to reserialize it.

At present, Pytorch Dataset and DatasetLoader are a good way to balance transparency and efficiency. The special program package TorchText handles text data sets and TorchVision handles image data sets. These packages provide relatively effective loading methods. Populate and batch the data for each domain.

3. Methods to speed up data loading

Here are some of the lessons Berker learned in his attempts to speed up data loading:

1) Do not load data that is currently being loaded

This is because you will eventually find that you may lose data or load duplicate data. The pit Berker has stepped on:

You write regular expressions to load certain files from folders, but the regular files are not updated when you add new files, which means the new files cannot load successfully
Miscalculation of the number of steps in an Epoch resulted in some data sets being skipped
There are recursive symbols in folders that cause the same data to be loaded multiple times (in Python, recursion is limited to 1000)
Unable to fully traverse the file hierarchy and thus load data into subfolders

2) Incorrect storage of data

Don’t put all your data in one directory.

If you have millions of text files all in one folder, anything is going to be very, very slow. Sometimes even just viewing or calculating actions need to wait for a large number of folders to load, which greatly reduces work efficiency. The situation is even worse if the data is not stored locally but stored remotely in the data center using the SSHFS mount directory.

The second pitfall is not backing up data during preprocessing. The right approach is to save the time-consuming preprocessing results to disk so that you don’t have to redo the model every time you run it, but make sure you don’t overwrite the original data and keep track of what preprocessing code is running on what data.

Here’s a good example:

3) Inappropriate pretreatment

Data abuse is common in preprocessing, especially in NLP tasks.

Error handling of non-ASCII characters is a big pain point, and it doesn’t come up often enough to be hard to spot.

Participles also cause a lot of mistakes. If word-based segmentation is used, it is easy to form a vocabulary based on one dataset, only to find that a large number of terms are not found on the vocabulary when used on another dataset. The model doesn’t report errors in this case, it just doesn’t perform well on other data sets.

Vocabulary differences between the training set and the test set are also a problem because words that only appear in the test set are not trained.

Therefore, it is very valuable to know the data and catch these problems early.

4. Avoid errors in data processing

1) Log as much as possible

Ensure that sample data is logged for each data processing, not just model results, but process logs as well.

2) Memorize model hyperparameters

You need to be very familiar with model hyperparameters:

How many samples are there?
What is the sample size of a training session?
How many batches are there in an Epoch?

This should also be logged, or you can add arguments to make sure nothing is pulled down.

3) All states were recorded during the pretreatment process

Some pre-processing steps require the use or creation of artifacts, so remember to save them. For example, use the mean and variables of the training set to regularize the numerical data and save the mean and variables so that the same transformation can be applied at test time.

Similarly, in NLP, word segmentation cannot be done in the same way at test time without saving the vocabulary of the training set. Forming a new vocabulary and re-dividing the words during the test would be meaningless because each word would get a completely different mark.

4) Down-sampling

When the data set is very large (such as images and audio) and the data is fed into the neural network, the model is expected to learn the most efficient preprocessing method. This may be a good idea if you have unlimited time and computing power, but in practical situations, down-sampling is the preferred option.

We do not need full HD images to train dog/cat classifiers, we can use extended convolution to learn the descent sampler, or traditional gradient descent to accomplish descent sampling.

Downsampling is a better way to save time because it can accomplish model fitting and evaluation more quickly.

Five, the conclusion

Here are five guiding principles for machine learning:

Start small, and the experiment will go fast. Reducing cycle times allows problems to be detected early and hypotheses to be tested faster.
Know the data. * You can’t model well without understanding the data. Don’t waste your time on fancy models; take your time and finish your data exploration.
Log as much as you can. * The more information you have about your training process, the easier it is to identify anomalies and make improvements.
Focus on simplicity and transparency, not just efficiency. * Don’t sacrifice code transparency to save a little time. The time wasted understanding opaque code is much greater than the running time of inefficient algorithms.
If the model performs unbelievably well, there may be a problem. * There are many mistakes in machine learning that can “fool” you, and being a good scientist means being rational about finding and eliminating those mistakes.