From Pete Warden’s Blog by Pete Warden, Heart of the Machine.
There is a big difference between deep learning research and production. In academic research, people generally pay more attention to the design of model architecture and use smaller data sets. This paper emphasizes the need to pay more attention to the construction of data sets in the development of deep learning projects from the production level, and takes the author’s own development experience as an example to share several simple and practical suggestions, involving data set characteristics, transfer learning, indicators and visual analysis. These suggestions are valuable for both researchers and developers.
This post was also retweeted by Andrej Karpathy:
About the author: Pete Warden is CTO of Jetpac Inc., author of The O’Reilly books The Public Data Handbook and The Big Data Glossary, and has been involved in founding several open source projects, For example, OpenHeatMap and Data Science Toolkit.
Andrej Karpathy showed this slide during his talk at Train AI (www.figure-eight.com/train-ai/) and I loved it! It perfectly illustrates the difference between deep learning research and actual production. Most academic papers focus on creating and improving models using only a fraction of the public data as data sets. But as FAR as I know, when people start using machine learning in real applications, they spend most of their time worrying about training data.
There are many good reasons why researchers are so focused on model architecture, but it does mean that there are few resources available to those focused on applying machine learning to production environments. To address this issue, I gave a talk at the conference on “The Unreasonable Effectiveness of Training Data,” and in this blog POST, I want to expand on why data is so important and some practical tips for improving it.
As part of my job, I work closely with many researchers and product teams. I’ve seen great results when they focus on this model building perspective, and it’s made me believe in the power of improved data. The biggest obstacle to applying deep learning to most applications is getting high enough accuracy in the real world, and as far as I know, the fastest way to improve accuracy is to improve the training set. Even if you are hampered by other limitations, such as latency or storage space, improving accuracy on a particular model can help you balance these performance metrics by using a smaller architecture.
Speech data set
I can’t share most of my observations about productive systems, but I have an open source example that illustrates the same pattern. Last year, I created a simple speech recognition of the TensorFlow sample (www.tensorflow.org/tutorials/a)… The results show that none of the existing data sets can be easily used as training data. Thanks to the developed by AIY team helped me to open voice recording site (aiyprojects.withgoogle.com/open_speech…). I was able, with the generous help of many volunteers, to collect 60,000 one-second audio clips of people saying short words. The model worked with this data training, but it was still not as accurate as I wanted. To understand my design model when there may be limitations, I use the same data set has launched a Kaggle competitions (www.kaggle.com/c/tensorflo)… . The participants performed much better than my simple model, but even with many different approaches, multiple teams ended up with an accuracy of only around 91%. To me, this meant that there was something fundamentally wrong with the data itself, and there were a lot of problems that contestants found, such as incorrect labels or truncated audio. This encouraged me to solve the problems they found and increase the sample size of this dataset.
I looked at error metrics to see what problems the model encounters most often, and found that errors are more likely to occur in the “other” category (when speech sounds are recognized, but the words are not in the model’s limited vocabulary). To solve this problem, I increased the number of different words we captured to provide more diverse training data.
Since Kaggle contestants reported tagging errors, I added an additional validation process in the form of crowdsourcing: asking people to listen to each clip and make sure it matches the intended tag. Since the Kaggle contest also found almost silent or truncated files, I also wrote a useful program to do some simple audio analysis (github.com/petewarden/…). And automatically remove particularly bad samples. In the end, despite deleting the wrong files, we ended up with more than 100,000 samples of speeches thanks to more volunteers and some paid crowdsourcing staff.
To help others use the dataset (and learn from my mistakes!) I wrote all about it and the latest results in an arXiv paper (arxiv.org/abs/1804.03…). . The most important conclusion was that we were able to improve top-1 accuracy from 85.4% to 89.7% without changing the model or the test data. That’s a huge improvement (over 4%), and it’s even better when people use the model in android or Raspberry PI sample applications. Although I am currently using a far from optimal model, I am convinced that if I spent all that time tweaking the model, I would not have achieved this performance improvement.
I’ve seen this performance improvement several times in production configurations. When you want to do the same thing, it can be hard to know where to start. You can take some inspiration from my techniques for handling voice data, but in the following sections, I’ll show you some specific methods that I find useful.
First, look at your data
This may seem obvious, but the best thing you can do first is to randomly review the training data you’re going to use. Copy some files to your local computer and preview them for a few hours. If you’re working with images, scrolling through thumbnails using a feature similar to MacOS’s viewfinder will allow you to quickly see thousands of images. For audio, you can play a preview using the viewfinder, or dump random snippets of text to a terminal. Because I didn’t spend enough time doing this with the first version of the voice command, Kaggle contestants found a lot of problems when they started processing the data.
I always thought the process was a bit silly, but I never regretted it. Whenever I’ve done this, I can find things that are very important to the data, such as imbalances in the number of samples between different categories, data garbled (such as PNG files with JPG extensions), wrong labels, or just surprising combinations. Tom White’s examination of ImageNet made some surprising discoveries, such as: the label “sunglasses” actually referred to an ancient device used to amplify sunlight. Andrej ImageNet for manual classification work (karpathy. Making. IO / 2014/09/02 /…). It also taught me a lot about this data set, including how to distinguish between all the different dog breeds and even people.
The actions you’ll take will depend on what you find, but you should always check this before you do any other data cleansing, because a visual understanding of the data set’s contents will help you make better decisions on the remaining steps.
Choose a model quickly
Don’t spend too much time choosing models. If you’re working on an image classification task, check out AutoML, or check out TensorFlow’s model repository (github.com/tensorflow/…) Or samples collected by fast. AI (www.fast.ai/) to find models of similar problems you face in your product. It’s important to start iterating as quickly as possible so that you can get real users to try out your model early and often. You can always go online with improved models and you may see better results, but you have to do the right thing with the data first. Deep learning still follows the basic computational law of “input determines output,” so even the best models are limited by data flaws in training sets. By selecting the model and testing it, you will be able to understand the flaws and begin to improve the data.
To further speed up the iteration of the model, you can try to start with a model that has already been pre-trained on a large existing data set, using transfer learning to fine-tune it with a (possibly much smaller) set of data that you have collected. This is usually much better and faster than just training on smaller data sets, so you can quickly learn how to adjust your data collection strategy. Most importantly, you can adapt the data collection (and processing) process to suit your learning strategy based on feedback in the results, rather than just having data collection as a separate phase prior to training.
Pretend to do it before you do it (manually annotate data)
The biggest difference between building research and production models is that research usually begins with a well-defined problem defined, while the requirements for practical applications lurk in the user’s mind and can only be learned gradually over time. For Jetpac, for example, we wanted to find good photos and display them in the city’s automated travel guide. We started by asking raters to tag pictures they thought were good, but we ended up with a lot of smiley faces because that’s what they meant by the question. We put this into a demo model of the product to test the user’s reaction and found that they were not very impressed. To solve this problem, we changed the question to “Does this photo make you want to go where it’s shown?” . This greatly improved the quality of our results, though it turned out that the staff from Southeast Asia were more likely to find the photos of meetings full of men in suits and wine glasses in large hotels to be stunning. This mismatch is a reminder of the bubble we live in, but it’s also a practical problem because our product’s target audience is Americans who are depressed and frustrated by meeting photos. In the end, the six of us on the Jetpac team rated more than 2 million photos ourselves, because we knew the criteria better than anyone who could be trained to do it.
This is an extreme example, but it shows how much the annotation process depends on the requirements of the application. For most production use cases, finding the right answer to the model’s right question takes a long time, which is critical to getting the problem solved correctly. If you’re trying to get the model to answer the wrong questions, you’ll never build a reliable user experience on that shaky foundation.
I’ve found that the only way to tell if you’re asking the right questions is to simulate your application, rather than using a machine learning model with people participating in iterations. This method is sometimes called “wizard-of-Oz-ing” because of the human involvement behind it. In Jetpac’s case, we asked people to manually select photos for some travel guide samples, rather than training a model to adjust the criteria for selecting images by testing user feedback. Once we can reliably get positive feedback from the test, we can then translate the photo selection rules we have designed into a tagging instruction manual, so that millions of images can be used as training sets in this way. We then used that data to train a model that could predict the quality of billions of photos, but its DNA came from the original artificial rules that we designed.
Train on real data
In the Jetpac case, the images we used for the training model came from the same sources as the images we wanted to apply to the model (primarily Facebook and Instagram), but a common problem I found was that some key differences between the training data set and the model’s final input data ended up in production. For example, I often see imagenet-based training models run into problems when trying to apply them to drones or robots. That’s because ImageNet mostly takes pictures of people, and they have a lot in common: they’re taken with a phone or camera, they use a neutral lens, they’re at roughly head height, they’re taken in daylight or artificial light, they’re centered and in the foreground, and so on. Robots and drones, on the other hand, use video cameras, usually equipped with high-field lenses, which shoot either on the ground or high in the air, lack light conditions, and can only be clipped because there is no intelligent determination of the contour of the object. These differences mean that if you just train your model on ImageNet and deploy it on a single device, you won’t get good accuracy.
There are also many subtle differences between the training data and the final model input data. Imagine you’re using animal data sets from around the world to train a camera to recognize wildlife. If you only plan to deploy it in the jungles of Borneo, the chances of penguin tags being selected are particularly low. If the training data included photographs of the South Pole, the model had a greater chance of mistaking other animals for penguins, and the overall accuracy of the model was much lower than if you didn’t use the training data.
There are many ways to calibrate results based on known prior knowledge (e.g., drastically reducing the probability of penguins in a jungle environment), but it is more convenient and effective to use training sessions that reflect the real scenario of the product. I’ve found that the best approach is to always use data captured directly from the real application, which has a nice connection to the “Wizard of Oz” approach I mentioned above. In this way, the part of the training process that uses people for feedback can be replaced by the pre-labeling of the data. Even if the number of tags collected is very small, they can reflect the real usage and are almost enough to be used for some initial experiments of transfer learning.
Confusion matrix
When I look at examples of voice commands, one of the most common reports I see is confusion matrices during training. Here is an example shown in the console:
[[258 0 0 0 0 0 0 0 0 0 0 0 0 0] [7 6 26 94 7 49 1 15 40 2 0 11] [10 1 107 80 13 22 0 13 10 10 4] [13 16 163 6 48 0 5 10 10 17] [15 1 17 114 55 13 0 9 22 50 9] [11 6 97 3 87 112 46 0 10] [86 86 84 13 24 1 9 9 0 6] [9 3 32 112 9 26 1 36 19 0 9] [8 2 12 94 9 52 0 6 72 0 2] [16 1 39 74 29 42 0 6 37 9 0 3] [15 6 17 71 50 37 0 6 32 2 19] [ 11 16 151 5 42 0 8 16 0 0 20]]Copy the code
This may seem intimidating, but it’s really just a table showing details of network errors. Here’s a more aesthetically pleasing, tagged version:
Each row in the table represents a set of samples identical to the actual tag, and each column shows the number of predicted results of the tag. For example, the highlighted line represents all the silent audio samples, and if you read from left to right, you can see that the label prediction is correct because each label falls in the “Silence” column. This shows that the model can recognize silent audio clips very well, without any misjudgments. From a column perspective, the first column shows how many audio clips are predicted to be silent, and we can see that some audio clips that are actually words are mistaken for silent, and there are a lot of misjudgments. This knowledge was very useful to me because it allowed me to look more closely at audio clips that were mistaken for being silent, when in fact they were not always silent. This helped me improve the quality of the data by removing low-volume audio clips that I wouldn’t have been able to do without clues to obfuscating the matrix.
Almost any summary of the results could be useful, but I’ve found the obfuscation matrix to be a good compromise, providing more information with more accuracy than a single one, while not covering too much detail that I can’t handle. Watching numbers change during training is also useful because it tells you what categories the model is trying to learn and allows you to focus on certain aspects as you clean up and expand your data set.
Visual model
Visual clustering is one of my favorite ways to understand how my network interprets training data. TensorBoard provides good support for this exploration, and although it is often used to look at word embeddings, I have found it to be suitable for almost any network layer that works similarly to any embeddings. For example, the penultimate layer that the image classification network typically has before the last full connection or SoftMax unit can be used for embedding (this is how the simple transfer learning example works, Such as TensorFlow for Poets (codelabs.developers.google.com/codelabs/te…). ). These are not embeddings in the strict sense, since we did not make an effort during training to ensure that the real embeddings had the desired spatial properties, but clustering their vectors did yield some interesting results.
For example, a previous team I worked with was puzzled by the high error rate of some animals in the image classification model. They used cluster visualizations to see how their training data was distributed across various categories, and when they saw “Jaguar,” they clearly saw that the data was split into two distinct groups separated by some distance from each other.
The graph they saw is shown above. Once we showed the images of each cluster, the results became obvious: many Jaguar-branded vehicles were mislabeled as Jaguars (animals). Once they were aware of the problem, they were able to examine the tagging process and realize that staff guidance and the user interface were confusing. With this information, they were able to improve the tagger’s training process and fix problems in the tool to remove all car images from the Jaguar category and make the model look better in that category.
Clustering provides the same benefits as just observing the data by providing insight into the content of the training set, but the web actually guides your exploration by grouping inputs according to your own learning understanding. As humans, we’re pretty good at spotting anomalies visually, so combining our intuition with a computer’s ability to process large amounts of input offers a highly scalable solution to tracking data set quality problems. A full tutorial on how to do this using TensorBoard is beyond the scope of this article, but IF you really want to improve results, I strongly recommend familiarizing yourself with the tool.
Continuous data collection
I’ve never seen more data collected without ultimately improving the accuracy of the model, and it turns out there’s plenty of research to back up my experience.
From the “unreasonable effectiveness review data (ai.googleblog.com/2017/07/rev…)” And show how the model accuracy of image classification continues to increase as the size of training sets grows into billions. Facebook recently conducted a further exploration, use billions of tagging sets image classification task on ImageNet image obtained optimal accuracy (www.theverge.com/2018/5/2/17)… . This suggests that increasing the size of the training set can improve the model even for tasks with large, high-quality data sets.
This means that as long as any user can benefit from higher model accuracy, you need a strategy that can continually improve your data set. If you can, find creative ways to use weak signals to capture larger data sets (one direction to try). Facebook’s use of the Instagram hashtag is a good example. Another approach is to make the labeling process more intelligent, for example by making the label prediction results of the initial version of the model available to the annotators so that they can make faster decisions. The risk of this approach is that it may create some degree of bias early in labeling, but in practice, the benefits often outweigh this risk. In addition, addressing this problem by hiring more people to flag new training data is often a worthwhile investment, but organizations without a tradition of budgeting for this kind of spending can be held back. If you’re a nonprofit, making it easier for your supporters to volunteer their data through some sort of public tool might be a good way to increase the size of your data set without increasing costs.
Of course, the optimal solution for any organization is to have a product that naturally generates more annotated data as it is used. While I wouldn’t pay too much attention to this idea, it doesn’t work in many real-world scenarios where people just want to get answers as quickly as possible and don’t want to get involved in the complicated tagging process. For a startup, it is a good investment hot spots, because it is like a perpetual motion machine improved model, of course, when cleaning or enhance the data always can’t avoid to produce some of the cost per unit, so the economists often choose a final look more cheaper than real free version.
Potential risks
Almost all model errors affect application users much more than the loss function can capture. You should think about the worst possible outcomes ahead of time and try to design the backing of the model to avoid them. This may simply be a blacklist of categories that you don’t want your model to predict because the cost of false positives is too high, or you may have a simple set of algorithmic rules to ensure that the actions you take don’t exceed some set of bounds. For example, you might maintain a list of swear words that you don’t want the text generator to output, even if they are in the training set. Because they are very inappropriate for your product.
It’s not always obvious in advance what bad outcomes can happen, so it’s important to learn from real-world mistakes. One of the easiest ways to do this is to use error reporting once you have a work-in-progress. When people use your application, you need to make it easy for them to report unsatisfactory results. Try to get complete inputs to the model, but when they are sensitive data, just knowing what the bad outputs are can also help guide your investigation. These categories can be used to choose which sources to collect more data from, and which categories you should look for for current label quality. Once new adjustments have been made to the model, separate tests should be performed on previously undesirable inputs in addition to the normal test set. Given that a single metric can never fully capture everything people care about, this error gallery is a bit like regression testing and gives you a way to track how much you’re improving your user experience. By looking at a small number of examples that have provoked strong reactions in the past, you can get some independent evidence that you are actually providing a better service to your users. If you are too sensitive to obtain input data for your model, use internal tests or experiments to determine which inputs can produce these errors, and then replace those inputs in the regression dataset.
In this article, I hope to try to convince you to spend more time on data and give you some ideas on how to improve it. This area has not received enough attention at the moment, and I even feel that my suggestions here are a bit of a kick in the ass, so I thank everyone who has shared their strategies with me, and I hope I can learn more about effective approaches from more people in the future. I think more and more organizations will assign teams of engineers to data set improvement, rather than machine learning researchers to drive progress in this direction. I look forward to seeing the entire field benefit from the work on data improvement. I’m always amazed at how well the model works even with severely flawed training data, so I can’t wait to see what we can do with the improved data!
The original link: petewarden.com/2018/05/28/…