For more dry goods, please pay attention to the wechat official account “AI Front” (ID: AI-Front).
Time spent on data and models in industry versus academia
Andrej Karpathy showed this slide in his Train AI talk that really hit home! It perfectly illustrates the difference between deep learning research and actual production. Academic papers focus almost entirely on new and improved models, while datasets are often drawn from a small public archive. On the other hand, every real user I spoke to was struggling to incorporate deep learning into real applications and spent most of their time crundling training data.
Why do researchers care so much about model architecture? There are many reasons for this, but one of the most important is that there are currently limited resources available to help users deploy machine learning technologies in production environments. To address all this, I wish to highlight the “invalidity of training data”. In today’s article, I’ll expand on this idea, including practical tips on why data is so important and how to improve it.
As an important part of my daily work, I work closely with many researchers and product teams, and I personally believe that the power of achieving data improvement comes from the tremendous benefits that users get when they successfully implement model builds. In most applications, the biggest hurdle to using deep learning is getting high enough accuracy based on real-world scenarios — improving training data sets is the fastest way to improve accuracy I know of. Even if you are constrained by other constraints (such as latency levels or storage capacity), you can still make performance trade-offs in smaller architectures to ensure that certain models are more accurate.
Space does not permit me to recount all the conclusions I have seen in my observations of production systems. But we can talk about similar production patterns from an open source example. Last year, I created a simple speech recognition example for TensorFlow. As it turned out, I didn’t have a ready-made data set to use for model training. Thanks to the generosity of many volunteers, I’ve collected 60,000 one-second phrase audio clips — thanks again to the AIY team for helping me launch Open Speech Recording. Using this data, I built a model. Unfortunately, although this model has some functions, the accuracy is not as good as my expectation. To understand my limitations as a model designer, I started participating in Kaggle competitions using the same data set. Competitors performed much better than my simple model, and several teams, among many different approaches, improved accuracy to 91% or more. To me, that means there is something fundamentally wrong with the data itself. In fact, competitors did find a number of errors, such as incorrect labels or truncated audio. This gave me the impetus to focus on solving their real problems by releasing new data sets and more samples.
I looked at error indicators to see what problems the model encountered most often, and found that the “other” category (where the speech has been recognized but the related word is not within the model’s limited vocabulary) came up most frequently. To solve this problem, I increased the number of words captured in order to provide more training data.
Since Kaggle contestants reported tagging errors, I also squeezed in extra time for tagging verification, including inviting volunteers to listen to each audio clip to make sure it matched the tag. In addition, Kaggle contestants found files that were almost silent or truncated, so I wrote a utility to do simple audio analysis while automatically wiping out particularly poor samples. In the end, with the support of more volunteers and paid collectors, my dataset expanded to over 100, 000 sounds, despite the removal of many false files.
To help other users use the data set (and learn from my mistakes), I’ve compiled everything into an Arxiv paper and updated the accuracy results. The most important finding was a more than 4% improvement in maximum accuracy — from 85.4% to 89.7% — without changing the model or test data. This is definitely a significant improvement, and people do report higher levels of satisfaction when using the model in demo applications on Android or Raspberry Pi. It seems to me that improving the quality of the data sets has a greater effect than investing more time in model tweaking.
In actual production, I have seen people constantly adjust their production Settings to gradually improve the level of results. But if people want to adopt a similar approach, it is often difficult to find the exact way to implement it. You can certainly take some inspiration from the voice data usage techniques mentioned earlier, but to address this topic more explicitly, we should definitely move to the next stage — the practical approach.
This may seem like nonsense, but the first step should be to browse your training data in a random way. Copy a portion of the file to your local computer and spend hours reviewing its contents. If you plan to work with images, use MacOS’s Finder to scroll through various thumbnails to quickly see thousands of images. For audio, you can use finder’s play preview feature or display random snippets as text on the terminal. I didn’t spend enough time previewing the voice data set, so Kaggle contestants found a lot of problems with the data.
I do think the process was stupid, but I don’t regret it. After all, when I finish my work, I always find some conclusion that is very important to the data — including an unbalanced number of examples of different categories, data corruption (PNG files marked with JPG extensions), mislabeling, and even surprising combinations of associations. Tom White made some surprising discoveries when he examined ImageNet. He realized that the label “sunglasses” referred to “concentrators,” an ancient device used to concentrate sunlight. “Garbage truck” represents the home page image; The cloak is the female undead. Andrej’s manual sorting of the images in ImageNet also gave me important lessons, including how to distinguish between different types of dogs — even for humans, this is still not an easy task.
The specific action you take depends on what you find; But as the first step in cleaning up your data, you should complete the above checks first, and this intuitive understanding of the current data set will help you make good judgments about the next steps.
Don’t invest too much time in choosing a model. If you are doing image classification, please choose AutoML first; If not, choose TensorFlow’s model library or faster. AI’s sample collection, both of which provide models that can actually solve your problems. In other words, the most important thing is to start iterating as soon as possible, with real users trying out your model early and often. You can always improve your model and get better results, but getting the data is the first step. Deep learning still follows the basic computational principle of “garbage in, garbage out”, so even the highest-quality models will be directly affected by the quality of the training data set. By selecting a model and testing it, you will be able to understand that these defects exist and improve upon them.
To speed up the iteration, you can try to start with a model that has already been trained through an existing large data set and fine-tune it with transfer learning using a (possibly smaller) set of data that you have collected. This often leads to direct training that goes beyond small data sets, and is performed faster, so people can quickly learn how to adjust their data collection strategy. More importantly, you can incorporate feedback into your collection process so that it can be adjusted synchronously as you learn — rather than treating collection as an isolated transaction prior to training.
The essential difference between model-building for research and production scenarios is that the problem statement is usually clearly defined at the beginning of the research. On the other hand, the actual application requirements are only in the user’s head and gradually become apparent to the developer over time. For Jetpac, for example, we wanted to be able to show as many good photos as possible in our automated travel guides for each city. So we started asking raters to upload photos that they thought were “good.” Instead, we end up with a lot of smiling portraits, and that’s what they mean by “good.” We put this into a production model and tested how users reacted — and it turned out that they weren’t impressed and didn’t resonate. To solve this problem, we adjusted the question to “Does this photo make people want to go to the places shown in it?” It makes our content better, but there are huge differences in how it’s perceived by different groups. Workers in Southeast Asia were notably more attracted to the conference-style photos of people in large hotels wearing formal suits and holding a glass of wine. This is obviously an understandable misunderstanding, and the target audience in the US will only feel depressed and unwilling to go after seeing these images. In the end, six members of the Jetpac team manually rated more than two million images, because we knew better than anyone what the criteria were.
Yes, this is an extreme example, but it shows how much the tagging process depends on the specific needs of the application. For most production use cases, a considerable amount of time is spent helping the model find the right answer to the right question — which directly determines the model’s ability to solve the problem. If we use models to answer the wrong questions, we will never be able to build a solid user experience on such a poor foundation.
I found that the only way to guarantee the correctness of the problem was to simulate the application — rather than relying on machine learning models to loop themselves. Sometimes we call this the “Wizard of Oz,” because there are mentors behind the scenes. Again using Jetpac as an example, we let people manually select photos for travel guides (rather than training models) and use user test feedback to adjust the criteria we use to select images. Once we were able to get positive feedback from the test in a reliable way, the resulting photo selection rules could be turned into tagging scripts that could process millions of images from the training set. That material has since trained models that can predict the quality of billions of photos — but ultimately, the DNA of the final model was already established when we developed the original manual rules.
In Jetpac, the images we used for the training model had the same sources as the images we applied to the model (mostly from Facebook and Instagram), but a common problem I found was that the training data set was largely different from the actual input data set in the production scenario. For example, I often see teams that want to use ImageNet images for model training only to find that their model results simply don’t solve the problems of drones or robots in the real world. This happens because ImageNet consists mostly of images taken by humans, and these images tend to have a lot in common. For example, people shoot with mobile phones or cameras, use medium shots, view from roughly head height, generally in daytime or artificially lit environments, and point to objects centered in the foreground. In contrast, robots and drones use video cameras for shooting, which usually use high-field shots. The shooting position may be on the ground or in the air, and the lighting conditions are generally poor, and no object can be selected intelligently. These differences mean that simply training a model with images from ImageNet and deploying the results to a drone or robot will not yield the analytical accuracy you would expect.
The training data may also differ in subtle ways from what is seen in the real world. Let’s say you’re designing cameras to identify wildlife and training them using animal datasets from around the world. If you’re only going to use your camera in the jungles of Borneo, the penguin images in the data set have little chance of matching up with real life images. So if the training data included photographs of Antarctica, the model might mistake other animals for penguins. To reduce the chance of such errors, delete such images from the data set in advance.
There are many ways to calibrate results against established prior criteria (e.g., substantially reducing the probability of penguin presence in a jungle environment), but a more convenient and effective approach is to use training sets that reflect the actual conditions encountered by the product. I’ve found that the best approach is always to use data directly captured from real-world applications, which matches well with the wizard of Oz approach I mentioned above. Project participants should be allowed to design labels for the initial data set, and even if the initial labels are small, they will at least reflect reality and are generally sufficient for transfer learning from the initial experiment.
When I look at cases of voice commands, the most common report I read is confusion matrices generated during training. They look like this in the console:
This may seem intimidating, but it’s really just a table that displays detailed information about network errors. Here’s the tabbed version, which looks nicer:
Each row in this table represents a set of samples where the actual real tags are identical, and each column shows the number of the predicted tags. For example, the highlighted rows represent all the silent audio samples, and if you look from left to right, you can see that the predicted labels are correct, and that each label falls into the silent prediction column. It can be seen that the model is very suitable for judging the true silent state, and there is no wrong counterexample. If we look at the entire column to show how many samples are predicted to be silent, we can see that some samples that are actually statements are mistaken for silent, and there are a lot of false positives. These results were useful because they allowed me to scrutinize samples that had been misreported as silent and found that they were mostly recordings of very low volume. So I was able to eliminate the lower volume samples to improve the data quality, thanks to the cues provided by the obfuscation matrix.
Almost any analysis of the results can be useful, but I have found the obfuscation matrix to be a good compromise, providing more information than a single precise number without producing too much detail that is difficult to analyze. It’s also useful to watch the numbers change during training to see what categories the model is learning and to tell you which layers to focus on as you clean up and expand your data set.
Visual clustering is one of my favorite ways to see how the network analyzes training data. TensorBoard provides good support for this analysis approach; While visual clustering is often used to look at word embeddings, I’ve found that it works at almost any level of embedded operation. For example, image classification networks usually have a penultimate layer For embedding before the last full connection or Softmax unit (simple transfer learning cases like TensorFlow For Poets work this way). These are not strictly embedded, as there is no process during training to ensure that desirable spatial attributes are expected in the real embedding layer, but clustering their vectors does yield interesting results.
In one practical case, a team I worked with was puzzled by the high error rate of some animals in an image classification model. They used cluster visualization to see how training data was distributed across various categories. When they saw the cheetah, they clearly saw that the data was divided into two different groups, separated by some distance.
Here’s a snapshot of what they found. A look at the photos in each cluster makes it clear: Many Jaguars are mislabeled as cheetahs. The team can then examine the tagging annotation process and recognize problems with the direction and interface of the people involved. Armed with this information, they were able to improve the tagger training process and fix the tools to remove all car images from the cheetah category, making it significantly more accurate.
Clustering gives you insight into the content of the training set, providing similar benefits as simply looking at the data. But the neural network actually classifies the input data according to its own learning understanding and guides your exploration and analysis. As humans, we’re pretty good at spotting anomalies with our eyes, so the combination of our intuition and a computer’s ability to process large amounts of input offers a scalable solution to finding quality problems in data sets. There’s no room here for a full tutorial on how to use Tensorboard for such tasks (this article is long enough, thanks for sticking with it!). , but if you really want to improve your output metrics, I highly recommend familiarizing yourself with this tool.
I haven’t heard of collecting more data without improving the model’s accuracy, and in fact there are many studies that support my empirical judgment.
This figure, from “Re-examining the unreasonable validity of data”, proves that the model accuracy of image classification is still improving even as the size of training data sets grows to hundreds of millions. Facebook recently went a step further, using billions of tagged Instagram images as training tools to set a new record for accuracy in ImagNet tests. This shows that increasing the size of the training set can improve the model accuracy even if the task uses large, high-quality data sets.
This means that as long as higher model accuracy improves the user experience, you need a strategy to keep expanding the data set. You can try innovative ways to use relatively weak signals to capture larger data sets, and Facebook’s use of Instagram hashtags is a good example. Another approach is to improve the efficiency of the labeling process, such as using predictive labels generated by earlier versions of the model for auxiliary judgment to improve the efficiency of the labeling staff. This carries the risk of preconceived notions, but in practice the benefits often outweigh this risk. Hiring more people to tag more training data is also often a meaningful investment, although some organizations do not budget for such input. For nonprofits, it is a good way to increase the scale of data at a low cost by trying to provide some public tools for supporters to contribute data spontaneously.
Of course, the ultimate dream for any organization is to have a product that naturally generates more markup data during use. I’m personally not that interested in this idea, and in fact it’s not a panacea, because there are a lot of situations where people just want to solve problems as quickly as possible and don’t want to spend time and effort tagging data. It’s a great investment spot for startups because it’s like a perpetual motion machine for model improvement; But there are always unit costs associated with cleaning up or aggregating collected data, so often a cheap commercial crowdsourcing solution is more economical than a completely free one. All the way to the perilous peak
There are always model errors that affect the user more than the results of the loss function calculation. You should plan ahead for the worst consequences and design a model backstop to avoid such accidents. Backing can be a blacklist of categories that are too costly to predict wrong; Or a simple set of algorithmic rules to ensure that actions do not exceed some preset boundary parameter. For example, you can create a curse word list in case the text output application produces those words by learning from the data in the training set, because such output can have a negative impact on your product.
It’s often difficult to predict all the negative outcomes in advance, so it’s important to learn from real-world mistakes. One of the easiest ways to do this is to use error reporting when a product first comes to market. Make it easy for people to give you feedback when they use your application and aren’t satisfied with the output of the model. Get the full input data for the model if you can, but if sensitive data is involved, just knowing what the low-quality output is will help you do your research. These categories can be used to determine where to collect more data and to locate classes to study the quality of existing tags. Once a new adjustment is made to the model, the normal data set is run in addition to the previous data set that produced the low-quality output, and the output of the latter is analyzed separately. This process is a bit like regression testing and provides a way to assess the degree of improvement in the user experience, since a single model accuracy metric does not cover all aspects of concern. By examining the small number of cases that have generated strong reactions in the past, you can get some independent evidence that you are actually improving the user experience. If some data from the input model is too sensitive to be collected for analysis, dogfood tests or internal experiments are used to determine which data on hand can produce such false results and replace user input with these data in the regression test set.
I hope this article has convinced you to spend more time looking at the data, and I’ve provided some advice on how to improve it. This area is not getting enough attention, and I struggle to find a bit of help and advice. So I thank all the practitioners who have shared their methods with me and hope to hear from more of you about your methods for achieving success. I think more and more organizations will have teams of engineers dedicated to data set improvement, rather than ML researchers driving the progress. I’d like to see the whole field develop. I am often amazed at how well the model can produce good output after training with very poor quality data sets, so I can’t wait to see what our model can do when the data sets are improved!
Original link:
Why you need to improve your training data, and how to do it
For more dry goods, please pay attention to the wechat official account “AI Front” (ID: AI-Front).