This article is the notes section of Ng’s deep Learning course [1].
Author: Huang Haiguang [2]
Main Author: Haiguang Huang, Xingmu Lin (All papers 4, Lesson 5 week 1 and 2, ZhuYanSen: (the third class, and the third, three weeks ago) all papers), He Zhiyao (third week lesson five papers), wang xiang, Hu Han, laughing, Zheng Hao, Li Huaisong, Zhu Yuepeng, Chen Weihe, the cao, LuHaoXiang, Qiu Muchen, Tang Tianze, zhang hao, victor chan, endure, jersey, Shen Weichen, Gu Hongshun, when the super, Annie, Zhao Yifan, Hu Xiaoyang, Duan Xi, Yu Chong, Zhang Xinqian
Editorial staff: Huang Haiguang, Chen Kangkai, Shi Qinglu, Zhong Boyan, Xiang Wei, Yan Fenglong, Liu Cheng, He Zhiyao, Duan Xi, Chen Yao, Lin Jianyong, Wang Xiang, Xie Shichen, Jiang Peng
Note: Notes and assignments (including data and original assignment files) and videos can be downloaded on Github [3].
I will publish the course notes on the official account “Machine Learning Beginners”, please pay attention.
Week 2: Machine Learning Strategies (2) (ML Strategy (2))
2.1 Carrying out error analysis
Hello, welcome back. If you want to make your learning algorithm capable of doing tasks that humans can do, but your learning algorithm hasn’t yet performed as well as humans, a manual examination of the errors your algorithm makes may give you an idea of what to do next. This process is called error analysis, and let’s start with an example.
Let’s say you’re debugging a cat classifier, and you get 90% accuracy, 10% error, on your development set, which is a long way from where you want to be. Maybe one of your team members looked at an example of where the algorithm went wrong, and noticed that the algorithm classified some dogs as cats, and you look at these two dogs, and they look a little bit like cats, at least at first glance. So maybe your teammate gave you a suggestion on how to optimize the algorithm for the dog picture. Imagine that you could collect more dog graphs for dogs, or design algorithms that only deal with dogs, or something like that, in order for your cat classifier to do a better job on dog graphs, so that the algorithm doesn’t classify dogs as cats. So the question is, should you start a program that deals specifically with dogs? The project could take months to get the algorithm to make fewer mistakes on dogs. Is it worth it? Or rather than spend months working on the project, it may turn out to be useless. Here’s an error analysis process that will let you know immediately if this direction is worth the effort.
This is what I suggest you do, first of all, collect, say, 100 samples of your development set that are mismarked, and then go through it manually, one at a time, to see how many samples of your development set that are mismarked are dogs. Now, suppose that, in fact, only 5% of your 100 error tagged samples are dogs, which means that out of 100 error tagged development set samples, 5 are dogs. That means 100 samples, in a typical 100 error samples, even if you completely solve the dog problem, you can only fix 5 of those 100 errors. Or to put it another way, if only 5% of errors are dog pictures, then if you spend a lot of time with dogs, the best you can hope for is that your error rate drops from 10% to 9.5%, right? The relative error rate decreased by 5% (0.5% overall, 10% error rate for 100 samples, 10% error rate for 1000 samples), that’s 10% down to 9.5%. You can be sure it’s not a good time, or maybe it should be, but at least this analysis gives you an upper limit. If you continue with the dog problem, you can improve the upper bound of the algorithm’s performance, right? In machine learning, we sometimes call this a performance upper bound, which means, where is the best you can do, how much it will help you to solve the dog problem completely.
But now, let’s say another thing happens, let’s say we look at the sample of the 100 mislabeled development sets, and you see that 50 of the images are actually dogs, so 50 of them are dogs, and it’s probably good to spend some time on dogs right now. In this case, if you actually solve the dog problem, your error rate might drop from 10% to 5%. Then you might decide that the direction of halving the error rate is worth a try and can focus on reducing the problem of mismarked dogs.
I know that in machine learning, sometimes we look down on manual manipulation, or using too many artificial values. But if you’re building an application, this simple manual statistical step, error analysis, can save a lot of time, and you can quickly decide what’s most important, or the most promising direction. In fact, if you look at a sample of 100 mislabeled development sets, it might take five or ten minutes to look at those 100 samples and count for yourself how many of them are dogs. Depending on the results, see if it’s 5%, 50%, or something else. This will give you an estimate of the value of that direction in 5 to 10 minutes and will help you make a better decision about whether to spend the next few months solving the problem of the mislabeled dog.
In this slide, we describe how error analysis can be used to evaluate an idea, whether the dog problem in this sample is worth solving. Sometimes you do error analysis, can also be parallel to evaluate several ideas at the same time, for example, you have several improved cat detector idea, maybe you can improve the performance of figure for dogs, or sometimes to note that those cats such as lions, leopards, cheetahs, etc., they are often classified into a kitten or a house cat, so you may be able to find a way to solve this mistake. Or maybe you find that some of the images are blurry, and if you can design some system that can handle blurry images better. Maybe you have some ideas about how to deal with these problems, and you need to do error analysis to evaluate these three ideas.
What I’m going to do is create a table like this, which I usually do with a spreadsheet, but a plain text file is fine. On the far left, manually go through the set of images you want to analyze, so the images could be from 1 to 100, if you look at 100 images. One column of the spreadsheet is the idea you want to evaluate, so dog questions, cat questions, blurred image questions, I usually leave space in the spreadsheet for comments as well. So remember, in error analysis, you just look at a sample of the development set where the algorithm got it wrong, and if you find that the first image that you got it wrong was a dog, then I’ll put a check there, and to help myself remember these images, sometimes I’ll comment in the comments, maybe it’s a pit bull. If the second photo is blurry, note that, too. If the third picture is a lion in the zoo on a rainy day, it’s recognized as a cat, it’s a big cat, and the picture is blurry, and the comment section says it’s raining in the zoo, and the rain blurs the image or something. And finally, after this set of images has gone through, I can count the percentage of these algorithms, or the percentage of each error type here, how many of them are dogs, big cats or fuzzy these error types. So maybe 8% of the images you examine are dogs, maybe 43% are big cats, and 61% are blurry. This means scanning each column and counting what percentage of the image is ticked in that column.
Halfway through this process, sometimes you might find other types of errors, like you might find Instagram filters, those fancy image filters, messing with your classifier. In this case, you can actually add a column like multicolor filters Instagram and Snapchat filters along the error analysis route, and then go through that again, count those problems as well, and determine what percentage of this new error type is, and the results of that analysis step will give you an estimate, Is it worth dealing with each different error type?
For example, in this sample, many of the errors were from blurred images, and many of the error types were big cat images. So the result of this analysis is not that you have to deal with blurred images, it doesn’t give you a strict mathematical formula for what you should do, but it does give you an idea of what you should do. It also tells you that no matter how good you are with dog pictures or Instagram pictures, for example, you can only get an 8 or 12 percent performance improvement in those cases. In the big cat category, you can do better. Or blurred images, these types have the potential for improvement. In these types, the upper limit for performance improvement is much larger. So it depends on how many ideas you have for improving performance, such as improving the performance of big cat images or blurry images. Maybe you can choose two of them, or maybe you have enough people on your team, maybe you can split the team into two groups, one of which is trying to improve the recognition of big cats, and the other team is trying to improve the recognition of fuzzy images. But this quick stat step, which you can do frequently and take a few hours at most, can really help you pick out high-priority tasks and see how each approach can improve performance.
So to summarize, for error analysis, you should take a set of error samples, perhaps in your development set or test set, and look at the mislabeled samples to see how much false positives and false negatives are used to calculate the number of negatives for each error type. In the process, you may be inspired to generalize new types of errors, as we have seen. If you go through the error sample and say, gee, there are so many Instagram filters or Snapchat filters that are interfering with my classifier, you can create a new error type on the way. In summary, counting the percentage of the total number of different types of error markers can help you identify which problems need to be addressed first, or give you inspiration for new optimization directions. When doing error analysis, sometimes you notice that some samples in your development set are incorrectly marked. What should you do? We’ll talk about that in the next video.
2.2 Cleaning up Incorrectly labeled Data
The data for your supervised learning problem is made up of input and output labels, and if you look at your data and see that some of the output labels are wrong. Some of the labels in your data are wrong. Is it worth the time to correct them?
Let’s look at the cat classification problem, the picture is cats,; Not a cat. So let’s say you look at some data samples and realize that this is not actually a cat, so this is a mislabeled sample. I used the term, “mark the wrong sample,” to indicate that your learning algorithm is outputting the wrong value. But WHAT I’m saying is, for the mislabeled sample, referring to your data set, the label on the training set or the test set, the human label on this part of the data is actually wrong, this is actually a dog, so it should actually be 0, maybe the person who labeled it was careless. What should you do if you find some mislabeled samples in your data?
First, we consider the training set. It has been proved that the deep learning algorithm is quite robust for random errors in the training set. As long as you’re marking the error samples, as long as the error samples are not too far away from random errors, sometimes the person who’s marking them doesn’t pay attention or accidentally hits the wrong key, and if the errors are random enough, then it’s probably ok to leave them there and not spend too much time fixing them.
Of course it doesn’t hurt to go through the training set, check the tabs, and fix them. Sometimes it is worthwhile to fix these errors, sometimes it is ok to leave them alone, as long as the total data set is large enough, the actual error rate may not be too high. I’ve seen a bunch of machine learning algorithms train, knowing there’s an error label in the training set, but it turns out to be fine.
I should warn you that deep learning algorithms are robust to random errors, but not so robust to systematic errors. So if, for example, markers keep labeling white dogs as cats, that’s a problem. Because when your classifier learns, it will classify all white dogs as cats. But random errors, or near-random errors, are not a problem for most deep learning algorithms.
Now, the previous discussion focused on the tagging errors in the training set. What about the tagging errors in the development set and the test set? If you are concerned about the impact of mislabeled samples on your development set or test set, they generally recommend that you add an extra column during error analysis so that you can count the number of mislabeled samples as well. So, for example, you may of 100 mark error statistics about the influence of the sample, so you will find a 100 samples, which one of the output of the classifier and development set of tags, sometimes with only a few samples of these, you of the classifier output and the label is different, because the label is wrong, not your error classifier. So maybe in this sample, you see that the person who marked missed a cat in the background, so put a check there to indicate that the sample 98 label was wrong. Maybe this is actually a drawing of a cat, not a real cat, and maybe you want the person who marked the data to mark it as, not as, and then put a check mark there. And when you figure out the percentage of other types of errors, as we saw in the previous video, you can also figure out the percentage of errors due to tagging, where the values in your development set are wrong, which explains why your learning algorithm makes predictions that are different from the tags in your data set1.
So the question now is, is it worth fixing the 6% of the samples that are labeled incorrectly, and my advice is that if those labeled errors seriously affect your ability to evaluate the algorithm on your development set, then take the time to fix the wrong labels. However, if they don’t significantly affect your ability to assess cost deviations with your development set, then they probably shouldn’t be spent valuable time dealing with them.
Let me show you a sample to explain what I mean. So I suggest you see three Numbers to determine whether it is worth to artificial correction marks data of error, I suggest you look at the development of the whole set of error rate, in the video before we sample, we say that maybe we reached 90% of the overall system accuracy, so a 10% error rate, then you should look at the number of marks caused by the errors or percentage. So in this case, 6% of the errors come from tagging errors, so 10% of 6% is 0.6%. Maybe you should look at errors for other reasons. If you have 10% of errors on your dev set, 0.6% are due to tagging errors, and the rest 9.4% are due to other reasons, such as mistaking a dog for a cat, big cat images. So in this case, I say that there is a 9.4% error rate that needs to be fixed, and the errors caused by tagging errors are only a small part of the total error, so if you must, you can fix the various error tags manually, but maybe that’s not the most important task at the moment.
Let’s look at another sample, and let’s say you’ve made a lot of progress on the learning problem, so that now the error rate is no longer 10%, let’s say you’ve reduced the error rate to 2%, but 0.6% of the total errors are still due to labeling errors. So now, if you want to examine a set of developer set images that are mislabeled, and 2% of the developer set data is mislabeled, then a large portion of that, 0.6% divided by 2%, actually becomes a 30% tag instead of a 6% tag. So many of the samples of errors were actually caused by tagging errors, that now the number of errors due to other causes is 1.4%. When such a large proportion of measured errors are caused by development set tagging errors, it seems more valuable to fix the error tags in the development set.
The main purpose of the development set, if you remember, is that you want to use it to choose between two classifiers and. So when you test two classifiers and, on the development set, one has a 2.1% error rate and the other a 1.9% error rate, but you can’t trust the development set anymore because it can’t tell you if this classifier is better than this one, because the 0.6% error rate is due to tagging errors. Now you have a good reason to fix mislabeling in the development set, because in the sample on the right, mislabeling has a serious impact on the overall evaluation criteria for algorithmic errors. In the sample on the left, the percentage of markup errors that affect your algorithm is still relatively small.
Now if you decide to fix the development set data, manually re-check the tags, and try to fix some tags, there are some additional guidelines and principles to consider. First of all, I encourage you to apply whatever fixes you use to both your development set and your test set, because we talked about why development and test sets have to come from the same distribution. The development set sets your goals, and when you hit them, you want the algorithm to be generalized to the test set so that your team can more efficiently iterate over development sets and test sets from the same distribution. If you are going to fix some data on the development set, it is best to do the same for the test set to ensure that they continue to come from the same distribution. So we hired someone to go through the tags, but we had to go through both the development set and the test set.
Secondly, I strongly recommend that you think about checking both the correct and the wrong samples, because it’s easy to check the wrong samples, just to see if those samples need to be fixed, but there may be some samples that are correct and those that need to be fixed. If you correct only the samples where the algorithm got it wrong, you might have a bigger bias estimate of the algorithm, which would give your algorithm a little bit of an unfair advantage, and we need to double-check the samples that got it wrong, but we also need to double-check the samples that got it right, because the algorithm might get lucky and get something right. In that particular case, fixing those tags could turn the algorithm from being right to being wrong. This second point is not very easy to do, so it is not usually done. The reason you usually don’t do this is that if your classifier is accurate, you’ll get it wrong far fewer times than you get it right. So you get 2% wrong and 98% right, so it’s easier to check the labels on 2% data, whereas it takes much longer to check the labels on 98% data, so you don’t usually do that, but it’s something to take into account.
Finally, if you enter a development and test sets to modify this part of the label, you may, or may not go to the training set to do the same thing, do you remember we talked about in the other video, correct training focused on label is relatively less important, you may decide to fixed label development set and test set, only because they are usually much smaller than the training set, You probably don’t want to put all that extra effort into fixing the tag for the much larger training set, so it’s actually fine. We will discuss some steps later this week to deal with situations where your training data distribution and development and test data differ, where the learning algorithm is actually quite robust, and it is important that your development and test sets come from the same distribution. But if your training set comes from a slightly different distribution, it’s usually a reasonable thing to do, and I’ll talk about how to deal with that later this week.
Let me conclude with a few suggestions:
First, deep learning researchers sometimes like to say, “I’m just feeding data to an algorithm, I’ve trained it, and it works.” This tells the truth about a lot of deep learning mistakes, more often than not, we feed data to algorithms and then train them, and less human intervention, less use of human insights. But I think that when constructing actual systems, it usually takes more human error analysis, more human insight to structure those systems, even though deep learning researchers don’t like to admit it.
Secondly, I don’t know why, BUT I see some engineers and researchers who don’t want to go and look at these samples, and maybe it’s boring to do this, to sit down and look at 100 or hundreds of samples and count errors, but I do that a lot. When I lead a machine learning team, I want to know what mistakes it’s making, and I look at the data myself and try to fight some of them. I want to just because it took it a few minutes, or hours to statistics in person, can really help you find the need to prioritize tasks, take the time to personally check the data that I find very worth it, so I strongly recommend you to do so, if you’re building your machine learning system, and then you want to determine which idea should be preferred to try, or which direction.
That’s the error analysis process, and in the next video, I want to share how error analysis works in starting new machine learning projects.
2.3 Build your first system quickly, then iterate (iterate)
If you’re developing a new machine learning application, the advice I usually give you is that you should prototype your first system as soon as possible and then iterate quickly.
Let me tell you what I mean, I’ve been working in speech recognition for years, and if you’re thinking about building a new speech recognition system, there are actually a lot of directions you can go and a lot of things you can prioritize.
For example, there are certain technologies that can make speech recognition systems more robust to noisy backgrounds, noisy backgrounds might be coffee shop noise, a lot of people talking in the background, or traffic noise, cars on the highway noise or other types of noise. There are ways to make speech recognition systems more robust when dealing with accents, and specific problems have to do with the distance between the microphone and the speaker, something called far-field speech recognition. Children’s speech recognition presents special challenges in terms of how words sound, but also in terms of the words they choose, the words they tend to use. For example, the speaker eats, or says a lot of meaningless phrases like “oh”, “ah” and so on. There are a lot of different technologies you can choose to use to make your dictation more readable, so there are a lot of things you can do to improve speech recognition systems.
In general, for almost any machine learning program there are probably 50 different directions you can go in, and each direction is relatively reasonable to improve your system. But the challenge is how you choose a direction to focus on. Even though I’ve been working in speech recognition for years, if I’m building a new system for a new application domain, I still find it hard to choose a direction without taking the time to think about it. So my advice to you, if you want to build a new machine learning program, is to quickly build your first system and start iterating. What I mean by that is I suggest you quickly set up a development set and a test set and metrics, so that determines where you want to go, and if you get your goals wrong, you can change them later. But you have to set a goal, and then I recommend that you immediately prototype a machine learning system, and then go to the training set, and train it, and see what happens, and start to understand how your algorithm is doing, how it’s doing on the development set, on the test set, on your metrics. And when you build your first system, you can immediately use the bias variance analysis that I talked about, and the error analysis that I talked about in the last couple of videos, to figure out what to prioritize next. Especially if the error analysis for you to understand to most of the error source is the speaker from the microphone, this pose special challenges to speech recognition, then you will have a good reason to concentrate on the study of these technologies, the far field of speech recognition technology, this basically is to deal with the situation of the speaker is far away from the microphone.
The whole point of building this initial system is that it can be a quick and dirty implementation, you know, don’t think too much about it. The whole meaning of the initial system is that there is a learning system, there is a training system, allows you to determine the scope of the error variance, can know what the next step should be the priority, allows you to error analysis, can observe some errors, and then come up with all the way to go, which is in fact the most promising direction.
So just to review, I recommend that you build your first system quickly and then iterate. However, if you have a lot of experience in the application space, this advice is less applicable. Another case adaptation degree is lower, when the field there are a lot of can draw lessons from the academic literature, deal with and you are almost identical to solve, so, for example, face recognition, there are a lot of academic literature, if you try to build a face recognition device, then can from existing a large number of academic literature, on the basis of building more complex systems from the start. But if you’re dealing with something new for the first time, I really don’t encourage you to overthink it or make your first system too complicated. I recommend that you build some quick and rough implementations that help you find priorities for improving your system. I’ve seen a lot of machine learning projects, and I think some teams overthink their solutions and build systems that are too complex. I’ve also seen limited teams think too little and build systems that are too simple. On average, I see more teams thinking too much and building systems that are too complex.
So I hope these strategies help, if you are applying machine learning algorithms to the new application, your main goal is to make a working system, your main goal is not invented a new machine learning algorithm, this is a totally different goal, then your goal should be to come up with some kind of effect is very good algorithm. So I encourage you to build a fast, rough implementation, and then use it for bias/variance analysis, use it for error analysis, and then use that analysis to determine the next priority.
2.4 Training and testing on different distributions with data from different distributions
Deep learning algorithm is a big appetite for training data, and when you have collected enough data with labels constitute the training set, algorithm works best, as a result, many teams tried all ways to collect data, and stacked them into the training set, more let the amount of training data, even if some of the data, or even most of the data from different distribution and development set and test set. In the era of deep learning, more and more teams are training with data from different distributions of development and test sets. Here are some subtleties, and some best practices for dealing with differences between training and test sets.
Let’s say you’re developing a mobile app and users upload photos they’ve taken with their phone, and you want to identify whether the pictures they upload from the app are cats. Now you have two sources of data, one is the distribution of data that you really care about, data from apps that are uploaded, like the app on the right, and these photos tend to be more amateur, not very good, some of them are even blurry, because they’re taken by amateur users. Another source of data is that you can use crawlers to mine web pages and download them directly. In this case, you can download a lot of professionally framed, high-resolution, professionally shot cat pictures. If your app doesn’t have a lot of users, maybe you only have 10,000 uploaded photos, but by crawling through the web, you can download tons of cat pictures, maybe you’ve downloaded over 200,000 cat pictures from the Internet. And what you really care about is how well your final system handles this distribution of images from your application, because eventually your user will upload images like the one on the right, and your classifier will have to perform well in that task. Now you’re stuck because you have a relatively small data set, only 10,000 samples from that distribution, and you have a much larger data set from another distribution, and the picture doesn’t look like what you really want to deal with. But you don’t want to just use 10,000 images, because then your training set is too small, and using 200,000 images seems to help. But the dilemma is that those 200,000 images don’t come exactly from the distribution you want, so what can you do?
One option here, one thing you can do is merge the two sets of data together, so you have 210,000 photos, and you can randomly assign those 210,000 photos to the training, development, and test set. To illustrate the point, let’s assume that you have determined that the development set and test set each contain 2,500 samples, so your training set has 205,000 samples. Setting up your data set this way now has some advantages and disadvantages. The advantage is that your training set, development set, and test set all come from the same distribution, which makes it easier to manage. But the downside, and it’s not a small downside, is that if you look at the development set, and you look at these 2,500 samples and a lot of the images are from web downloads, that’s not really the data distribution that you care about, you’re really dealing with images from the phone.
So it turns out that your total amount of data, these 200,000 samples, I’m going to abbreviate, I’m going to write the total amount of data that was downloaded from the Web, so for those 2,500 samples, the mathematical expected value is: 2,381 graphs were downloaded from the web, this is the expected value, the exact number varies depending on the exact random assignment. But on average, only 119 images came from mobile uploads. Remember, the purpose of a development set is to tell your team what to aim for, and the way you aim for it, most of your energy is spent optimizing images downloaded from the web, which is not what you really want. So I really don’t recommend using the first option, because setting up your development set tells your team to optimize for a different data distribution than you actually care about, so don’t do it.
I suggest you go the other way, that’s it, the training set, let’s say again 205,000 images, our training set is 200,000 images downloaded from the web, and then, if you need to, 5,000 images uploaded from your phone. And then for the development set and the test set, the size of the data set is drawn to scale, so your development set and your test set are both mobile maps. The training set contains 200,000 images from the web, 5,000 images from the app, the development set is 2,500 images from the app, and the test set is 2,500 images from the app. The nice thing about breaking up the data into training sets, development sets, and test sets is that now you’re targeting the target that you want to work on, and you’re telling your team, my development set contains all the data from mobile uploads, and that’s the picture distribution that you really care about. We try to build a learning system, so that the system in the processing of mobile phone upload picture distribution effect. The downside, of course, is that your training set is now distributed differently than your development set and test set. But it turns out that breaking up your data into training, development, and test sets like this gives you better system performance in the long run. We’ll discuss some special techniques for dealing with situations where the distribution of the training set differs from the distribution of the development and test sets.
Let’s look at another sample. Let’s say you’re developing a brand new product, a voice-activated car rearview mirror, which is a real product in China, and it’s coming to other countries. But that’s building a rearview mirror, replacing this little thing, and now you can talk to the rearview mirror, and just say, “Dear rearview mirror, please find me directions to the nearest gas station,” and the rearview mirror will process the request. So this is actually a real product, and suppose now you want to build this product for your own country, how do you collect the data to train the product language recognition module?
Well, maybe you’ve been working on speech recognition for a long time, so you have a lot of data from other speech recognition apps that’s not from the voice-activated rearview mirror. Now I’ll talk about how to allocate the training set, development set, and test set. For your training set, you can take all the speech data that you have, the data that you’ve collected from other speech recognition problems, for example, the data that you’ve bought from various speech recognition data providers over the years, and today you can just buy the right data, which is audio clips, which is dictation. Or maybe you’ve worked on smart speakers, voice-activated speakers, so you have some data, maybe you’ve done voice-activated keyboard development or something.
For example, maybe you collected 500,000 recordings from these sources, but for your development set and test set the data set may be much smaller, such as the data actually from the voice-activated rearview mirror. Because the user is looking up navigation information or trying to find directions to various places, this data set might have many street addresses, right? “Please help me navigate to this street address,” or “Please help me navigate to this gas station,” so the distribution of this data is very different from the distribution on the left, but this is really the data that you care about, because this is the data that your product has to work with, so you should set it up as your development and test set.
In this sample, you should set up your training set with 500,000 sounds on the left, and then your development set and test set, which I’ve shortened to and, maybe each set contains 10,000 sounds collected from the actual voice-activated rearview mirror. Or another way, if you don’t feel like you need to put all 20,000 recordings from the voice-activated rearview mirror into the development and test set, maybe you can take half of that and put it in the training set, so the training set might be 510,000 voices, including half a million voices from there, and 10,000 voices from the rearview mirror, Then the development set and the test set may each have 5000 voices. So 20,000 voices, maybe 10,000 voices went into the training set, 5,000 into the development set, 5,000 into the test set. So this is another way to divide your data into training, development and testing. So you have a much larger training set, about half a million voices, than just voice-activated rearview mirror data as a training set.
So in this video, you see a couple of samples, and you get your training set from a different distribution than your development set, your test set, so you can have more training data. In these samples, this will improve your learning algorithm.
Now, you might ask, should we use all the data we’ve collected? The answer is subtle, it’s not always yes, but let’s look at a counter example in the next video.
2.5 Analysis of Bias and Variance with mismatched data distributions
Estimating the bias and variance of a learning algorithm can really help you determine which direction to prioritize next. However, when your training set comes from a different distribution than your development set or test set, the way to analyze bias and variance may be different. Let’s see why.
Let’s go ahead and use the cat classifier, and we say that humans are almost perfect at this task, so the Bayesian error rate or bayesian optimal error rate, we know that in this case it’s almost zero. So to do error rate analysis, you usually need to look at training errors as well as development set errors. For example, in the sample, you of the training set error is 1%, and the development of your error is 10%, if you are a set of development from a training set and the same distribution, you may say, there is a big variance problem, your algorithm is not very good from the training set generalization, it deals with the training set is very good, but the processing development set suddenly effect is poor.
But if your training data and development data come from different distributions, you can’t be so sure. In particular, maybe the algorithm does a good job on the development set, maybe because the training set is easy to recognize, because the training set is high resolution images, very sharp images, but the development set is much harder to recognize. So maybe the software doesn’t have a variance problem, which simply reflects the fact that the development set contains images that are harder to categorize accurately. So the problem with this analysis is, when you look at the training error, and then you look at the development error, two things change. First, the algorithm has only seen the training set data, not the development set data. Second, the development set data comes from different distributions. And because you’re changing two things at the same time, it’s hard to know how much of this increased 9% error is due to the algorithm not seeing the data in the development set, which is part of the variance of the problem, and how much is due to the data in the development set just being different.
To figure out which factor is more important, don’t worry if you have no idea what the two effects are, we’ll go over them again in a minute. But in order to distinguish between the two factors, it makes sense to define a new set of data, which we call the training-development set, so this is a new subset of data. We should dig it out of the distribution of training sets, but you don’t use it to train your network.
I mean we’ve already set up training sets, development sets, and tests, and the development sets and tests come from the same distribution, but the training sets come from different distributions.
What we do is randomly break up the training set, and then separate some of the training set as train-dev, just as the development set and the test set come from the same distribution, so do the training set and the train-dev set come from the same distribution.
But the difference is that now you only train your neural network in the training set, you don’t have the neural network running backward on the train-development set. For error analysis, what you should do is look at the errors of the classifier on the training set, the train-development set, and the development set.
Let’s say in this sample, the training error is 1%, let’s say the train-development set error is 9%, and then the development set error is 10%, just like before. And you can conclude from this that when you go from training data to train-development data, the error rate really goes up a lot. The training data and training – the difference development data, you can see the first part of neural network data and do the training on it directly, but no training in the training set – development directly, it will tell you, problems of variance algorithm, because training – development set error rate is in the training set and measured from the same distribution of data. So you know, although your neural network does well in the training set, it doesn’t generalize very well to the train-development set from the same distribution, it doesn’t generalize very well to data from the same distribution that we haven’t seen before, so we do have a variance problem in this sample.
Let’s look at a different sample and say the training error is 1% and the train-development error is 1.5%, but when you start working with the development set, the error rate goes up to 10%. Now your variance problem is very small, because when you go from the training data that you’ve seen to the train-development set data, the data that the neural network hasn’t seen yet, the error rate only goes up a little bit. But when you go to the development set, the error rate goes up dramatically, so it’s a data mismatch. Because your learning algorithm is not directly trained in the train-dev or dev set, but the two data sets come from different distributions. But whatever the algorithm is learning, it does a good job on the training-development set, but it doesn’t do a good job on the development set, so in general your algorithm is good at dealing with distributions that are different from the data you care about, which we call data mismatches.
Let’s take a couple more samples, and I’ll write them in the next line, because I’m running out of space. So the training error, the train-development error, and the development error, let’s say the training error is 10%, the train-development error is 11%, the development error is 12%, and remember, the human level estimate of the Bayesian error rate is about 0%, and if you get that level of performance, then you really have a bias problem. There is an avoidable bias problem, because the algorithm is doing much worse than the human level, so the bias here is really high.
As a final example, if your training set error rate is 10%, your train-development error rate is 11%, and your development error rate is 20%, there are actually two problems. One, the avoidance bias is quite high, because you’re not doing very well on the training set, whereas humans can get close to 0% error rate, but your algorithm gets 10% error rate on the training set. Here the variance seems small, but the data mismatch is a big problem. So for this sample, I said, if you have a big bias or avoidable bias problem, you have data mismatches.
So let’s look at what we did on this slide, and then write down the general principles, the key data that we’re going to look at is the human level error rate, your training set error rate, your train-development set error rate, so this distribution is the same as the training set, but you’re not training directly on it. Depending on how far apart these error rates are, you can get an idea of how much of a problem you can avoid with bias and variance data mismatches.
Let’s say the human level error rate is 4 percent, your training error rate is 7 percent, your train-development error rate is 10 percent, and your development error rate is 12 percent, so you get a sense of how much avoidable deviation there is. Because you know, you want your algorithm to at least perform humanlike on the training set. And that roughly indicates the variance, so how well do you generalize from training set generalization to train-development set? And that gives you an idea of how big a problem the data mismatches are. Technically you can add one more number, which is test set performance, we’ll write test set error rate, you shouldn’t be developing on test sets, because you don’t want to overfit your test sets. But if you look at this, then the gap here is an indication of how much you overfit the development set. So if there’s a big gap between development set performance and test set performance, then you probably overfit the development set, so maybe you need a bigger development set, right? Keep in mind that your development and test sets come from the same distribution, so there is a big gap here. If the algorithm does a good job on the development set, much better than the test set, then you might overfit the development set. If this is the case, then you might want to take a step back and collect more development set data. Now I’m going to write these numbers, and the numbers get bigger and bigger as I go down the list.
Here’s another example where the numbers don’t keep getting bigger, maybe the human performance is 4%, the training error rate is 7%, the train-development error rate is 10%. But when we look at the development set, you see, surprisingly, the algorithm does better on the development set, maybe 6 percent. So if you see this, for example, when you’re doing speech recognition tasks, the training data is actually much harder to identify than your development set or your test set. So these two (7%, 10%) are evaluated from the training set distribution, and these two (6%, 6%) are evaluated from the development test set distribution. So sometimes if your development test set is distributed much more easily than the data your application is actually working with, then those error rates can really go down. So if you see something interesting like this, you probably need a more general analysis than this, and I’ll explain that very quickly in the next slide.
So, let’s take the voice-activated rearview mirror as an example, and it turns out that the numbers that we’ve been writing out can fit in a table, and on the horizontal axis, I’m going to put in different data sets. For example, you might get a lot of data from a general speech recognition task, so you might have a bunch of data, data from a speech recognition problem with a small smart speaker, data from your purchase and so on. Then you collected voice data related to the rearview mirror, recorded in the car. So this is the axis of the table, the different data sets. On the other axis, I’m going to label different ways or algorithms for processing data.
First, at the human level, how accurately humans process these data sets. And then this is the error rate achieved on the data set that the neural network has trained, and then there’s the error rate achieved on the data set that the neural network has not trained. So the result we slide on the level of human error rate, the Numbers into the cell (the second line of the second column), how good of a human of this kind of data processing, such as data from a variety of speech recognition system, those who enter your training set of tens of thousands of voice clips, and on a slide in the example is 4%. This number (7%), which is probably our training error rate, was 7% in the example on the last slide. Yes, if your learning algorithm has seen this sample, run gradient descent on this sample, from your training set distribution or your general speech recognition data distribution, how does your algorithm perform on the trained data? And then there’s the train-development set error rate, usually the error rate from this distribution is a little bit higher, the average speech recognition data, if your algorithm hasn’t been trained on samples from this distribution, how does it perform? This is what we call the train-development set error rate.
If you move to the right, this cell is the development set error rate, or maybe the test set error rate, which in this case was 6%. And the development set and the test set, they’re actually two numbers, but they both fit into this cell. If you have data from the rearview mirror, data actually recorded from the rearview mirror application in the car, but your neural network has not propagated back on that data, what is the error rate?
The analysis we did in the last slide was to observe the difference between these two numbers (Human level 4% and Training error 7%), and between these two numbers (Training error 7% and train-dev error 10%), Between these two numbers (train-dev error 10% and dev /Test dev 6%). This gap (Human Level 4% and Training Error 7%) measures the size of avoidable deviation, and this gap, Training Error 7% and train-dev error 10%) measures the size of variance, This gap (train-dev error 10% and dev /Test dev 6%) measures the size of the data mismatch problem.
Rearview Mirror Speech Data 6% and Error on Examples trained on 6% turn out to be useful as well. If it turns out to be 6 percent, then the way you get that number is you get some people to tag their rearview mirror speech recognition data themselves, and see how good a human can do at that task, and maybe it turns out to be 6 percent. Practice is that you collect some rearview mirror speech recognition data, put it in the training set, let the neural network to learn, then measure the data subset on error rate, but if you get the result, well, that means you have reached the human level on the rearview mirror voice and data, so you may do to the data distribution.
As you go on to do more analysis, it doesn’t necessarily show you a path forward, but sometimes you can get insight into features. For example, comparing these two numbers (General Speech recognition Human level 4% and Rearview Mirror Speech Data 6%) tells us that rearview mirror speech data is actually more difficult for humans than General speech recognition. Because humans are 6% wrong, not 4% wrong, but if you look at the difference, you can get a sense of the bias and variance, and the degree to which the data mismatches these problems. So a more general analysis, and I’ve done this a couple of times. I haven’t used it yet, but for a lot of problems checking the entries in this subset, looking at the differences, is enough to get you moving in a relatively promising direction. But sometimes by filling out the entire table, you might get more insight.
And finally, we’ve talked a lot about how to deal with bias, how to deal with variance, but what about data mismatches? Especially if your development set, your test set, and your training set come from different distributions, you can use more training data and really help improve your learning algorithm. But if the problem isn’t just bias and variance, and you now have this potential new problem, data mismatches, what’s a good way to deal with data mismatches? To be honest, it’s not very generic, or at least the way the system solves data mismatches, but there are a few things you can do that might help, and we’ll look at those in the next video.
So we talked about using training data from a different distribution than the development set and the test set, which gives you more training data, and therefore helps improve the performance of your learning algorithm, but the potential problem is not just bias and variance, and doing so introduces a third potential problem, data mismatches. If you do error analysis and find that data mismatches are the source of a large number of errors, how do you solve the problem? It turns out, unfortunately, that there’s no particular systematic way to solve data mismatches, but there are a few things you can try, and it might help. Let’s watch the next video.
2.6 Addressing Data Mismatch
What if your training set comes from a different distribution than your development test set, and error analysis shows that you have a data mismatch problem? There is no completely systematic solution to this problem, but we can look at some things that can be tried. If I find a serious data mismatch, I usually do error analysis myself, trying to understand the specific differences between the training set and the development test set. Technically, to avoid overfitting the test set and doing error analysis, you should manually look at the development set rather than the test set.
But as a concrete example, if you’re developing a voice-activated rearview mirror app, you might want to look at… I think if it’s voice, you might want to listen to a sample from a dev set and try to figure out what the difference is between a dev set and a training set. So, for example, you might find that a lot of dev sets have a lot of noise, a lot of car noise, and that’s one of the differences between your dev sets and your training sets. Maybe you’ll find other errors, like the language activated rearview mirror in your car, and you’ll find that it probably often reads the wrong street number because there are a lot of navigation requests that have street addresses, so it’s really important to get the right street number. When you understand the nature of the development set error, you know that the development set can be different or harder to identify than the training set, so you can try to make the training data more like the development set, or you can collect more data similar to your development set and test set. So, for example, if you find that vehicle background noise is a major source of error, you can simulate vehicle noise data, which I’ll discuss in detail in the next slide. Or if you find it hard to read street numbers, maybe you can consciously collect more audio data of people saying numbers and add it to your training set.
Now I know that this slide only gives a rough guide, lists some of the things you can try, it’s not a systematic process, I think, it doesn’t guarantee that you’ll make progress. But I’ve found that this artificial insight that we can try to gather more data together that are similar to the situations that really matter often helps solve a lot of problems. So, if your goal is to get training data closer to your development set, what can you do?
One of the techniques you can use is artificial Data Synthesis, which we’ll talk about. In the case of solving the problem of automobile noise, speech recognition system should be established. Maybe you don’t actually have that much audio that you actually record with car background noise, or freeway background noise. But what we found is that you can synthesize. So let’s say you record a lot of clear audio, audio without background noise from vehicles, “The quick Brown fox jumps over The lazy dog.” So, this could be an audio clip from your training set. By The way, this sentence is often used in AI testing, You will see this sentence a lot because it contains all the letters from A to Z. 4. But with the recording “The quick Brown fox jumps over the lazy dog,” you can also collect a snippet of car noise: the background noise of the car’s interior — if you drive without saying a word. If you put two audio clips together, you can compose the effect of “The quick Brown fox jumps over the lazy dog” (with car noise) in the background noise of cars that sounds like this, so this is a relatively simple example of audio composition. In practice, you might synthesize other audio effects, such as reverb, which is the effect of sound bouncing off the inside of the car.
But with artificial data synthesis, you can create more training data very quickly, just like you can actually record it in a car, without having to actually go out and collect data, say, in a moving car, and record tens of thousands of hours of audio. So if error analysis suggests you should try to make your data sound more like it was recorded in a car, it makes sense to synthesize that audio and feed it to your machine learning algorithm.
Now, there’s a potential problem with artificial data synthesis, for example, if you record 10,000 hours of audio data in a quiet background, and then, for example, if you record only one hour of car background noise, you can do this and play back that hour of car noise 10,000 times. And superimposed on 10,000 hours of data recorded against a quiet background. If you do that, the audio sounds fine. But there’s a risk that your learning algorithm might overmatch that hour of car noise. In particular, if the audio recorded in this set of cars is probably a collection of all the car noise backgrounds that you can imagine, if you only record car noise for an hour, you’re probably only simulating a small fraction of the total data space, you’re probably synthesizing data from a very small subset of car noise.
To the human ear, the audio sounds fine, because one hour of traffic noise sounds the same to the human ear as any other hour of traffic noise. But you can synthesize data from a very small subset of this whole space, and the neural network can end up overfitting your hour of car noise. I don’t know if it’s possible to collect 10,000 hours of car noise at a lower cost, so that instead of playing back that one hour of car noise over and over again, you have 10,000 hours of never-repeating car noise superimposed on 10,000 hours of never-repeating voice recording on a quiet background. It can be done, but it is not guaranteed. But using 10,000 hours of never-repeating car noise, rather than one hour of repeated learning, the algorithm is likely to achieve better performance. The challenge with artificial data synthesis is that the human ear, the human ear can’t tell that 10,000 hours sound the same as that one hour, so you might end up producing this training data that has very little raw data, in a much smaller subset of the space, but you don’t realize it.
Here’s another example of synthetic data, if you’re developing a driverless car, you might want to detect such a car, and then wrap it around a box like this. One idea that has been discussed a lot is why not use computer-generated images to simulate thousands of vehicles? In fact, there are a couple of pictures of cars here (the last two below) that are actually computer-generated, and I think the composite is pretty realistic, and I think you can train a pretty good computer vision system to detect cars by doing that.
Unfortunately, the same thing that happened in the last slide, like this is a collection of all the cars, and if you synthesize only a small subset of these cars, that might be fine for the human eye, but your learning algorithm might overfit this small subset of cars that you synthesize. In particular, a lot of people have independently come up with the idea that once you find a computer game where the vehicle rendering is realistic, you can take screenshots and get a huge data set of car images. It turns out that if you look closely at a video game, if the game has only 20 individual cars, it looks ok. Because you’re driving in the game, and you only see these 20 cars, the simulation looks pretty realistic. In the real world, however, there are more than 20 different vehicle designs. If you train the system with composite photos of 20 unique vehicles, your neural network is likely to overmatch those 20 vehicles, but humans can hardly tell them from each other. Even though these images look realistic, you may really only be using a very small subset of all possible vehicles.
So, in summary, if you think there’s a data mismatch problem, I suggest you do error analysis, or you look at the training set, or you look at the development set, and try to figure out, try to understand what the difference is between these two data distributions, and then see if there’s a way to collect more data that looks like the development set for training.
One approach we talked about is artificial data synthesis, and it works. In speech recognition. I’ve seen artificial data synthesis significantly improve the performance of already-good speech recognition systems, so it works. But be careful when using synthetic data, and remember that you may be simulating data from only a fraction of the possible space.
So that’s how to deal with data mismatches, and then I want to share with you some ideas about how to learn from multiple types of data at the same time.
2.7 Transfer Learning
One of the most powerful ideas in deep learning is that neural networks can sometimes learn from one task and apply that knowledge to a separate task. So for example, maybe you’ve trained a neural network to recognize objects like cats, and then you use that knowledge, or part of that learned knowledge, to help you read X-ray scans better, which is called transfer learning.
Let’s see, let’s say you’ve trained an image recognition neural network, so you take a neural network first and train on pairs, where are images, where are some objects, where are images of cats, dogs, birds or something else. If you take this neural network, and you adapt it, or transfer it, what you learn on a different task, say, radiology diagnosis, which means reading a X-ray scan. What you can do is you can take away the last output layer of the neural network, delete it, delete the weights that go to the last layer, and then re-assign random weights to the last layer, and then train it on diagnostic radiology data.
Specifically, during the first stage of training, when you do the image recognition task training, you can train all the usual parameters of the neural network, all the weights, all the layers, and then you have a network that can do the image recognition prediction. Now, after training this neural network, to implement transfer learning, what you do now is, you replace the data set with new pairs, and now these become radiological images, and the diagnoses that you want to predict, what you do is you initialize the weights of the last layer, let’s call it sum random initialization.
Now, we retrain the network on this new data set, train the network on the new radiology data set. There are several ways to retrain a neural network with a radiological data set. You might, if your radiology data set is small, you might just need to retrain the weights of the last layer, which is the sum and leave the other parameters unchanged. If you have enough data, you can retrain all the remaining layers of the neural network. The rule of thumb is that if you have a small data set, only train the last layer before the output layer, or maybe the last one or two. But if you have a lot of data, then maybe you can retrain all the parameters in the network. If you retrain all the parameters in the neural network, and this is in the initial training phase of the image recognition data, sometimes called pre-training, because you’re using the image recognition data to pre-initialize, or pre-train the weights of the neural network. Then, if you update the title weight later and then train on the radiology data, this process is sometimes called fine tuning. If you read about pre-training and fine-tuning in the deep learning literature, you know that’s what they mean. The weight of pre-training and fine-tuning comes from transfer learning.
What you’re doing in this case is applying or transferring what you’ve learned from image recognition to radiology, why does that work? There are a lot of low level features like edge detection, curve detection, positive objects, and getting those from a very large image recognition database might help your learning algorithm do a better job in radiology diagnosis, algorithms get a lot of structure information, image shape information, Some knowledge can be very useful, so I learn to image recognition, it can learn enough information, how can understand the different part of the image, the knowledge learn line, point, curve, perhaps a small part of the object, this knowledge could help you learn faster the radiology diagnosis of network, or data requires less learning.
Here’s another example, suppose you’ve trained a speech recognition system, now it’s audio or audio snippet input, instead of dictation text, so you’ve trained the speech recognition system to output dictation text. Now let’s say that you want to build a wakeword or trigger word detection system, so wakeword or trigger words are words that we say to wake up voice-controlled devices in your home, like if you say “Alexa” to wake up an Amazon Echo device, or “OK Google” to wake up a Google device, Wake up an Apple device with “Hey Siri” and a Baidu device with “Hello Baidu”. To do this, you may need to remove the last layer of the neural network and add new output nodes, but sometimes you can add more than one new node, or even add several new layers to your neural network and feed the wakeword detection problem tag into the training. Again, depending on how much data you have, you might just need to retrain new layers of the network, maybe you need to retrain more layers of the neural network.
So when does transfer learning make sense? The situations where transfer learning works are where you have a lot of data for the transfer source problem, but you don’t have as much data for the transfer target problem. For example, suppose you have a million samples in an image recognition task, so there’s a lot of data here. You can learn low-level features, you can learn how to recognize a lot of useful features in the first few layers of the neural network. But for radiology department task, perhaps you only have one hundred samples, so you radiological diagnosis problem data are very few, perhaps only 100 ray scanning, so you can learn a lot of knowledge from image recognition training migration, and really help you strengthen the radiology department identify task performance, even if your radiology data rarely.
For speech recognition, maybe you’ve trained your speech recognition system with 10,000 hours of data, so you’ve learned a lot about the features of the human voice from 10,000 hours of data, which is a lot of data. But for trigger word detection, you may only have one hour of data, so this data is too small to fit many parameters. So in this case, learning a lot of features of the human voice up front, the components of human language and so on, can help you build a pretty good wakeup word detector, even if your data set is relatively small. For wakeword tasks, at least, the data set is much smaller.
So in both cases, you move from a problem with a lot of data to a problem with relatively little data. Then the other way around, transfer learning might not make sense. For example, you use 100 picture training image recognition system, and then there are 100 or even 1000 picture X-ray diagnosis system is used for training, people might think, to enhance the performance of radiology diagnosis, if you really want the X-ray diagnosis system is well done, training in radiology images may be more valuable than the use of images of cats and dogs. So the value per sample here (100 or even 1,000 images to train the radiology diagnostic system) is much greater than the value per sample here (100 images to train the radiology image recognition system), at least in terms of building a well-performing radiology system. So if you have more radiology data, then your 100 pictures of cats and dogs or random objects are not going to be very helpful, because each image from the cat and dog recognition task is not going to be as valuable as a scan for building a good radiology diagnostic system.
So this is one example of how migrating learning may not be harmful, but don’t expect it to yield meaningful gains. Similarly, if you train a speech recognition system with 10 hours of data. And then you actually have 10 hours or more, let’s say 50 hours of wakeup word detection data, and you know that migrating learning may or may not help, and maybe migrating learning for those 10 hours won’t hurt too much, but you don’t expect to get meaningful gains.
So to sum up, when does transfer learning make sense? If you want to learn and transfer some knowledge from a task to a task, then transfer learning makes sense when both the task and the task have the same input. In the first example, the inputs for and are images, and in the second example, both are audio. Transfer learning makes more sense when the task has much more data than the task. All of these assumptions assume that you want to improve the performance of the task, because the task has more value per piece of data, and for the task in general the task has to have a much larger amount of data to be helpful, because the value of the individual sample in the task is no greater than the value of the individual sample in the task. And then if you feel that the low-level features of the task can help the learning of the task, then transfer learning makes more sense.
And in these two previous examples, maybe learning image recognition teaches the system enough about images to allow it to do radiology diagnosis, or maybe learning speech recognition teaches the system enough about human language to help you develop trigger word or wake word detectors.
So to summarize, the migration study is the most useful situations, if you try to optimize the performance of the task B is usually the task data is relatively small, for example, you know it is difficult to collect a lot of rays in radiology department scans to build a good performance of radiology diagnosis system, so in this case, you may find a related but different tasks, such as image recognition, You might have been trained with a million images and learned a lot of low-level features from that, so that might help the network do a better job on the radiology task, even though the task doesn’t have as much data. When does transfer learning make sense? It does significantly improve the performance of your learning task, but I have sometimes seen cases where the task actually has less data than the task, and in those cases the gain may not be much.
Ok, so that’s transfer learning, where you learn from one task and then you try to transfer to a different task. There’s another version of learning from multiple tasks, which is what’s called multitasking learning, when you try to learn in parallel from multiple tasks, as opposed to serial learning, and you train for one task and then try to move to another task, so in the next video, let’s talk about multitasking learning.
2.8 Multi-task Learning
In transfer learning, your steps are serial, you just learn from the task and then migrate to the task. In multitasking learning, you start learning at the same time, trying to get a single neural network to do several things at once, and then hoping that each task here will help all the other tasks.
Let’s take an example. If you’re developing a driverless car, your driverless car might need to detect different objects at the same time — pedestrians, vehicles, stop signs, traffic lights, all sorts of other things. For example, in this example on the left, there’s a stop sign in the image, and then there’s a car in the image, but no pedestrians, and no traffic lights.
If this is the input image, then instead of one label, there are four labels. In this example, there are no pedestrians, there is a car, there is a stop sign, and there are no traffic lights. And then if you try to detect other objects, maybe the dimensions are higher, so let’s just use four for now, so it’s a 4 by 1 vector. If you look at the training set label as a whole like before, we stack the training set labels horizontally, like this, until:
But now we have a 4 by 1 vector, so these are all vertical column vectors, so this matrix now becomes a matrix. And before, when you had a single real number, this was the matrix.
So what you can do now is train a neural network to predict these values, and you get a neural network that looks like this, the input, and now the output is a four-dimensional vector. Notice that the output here I’ve drawn four nodes, so the first node is where we want to predict if there are pedestrians, and then the second output node is where we want to predict if there are cars, and here we want to predict if there are stop signs, and here we want to predict if there are traffic lights, so this is four dimensional.
To train the neural network, you now need to define the neural network’s loss function, which, for one output, is a 4-dimensional vector, for the average loss of the entire training set:
These individual predicted losses, so this is the sum of the four components, pedestrian, vehicle, stop sign, traffic light, and this sign L refers to the logistic loss, so let’s write it like this:
The main difference between the average loss for the entire training set and the previous example of classifying cats is that you now sum up to 4. The main difference between this and softmax regression is that, unlike Softmax regression, softmax assigns a single tag to a single sample.
This image can have many different tags, so not every image is just a pedestrian image, a car image, a stop sign image, or a traffic light image. You need to know if there are pedestrians, or cars, stop signs, or traffic lights in each photo. Multiple objects may be in the same image. In fact, on the last slide, the graph had both car and stop signs, but no pedestrians and no traffic lights, so instead of just labeling the image, you need to go through the different types and see if, for each type, that type of object is present in the image. So I’m just saying that in this case, one image can have multiple labels. If you train a neural network to minimize this cost function, what you’re doing is multitasking. Because what you’re doing is you’re building a single neural network, you’re looking at each picture, and you’re solving four questions, and the system is trying to tell you, are there any of these four objects in each picture? And instead of training one network to do four things, you can train four different neural networks. But there are some early features of neural networks that are used to recognize different objects, and then you find that training one neural network to do four things is better than training four completely independent neural networks to do four different things, and that’s the power of multitasking learning.
Another detail, the way I’ve described the algorithm so far, is as if every graph has all the labels. It turns out that multitasking learning can also handle situations where only part of the image is marked. So the first training sample, we said somebody, the person who’s tagging the data tells you there’s a pedestrian in there, no car, but they don’t mark whether there’s a stop sign, or whether there’s a traffic light. Maybe in the second example, you have pedestrians, you have cars. But when the tagger looked at the image, they didn’t tag it, didn’t mark whether there was a stop sign, whether there was a traffic light, etc. Maybe some of the samples were marked, but maybe some of the samples they just marked if there was a car, and then some of the samples were question marks.
Even with this data set, you can train algorithms on it to do four tasks at once, even if some of the images have only a small portion of labels and others are question marks or whatever it is. Then you training algorithm way, even if there are some label is question mark, or no tag, this is from 1 to 4 sum, you can only to take the value of 0 s and 1 s label sum, so when there is a question mark, you will ignore the item when sum, so only to have the value of the label, and then you can use this data set.
So when does multitasking make sense? When three things are true, it makes sense.
First, if you train a group of tasks, you can share low-level features. For the driverless example, it makes sense to recognize traffic lights, cars and pedestrians at the same time. These objects have similar features that might help you recognize stop signs because those are features on the road.
Second, this rule is not so absolute that it is not necessarily true. But what I’ve seen from many successful cases of multitasking learning is that if the amount of data on each task is close, you remember that in transfer learning, you learn from the task and then you move to the task, so if the task has a million samples, and the task only has a thousand samples, then what you learn from the one million samples, It really helps you sharpen your training for tasks with smaller data sets. What about multitasking? In multi-tasking, you usually have more tasks than just two, so maybe you have, we had four tasks before, but let’s say you’re doing 100 tasks, and you’re doing multi-tasking, trying to identify 100 different types of objects at the same time. You might find that there are about 1,000 samples per task. So if you focus on enhancing performance on a single task, let’s say we focus on enhancing performance on the 100th task, let’s say, if you try to do this last task alone, you only have 1000 samples to train for this task, which is one of the 100 tasks, and by training on the other 99 tasks, These add up to a total of 99,000 samples, which could significantly improve the performance of the algorithm and provide a lot of knowledge to enhance the performance of the task. Otherwise, the training set with only 1000 samples for the task might be poor. If there is symmetry, these other 99 tasks might provide some data or provide some knowledge to help each of these 100 tasks. So the second one is not an absolute rule, but what I usually look at is that if you focus on single tasks, if you want to get a big performance boost from multitasking, then the other tasks together have to have a much larger amount of data than the single task. To meet this condition, one way is, such as the example on the right, or if the amount of data in each task is similar, but the point is that if you already have 1000 samples for a single task, so for all other tasks, you’d better have more than 1000 samples, so that the knowledge of the other tasks can help you to improve the performance of this task.
Finally, multitasking often makes more sense when you can train a neural network large enough to do all the work at once, so the alternative to multitasking is to train a separate neural network for each task. So instead of training a single neural network to simultaneously process pedestrian, car, stop sign and traffic light detection. You can train a neural network for pedestrian detection, a neural network for car detection, a neural network for stop sign detection and a neural network for traffic light detection. So what did researcher Rich Carona discover a few years ago? The only way multitasking can degrade performance, and the only way you can degrade performance compared to training a single neural network is if your neural network isn’t big enough. But if you can train a large enough neural network, multi-task learning certainly does not or rarely degrades performance, and we expect it to improve performance better than training the neural network to perform each task individually.
So that’s multitasking learning, and in practice, it’s used less often than transfer learning. I see a lot of applications of transfer learning, where you need to solve a problem, but you have very little training data, so you need to find a related problem with a lot of data to learn in advance, and transfer the knowledge to the new problem. But multitasking is rare, where you have to do a lot of things at once and do them well, and you can train for all of them at once, maybe computer vision is an example. In object detection, we see more applications using multitasking learning, where one neural network trying to detect a large number of objects is better than training different neural networks separately to detect objects. But I said, on average, transfer learning is now used more often than multitasking, but both can be powerful tools for you.
So in summary, multitasking learning allows you to train a neural network to perform many tasks, which can give you higher performance than if you did each task individually. But note that transfer learning is actually used more often than multitasking. I see a lot of tasks are, if you want to solve the problem of a machine learning, but your data set is relatively small, so the migration study really can help you, if you find a related problem, the data quantity is much, you can on the basis of its training your neural network, and then migrated to the amount of data a few tasks.
Today we learned a lot about transfer learning, and some applications of transfer learning and multitasking. But multitasking, I think, is much less used than transfer learning, and maybe one exception is computer vision, object detection. In those tasks, people often train a single neural network to detect many different objects at the same time, rather than training a single neural network to detect visual objects. But on average, I think that even transfer learning and multitasking learning work similarly. In fact, I see more transfer learning than multitasking, and I think that’s because it’s hard to find that many similar tasks with the same amount of data can be trained with a single neural network. Again, object detection is the most notable exception in computer vision.
So that’s multitasking learning, multitasking learning and transfer learning are important tools in your toolkit. Finally, I want to continue talking about end-to-end deep learning, so let’s look at the next video on end-to-end learning.
2.9 What is end-to-end deep Learning? (What is end-to-end deep learning?)
One of the most exciting recent developments in deep learning is the rise of end-to-end deep learning, so what exactly is end-to-end learning? In short, there used to be data processing systems or learning systems that required multiple stages of processing. So end-to-end deep learning is about ignoring all these different stages and replacing it with a single neural network.
Let’s look at some examples, for example, of speech recognition, where your goal is to take the input, say a piece of audio, and map it to an output, which is the dictation text of that audio. So traditionally, speech recognition requires many stages of processing. First you’re going to extract some features, some hand-designed audio features, and maybe you’ve heard of MFCC, which is an algorithm that extracts a specific set of human-designed features from audio. After extracted some low-level features, you can be found in application of machine learning algorithm in audio clips phonemes, so phoneme is the basic unit of the sound, for example, the word “Cat” is composed of three syllables, Cu -, Ah – and Tu – algorithm took the three phoneme extracted, and then you will phoneme string together constitute independent word, Then you string the words together to form the dictation text of the audio clip.
So what end-to-end deep learning does, as opposed to this kind of pipelining that has many stages, is you train a huge neural network, and the input is a piece of audio, and the output is directly dictation text. AI is one of the interesting sociology effect, as the end-to-end deep learning to better system performance, there are some spend a lot of time or the entire career design assembly line the various steps of the researcher, and researchers in other fields, not just in the field of speech recognition, perhaps the computer vision, and other areas, they spend a lot of time, I’ve written a lot of papers, and some of them have spent a lot of their entire careers developing features and other components of the pipeline. End-to-end deep learning, on the other hand, bypasses many of these steps by simply taking the training set and learning the functional mapping between and. This is very difficult for people in some disciplines to accept, and they can’t accept this way of building AI systems, because in some cases, the end-to-end approach completely replaces the old system, and some of the intermediate components that have been worked on for years may be outdated.
It turns out that one of the challenges of end-to-end deep learning is that you may need a lot of data to make the system perform well, so for example, if you only have 3,000 hours of data to train your speech recognition system, then the traditional pipeline works really well. But when you have very large data sets, like 10,000 hours of data or 100,000 hours of data, then all of a sudden the end-to-end approach starts to be very powerful. So when you have a small data set, the traditional pipelined approach actually works fine, often better. You need big data sets for the end-to-end approach to really shine. If you have a modest amount of data, then you can also use a middleware approach, where you might put in audio again and then bypass feature extraction and try to output phonemes directly from the neural network, and then you can use it in other stages as well, so it’s a small step towards end-to-end learning, but it’s not there yet.
This picture shows a researcher’s face recognition access control, made by Researcher Lin Yuanqing of Baidu. This is a camera that takes a picture of someone approaching the access control, and if it recognizes that person, the access control system automatically turns on and lets him through, so you don’t have to swipe an RFID time card to get into the facility. The system is being deployed in more and more Chinese offices, and hopefully more in other countries as well. You can approach the gate and if it recognizes your face, it will let you through. You don’t need to wear an RFID time card.
So how do you build such a system? The first thing you can do is look at the pictures that the camera takes, right? I don’t think I drew it very well, but maybe this is a camera picture, you know, somebody got close to the door, so this could be a camera image. One thing you can do is try to learn the functional mapping of images to people’s identities directly, which turns out not to be the best way. One of the problems is that people can approach the door from many different angles, they could be in the green position, they could be in the blue position. Sometimes they’re closer to the camera, so they look bigger, and sometimes they’re closer to the camera, so the face in the picture is bigger. When he actually built these access systems, he didn’t feed raw photos directly into a neural network to try to figure out a person’s identity.
Instead, by far the best way seems to be a multistep method, first of all, you run a software to detect human face, looking for the human face location, so the first detector to detect human face, then magnified image of the part, and cut out the image, the face centered, and then there is red line frame picture here, then fed into the neural network and make the network to learn, Or estimate the identity of the person.
The researchers found that it was easier to break the problem down into two steps than to learn it all at once. First, figure out where the face is. Step two is to look at the face and figure out who it is. This second approach allows the learning algorithm, or two learning algorithms, to solve two simpler tasks separately and perform better overall.
By the way, if you want to know how step 2 actually works, I’ve omitted a lot here. The way you train the second step, the way you train the network is you take two images, and what your network does is it compares the two images that you put in, and it decides if it’s the same person. Let’s say you have 10,000 employee ids, and you can quickly compare the images boxed in red… Maybe the ids of all 10,000 employees on record, and see if the picture in the red line is one of those 10,000 employees, to determine whether they should be allowed into the facility or into the building. This is an access control system that allows employees to enter the workplace with access control.
Why is the two-step better? There are actually two reasons. One is that you solve two problems, each of which is actually much simpler. But second, there’s a lot of training data for both subtasks. Specifically, there are a lot of training data can be used in face recognition, for the task 1, the task is to watch a picture, find out human face location, frame the face image, so there are a lot of data, there are a lot of label data, which is the picture, is the location of the face, said, you can set up a neural network is a good way to deal with task 1. Then task 2, also has a lot of data is available, today, industry leading companies, such as hundreds of millions of human face photo, so enter a photograph of cut very compact, such as the red photo, below this, today’s leading face recognition team at least hundreds of millions of images, they can be used to observe the two pictures, and in trying to determine the identity of the person, To determine if it’s the same person, so there’s a lot more data on task 2. In contrast, if you want to do it all at once, you have a much smaller set of data, which is the image taken by the access control system, which is the identity of the person, because you don’t have enough data to solve this end-to-end learning problem, but you do have enough data to solve subproblem 1 and subproblem 2.
In fact, splitting this into two sub-problems leads to better performance than a pure end-to-end deep learning approach. But if you have enough data to do end-to-end learning, maybe the end-to-end approach works better. But in practice today, it’s not the best way.
Let’s look at a few more examples, like machine translation. Traditionally, machine translation systems also have a very complicated pipeline, like English machine translation to get text, and then you do text analysis, and you basically extract features from the text and so on, and you go through a lot of steps, and you end up translating the English text into French. Because there are a lot of data pairs (English and French) for machine translation, end-to-end deep learning works really well in machine translation, and that’s because you can collect the right big data sets today, which are English sentences and their French translations. So in this case, end-to-end deep learning works really well.
One last example, let’s say you want to look at an X-ray of a child’s hand and estimate a child’s age. You know, when I first heard this question, I thought it was a really cool crime scene investigation mission where you might tragically find the skeleton of a child, and you’re trying to figure out what the child was like at birth. It turns out that the typical application of this question, to estimate a child’s age from an X-ray image, is that I think too much, not as much as MY csi imagination, and it turns out that this is what pediatricians use to determine if a child is developing normally.
One of the non-end-to-end ways of dealing with this example, is to take a picture, and then slice up each of the bones, so sort of figure out where that bone should be, where that bone is, where that bone is, and so on. And then, knowing the lengths of different bones, you can go to a table, look up the average length of bones in a child’s hand, and use that to estimate the child’s age, so it’s actually pretty good.
By contrast, if you are trying to determine a child’s age directly from an image, you need a lot of data to train directly. As far as I know, that still doesn’t work today because there isn’t enough data to train the task in an end-to-end way.
You can imagine how you could break this down into two steps, the first step is a simpler problem, maybe you don’t need as much data, maybe you don’t need a lot of X-ray images to slice the bone. Task two, collecting statistics on bone length in children’s hands, you don’t need a lot of data to make a fairly accurate estimate, so this multi-step approach looks very promising, maybe more promising than the end-to-end approach, at least until you have more data on end-to-end learning.
So end-to-end deep learning systems can work, they can perform well, and they can simplify the architecture so you don’t have to build so many individual components that are designed by hand, but it’s not a panacea, it doesn’t work every time. In the next video, I want to share with you a more systematic description of when you should and shouldn’t use end-to-end deep learning, and how to assemble these complex machine learning systems.
2.10 Should WE use end-to-end deep learning? (Whether to use end-to-end learning?)
Let’s take a look at some of the pros and cons of end-to-end deep learning, so that you can use some criteria to determine whether your application is promising to use end-to-end methods.
Here are some of the benefits of applying end-to-end learning. First, end-to-end learning really just lets the data do the talking. So if you have enough data, so no matter what is the most suitable function mapping from to, if you have a large enough neural network training, hope that the neural network can understand myself, and use pure machine learning method, directly from the input to the training of neural networks, may be able to capture data in any statistical information, Instead of being forced into human stereotypes.
For example, in the field of speech recognition, early have the concept of phoneme recognition system, is the voice of the basic unit, such as the cat “cat” Cu – words, Ah – Tu – and I think the phoneme is human linguists coined relatively, I actually think phoneme is the illusion of experts, using phoneme description language also is reasonable. But don’t force your learning algorithm to think in terms of phonemes, it’s not always obvious. If you let your learning algorithm learn any representation it wants to learn, rather than forcing your learning algorithm to use phonemes as representations, it’s likely to perform better overall.
The second benefit of end-to-end deep learning is that there are fewer components that need to be manually designed, so maybe it simplifies your design workflow, you don’t have to spend as much time manually designing features, manually designing these intermediate representations.
What about the disadvantages? There are some downsides. First, it can require a lot of data. To learn this mapping directly, you probably need a lot of data. We saw an example in a previous video where you can collect a lot of task data, like face recognition, and we can collect a lot of data to distinguish faces from images, and when you find a face, you can also find a lot of face recognition data. But for the entire end-to-end task, there may be less data available. So it’s the input side of end-to-end learning, it’s the output side, so you need a lot of that data, both on the input side and on the output side, to train these systems. That’s why we call it end-to-end learning, because you learn directly from one end of the system to the other.
Another disadvantage is that it excludes hand-designed components that might be useful. Machine learning researchers generally despise things designed by hand, but if you don’t have a lot of data, your learning algorithm won’t be able to gain insight from a very small training set of data. So designing components by hand in this case could be a way to inject human knowledge directly into the algorithm, which is never a bad thing. I think learning algorithms have two main sources of knowledge, one is data, and the other is anything you design by hand, whether it’s components, functions, or whatever. So when you have a lot of data, hand-designed things don’t matter so much, but when you don’t have a lot of data, building a well-designed system can actually inject a lot of human knowledge of the problem directly into the problem, into the algorithm should be quite helpful.
So one of the downsides of end-to-end deep learning is that it excludes human-designed components that might be useful, which can be very useful, but they can also really hurt your algorithm’s performance. For example, forcing your algorithm to think in terms of phonemes might be better for the algorithm to find a better representation on its own. So it’s a double-edged sword, there can be harm, there can be good, but often more good, hand-designed components tend to be more helpful when the training set is smaller.
If you’re building a new machine learning system, and you’re trying to decide whether to use end-to-end deep learning, I think the key question is, do you have enough data to learn directly from mappings to sufficiently complex functions? I haven’t formally defined the term “complexity needed.” But intuitively, if you want to learn a function from the data of PI, which is to look at an image like this and recognize the position of all the bones in the image, then maybe it’s a relatively simple problem like recognizing the bones in the image, maybe the system doesn’t need that much data to learn to handle that task. Or give a picture of a person, maybe it’s not that hard to find a face in the picture, so you probably don’t need much data to find a face, or at least you can find enough data to solve the problem. Finding a function that maps the X-ray image of the hand directly to the child’s age seems intuitively more complicated. If you use a pure end-to-end approach, it takes a lot of data to learn.
At the end of the video I’ll talk about a more complicated example. As you probably know, I’ve been spending a lot of time helping drive. Ai, a company that focuses on driverless technology. Well, one thing you can do here, and this is not end-to-end deep learning, is you can take the readings from the radar or lidar or other sensors in front of your car as input images. But to make it simple, let’s just say take a picture of the car in front of or around it, and then to drive safely, you have to be able to detect nearby cars, you also need to detect pedestrians, you need to detect other things, and of course, this is a highly simplified example that we’re providing here.
Once you know where the other cars and shapes are, you need to plan your own route. So in other words, when you see where the other cars are, where the pedestrians are, you need to decide how to swing the steering wheel to guide the car’s path for the next few seconds. If you decide to take a particular path, maybe this is an overhead view of the road, this is your car, maybe you decide to take that route, this is a route, then you need to swing your steering wheel to the right Angle, and give the right acceleration and braking commands. So from sensor or image input to detecting pedestrians and vehicles, deep learning can do a great job, but once you know where other cars and pedestrians are or what they’re doing, and you choose a path that the car is going to take, that’s usually done not with deep learning, but with what’s called motion planning software. If you’ve taken a robotics course, you probably know about motion planning, which determines the path your car is going to take. There will be other algorithms, let’s say a control algorithm, that will generate precise decisions about how much exactly the steering wheel should turn, how much force should be applied to the accelerator or brake.
So this example just goes to show that if you want to use machine learning or deep learning to learn some individual components, then when you apply supervised learning, you should choose carefully the type of mapping to learn, depending on which tasks you can collect data on. Talk about pure end-to-end depth, by contrast, method of study is very exciting, you input image, it is concluded that the steering wheel Angle directly, but as far as the present to the collected data, and we can use the neural network learning today in terms of data types, it is not the most promising method, or this method is not a team to come up with the best method. In my opinion, this kind of pure end-to-end deep learning method is not as promising as this more complex multi-step method. Because there are limits to the amount of data that can be collected, and our ability to train neural networks.
This is end-to-end deep learning, and sometimes the results are overwhelming. But you also need to be aware of when to use end-to-end deep learning. Finally, thank you, congratulations you insist to now, if you learned the video last week and this week’s video, then I think you have become smarter, more strategic, and be able to make better decisions of priority tasks, to better promote your machine learning project, perhaps than many machine learning engineer, and I saw the researchers are strong in silicon valley. So congratulations on getting here, and I hope you’ll look at this week’s assignment, which should give you another chance to practice these ideas and make sure you master them.
The resources