This article is the notes section of Ng’s deep Learning course [1].
Author: Huang Haiguang [2]
Main Author: Haiguang Huang, Xingmu Lin (All papers 4, Lesson 5 week 1 and 2, ZhuYanSen: (the third class, and the third, three weeks ago) all papers), He Zhiyao (third week lesson five papers), wang xiang, Hu Han, laughing, Zheng Hao, Li Huaisong, Zhu Yuepeng, Chen Weihe, the cao, LuHaoXiang, Qiu Muchen, Tang Tianze, zhang hao, victor chan, endure, jersey, Shen Weichen, Gu Hongshun, when the super, Annie, Zhao Yifan, Hu Xiaoyang, Duan Xi, Yu Chong, Zhang Xinqian
Editorial staff: Huang Haiguang, Chen Kangkai, Shi Qinglu, Zhong Boyan, Xiang Wei, Yan Fenglong, Liu Cheng, He Zhiyao, Duan Xi, Chen Yao, Lin Jianyong, Wang Xiang, Xie Shichen, Jiang Peng
Note: Notes and assignments (including data and original assignment files) and videos can be downloaded on Github [3].
I will publish the course notes on the official account “Machine Learning Beginners”, please pay attention.
Structuring Machine Learning Projects
Week 1 Machine Learning (ML) Strategy (1)
1.1 Why ML Policy? (Why ML Strategy?)
Hello and welcome to this lesson on how to Build your Machine learning Projects which is machine learning strategies. I hope that in this course you will learn how to optimize your machine learning system more quickly and efficiently. So what are machine learning strategies?
Let’s start with an illuminating example. Let’s say you’re debugging your cat classifier, and after some tweaking, your system reaches 90% accuracy, but it’s not good enough for your application.
You may have many ideas to improve your system, for example, you may want us to collect more training data. Or you might say, maybe your training set isn’t diverse enough, you should have more cat pictures in different poses, or a more diverse set of counter examples. Or if you want to go back to gradient descent training, train longer. Or maybe you want to try a completely different optimization algorithm, like Adam optimization. Or try to use larger or smaller neural networks. Or you want to try dropout or regularization. Or you might want to change the architecture of the network, such as changing the activation function, changing the number of hidden units, etc.
When you’re trying to optimize a deep learning system, there’s usually a lot of ideas you can try, but the problem is, if you make the wrong choice, you can spend six months going in the wrong direction and realize six months later that it doesn’t work. For example, I’ve seen teams spend six months collecting more data, only to find after six months that the data has barely improved the performance of their systems. So, assuming you don’t have six months to waste on a project, if there is a quick and effective way to determine which ideas are sound, or even come up with new ones, which are worth trying and which are safe to discard.
I hope that in this course, you’ll get some strategies, some approaches to analyzing machine learning problems that will lead you in the most promising directions. In this course, I’m going to share with you some of the lessons I’ve learned from building and deploying a number of deep learning products that I think are unique to this course. For example, many college deep learning courses rarely mention these strategies. In fact, machine learning strategies are changing in the era of deep learning, because the things that deep learning algorithms can do are very different now than the previous generation of machine learning algorithms. I hope these strategies will help you become more efficient and put your deep learning systems to work faster.
1.2 Orthogonalization
One of the challenges of building machine learning systems is that there are so many things you can try and change. Including, for example, so many hyperparameters that can be tuned. One of the things I’ve noticed about highly effective machine learning experts is that they think very clearly, and they have a very clear idea of what needs to be adjusted to achieve an effect, and this is called orthogonalization, and I’ll tell you what that means.
This is an old television pictures, there are a lot of knobs can be used to adjust the image of the various properties, so for these old television, there may be a knob to adjust the height of the image vertically, in addition, there are a knob to adjust the image width, there may be a knob to the trapezoid Angle, there is a knob to adjust the image about migration, And there’s a knob to rotate the image or something like that. Television designers spent a lot of time designing circuits, often analog circuits, to ensure that each knob had a relatively specific function. Such as a knob to adjust this (height), a knob to adjust this (width), a knob to adjust this (trapezoidal Angle), and so on.
In contrast, imagine if you had a knob that was setting the height of the image, setting the width of the image, setting the trapezoidal Angle, setting the coordinates of the image on the horizontal axis and so on. If you adjust this, the height, the width, the trapezoidal Angle, the pan all change at the same time, and if you have these knobs, it’s almost impossible to get the TV right in the middle of the area.
So in this case, orthotopic refers to the fact that TV designers design knobs so that each knob adjusts only one property, so that it is much easier to adjust the television image, so that the image is in the middle.
Now, another example of orthointersection, if you think about when you’re learning to drive, a car has three main controls, the first is the steering wheel, which determines how far you go to the left or right, and the gas pedal and the brake. It’s these three controls, one of which controls direction, and the other two control your speed, so it’s easier to read. Know how different actions with different controls affect the car’s movement.
Imagine, if someone so building cars, building a game handle, the handle of a shaft control the steering Angle and velocity, and then there’s a shaft is steering Angle of the speed control, in theory, by adjusting the two knobs you can adjust the car to you hope to get the Angle and speed, but rather than separate control steering Angle, It’s much harder to separate separate speed controls.
So refers to the concept of orthogonalization, you can come up with one dimension, the dimension you want to do is to control the steering Angle, there is another dimension to control your speed, so you need a knob to control the steering Angle, another knob, to drive in this case is the accelerator and brake control your speed. But if you have a control knob that mixes the two together, like if you have a control that affects both your steering Angle and your speed and changes both properties, then it’s hard to get your car going at the speed and Angle you want. But after orthogonalization, orthogonality means 90 degrees. Designing orthogonalized controls, ideally consistent with the properties you actually want to control, makes it much easier to adjust the parameters. You can adjust the steering Angle separately, as well as your gas and brakes, to make the car move the way you want.
So what does this have to do with machine learning? To make a supervised learning system work, you usually need to adjust the knobs on your system.
Ensure that four things, first of all, you usually must ensure that the system on the training set at least the result is good, so must be through some kind of evaluation on the training set, reach the level of acceptable, for some applications, this could mean that reach the level of human performance, but it depends on your application, we will be more next week to talk about how to comparing with the level of human performance. But, after doing well on the training set, you want the system to do well on the development set, and then you want the system to do well on the test set. At the end of the day, you want the cost function of the system on the test set to perform satisfactorily in practice, for example, you want users of these cat apps to be satisfied.
If your TV image is too wide or too narrow, and you want one knob to adjust it, you don’t want to carefully adjust five different knobs, they will affect other image properties, you only need one knob to change the width of the TV image.
So similarly, if your algorithm is not well on the cost function fitting the training set, you want to be a knob, yes I did this thing said knob, or a specific set of knob, so that you can be used to make sure that you can adjust your algorithm, let it well fit the training set, so you used to debug the knob is you may be able to train a larger network, Or you can switch to a better optimization algorithm, such as Adam optimization algorithm, and so on. We’ll discuss some other options this week and next.
In contrast, if the algorithm is found to fit the development set poorly, then there should be a separate set of knobs, yes, that’s the other knob THAT I’ve drawn so fiddly, you want to have a separate set of knobs to debug. Let’s say your algorithm doesn’t do well on the development set, it does well on the training set, it doesn’t do well on the development set, and then you have a set of regularized knobs that you can adjust to try to make the system satisfy the second condition. If the height of the image on the TV is not quite right, you need a different knob to adjust the height of the image on the TV, and then you hope that this knob will not affect the width of the TV as much as possible. Enlarging the training set is another knob that can be used to help your learning algorithm better generalize the rules of the development set, now tuning the height and width of the TV image.
What if it doesn’t meet the third criterion? What if the system does well on the development set, but not well on the test set? If so, then the knob you need to adjust may be a larger development set. Because if it does well on the development set, but not on the test set it probably means that you overfit the development set, and you need to take a step back and use a bigger development set.
Finally, if it does a great job on the test set but doesn’t give your cat app users a great experience, that means you need to go back and change the development set or the cost function. Because if the system does a good job on the test set according to a cost function, but it doesn’t reflect how your algorithm performs in the real world, it means that either your development set distribution is set incorrectly, or your cost function is measuring the wrong metrics.
We’re going to go through each of these examples in a minute, and we’re going to go into more detail about these particular knobs later this week and later next week. So if you don’t understand all the details right now, don’t worry, but I want you to get a sense of the orthogonalization process. You have to be very clear about which of the four problems it is, and what different things you can adjust to try to solve that problem.
When I train neural networks, I usually don’t use early stopping, which is a good trick, and a lot of people do it. Personally, THOUGH, I find early Stopping a little hard to analyze, because that knob also affects how you fit the training set, because if you stop early, it doesn’t fit the training set very well, but it’s also used to improve the performance of the development set, so it’s not that orthodontic. Because it affects two things at once, just as a knob affects the width and height of a television picture at the same time. Not that you shouldn’t use it, but you can if you want to. But if you have more orthodontic control, like the other things THAT I’ve written down here, it’s a lot easier to tune the network.
So I want you to have a sense of what orthogonalization means, just like when you look at pictures on television. If you say, my TV image is too wide, so I’m going to adjust this knob (width knob). Or it’s too high, so I have to adjust that knob. Or it’s too trapezoidal, so I’m going to adjust this knob, and that’s good.
In machine learning, if you can look at your system and say this part is wrong, it doesn’t do well on the training set, it doesn’t do well on the development set, it doesn’t do well on the test set, or it doesn’t do well on the test set, but it doesn’t do well in the real world, that’s fine. We have to figure out what’s going wrong, and then we just have the right knob, or the right set of knobs, to solve the problem, the problem that’s limiting the performance of the machine learning system.
That’s what we’re going to talk about this week and next week, how to diagnose where the system performance bottlenecks are. And finding a specific set of knobs that you can use to tweak your system to improve specific aspects of its performance, so let’s start talking about that process in detail.
1.3 Single number Evaluation Metric
Whether you’re tweaking hyperparameters, or trying different learning algorithms, or trying different things when you’re building a machine learning system, you’ll find that if you have a single real metric, you’ll make much faster progress, and it’ll tell you quickly whether the new thing you’re trying is better or worse than the previous one. So when teams start machine learning projects, I often recommend that they have a single real metric for their problems.
Let’s take an example. You’ve heard me say before that applying machine learning is a very empirical process, where we usually take an idea, program it, run experiments, see how it works, and then use those experiments to improve your idea, and then continue the cycle of improving your algorithm.
For example, for your cat classifier, before you built a classifier, by changing the hyperparameter, changing the training set and so on, you now train a new classifier B, so a reasonable way to evaluate your classifier is to observe its precision and recall.
The exact details of precision and recall are not important for this example. But in a nutshell, the definition of accuracy is how many of the examples that your classifier marks as cats are actually cats. So if the classifier has a 95% accuracy, that means that when your classifier says there’s a cat in the picture, there’s a 95% chance it really is a cat.
Recall is what percentage of all images of real cats your classifier correctly identifies. How many of the pictures that are actually cats are recognized by the system? If the recall rate of the classifier is 90%, it means that for all the images, say your development set are real cat graphs, the classifier accurately identifies 90% of them.
So you don’t have to think too much about the definition of precision and recall. As it turns out, there is often a trade-off between precision and recall, with both indicators taken into account. What you want to get is that when your classifier says something is a cat, there’s a good chance it’s a cat, but for all the pictures that are cats, you also want the system to be able to classify most of them as cats, so it makes sense to evaluate the classifier using precision and recall.
But the problem with using recall and recall as metrics is that if the classifier is better at recall and the classifier is better at recall, you can’t tell which classifier is better. If you try a lot of different ideas, a lot of different hyperparameters, you want to be able to quickly experiment with not just two classifiers, but maybe a dozen, and quickly pick the “best” one so you can iterate from there. If there are two metrics, it is difficult to quickly choose one out of two or ten, so I do not recommend using two metrics, precision and recall, to select a classifier. You just need to find a new metric that combines precision and recall.
In the machine learning literature, the standard method for combining precision and recall is the so-called fraction, the details of which are not important. But unofficially, you can think of this as the average of precision and recall. Formally, a fraction is defined by this formula:
In mathematics, this function is called the harmonic mean of precision and recall. But informally, you can think of it as the average of some sort of precision and recall, except instead of taking a direct arithmetic average, you’re taking a harmonic average as defined by this formula. This metric has some advantages in tradeoffs between precision and recall.
But in this case, you can see right away that the classifier scored higher. Assuming scores are a reasonable way to combine precision and recall, you can quickly pick out classifiers and eliminate classifiers.
That’s what I’ve found in a lot of machine learning teams, where you have a well-defined development set that measures precision and recall, and then you have a single numerical evaluation metric, sometimes I call a single real evaluation metric, that allows you to quickly determine whether a classifier is better or not. So with a development set like this, plus a single real evaluation metric, you’re sure to iterate fast, and it can speed up the iterative process of improving your machine learning algorithm.
Let’s look at another example. Let’s say you’re developing a cat app to serve cat lovers in four geographic regions: the United States, China, India, and the rest of the world. Let’s say your two classifiers get different error rates in data from four geographic regions, such as an algorithm with a 3% error rate in images uploaded by USERS in the US, and so on.
So keeping track of how your classifier is performing in different markets and geographies should be useful, but by keeping track of four numbers, it’s hard to quickly determine which algorithm or algorithm is better by looking at those numbers. If you’re testing a lot of different classifiers, it’s hard to look at a lot of numbers and quickly pick the best one. So in this example, I suggest that, in addition to tracking the performance of the classifier across four different geographic regions, we also do an average. Assuming average performance is a reasonable single-real metric, you can make quick judgments by calculating the average.
It looks like the algorithm has the lowest average error rate, and then you can continue with that algorithm. You have to pick an algorithm and iterate, so your machine learning workflow tends to be you have an idea and you try to implement it and see if it’s a good idea.
So what this video shows is that having a single real metric can really improve your efficiency, or your team’s efficiency in making those decisions. Now we haven’t fully discussed how to effectively set up indicators. In the next video, I’m going to show you how to set optimization and meet metrics. Let’s go to the next video.
1.4 Satisficing and Optimizing Metrics
It’s not always easy to combine all the things you care about into single real metrics, and in those cases, I’ve found that sometimes it’s important to set satisfaction and optimization metrics, and let me tell you what I mean.
\
Suppose you have decided that you value the accuracy of the cat classifier, which can be a score or some other measure of accuracy. But in addition to accuracy, we also need to consider the running time, which is how long it takes to classify a graph. The classifier takes 80 milliseconds, it takes 95 milliseconds, it takes 1500 milliseconds, so it takes 1.5 seconds to sort the image.
You can do this by combining accuracy and elapsed time into an overall measure. So the cost, for example, the total cost is, this combination is probably too deliberate, just using this formula to combine the accuracy and the running time, the linear weighted sum of the two values.
There are other things you can do, which is you might choose a classifier that maximizes the accuracy, but it has to meet the running time requirement, that is, it has to take less than or equal to 100 milliseconds to sort the image. So in this case, we have said is an optimized index accuracy, because you want to maximize the accuracy, you want to be as accurate as possible, but the running time is what we call meet index, meaning that it must be good enough, it only need less than 100 milliseconds, reached after, you don’t care how good this index, or at least you don’t really care. So it’s a pretty reasonable tradeoff, or a combination of accuracy and runtime. The reality is that as long as the elapsed time is less than 100 milliseconds, your users won’t care if the elapsed time is 100 milliseconds or 50 milliseconds, or even faster.
Defining optimization and satisfaction metrics gives you a clear way to choose the “best” classifier. In this case, classifier B is the best, because it has the best accuracy of all classifiers with running times less than 100 ms.
So more generally, if you’re looking at metrics, sometimes it makes sense to pick one of those metrics for optimization. So you want to optimize that metric as much as you can, and then the rest of the metrics are satisfy metrics, meaning as long as they hit a certain threshold, say faster than 100 milliseconds, but as long as they hit a certain threshold, you don’t care how they perform beyond that threshold, but they have to hit that threshold.
Here’s another example, suppose you’re building a system to detect wakewords, also known as trigger words, which refer to voice-controlled devices. For amazon Echo, you say “Alexa,” or “Okay Google” to wake up a Google device, or for Apple devices, you say “Hey Siri,” or for some Baidu devices, we wake up with “Hello Baidu.”
Right, these are the wake words, which wake up these voice-controlled devices and listen to what you want to say. So you might care about the accuracy of the trigger word detection system, so what is the probability that someone will wake up your device when they say one of the trigger words.
You might also have to take into account the number of false positives, which is, what are the chances that the trigger word is awakened randomly when no one is saying it? So in this case, the logical way to combine these two measures might be to maximize accuracy. So when someone says the wake word, you maximize the probability that your device will wake up, and then you have to have at most one false positive in 24 hours, right? So on average, your device will only wake up randomly once a day when no one is actually talking. So in this case, accuracy is the optimization indicator, and then one false positive every 24 hours is the satisfaction indicator, and you only have to have at most one false positive every 24 hours to be satisfied.
So to summarize, if you have multiple metrics, for example, there’s one optimization metric that you want to optimize as much as possible, and then there’s one or more fulfillment metrics that need to be met, that need to meet some threshold. Now you have a fully automated way to look at multiple cost sizes and pick the “best” one. Now these metrics must be calculated or calculated on a training set or development set or test set. So one more thing you need to do is set up a training set, a development set, and a test set. In the next video, I’d like to share some guidelines for setting up training, development, and test sets, which we’ll continue in the next video.
1.5 Training/development/Test Set Division (Train/dev/test review)
The way you set up your training, development, and test sets greatly affects the speed at which you or your team can make progress in building machine learning applications. The same team, even in a large company, can really slow down the team’s progress in setting up these data sets rather than speeding them up, so let’s look at how you can set up these data sets to maximize your team’s effectiveness.
In this video, I want to focus on how to set up a development set and a test set. A dev set is also called a development set, sometimes called a hold out cross Validation set. Then, the working process of machine learning is that you try a lot of thinking, with training set a different model, development set is then used to evaluate different ideas, and then choose a, then iterated to improve performance of development set, until you can get a make you satisfied with the cost of, then you use test set to evaluate again.
Now, for example, you want to develop a cat classifier, and then you operate in these regions, the United States, the United Kingdom, other European countries, South America, India, China, other Asian countries and Australia, so how do you set up development sets and test sets?
One way to do it is, you can pick four of the regions, and I’m going to use those four (the first four), but you can also pick random regions and say, the data from those four regions make up the development set. And then the other four areas, and I’m going to use these four, or I can pick four at random, make up the test set.
This turns out to be a pretty bad idea, because in this example, your development and test sets come from different distributions. I recommend that you do not do this and instead have your development and test sets come from the same distribution. I mean, what do you want to remember, I want to is to set up your development set and evaluation index of a single real number, this is like set a goal, and then tell your team, that’s what you want to aim at the bull ‘s-eye, because once you set up the development of this set and indicators, teams can rapid iteration, trying out different ideas, running experiments, You can quickly evaluate different classifiers using development sets and metrics, and then try to pick the best one. So, machine learning teams are generally very good at using different approaches to get to the target, and then iterating and getting to the bull’s eye. So, optimize for metrics on the development set.
And then to the left of the case, set up a development and test sets when there is a problem, your team may take a few months time in developing a set of optimization, the results show that when you finally test system on the test set, from the four countries or data from the following four areas (that is, the test set data) and development of the data set may vary greatly, So you might get a “pleasant surprise” and find that after spending months optimizing for your development set, you don’t perform well on your test set. So, if your development set and your test set come from different distributions, it’s like you set a goal, and your team spends months trying to get close to the bull ‘s-eye, and it turns out after months of work, you say, “Wait a minute,” and when you test, “I’m going to move the target over here,” and the team might say, “Okay, Why do you let us spend so many months getting close to that bull ‘s-eye, and then all of a sudden you can move the bull ‘s-eye to a different location?” .
So, to avoid this, what I recommend is that you randomly shuffle all your data into the development and test sets, so that both the development and test sets have data from eight regions, and both the development and test sets come from the same distribution, which is all your data mixed together.
Here’s another example, this is a true story, but some details have changed. So I know there was a machine learning team that spent months optimizing on a development set that had loan approval data for middle-income zip codes. So the specific machine learning question is, the input is the loan application, can you predict the output, will they be able to repay the loan? So the system helps the bank decide whether to approve a loan. So the development set comes from loan applications, and those loan applications come from middle-income zip codes, zip codes for the United States. But after a few months of training on this, the team suddenly decided to test it out on low-income zip code data. Of course, the middle income and low income zip codes in this distribution are very different, and they spend a lot of time optimizing their classifiers for the first set of data, resulting in poor performance for the second set. So this particular team actually wasted three months and had to go back and redo a lot of the work.
What actually happened here is that the team spent three months aiming at one target, and then three months later the manager suddenly said, “How about you try aiming at that target?” The new target position was completely different, so it was very devastating for the team.
So I would suggest that when you set up your development set and your test set, you choose the kind of development set and test set that reflects the data that you’re going to get in the future, the data that you think is important, the data that you need to get good results, and in particular, the development set and the test set may come from the same distribution. So whatever data you get in the future, once your algorithm works well, try to collect similar data, and, whatever that data is, randomly assign it to the development set and the test set. Because of this, you will be able to focus on what you want and have your team efficiently iterate towards the same, hopefully the same, goal.
We haven’t talked about setting up a training set yet, and we’ll talk about setting up a training set in a future video, but the point of this video is that setting up a development set and evaluating metrics really defines what you want to aim for. We hope that by having a development set and a test set in the same distribution, you can target what you want your machine learning team to target. And the way you set up your training set affects how fast you get to that goal, but we’ll talk about that in another lecture. I know of machine learning teams that could save months of work if they followed this guideline, so I hope it helps you, too.
And then, actually the size of your development set and your test set, how you choose to size them, is changing in the era of deep learning, and we’ll talk about that in the next video.
1.6 Size of dev and test sets
In the last video you saw why your development and test sets have to come from the same distribution, but how big should they be? In the era of deep learning, the guidelines for setting up development and test sets are changing, so let’s take a look at some of the best practices.
You’ve probably heard of the rule of thumb in machine learning, which is to divide all the data you get into training sets and test sets on a 70/30 scale. Or if you had to set up a training set, a development set, and a test set, you would do 60% training set, 20% development set, and 20% test set. In the early days of machine learning, this was quite reasonable, especially since data sets were much smaller in the past. So if you have a total of 100 samples, then the rule of thumb for 70/30 or 60/20/20 is pretty reasonable. If you had thousands of samples or you had 10,000 samples, it would still make sense.
But in modern machine learning, we’re more used to working with much larger data sets, so if you have, say, a million training samples, it might make more sense to divide it into 98 percent training sets, 1 percent development sets, 1 percent test sets, and we use the acronym alpha and beta for development sets and test sets. Because if you have 1 million samples, then 1% is 10,000 samples, which is probably enough for development and test sets. So in the modern era of deep learning, sometimes we have much larger data sets, so it makes sense to use less than 20% or less than 30% of data as development sets and test sets. And because deep learning algorithms have a huge appetite for data, we see problems that have large data sets, with a higher percentage of data divided into training sets. What about test sets?
Keep in mind that the purpose of a test suite is to help you evaluate the performance of your system when it goes into production after you have completed system development. The guideline is to make your test set large enough to assess overall system performance with high confidence. So unless you need to have a very precise metric for the final production system, test sets generally don’t need millions of examples. For your application, you might think that 10,000 examples would give you enough confidence to give performance metrics, or something like 100,000, which could be much less than, say, 30% of the total data set, depending on how much data you have.
For some applications, you may not need a high confidence assessment of system performance; you may only need a training set and a development set. I think it’s ok not to have a separate set of tests. In fact, sometimes in practice some people will just divide into training sets and test sets, and they actually iterate over test sets, so there’s no test set, they have training sets and development sets, but there’s no test set. If you’re actually debugging this set, this development set or this test set, this is better called a development set.
But in the history of machine learning, not everyone has a very clear definition of the term, and sometimes what people call a development set is really a test set. But if all you need is data to train and data to debug, that’s enough. You’re going to deploy the final system regardless of the test set, so don’t worry too much about its actual performance, which I think is good, just call them training sets, development sets. And then make it clear that you don’t have a test set, isn’t that a little unusual? I definitely don’t recommend skipping test sets when building a system, because I feel more comfortable having a separate test set. Because you can use this set of data without bias to measure the performance of the system. But if your development set is very large, so that you don’t overfit it too much, it’s not entirely unreasonable to have only a training set and a test set. But I generally don’t recommend it.
In summary, the old rule of thumb in the era of big data, this 70/30 no longer applies. It’s fashionable to split a lot of data into a training set and then a little bit into a development set and a test set, especially if you have a very large data set. The old rule of thumb was to make sure the development set was big enough for its purpose, which was to help you evaluate different ideas and decide whether they were better or not. The purpose of the test suite is to estimate your final cost bias, and you just need to set up a test suite large enough to be able to do that, probably much less than 30% of your total data volume.
So I hope this video gives you a little guidance and advice on how to set up development and test suites in the era of deep learning. Now, sometimes in the course of working on machine learning problems, you might need to change your metrics, or change your development set or your test set, and we’ll talk about when you need to do that.
1.7 When is it time to change development/test sets and metrics? (When to change dev/test sets and metrics)
You’ve learned how to set development sets and metrics, like setting goals in a certain location for your team to aim at. But sometimes in the middle of a project, you may realize that the target is in the wrong place. In this case, you should move your target.
Let’s take an example. Let’s say you’re building a cat classifier and you’re trying to find lots of pictures of cats to show to your feline users, and the metric you decide to use is the classification error rate. So the algorithm has a 3% error rate and the algorithm has a 5% error rate, so the algorithm seems to do better.
But let’s actually try these algorithms, and if you look at these algorithms, the algorithms for some reason classify a lot of pornographic images as cats. If you deploy the algorithm, users will see more cat images, because it only has a 3% error rate, but it will also push some pornographic images to users, which is totally unacceptable to your company, and totally unacceptable to your users. By comparison, the algorithm has a 5% error rate, so the classifier gets fewer images, but it doesn’t push pornographic images. So from your company’s point of view, and from the user’s point of view, the algorithm is actually a better algorithm because it doesn’t let any pornographic images through.
So in this case, what happens is, algorithm A does A better job on the metrics, it has A 3% error rate, but it’s actually A worse algorithm. In this case, the metrics plus the development set they all tend to pick the algorithm, because they’re going to say, look, algorithm A has A lower error rate, that’s what you’re evaluating from the metrics that you’ve set yourself. But you and your users are more likely to use the algorithm because it doesn’t categorize pornographic images as cats. So when that happens, when your metrics don’t properly measure the order between algorithms, in this case, the original metrics incorrectly predict that algorithm A is the better algorithm and that signals that you need to change the metrics, or change the development set or the test set. In this case, the classification error rate indicator you use can be written like this:
Is the number of examples in your development set, which is the predicted value, which is 0 or 1, and this symbol represents a function that counts the number of samples in which this expression is true, so this formula counts the samples that are misclassified. The problem with this metric is that it doesn’t discriminate between pornographic and non-pornographic images, but you really want your classifier not to mistag pornographic images. For example, classify a pornographic image as a cat and push it to an unsuspecting user, who will be very unhappy when they see it.
One way to modify the evaluation indicator is to add a weight term here (between and), i.e. :
We call this, where if the image is not pornographic, then. If it’s porn, it could be 10 or even 100, so you give porn a lot more weight, and the error rate gets bigger quickly when the algorithm classifies porn as cat. In this example, the penalty for classifying pornographic images as cats is 10 times heavier.
If you want to get the normalized constant, technically, you sum over everything so that the error rate is still between 0 and 1, i.e. :
The details of the weighting are not important, in fact to use the weighting, you have to go through the development set and the test set yourself, and in the development set and the test set, you have to tag the porn images yourself so that you can use the weighting function.
But the rough conclusion is that if your metrics don’t do a good job of ranking the algorithm, you need to spend time defining a new metrics. This is one of the possible ways to define the assessment indicators (weighting method above). The point of an evaluation metric is to tell you exactly which of the two classifiers you know is more appropriate for your application. In terms of the contents of this video, we don’t need too pay attention to is how to define new error rate index, the key is, if you are not satisfied with the old error rate index, then don’t used you are not satisfied with the error rate of indicators, and should try to define a new index, can be more in line with your preferences, define the actual algorithm is more suitable for.
You may have noticed that so far we’ve only discussed how to define a metric to evaluate classifiers, that is, we’ve defined an metric to help us better rank classifiers, to be able to distinguish between the different levels at which they recognize pornographic images, which is actually an example of orthogonalization.
I think when you’re dealing with machine learning, you should break it up into separate steps. One step is to figure out how to define a metric that measures the performance of what you want to do, and then we can think separately about how to improve the system’s performance on that metric. You need to think of machine learning tasks as two separate steps, and to use the goal metaphor, the first step is to set a goal. So defining what you want to aim at, that’s a completely separate step, that’s a knob that you can adjust. How you set the target is a completely separate problem, but think of it as a single knob, a knob that can debug the performance of the algorithm, how you aim accurately, how you hit the target, defining the metrics is the first step.
And then the second step is to do something else. As you approach the target, maybe your learning algorithm is optimized for some cost function that looks like this, you minimize the loss on the training set. So one of the things you can do is you can modify this, and in order to introduce these weights, you might end up modifying this normalized constant, which is:
Again, the definition is not important, the key lies in the idea of orthogonalization, setting the target as the first step, then aiming and shooting the target is an independent step of the second step. In other words, I encourage you to look at defining metrics as a step, and then after you define metrics, you can figure out how to optimize the system to improve that metric score. Like changing the cost function that your neural network is optimizing.
Let’s do one more example before we move on. Suppose that your two cat classifiers and, respectively, have a 3% error rate and a 5% error rate for the useful development set evaluation. Or even on a test set of images downloaded from the Internet, high-quality, professional viewfinder images. Products, but maybe you in the deployment algorithm you find algorithm looks better, even if it did well on the development set, you find that you have been training with downloaded from the Internet high quality picture, but when you deploy to mobile applications, the algorithm role to the user to upload images, those images shot is not professional, full photographed didn’t put the cat, or the cat’s facial expression is very strange, Maybe the image is blurry, and when you actually test the algorithm, you find that the algorithm actually performs better.
This is another example where metrics and dev sets and test sets go wrong, and the problem is, you’re doing your evaluation with beautiful high resolution dev sets and test sets, and professional picture framing. But what your users really care about is whether the images they upload can be correctly identified. They’re probably not professional photos, a little blurry, amateur framing.
So the guideline is, if you’re doing well on metrics, you’re doing well on your current development set or your development set and your test set distribution, but your actual application, where you’re really focused, is not doing well, then you need to change your metrics or your development test set. In other words, if you find that your development test set is full of these high quality images, but the evaluation done on the development test set does not predict how your application will actually perform. Because your application is dealing with low quality graphics, it’s time to change your development test set so that your data more closely reflects what you actually need to process.
But the general guideline is that if your current metrics and the data you are currently evaluating are not relevant to what you really care about doing well, you should change your metrics or your development test set to better reflect the data your algorithm needs to process.
Having an evaluation metric and a development set allows you to make faster decisions about whether the algorithm is better or not, which can really speed up the iteration of you and your team. So my advice is that even if you can’t define a perfect set of metrics and development, just quickly set them up and use them to drive your team’s iteration rate. If, after that, you find that you made a bad choice and you have a better idea, you can change it immediately. For most teams, I recommend not running too long without metrics and development sets, as that may slow down your team’s ability to iterate and improve the algorithm. This video is on when you need to change your metrics and develop your test suite, and I hope that these guidelines will allow your entire team to set a clear goal that you can iterate effectively to improve performance.
1.8 Why is it human performance? (Why human-level performance?)
Over the past few years, more machine learning groups have been talking about how machine learning systems compare with human performance. Why?
I think there are two main reasons. First, because of the advances in deep learning systems, machine learning algorithms are suddenly getting better. Many applications of machine learning are beginning to see algorithms that can threaten human performance. Second, it turns out that when you’re trying to get a machine to do what a human can do, you can carefully design the workflow of a machine learning system to make the workflow more efficient, so it’s natural to compare humans and machines in these situations, or you want machines to mimic human behavior.
Let’s look at a couple of examples of this. I see a lot of machine learning tasks where after you’ve spent a lot of time on a problem, so the axis is time, and this could be many months or even many years. During these times, some team or some research group is working on a problem, and when you start working at the human level, progress is rapid. But over time, when the algorithm performs better than humans, the improvement in progress and accuracy becomes slower. Maybe it will get better and better, but it can get better and better as it goes beyond the human level, but the slope of performance growth, the rate of accuracy increase, will become more and more flat, and we all want to get to the theoretical optimal performance level. Over time, as you continue to train your algorithm, it’s possible that the model gets bigger and bigger, with more data, but the performance doesn’t exceed some theoretical upper limit, which is known as a Bayes Optimal error rate. So bayesian optimal error rate is generally considered to be the optimal error rate theoretically possible, that is, there is no way to design a function that can exceed a certain accuracy.
For example, for speech recognition, if it’s an audio clip, some of the audio is so noisy that it’s almost impossible to know what’s being said, so perfect accuracy may not be 100%. Or for cat recognition, maybe some images are so blurry that neither a human nor a machine can tell if a cat is in the picture. So, perfect accuracy may not be 100%.
The Bayesian optimal error rate is sometimes written Bayesian, omitting Optimal, which is the theoretical optimal function from to mapping, and is never surpassed. So it should come as no surprise to you that the purple line, no matter how many years you’ve been working on a problem, you’ll never get beyond the Bayesian error rate, the Bayesian optimal error rate.
It turns out that machine learning tends to progress pretty fast, fast until you outperform human performance, and when you do, it sometimes slows down. I think there are two reasons why progress slows down when you exceed human performance. One reason is that the human level is not that far away from Bayesian optimal error rates on many tasks, and people are very good at looking at images, telling if there’s a cat in them or dictation of audio. So there may not be much room for improvement after you surpass human performance. But the second reason is that as long as your performance is worse than human performance, there are actually tools that can be used to improve performance. Once you go beyond human performance, these tools don’t work so well.
I mean, for tasks that humans are pretty good at, including recognizing things by looking at pictures, dictating audio, or reading language, humans are generally pretty good at processing these natural data. For tasks that humans are good at, as long as your machine learning algorithm is worse than humans, you can go from having people tag data for you, you can have people tag examples for you or you can pay people to tag examples for you, so you have more data to feed to the learning algorithm. We’ll talk about human error rate analysis next week, but as long as humans perform better than any other algorithm, you can have humans look at examples of what your algorithm does, see where the error is, and try to understand why the human gets it right and the algorithm gets it wrong. As we’ll see next week, doing this helps improve the performance of the algorithm. You can also better analyze bias and variance, which we’ll talk about later. But as long as your algorithm is still worse than humans, you have these important strategies for improving it. And once your algorithms are better than humans, all three strategies are hard to exploit. So that might be another benefit of comparing human performance, especially on tasks that humans do well.
Why machine learning algorithms tend to be very good at mimicking what humans can do and then catching up or even surpassing human performance. In particular, even if you know what the bias is, what the variance is. Knowing how good humans are at certain tasks can help you better understand whether you should focus on trying to reduce bias or reduce variance, and I want to give you an example of that in the next video.
1.9 Avoidable Bias
We talked about how you want your learning algorithm to do well on the training set, but sometimes you don’t actually want to do very well. You have to know what the human-level performance is, and that can tell you exactly how well the algorithm should perform on the training set, or how badly, let me tell you what I mean.
We often use cat classifiers as examples, like humans have near-perfect accuracy, so the error at the human level is 1%. In this case, if your learning algorithm has an 8% training error rate and a 10% development error rate, you might want to get better results on the training set. So in fact, if your algorithm’s performance on the training set is significantly different from that of the human level, your algorithm is not good at fitting the training set. So from the perspective of tools for reducing bias and variance, in this case, I’m going to focus on reducing bias. What you need to do is, say, train a larger neural network, or run a longer gradient descent, and see if you can do better on the training set.
But now let’s look at the same training error rate and development error rate, assuming that human performance is not 1%, let’s copy that. But you know, in different applications or in different data sets, let’s say the human level error is actually 7.5%, maybe the image in your data set is so blurry that even a human can’t tell if there’s a cat in the picture. This example is a little more complicated, because humans are actually pretty good at looking at photos and telling if there’s a cat in them or not. But just to give you an example, let’s say the images in your dataset are so blurry that the resolution is so low that even humans have a 7.5% error rate. In this case, even though your training error rate and your development error rate are the same as in the other examples, you know that maybe your system is doing fine on the training set, it’s just a little bit worse than human performance. In the second example, you might want to focus on reducing this component, reducing the variance of the learning algorithm, and maybe you could try regularization to get your development error rate closer to your training error rate.
So in the discussion of bias and variance in the previous lectures, we basically assumed that there were some tasks where the Bayesian error rate was almost zero. So to explain what’s happening here, look at this cat classifiers, with the level of human error probability estimates or replace bayesian or bayesian optimal error rates, for computer vision tasks, this alternative is quite reasonable, because humans are actually very good at the task of computer vision, so human can do level and there is little bayes error rate. By definition, the human level error rate is a little bit higher than the Bayesian error rate, because the Bayesian error rate is the theoretical upper limit, but the human level error rate is not very far from the Bayesian error rate. So the surprise here depends on how much the human level error rate is, or it’s really close to the Bayesian error rate, so we assume it is, but depends on what level we think is achievable.
In both cases, with the same training error rate and development error rate, we decided to focus on a bias reduction strategy or a variance reduction strategy. So what happens in the left-hand example? An 8 percent training error rate is really high, and if you think you can get it down to 1 percent, then bias reduction measures may be effective. And in the right of the case, if you think the bayes error rate is 7.5%, and here we use the level of human error rate to replace the bayes error rate, but do you think the bayes error rate close to 7.5%, you will know that don’t have much room for improvement, can’t continue to reduce your training error rates, you don’t want to it is far better than 7.5%, Because such goals can only be passed, further training may be required. On the other hand, there is more room for improvement, to close the 2% gap a little bit, and variance reduction should be possible, such as regularization, or collecting more training data.
So I’m going to give these concepts names, which is not a term that’s widely used, but I think it’s a nice way to think about it. Is this difference, bayes error rate or estimate of the bayes error rate and the difference in value between training error rate is called can avoid deviation, you may want to always improve the training set, until you are close to the bayes error rate, but in fact you don’t want to do it better than bayes error rate, which are theoretically impossible exceed the bayes error rate, unless the fitting. And the difference between the training error rate and the development error rate is a rough indication of how much improvement your algorithm has in the variance problem.
The word “avoidable bias” means that there is some other bias, or that there is a minimum error rate that cannot be exceeded, and that is if the Bayesian error rate is 7.5%. You don’t actually want an error rate below that level, so you don’t say your training error rate is 8 percent, and then 8 percent measures the amount of deviation in the example. You should say avoidable deviation is probably around 0.5%, or 0.5% is the indicator of avoidable deviation. And this 2% is an indicator of variance, so it takes a lot more space to reduce this 2% than it does to reduce this 0.5%. In the example on the left, the 7% measures the avoidable bias, and the 2% measures the variance. So in this example on the left, less focus avoids bias and maybe has more potential.
So in this case, when you understand the human level error rate, when you understand your estimate of the Bayesian error rate, you can focus on different strategies in different scenarios, using the bias avoidance strategy or the variance avoidance strategy. And there’s a lot more subtlety to how you take human level performance into account when you’re training to decide what to focus on, so in the next video, we’ll dive into what human level performance really means.
Understanding Human-level Performance
Human level performance is a term that gets thrown around a lot in papers, but I’m going to give you a more accurate definition of it, and in particular, using human level performance can help you move your machine learning projects forward. Remember in the last video, we used the term “human-level error rate” to estimate bayesian error, which is the lowest theoretical error rate, the lowest that any function, whether it’s now or in the future, can go to. So let’s keep that in mind and then look at the medical image classification example.
Suppose you were to look at radiological images like this and make a classification diagnosis, and suppose that an average human, an untrained human, achieved a 3% error rate on this task. The average doctor, maybe the average radiologist, can achieve a 1% error rate. Experienced doctors did better, with an error rate of 0.7 percent. And a team of experienced doctors, which means if you have a team of experienced doctors, and they all look at this image, and they talk about it, and they debate it, they come to a consensus that’s wrong about 0.5 percent of the time. So my question to you is, how do you define the human level error rate? Human level error rate 3%,1%, 0.7%, 0.5%?
You can also pause the video and think about it, and to answer this question, I want you to remember that one of the most useful ways to think about human-level error rates is as a substitute or an estimate for Bayesian error rates. If you want, you can pause the video and think about it.
But I’m going to go straight to the definition of error rate at the human level, which is that if you want to replace or estimate the Bayesian error rate, then a team of experienced physicians can discuss and debate and get it to 0.5%. We know that the Bayesian error rate is less than half a percent, because some systems, these teams of doctors can get up to half a percent error rate. So by definition, the optimal error rate has to be less than 0.5%. We don’t know how much better, maybe a bigger team with more experienced doctors could do a better job, so maybe a little bit better than 0.5%. But we know that the optimal error rate can’t be higher than 0.5%, so in this context, I can use 0.5% to estimate the Bayesian error rate. So I define the human level as 0.5%, at least if you want to use human level error to analyze bias and variance, as we did in the last video.
Now, in order to publish research papers or deploy systems, maybe the definition of error rate at the human level could be different, you could use 1%, as long as you exceed the performance of a general practitioner, and if you can reach that level, the system is practical. Maybe the performance of more than one radiologist, one doctor, means that the system can be deployed in some cases.
The point of this video is to be clear about what your goal is when you define the human level error rate, and if you want to show that you can go beyond a single human, then there are situations where it makes sense to deploy your system, and maybe that definition is appropriate. But if your goal is to replace the Bayesian error rate, then this definition (a team of experienced physicians — 0.5%) is appropriate.
To see why this is important, let’s look at an example of error rate analysis. For example, in the medical image diagnostic example, your training error rate is 5%, and your development error rate is 6%. And in the example on the last slide, our human-level performance, I think of it as a substitute for the Bayesian error rate, depending on whether you define it as the performance of an average individual physician, or the performance of an experienced physician or a team of physicians, you might use 1% or 0.7% or 0.5%. And also recall from the definition in the previous video, the Bayesian error rate or the direct difference between the bayesian error rate estimation and the training error rate measures what’s called avoidable bias, which is a measure or an estimate of how bad the variance of your learning algorithm is.
So in this first example, whatever choice you make, the avoidable deviation is about 4%, which I think is somewhere in between… If you take 1% it’s 4%, if you take 0.5% it’s 4.5%, and the gap (the difference between the training error and the development error) is 1%. So in this case, I have to say, no matter how you define the error rate at the human level, using the definition of error rate by a single general practitioner, or the definition of error rate by a single experienced practitioner or a team of experienced physicians, whether that’s 4% or 4.5%, that’s clearly a bigger problem than the variance. So in this case, you should focus on techniques that reduce bias, such as training larger networks.
Now let’s look at the second example, let’s say your training error rate is 1%, your development error rate is 5%, it doesn’t really matter, it’s more of an academic question, is human level performance 1% or 0.7% or 0.5%. Because no matter which definition you use, the way you measure avoidable deviation is, if you use that value, it’s 0 to 0.5 percent ahead, right? That’s the difference between the human level and the training error rate, and the difference is 4%, so the 4% difference is larger than any definition of avoidable deviation. So they suggest that you should mainly use tools that reduce variance, such as regularization or to get a larger training set.
When does it really work?
It’s like your training error rate is 0.7 percent, so you’re doing pretty well now, and your development error rate is 0.8 percent. In this case, it matters that you use 0.5% to estimate the Bayesian error rate. Because in this case, the avoidable deviation that you measure is 0.2%, which is twice as much as the variance problem that you measure is 0.1%, which suggests that maybe there is a problem with both the bias and the variance. However, the avoidable bias problem is even worse. In this case, what we talked about on the last slide was 0.5%, which is the best estimate of the Bayesian error rate, because a group of human doctors can achieve this goal. If you substitute 0.7 for the Bayesian error rate, and you get an avoidable deviation that’s basically 0%, you might ignore the avoidable deviation. You should actually see if you can do better on the training set.
I hope that gives you a little bit of an idea of why it gets harder and harder to make progress on machine learning problems, as you approach the human level.
In this example, once you get close to the 0.7% error rate, unless you estimate the Bayesian error rate very carefully, you probably have no way of knowing how far away you are from the Bayesian error rate, so you should try to minimize the avoidable bias. In fact, if you only know that a single general practitioner has a 1% error rate, it might be hard to know if you should go ahead and fit the training set, and that’s only going to happen if your algorithm is already doing really well, and only if you’re already doing 0.7, 0.8 percent, close to human level.
In the two examples on the left, when you are far from the human level, it may be easier to focus on bias or variance. And that explains why, as you approach the human level, it’s harder to tell whether the problem is bias or variance. So the progress of machine learning projects when you’re already doing well, it’s hard to go further.
To summarize what we’ve been talking about, if you want to understand bias and variance, then in tasks that humans can do very well, you can estimate the error rate at the human level, and you can use the human level error rate to estimate the Bayesian error rate. So the difference between your bayesian error rate estimate tells you how big the avoidable bias problem is, how big the avoidable bias problem is, and the difference between the training error rate and the development error rate tells you how big the variance problem is, whether your algorithm can generalize from the training set to the development set.
The big difference between what we’re talking about today and what we’ve seen in previous lectures is that you used to compare the training error rate and the zero percent, and you used that directly to estimate the deviation. In contrast, in this video, we have a more subtle analysis, which doesn’t assume that you should get a 0% error rate, because sometimes the Bayesian error rate is non-zero, and sometimes it’s virtually impossible to get lower than a certain error rate threshold. So in previous lectures, we measured the training error rate, and then we looked at how much higher the training error rate was than zero percent, and we used that difference to estimate how big the deviation was. And it turns out that for problems where the Bayesian error rate is almost zero, like cats, humans are close to perfect, so the Bayesian error rate is close to perfect. So when the Bayesian error rate is almost zero, you can do that. But when there is a lot of noise in the data, such as language recognition with a noisy background, it is sometimes almost impossible to hear what is being said and record it down properly. For these kinds of problems, a better estimate of the Bayesian error rate is necessary to help you better estimate the avoidable bias and variance, so that you can make better decisions about whether to choose a strategy that reduces bias or a strategy that reduces variance.
As a review, having a rough estimate of the human level allows you to make an estimate of the bayesian error rate, which allows you to decide more quickly whether you should focus on reducing the bias of the algorithm, or reducing the variance of the algorithm. This decision technique usually works well until your system starts to outperform humans and your bayesian error rate estimation is no longer accurate, but it can still help you make a clear decision.
Now, one of the exciting developments in deep learning is that for an increasing number of tasks, our systems can actually outperform humans. In the next video, let’s continue talking about the process of surpassing the human level.
1.11 Surpassing human performance, Surpassing human-level Performance
A lot of teams get excited when machines outperform humans at certain recognition and classification tasks, so let’s talk about that and see if you can do it yourself.
We talked about machine learning progress, which gets slower and slower as it approaches or exceeds the human level. Let’s give an example of why that might be.
Let’s say you have a problem, and a group of human experts debate it thoroughly, and it has a 0.5% error rate, 1% error rate for a single human expert, and then you train an algorithm that has a 0.6% training error rate, 0.8% development error rate. So in this case, what is the avoidable deviation? This is the easy answer, 0.5% is your estimate of the Bayesian error rate, so the avoidable deviation is 0.1%. You don’t use this 1% number as a reference, you use this difference, so maybe your estimate of avoidable deviation is at least 0.1%, and then the variance is 0.2%. There may be more room to reduce the variance than to reduce the avoidable deviation.
But now let’s look at a difficult example, where a group of human experts and a single human expert perform the same as before, but your algorithm can get a 0.3% training error rate, and a 0.4% development error rate. Now, what is the avoidable bias? Now it’s really hard to answer, the fact that your training error rate is 0.3 percent, does that mean that you overfit 0.2 percent, or that bayes error rate is actually 0.1 percent? Or maybe the Bayesian error rate is 0.2%? Or what about a Bayes error rate of 0.3 percent? You really don’t know. But based on the information given in this example, you don’t actually have enough information to determine whether you should focus on reducing bias or reducing variance when optimizing your algorithm, and you’ll be less efficient at making progress. And if your error rate is already lower than, say, a group of well-debated human experts, it’s hard to rely on human intuition to figure out how your algorithm can be optimized. So in this case, once you pass the 0.5% threshold, there are no clear options or directions to go to further optimize your machine learning problem. That doesn’t mean you can’t make progress, you can still make significant progress. But some of the tools that exist to help you navigate aren’t as useful.
Now, there are a lot of problems with machine learning that can go way beyond the human level. For example, I think online advertising, estimating how likely a user is to click on an AD, is probably better at learning algorithms than any human. There are also tasks like making product suggestions and recommending movies or books to you. I think today’s websites are better than your closest friends. And logistics forecasting, how long it will take to get there, or how long it will take for a delivery car to get there. Or predict whether someone is going to repay a loan, so you can decide whether to approve that person’s loan or not. I think the problem is that machine learning today goes way beyond the performance of a single human being.
Please note that these four examples, all the four examples of structured data to study, and here you might be A database records the history of the user clicks your shopping history database, database or how long does it take to get from A to B, before the loan application and results of database, these are not natural perception problems, these are not the computer vision problems, Or speech recognition, or natural language processing. Humans tend to perform very well on nature perception tasks, so it may be harder for computers to outperform them.
Finally, in all of these problems, the machine learning team has access to a lot of data, so for example, the best systems in those four applications may see more data than any human can see, so it’s relatively easy to get systems that are beyond human level. Computers can now retrieve so much data that they can identify statistical patterns in the data more acutely than humans can.
In addition to these problems, there are speech recognition systems today that are beyond human level, and there are some computer vision tasks, some image recognition tasks that computers are beyond human level. But because humans are so good at this natural perceptual task, I think it’s going to be a lot harder for computers to reach that level. There are also medical tasks, such as reading an ECG or diagnosing skin cancer, or in certain areas of radiology, which computers do very well, perhaps beyond the level of a single human.
One of the exciting aspects of the latest advances in deep learning is that even in nature perception tasks, computers can already outperform humans in some cases. But it must be harder now, because humans are generally very good at this natural perception task.
So it’s not always easy to outperform human performance, but if you have enough data, there are already a lot of deep learning systems that have outperformed human performance on single supervised learning problems, so it makes sense for the applications you’re developing. I hope one day you’ll be able to build deep learning systems that are beyond human level.
1.12 Improving Your Model Performance
You learned about orthogonalization, how to set up development sets and test sets, how to estimate bayesian error rates using human level error rates and how to estimate avoidable bias and variance. We’re now putting them all together into a set of guidelines, guidelines for how to improve the performance of the learning algorithm.
So I want to make a supervised learning algorithm practical, basically hoping or assuming that you can do two things. First of all, your algorithm fits the training set very well, which can be seen as your ability to avoid low deviations. The second thing you can do well is do well in the training set, and then do well in the development set and the test set, which means the variance is not too great.
In the spirit of orthotopic, you can see that there is a second set of knobs here that fixes avoidable bias problems, such as training larger networks or training longer. There is also a separate set of techniques that can be used to deal with variance problems, such as regularization or gathering more training data.
To summarize the steps we’ve seen in the last couple of videos, if you want to improve the performance of your machine learning system, I suggest you look at the distance between the training error rate and the Bayesian error rate estimate, just to give you an idea of how much bias can be avoided. In other words, how much better you think you can do and how much room you have to improve your training set. Then look at the distance between your development error rate and your training error rate to see how big your variance problem is. In other words, how much effort should you make to get your algorithm representation from the training set to the development set, where the algorithm is not trained.
If you want to do everything you can to reduce avoidable bias, I suggest trying a strategy like using a larger model so that the algorithm performs better on the training set, or training for longer. Use better optimization algorithms, such as Momentum or RMSprop, or use better algorithms, such as Adam. You can also try to find better new neural network architectures, or better hyperparameters. You can change the activation function, change the number of layers, or hide the number of units, although you may make the model larger. Or try other models, other architectures, such as cyclic neural networks and convolutional neural networks. We’ll talk more about that later in the course, and sometimes it’s hard to tell in advance whether a new neural network architecture will fit your training set better, but sometimes you can get much better results with a different architecture.
In addition, when variance is a problem, there are a number of techniques you can try, including the following: You can collect more data, because collecting more data for training helps you to generalize to the development set data that the system doesn’t see. You can try regularization, regularization, dropout regularization or data enhancement as we talked about in previous lectures. You can also experiment with different neural network architectures, hyperparametric search, and see if it helps you find a neural network architecture that is more suitable for your problem.
I think these concepts of bias, avoidable bias and variance are easy to learn and hard to master. If you can systematically and comprehensively apply the concepts in this week’s lecture, you can actually improve the performance of machine learning systems more efficiently, systematically, and strategically than many existing machine learning teams.
The resources