Structured machine learning program The first week of orthogonalization (convenient adjustment parameters) you can adjust the parameter Settings in different orthogonal dimensions, adjust one of the parameters, not or almost does not affect the other dimensions parameters change, so in machine learning project, you can make it much easier to more quickly adjust parameters to a better value

Iteration is the activity of repeating a feedback process, usually in order to get closer to a desired goal or result. Each iteration of the process is called an “iteration,” and the result of each iteration serves as the initial value for the next iteration.

Iterative process of accelerating improved machine learning algorithm F1 score: recall rate P mean of precision rate R (harmonic mean)

Optimization metric: Accuracy, as good as possible meet metric: Need to meet this metric to weigh N different criteria, you may need to consider n-1 of them as meet metric, need these N-1 to meet a particular value. Then define the last one as the optimization index

Training sets are used to run your learning algorithm. The development set (reserved cross validation set) is used to adjust parameters, select features, and make other decisions about the learning algorithm. A test set is used to evaluate the performance of an algorithm, but does not determine what learning algorithm or parameters to use. The training set is used to train models. Different models are trained by using the training set to try different methods and ideas, and then the optimal model is selected by using cross-validation through the verification set. The performance of the model on the verification set is improved through continuous iteration, and finally the performance of the model is evaluated through the test set. If the development set and test set come from different distributions, it may lead to: 1. Overfitting the development set; 2. Test sets are harder to identify than development sets. So your algorithm is probably already doing as well as expected, and it’s not likely to improve significantly further; 3. Test sets are not necessarily harder to identify, but they are different from development sets. So doing well on the development set doesn’t necessarily do well on the test set. In this case, efforts to improve the performance of the development set may be wasted; It is recommended to choose development and test sets from the same distribution to make your team more effective. In general, 60% training set, 20% development set, 20% test set, 70/30 no longer applies but in modern machine learning, we’re more used to working with much larger data sets, so let’s say you have a million training samples, maybe it makes more sense to divide them up, 98% training set, 1% development set, 1% test set, We use the D and T abbreviations for development and test sets.

Method of modifying evaluation index: add a weight item

If the picture is not pornographic, then. If it’s porn, it could be 10 or even 100, so you’re giving porn a lot of weight, so when the algorithm classifies porn as cat, the error rate term gets bigger very quickly and if you want to get a normalization constant, technically, you’re adding up all of them, so the error rate is still somewhere between zero and one,

Try to define a new metric, if you are not satisfied with the old error rate metric, then do not keep using the error rate metric that you are not satisfied with (use the actual application to define the metric)

Bayes optimal error rate: The best error rate theoretically possible, that is, there is no way to design a function from X to y that can exceed a certain accuracy

For tasks that humans are good at, as long as your machine learning algorithm is worse than humans, you can go from having people tag data for you, you can have people tag examples for you or you can pay people to tag examples for you, so you have more data to feed to the learning algorithm

Avoidable bias: Bayesian error rate or the direct difference between bayesian error rate estimation and training error rate

Level of human error rate is a little taller than bayes error rate, because the theory of bayes error rate is limit, but not too far away from the bayes error rate level of human error rate Close to the bayes error rate, you will know that don’t have much room for improvement, can’t continue to reduce your training error rates, you don’t want it is much better than the bayes error rate, Since this goal can only be achieved if further training may be required and if there is more room for improvement, this gap can be narrowed a bit, using variance-reducing measures should be feasible, such as regularization, or collecting more training data

The difference between your bayesian error rate estimate tells you how big the avoidable bias problem is, how big the avoidable bias problem is, and the difference between the training error rate and the development error rate tells you how big the variance problem is, whether your algorithm can generalize from the training set to the development set

Summary: Make a supervised learning algorithm practical: 1. Your algorithm fits the training set well, which can be seen as you can avoid the deviation is very low. 2. Did well in the training set, and then did well in the development set and test set, which means the variance is not too great.

Reduction can avoid bias: 1. Use a larger model, so that the algorithm will perform better on the training set, or train for longer. 2. Use a better optimization algorithm like Momentum or RMSprop, or use a better algorithm like Adam. 3. You can also try to find better new neural network architectures, or better hyperparameters.

Variance is a problem: you can collect more data, because collecting more data for training helps you generalize to development set data that the system doesn’t see. You can try regularization, regularization, dropout regularization. You can also experiment with different neural network architectures, hyperparametric search, and see if it helps you find a neural network architecture that is more suitable for your problem.

In the second week of

Performance upper bound: What is the best way to improve the performance of the algorithm

Perform error analysis: You should take a set of error samples, perhaps in your development set or test set, and observe the mislabeled samples to see how much false positives and false negatives are used to calculate the number of negatives for each error type. In the process, you may be inspired to generalize new types of errors.

If a tagging error seriously affects your ability to evaluate an algorithm on your development set, then spend time fixing the wrong tag

The main purpose of the development set is that you want to use it to choose between two classifiers A and B

Whatever fixes are used, they should be applied to both the development set and the test set, right

Build a new machine learning program, is quickly set your system first, and then start the whole meaning of iterative initial system lies in the fact that there is a learning system, there is a training system, allows you to determine the scope of the error variance, can know what the next step should be the priority, that you can for error analysis, can observe some mistakes, And then figure out all the possible directions, which ones are actually the most promising

Have your training set data come from a different distribution than your development and test sets, so you can have more training data

If the development and test sets come from the same distribution, but the training sets come from different distributions. We need to do is randomly scattered training set, then part of the training set as the training – development set, as the development and test sets from the same distribution, training, training – development sets also training error is 1% from the same distribution, training – development set error is 9%, problems of variance algorithm assumes that the training error is 1%, The train-development error is 1.5%, but when you start working with the development set, the error rate goes up to 10% and when you go to the development set, the error rate goes up dramatically, so it’s a data mismatch

There is a big gap between the performance of the development set and the performance of the test set, so you may have overfitted the development set and need a larger development set

Deviation measures the deviation procedure between the expected prediction and the real result of the learning algorithm, that is, describes the fitting ability of the learning algorithm itself. Variance measures the change of learning performance caused by the change of training set with the same size, that is, it describes the influence caused by data disturbance

Data mismatch problem: Do error analysis, or look at the training set, or look at the development set, try to figure out, try to understand the difference between the two data distributions, and then see if there is a way to collect more data that looks like the development set for training. Artificial data synthesis, it is possible to simulate data from a very small selection of possibilities

Migration study: work situation is that you have a lot of data in the source migration problem, but you don’t have much data migration target problem If you want to study some knowledge and migration from the task A task to B, then A and B have the same task input x (image/audio), migration study is meaningful. When task A has much more data than task B, it makes more sense to transfer learning to do this: the data set is changed to A new pair, now these become radiological images, but instead the diagnoses that you want to predict, what you do is you initialize the weights of the last layer, let’s call it sum random initialization. Pre-training: retrain all parameters in the neural network, so this is called pre-training fine-tuning in the initial training stage of image recognition data: Update the weights If you can train a neural network is big enough, so many task learning will surely not or rarely lower end-to-end performance deep learning is to ignore all of these different stages, replace it with a single neural network, a large amount of data is needed to make the system perform well One drawback is that it may be useful for artificial design components out