1.1 Machine Learning (ML)
FIG. 1 Machine learning timeline
Since the first ideas of science, technology and artificial intelligence, scientists have followed in the footsteps of Blaise Pascal and Von Leibniz to wonder if there could be a machine with the same intelligence as humans. Famous authors such as Jules Verne, Frank Baum(The Wizard of Oz), Marry Shelly(Frankenstein), and George Lucas(Star Wars) fantasized about artificial beings having abilities similar to or greater than those of humans, with the ability to be human in different situations.
1.1.1 Definition of machine learning
Concept: Machine Learning is the literal translation of the English name Machine Learning(ML). Machine learning involves many subjects such as probability theory, statistics, approximation theory, convex analysis and algorithm complexity theory. It specializes in the study of how computers simulate or realize human learning behavior to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve its own performance.
Subject Orientation: The core of Artificial Intelligence (Artificial Intelligence) is the fundamental way to make computers intelligent. Its application covers all fields of Artificial Intelligence. It mainly uses induction and synthesis rather than deduction.
Definition: The study and development of algorithms that enable computers to learn from data, model, and make predictions using established models and new inputs without the need for explicit instructions from the outside.
Arthur Samuel (1959): A subject in which computers are capable of self-learning without instructions from external programs. Langley (1996) : Machine learning is a science of artificial intelligence. The main research object of this field is artificial intelligence, especially how to improve the performance of specific algorithms in experiential learning. Tom Michell(1997): Machine learning is the study of computer algorithms that can be improved automatically through experience.
Experience learning: a computer is said to have the ability to learn if the performance of defined task T can be improved with the accumulation of experience E and a series of tasks and a certain performance measurement P
1.1.2 Development history of machine learning
Machine learning is a relatively young branch of artificial intelligence research, and its development process can be roughly divided into four periods.
The first stage is in the middle of the 1950s to the middle of the 1960s, belongs to the warm period. The second phase, from the mid-1960s to the mid-1970s, was known as the cooling-off period for machine learning. The third stage is from the mid-1970s to the mid-1980s, known as the Renaissance period.
The latest phase of machine learning began in 1986. Machine learning has entered a new stage in the following aspects: (1) Machine learning has become a new marginal subject and formed a course in universities. It combines psychology, biology and neurophysiology with mathematics, automation and computer science to form the theoretical basis for machine learning. (2) The study of integrated learning system with various learning methods is on the rise. Especially, the coupling of symbol learning can better solve the acquisition and refinement of knowledge and skills in continuous signal processing. (3) A unified view of the fundamental problems of machine learning and artificial intelligence is emerging. For example, the combination of learning and problem solving and the idea that knowledge expression is easy to learn resulted in the block learning of universal intelligent system SOAR. The case-based method combining analogy learning and problem solving has become an important direction of experiential learning. (4) The application of various learning methods is expanding, and some of them have become commodities. The knowledge acquisition tool of inductive learning has been widely used in diagnosis classification expert system. Linkage learning is dominant in phonograph recognition. Analytical learning has been used to design integrated expert systems. Genetic algorithm and reinforcement learning have a good application prospect in engineering control. Neural network connection learning coupled with symbolic system will play an important role in intelligent management and intelligent robot motion planning of enterprises. (5) The academic activities related to machine learning are unprecedentedly active. In addition to the annual symposium on machine learning, there are also conferences on computer learning theory and genetic algorithms.
1.1.3 Classification of machine learning
Machine learning is divided into several categories: supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning, and transfer learning.
supervised learning is to train an optimal model to achieve the required performance by adjusting the parameters of a classifier on a known category sample (labeled sample, which is known). supervised learning maps all inputs into the corresponding outputs. The output is simply judged to achieve the purpose of classification. In this way, unknown data can be classified.
In layman’s terms, we give a computer a bunch of multiple choice questions (training samples), along with standard answers, and the computer tries to adjust its model parameters in the hope that its guesses match the standard answers as much as possible, so that the computer learns how to do these kinds of questions. We then ask the computer to help us with multiple choice questions (test samples) that don’t provide answers.
unlabeled, classified samples require us to model the input data set directly, such as clustering. The most direct example is the phrase “birds of a feather flock together”. We just need to put together the things with high similarity, and for the new samples, calculate the similarity and classify them according to the similarity.
Popular speaking, we give the computer a heap of choice (training samples), but does not provide the standard answer, computer tries to analyze the relationship between these subjects, classifying the title, the computer also don’t know what is the answer to question the pile respectively, but the computer think each category should be the same for the answer.
semi-supervised learning is a machine learning approach between supervised and unsupervised learning that considers how to train and classify using a small number of labeled samples and a large number of unlabeled samples. In layman’s terms, we give a computer a bunch of multiple choice questions (training samples), but not all of them provide standard answers. The computer tries to analyze the relationships among these questions, sorting through a small number of labeled samples, and then getting a standard answer.
The applied scenarios include regression and classification, and the algorithms include extensions of some commonly used supervised learning algorithms, which first attempt to model unlabeled data and then predict labeled data. Graph Inferece or Laplacian SVM, etc. Since the birth of semi-supervised learning word, it is mainly used to process synthetic data. The sample data without noise interference is the data used by most semi-supervised learning methods at present. However, most of the data used in real life are not without interference, and it is usually difficult to get pure sample data.
reinforcement learning So-called reinforcement learning is intelligent system mapping from the environment to the behavior of learning, to make the reward signal (strengthen) function value maximum, connectionism, reinforcement learning is different from learning in the supervision, main show is on the teacher’s signal, reinforcement learning in the reinforcement signal is provided by the environment make a comment on the stand or fall of movement (usually a scalar signal). Instead of telling the REINFORCEMENT learning System how to produce the right action.
Popular speaking, we give the computer a heap of choice (training samples), but does not provide the standard answer, computer try these questions, we do as the teacher corrects computer right, right, the more the more reward, then the computer model parameters, they struggle to readjust to want their speculated that the answer to be able to get more reward. Loosely speaking, it can be understood that there is no supervision before supervised learning.
transfer learning
Considering that most of the data or tasks are related, we can share the learned parameters to the new model through Transfer learning so as to speed up and optimize the learning of the model without learning from Zero as before. Transfer the learned and trained model parameters to the new model to help the new model train the data set.
Figure 2
1.1.4 Common algorithms of machine learning
1. The decision tree is classified according to some features. Each node asks a question. These questions are learned from existing data, and when you put in new data, you can divide the data into the appropriate leaves based on the problems on the tree.
Figure 3
Advantages: low computational complexity, easy to understand the output results, not sensitive to the loss of intermediate values, can process irrelevant feature data.
Disadvantages: over-fitting, can limit the tree depth and the number of leaf nodes.
Keywords: ID3(information gain) C4.5(information gain ratio) CART(Gini coefficient).
Data requirements: nominal data, so numerical data must be discretized.
2. Random Forest video: www.youtube.com/watch?v=loN… Select data randomly from the source data to form several subsets
Figure 4.
S matrix is the source data, with 1-N pieces of data. A, B and C are features, and the last column C is the category.
Generate M submatrices randomly from S.
The M subsets get M decision trees, and the new data are put into the M trees to get M classification results. Counting to see which category has the most number of predictions, this category is taken as the final prediction result.
Figure 5
Advantages: Good performance on many data sets, high accuracy; Not easy to overfit; You can get the importance order of variables; It can process both discrete data and continuous data without normalization. Able to deal with missing data well; Easy to parallelize.
3. Logistic regression video: www.youtube.com/watch?v=gNh… www.youtube.com/watch?v=owI… When the prediction target is probability, the range needs to be greater than or equal to 0, less than or equal to 1. At this time, the simple linear model cannot do this, because when the range is not within a certain range, the range also exceeds the specified interval.
Figure 6.
So it would be good to have a model with this shape.
Figure 7.
So how do you get this model? This model has to satisfy two conditions: greater than or equal to 0, less than or equal to 1; If the model is greater than or equal to 0, you can choose absolute value, square value, exponential function, must be greater than 0; Less than or equal to 1 if you divide, the numerator is itself, and the denominator is itself plus 1, that must be less than 1.
And if you do another transformation, you get a Logistic regression model
You can calculate the coefficients from the source data.
And you end up with the logistic graph.
Figure 8.
Advantages: low computation cost, easy to understand and implement. Disadvantages: easy to fall short of fitting, classification accuracy may not be high. Key words: Sigmoid function, Softmax to solve multi-classification applicable data types: numerical and nominal data. Others: Although the logistic regression function is a nonlinear function, in fact, after removing the Sigmoid mapping function, the other steps are consistent with the linear regression.
4.SVM video: www.youtube.com/watch?v=1Nx… To separate the two types and obtain a hyperplane, the optimal hyperplane is to maximize the margin of the two types. Margin is the distance between the hyperplane and its nearest point, as shown in the figure below, Z2>Z1, so the green hyperplane is better.
Figure 9.
I’m going to represent this hyperplane as a linear equation, one class above the line, all greater than or equal to 1, and one class less than or equal to minus 1.
Figure 10.
The distance from point to surface is calculated according to the formula in the figure.
Figure 11.
Therefore, the expression of the total margin is as follows. The goal is to maximize the margin, so the denominator needs to be minimized, which becomes an optimization problem.
For example, three points, find the optimal hyperplane and define the weight vector = (2,3) – (1,1).
Figure 12
The weight vector is obtained as (a, 2a), the two points are substituted into the equation, (2,3) and its value = 1, and (1, 1) and its value = -1, then the values of a and the moment w0 are solved, and the expression of the hyperplane is obtained.
Figure 13
After a is solved and substituted into (a, 2a), the support vector is obtained. The equation of a and W0 substituted into the hyperplane is the Support vector machine.
Advantages: Suitable for small number of sample data, can solve high-dimensional problems, theoretical basis is relatively perfect, for learning mathematics, its theory is very beautiful; Can improve generalization ability;
Disadvantages: When the amount of data is large, the memory resource consumption is large (storing training samples and kernel matrix) and time complexity is high. In this case, LR and other algorithms are better than SVM. There is no universal solution to nonlinear problems, and it is sometimes difficult to find a suitable kernel.
The use of kernel functions is indeed a bright spot for SVM, but kernel functions are not exclusive to SVM, and other algorithms can use kernel functions once they involve inner product operations. Its optimization direction is a variety of different scenarios, such as extending to multiple categories, category label imbalance, etc. SVM can be changed to adapt to the scene.
Key words: optimal hyperplane maximum interval Lagrange multiplier dual problem SMO solution hinge loss relaxation variable penalty factor multiple classification
Applicable data types: numerical and nominal data.
Parameters: Select kernel functions, such as radial basis functions (low-dimensional to high-dimensional), linear kernel functions, and parameters of kernel functions; Penalty factor.
Others: SVM is not better than other algorithms in any scene. SVM is not as good as logistic regression, KNN and Bayes in mail classification. It is a distance-based model and needs normalization.
5. Naive Bayes video: www.youtube.com/watch?v=DNv… For example, in NLP, give a paragraph of text, return the emotion category, is the paragraph positive or negative attitude.
Figure 14
To solve this problem, just look at some of the words.
Figure 15
This text will only be represented by a few words and their count.
Figure 16
The original question is: to give you a word, which category it belongs to, through Bayes rules becomes a relatively simple and easy to solve the question.
The question becomes, what is the probability of this statement in this category, and of course, remember the other two probabilities in this formula
Example: The probability of the word “love” occurring in positive situations is 0.1, and in negative situations 0.001.
6.K nearest Neighbours video: www.youtube.com/watch?v=zHb… When you give a new number, the number of k points closest to it, the number of k points closest to it, that number belongs to that category.
Example: To distinguish the cat and dog, according to the claws and sound features, circle and triangle are already classified, so what category does the star represent
Figure 17
When k = 3, the points connected by these three lines are the nearest three points, so there are more circles, so this star belongs to the cat.
Figure 18
Advantages: high precision, insensitive to outliers, no data input assumptions.
Disadvantages: high computational complexity, space complexity, data imbalance problem. KD – Tree. Data types: numeric and nominal.
Others: how to choose K value; When the data is unbalanced, the classification tends to be more sample classes, and the solution is distance weighting.
Selection of K value: when k value is small, the prediction results are very sensitive to the nearest instance points, and overfitting is easy to occur. If the k value is too large, the model will tend to be large class, and it is easy to fail to fit. Usually k is an integer not greater than 20 (see Machine Learning In Action)
7. K-means Video: www.youtube.com/watch?v=zHb… In order to divide a group of data into three categories, the pink value is large, the yellow value is small, and the most happy to initialize first, in which the simplest 3, 2, and 1 are selected as the initial values of each category. In the rest of the data, each of the three initial values is calculated from the distance, and then classified into the category of the nearest initial value.
Figure 19
After good classes, calculate the average of each class as the center point of the new round
Figure 20
After a few rounds, the group has stopped changing, and you can stop.
Figure 21
Figure 22
Advantages: Easy to implement. Disadvantages: K value is not easy to determine, sensitive to initial value, may converge to a local minimum. KD – Tree.
Data type: Numeric data.
K value determination: cluster index (radius, diameter) inflection point; To overcome the convergence of the K-mean algorithm to the local minimum, the initial clustering center should be determined:
K-means ++ algorithm: the distance between the initial clustering centers should be as far as possible.
Binary K-means algorithm: first take all points as a cluster, then split the cluster in two. Then choose one of the clusters for further division. Which cluster to choose for division depends on whether the Sum of Squared Error (SSE) of two clusters can be reduced to the greatest extent.
8.Adaboost Video: www.youtube.com/watch?v=rz9… Adaboost is one of the methods of Bosting. Bosting is to synthesize several classifiers with poor classification effect and obtain a classifier with good classification effect. Adaboost is a kind of Adaboost model. Each model is established based on the error rate of the previous model, paying too much attention to the misclassified samples, while reducing attention to the correctly classified samples. After successive iterations, a relatively good model can be obtained. In the figure below, the left and right decision trees, individually, are not very good, but putting the same data into them, and considering the two results together, will increase credibility.
Figure 23
In the example of Adaboost, handwriting recognition can capture many features on the drawing board, such as the direction of the beginning point, the distance between the beginning point and the end point, and so on.
Figure 24
During training, the weight of each feature will be obtained. For example, the beginning parts of 2 and 3 are very similar. This feature plays a small role in classification, so its weight will be small.
Figure 25
This alpha Angle has strong identification, and the weight of this feature will be large. The final prediction result is the comprehensive consideration of the results of these features.
Figure 26
Advantages: Adaboost is a very high precision classifier. There are various ways to build subclassifiers, and the Adaboost algorithm provides the framework. When simple classifiers are used, the calculated results are understandable, and the construction of weak classifiers is extremely simple. Simple, no need to do feature screening. Overfitting is not easy to occur.
Disadvantages: Sensitive to outliers.
9. Neural network video: www.youtube.com/watch?v=P2H… Neural Networks are suitable for an input that may fall into at least two categories. NN is composed of several layers of neurons and their connections. The first layer is the input layer, the last layer is the output layer. Each layer has its own classifier in the hidden layer and output layer.
Figure 27
Input is input into the network, activated, and the calculated scores are transferred to the next layer, which activates the neural layer behind. Finally, the scores on the nodes of the output layer represent the scores belonging to various types. In the example below, the classification result is class 1. The same input is transmitted to different nodes. The result is different because each node has different weights and bias. This is also called forward Propagation.
Figure 28
Advantages: High accuracy of classification; It has strong parallel distribution processing ability, strong distribution storage and learning ability, strong robustness and fault tolerance ability to noise neural, and can fully approximate complex nonlinear relations. Have associative memory function.
Disadvantages: Neural network needs a lot of parameters, such as network topology, weight and initial value of threshold; Unable to observe the learning process, the output results are difficult to explain, which will affect the credibility and acceptability of the results; Studying for too long may not even achieve the purpose of learning.
10. Markov Video: www.youtube.com/watch?v=56m… Markov Chains are composed of State and Transitions. For example, depending on the sentence ‘The quick brown fox jumps over the lazy dog’, get markov chain. Step, first set each word into a state, and then calculate the probability of switching between the states.
Figure 29
This is the probability calculated in one sentence. When you use a large amount of text to do statistics, you will get a larger state transition matrix, such as the words that can be connected after the, and the corresponding probability.
Figure 30
In life, the alternative result of keyboard input method is the same principle, the model will be more advanced.
Figure 31
machine learning applies speech recognition, autopilot, language translation, computer vision, recommendation systems, unmanned aircraft, recognition of spam.
machine learning applications Timely translation: researched and Recognition Breakthrough for the Spoken, Translated Word links: https:www.youtube.com/watch?v=Nu-…
Self-driving Car Test: Steve Mahan Link: www.youtube.com/watch?v=cdg… Unmanned aerial vehicle (uav) : lafite, Andre: four axis aircraft motion performance of flexible link: https:www.youtube.com/watch?v=w2i…
Reference: machine learning employment demand: blog.linkedin.com/2014/12/17/…
1.2 Deep Learning
1.2.1 What is Deep Learning
Deep learning is a new field based on machine learning. It is originated from neural network algorithm inspired by human brain structure, and a series of new algorithms are generated along with the improvement of big data and computing power.
Figure 32
1.2.2 Development process of deep learning
The concept was developed by Geoffrey Hinton et al in 2006 and 2007 in papers published in Sciences et al.
Figure 33
1.2.3 What can learning be used for
Deep learning, as an extension of machine learning, is used in image processing and computer vision, natural language processing and speech recognition. Since 2006, the research and application of deep learning in cooperation between academia and industry has made breakthrough progress in the above fields. Take the object recognition contest in the classic image of ImageNet as an example, beating all traditional algorithms and achieving unprecedented accuracy.
Figure 34
1.2.4 Representative academic institutions and companies for deep learning
Universities represented by the University of Toronto, New York University, Stanford University, and industry represented by Google, Facebook, and Baidu are at the forefront of deep learning research and application. Google poached Hinton, Facebook poached LeCun, Baidu’s Silicon Valley lab poached Andrew Ng, Google acquired DeepMind, a start-up specializing in deep learning, in April last year for more than $500 million, and the competition for talent in deep learning has been more intense than ever as technology advances and talent is scarce. Many large and small companies (such as Alibaba and Yahoo!) are also following suit, starting to set foot in the field of deep learning, and the demand for deep learning talents will continue to grow rapidly.
Figure 36
1.2.5 Deep learning will influence our life now and in the future
Google’s voice recognition, Baidu image recognition and Google image search have all used deep learning technology in our Current Android phones. Last year, Facebook’s project DeepFace came close to matching the readiness of the human eye for facial recognition for the first time (97.25% vs 97.5%). In the era of big data, the development of deep learning will have an inestimable impact on our lives in the future. Conservatively, many of the activities currently performed by humans will be replaced by machines due to the development of deep learning and related technologies, such as autonomous car driving, unmanned aircraft, and more functional robots. The development of deep learning allows us to see and approach the ultimate goal of artificial intelligence for the first time.
Figure 36
参考文献
[1] Hebb D. O., The organization of behaviour.New York: Wiley & Sons.
[2]Rosenblatt, Frank. “The perceptron: a probabilistic model for information storage and organization in the brain.” Psychological review 65.6 (1958): 386.
[3]Minsky, Marvin, and Papert Seymour. “Perceptrons.” (1969).
[4]Widrow, Hoff “Adaptive switching circuits.” (1960): 96-104.
[5]S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master’s thesis, Univ. Helsinki, 1970.
[6] P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In Proceedings of the 10th IFIP Conference, 31.8 – 4.9, NYC, pages 762–770, 1981.
[7] Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. Learning internal representations by error propagation. No. ICS-8506. CALIFORNIA UNIV SAN DIEGO LA JOLLA INST FOR COGNITIVE SCIENCE, 1985.
[8] Hecht-Nielsen, Robert. “Theory of the backpropagation neural network.” Neural Networks, 1989. IJCNN., International Joint Conference on. IEEE, 1989.
[9] Quinlan, J. Ross. “Induction of decision trees.” Machine learning 1.1 (1986): 81-106.
[10] Cortes, Corinna, and Vladimir Vapnik. “Support-vector networks.” Machine learning 20.3 (1995): 273-297.
[11] Freund, Yoav, Robert Schapire, and N. Abe. “A short introduction to boosting.”Journal-Japanese Society For Artificial Intelligence 14.771-780 (1999): 1612.
[12] Breiman, Leo. “Random forests.” Machine learning 45.1 (2001): 5-32.
[13] Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. “A fast learning algorithm for deep belief nets.” Neural computation 18.7 (2006): 1527-1554.
[14] Bengio, Lamblin, Popovici, Larochelle, “Greedy Layer-Wise Training of Deep Networks”, NIPS’2006
[15] Ranzato, Poultney, Chopra, LeCun ” Efficient Learning of Sparse Representations with an Energy-Based Model “, NIPS’2006
[16] Olshausen B a, Field DJ. Sparse coding with an overcomplete basis set: a strategy employed by V1? Vision Res. 1997;37(23):3311–25. Available at: www.ncbi.nlm.nih.gov/pubmed/9425….
[17] Vincent, H. Larochelle Y. Bengio and P.A. Manzagol, Extracting and Composing Robust Features with Denoising Autoencoders, Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML‘08), pages 1096 – 1103, ACM, 2008.
[18] Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36, 193–202.
[19] LeCun, Yann, et al. “Gradient-based learning applied to document recognition.”Proceedings of the IEEE 86.11 (1998): 2278-2324.
[20] LeCun, Yann, and Yoshua Bengio. “Convolutional networks for images, speech, and time series.” The handbook of brain theory and neural networks3361 (1995).
[21] Zeiler, Matthew D., et al. “Deconvolutional networks.” Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010.
[22] S. Vishwanathan, N. Schraudolph, M. Schmidt, and K. Mur- phy. Accelerated training of conditional random fields with stochastic meta-descent. In International Conference on Ma- chine Learning (ICML ’06), 2006.
[23] Nocedal, J. (1980). ”Updating Quasi-Newton Matrices with Limited Storage.” Mathematics of Computation 35 (151): 773782. doi:10.1090/S0025-5718-1980-0572855-
[24] S. Yun and K.-C. Toh, “A coordinate gradient descent method for l1- regularized convex minimization,” Computational Optimizations and Applications, vol. 48, no. 2, pp. 273–307, 2011.
[25] Goodfellow I, Warde-Farley D. Maxout networks. arXiv Prepr arXiv …. 2013. Available at: arxiv.org/abs/1302.43…. Accessed March 20, 2014.
[26] Wan L, Zeiler M. Regularization of neural networks using dropconnect. Proc …. 2013;(1). Available at: machinelearning.wustl.edu/mlpapers/pa…. Accessed March 13, 2014.
[27] Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, John Langford, A Reliable Effective Terascale Linear Learning System, 2011
[28] M. Hoffman, D. Blei, F. Bach, Online Learning for Latent Dirichlet Allocation, in Neural Information Processing Systems (NIPS) 2010.
[29] Alina Beygelzimer, Daniel Hsu, John Langford, and Tong Zhang Agnostic Active Learning Without Constraints NIPS 2010.
[30] John Duchi, Elad Hazan, and Yoram Singer, Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, JMLR 2011 & COLT 2010.
[31] H. Brendan McMahan, Matthew Streeter, Adaptive Bound Optimization for Online Convex Optimization, COLT 2010.
[32] Nikos Karampatziakis and John Langford, Importance Weight Aware Gradient Updates UAI 2010.
[33] Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, Josh Attenberg, Feature Hashing for Large Scale Multitask Learning, ICML 2009.
[34] Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, and SVN Vishwanathan, Hash Kernels for Structured Data, AISTAT 2009.
[35] John Langford, Lihong Li, and Tong Zhang, Sparse Online Learning via Truncated Gradient, NIPS 2008.
[36] Leon Bottou, Stochastic Gradient Descent, 2007.
[37] Avrim Blum, Adam Kalai, and John Langford Beating the Holdout: Bounds for KFold and Progressive Cross-Validation. COLT99 pages 203-208.
[38] Nocedal, J. (1980). “Updating Quasi-Newton Matrices with Limited Storage”. Mathematics of Computation 35: 773–782.
[39] D. H. Ballard. Modular learning in neural networks. In AAAI, pages 279–284, 1987.
[40] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut f ̈ur Informatik, Lehrstuhl Prof. Brauer, Technische Universit ̈at M ̈unchen, 1991. Advisor: J. Schmidhuber.