BAT machine learning interview 1000 questions series

Arrange: July, Yuan Chao, Lina, Dewei, Jia Ru, Wang Jian, AntZ, Meng Ying, etc. Most of the questions in this series are from the open Internet. Share them with others. If you quote others in your answers, you must indicate the original author and the source link. In addition, a lot of answers have been han Xiaoyang, Dr. Guan, Zhang Yushi, Wang Yun, Dr. Chu and other online teachers in July. Note: As the first AI question bank in China, this series was launched on the online lab official account in July: Julyedulab, and part of the update on this blog, and has been launched on the day of The Double 12 in 2017 online official website in July, Android APP in July, iPhone APP in July, after which the update and maintenance of this article has been suspended, the other nearly 3000 questions have been updated to the online APP in July or the question bank plate in the online official website in July. Welcome to brush every day. In addition, can be reproduced, indicate the source link can be.

 

 

preface

July I’m back again.

Before this blog has sorted out thousands of Microsoft and other companies interview questions, focusing on data structure, algorithm, massive data processing, see: Microsoft interview 100 questions series, this 17 years, recently and the team sorted out BAT machine learning interview 1000 questions series, focusing on machine learning, deep learning. We will index most of the machine learning and deep learning written interview questions and knowledge points through this series, and it will be a large enough machine learning and deep learning interview library/knowledge base, systematic and step-by-step.

In addition, four points should be emphasized:

  1. Although this series is mainly about machine learning and deep learning, there are not many other types of questions, but it does not mean that companies or interviewers only ask these two questions when applying for machine learning or deep learning positions. Although it is related to data or AI, basic language (such as Python), coding ability (for development, Coding ability can not be overemphasized, such as the simplest handwriting quick sort, handwriting binary search), data structure, algorithm, computer architecture, operating system, probability statistics and so on must also be mastered. For data structure and algorithm, one focus on the recommendation of the previous Microsoft interview 100 questions series (later this series organized into a new book “Programming method: Interview and algorithm lessons”), both brush Leetcode, see 1000 questions is not as good as the actual brush 100.
  2. In this series, we will try to organize questions in the same section (e.g. model/algorithm related) and in the same direction (e.g. optimization algorithm), so that you can build a complete knowledge system by learning from one question to another when preparing for the written interview.
  3. The answers to each question in this series will be logical and easy to understand (if you don’t understand something, chances are it’s not because you’re not smart enough, it’s because the material you’re reading is not easy to understand). If you have any suggestions, please feel free to discuss them in the comments.
  4. On how to learn machine learning, the most recommended series of machine learning training camp. From Python basics, data analysis, crawlers, data visualization, Spark big data, and finally practical machine learning, deep learning and more.

In addition, this series will continue to be updated until there are thousands, if not thousands, of questions. Please leave a comment below to share questions you came across in your own written interview, or found online or bookmarked, to help more people around the world.

 

 

BAT machine learning interview 1000 questions series

1 please briefly introduce SVM, machine learning ML model easy SVM, full name is Support Vector machine, Chinese name is support vector machine. SVM is a data-oriented classification algorithm whose goal is to determine a classification hyperplane to separate different data. Extension: Here is a detailed introduction to the principle and derivation of SVM, “Popular Introduction to Support Vector Machines (Understanding the three levels of SVM)”. In addition, here is a video about the derivation of SVM: “Pure whiteboard manual SVM”

 

2. Please briefly introduce the calculation diagram of Tensorflow in DL framework of deep learning

@Han Xiaoyang &AntZ: Tensorflow is a programming system that uses computation in the form of a calculation diagram, a calculation diagram is called a data flow diagram, you can think of a calculation diagram as a directed diagram, and every node in Tensorflow is a Tensor on a calculation diagram, The edges between nodes describe the dependencies between computations (when defined) and mathematical operations (when evaluated). It is shown in the following two figures:

a=x*y; b=a+z; c=tf.reduce_sum(b);
Copy the code



3 in K-means or kNN, we often use Euclidean distance to calculate the distance between the nearest neighbors, and sometimes also use Manhattan distance. Please compare the difference between these two distances. Machine learning ML model

Euclidean distance, the most common representation of the distance between two or more points, also known as Euclidean metric, it is defined in Euclidean space, such as the point x = (x1,… ,xn) and y = (y1… The distance between,yn) is:



Although useful, Euclidean distance has significant disadvantages. It equates the differences between different attributes of the sample (i.e. indicators or dimensions of variables), which sometimes fails to meet practical requirements. For example, in educational research, people are often analyzed and discriminated. Different attributes of individuals have different importance for distinguishing individuals. Therefore, Euclidean distance is applicable to the case where the measurement standards of each component of a vector are uniform.

  • Manhattan distance, we can define the Formal meaning of Manhattan distance as L1-distance or city block distance, that is, the sum of the distance generated by the projection of the line segment formed by two points on the axis in the fixed rectangular coordinate system of Euclidean space. For example, in the plane, the Manhattan distance between point P1 at coordinates (x1, y1) and point P2 at coordinates (x2, y2) is:, it should be noted that the Manhattan distance depends on the rotation of the coordinate system, not on the translation or mapping of the system on the coordinate axis. When the axes change, the distances between the points differ.

In layman’s terms, imagine you’re driving from one intersection to another in Manhattan, the distance as the crow flies between two points? Obviously not, unless you can get through the building. The actual driving distance is the “Manhattan Distance”, which is where the name Manhattan Distance comes from and is also known as City Block Distance.

Manhattan distance and Euclidean distance are different in general use, and there is no substitute for each other. In addition, for the comparison of various distances, see “From k-nearest Neighbor algorithm, distance measurement to KD Tree, SIFT+BBF algorithm”.

 

4. Is the convolution kernel of CNN single-layer or multi-layer? Deep learning DL model

AntZ: The definition and understanding of convolution can be found in this article “CNN Notes: Popular Understanding of Convolutional Neural Networks”.Blog.csdn.net/v_july_v/ar…, originally, the weight matrix of the convolution kernel should be rotated 180 degrees, but we do not need the form of the weight matrix before rotation, so we directly use the post-rotation weight matrix as the convolution kernel to express, so the advantages of discrete convolution operation become matrix dot product operation.

In general, deep convolutional networks are layered upon layered. The essence of a layer is a feature graph that stores input data or intermediate representation values. A group of convolution kernels is the network parameter representation connecting the front and back layers, and the target of training is the weight parameter group of each convolution kernel.

The noun channel number or feature map number is usually used to describe the thickness of a certain layer in the network model. However, people are more used to call the thickness of the front layer as the data input channel number (for example, the RGB three-color layer is called the input channel number 3), and the thickness of the back layer as the convolution output is called the feature graph number.

The convolution filter is generally 3D multi-layer. In addition to the area parameter, such as 3×3, there is also the thickness parameter H (2D is regarded as thickness 1). The other property is the number of convolution kernels N.

The thickness of the convolution kernel H is generally equal to the thickness of the front layer M(the number of input channels or feature maps). Special case M > H.

The number of convolution kernels, N, is generally equal to the thickness of the latter layer (the number of feature maps of the latter layer is also represented by N because it is equal).

The convolution kernel is usually subordinate to the latter layer and provides the latter layer with various perspectives to view the features of the former layer, which is automatically formed.

When the thickness of the convolution kernel is equal to 1 it’s a 2D convolution, which means you multiply the points on the plane and you add them up, which is the dot product. A variety of 2 d convolution dynamic graph we can see here at https://github.com/vdumoulin/conv_arithmetic



When the thickness of convolution kernel is greater than 1, it is 3D convolution (depth wise). 2D convolution is obtained for each piece of plane respectively, and then the results of each piece of convolution are added up as 3D convolution results. 1×1 convolution is a special case of 3D convolution (point-wise). It has thickness and no area. Single points on each layer are multiplied and added directly.

Induction, the meaning of the convolution is an area, whether a segment, the two-dimensional square, or three-dimensional rectangle, all in accordance with the convolution kernel shape dimension, shape, from the dimension of input also dug up after the corresponding point by point multiplication summation, condensed into a scalar value, that is, to zero dimension, as the output to a characteristic figure of the value of a point. It’s like a fisherman pulling in a net.

It can be likened to a group of fishermen sitting on a fishing boat casting a net to catch fish. The fish pond is a multi-layer body of water with different fish in each layer.

The boat moves one stride at a time to a place, each fisherman casts a net to get a harvest, then changes the stride to a distance and casts again, and so on until the fish pond is covered.

A the fisherman stared at the species of fish and described the distribution of fish species in the pond after traversing the pond.

B the fisherman stared at the weight of the fish and described the weight distribution of the fish in the pond after traversing the pond.

And n-2 fishermen, each of their own interests;

Finally get N feature map, described all the fish pond!

2D convolution means that the fisherman’s net is a fishing net with a circle of buoys, and only hits the fish in the upper layer of water;

3D convolution indicates that the fisherman’s net is a multi-layer nested net, and the fish in upper, middle and lower water bodies cannot escape.

1×1 convolution can be regarded as the shifted stride each time, hook fishing instead of casting the net;

Here is a special case of M > H:

In fact, in addition to the small number of channels for data input, there are many feature maps of the middle layer, so the computation of convolution of the middle layer will kill the computer (the fish pond is too deep, fish are hunted at each layer, and the fish net is too heavy). Therefore, many deep convolutional networks divide all channels/feature maps, and each convolutional kernel only sees part of them (Fisherman A’s net only fishes the deep water section, and Fisherman B’s only fishes the shallow water section). The entire deep web architecture then diverges horizontally, only to merge again. In this way, the architecture of many network models is not entirely a whim, but is forced by the calculation of parameters. Especially now that AI application calculation (also called inference) needs to be carried out on mobile devices, the scale of model parameters must be smaller, so there are many convolution forms that reduce the size of the handshake, which is the case in most mainstream network architectures. Such as AlexNet:



In addition, attached baidu school recruitment machine learning pen 2015 questions:www.itmian4.com/thread-7042…

5 About LR. Machine learning ML model is difficult

Rickjin: Tell LR from top to toe. Modeling, field mathematical derivation, principle of each solution, regularization, LR and Maxent model relationship, LR is better than linear regression. There are many people who can recite answers, asking for logical details is confused. Principle? Ask engineering how to parallelize, how many ways to parallelize, what open source implementations you’ve read. Yes, so ready to accept, by the way, the LR model development history.

In addition, these two articles can be made for reference:The Past life of Logistic Regression (Theory Part),Machine Learning Algorithms and Python Practice (7) Logistic Regression.

 

6. How to solve overfitting? Dropout, Regularization, batch Normalizatin in MACHINE learning ML fundamentals

@AntZ: Overfitting is overfitting, and its intuitive performance is shown in the figure below. As the training process progresses, the model complexity increases and the error on training data gradually decreases. But error on the validation set gradually increases, because the trained network overfits the training set but does not work the data outside the training set, which is called generalization bad performance. Generalization performance is the primary goal in the evaluation of training effect. Without good generalization, it is equivalent to going in the opposite direction, and everything is useless.





Overfitting is the opposite of generalization, such as the countryside happy Liu Grandma into the grand View garden will not adapt to all kinds of, but the well-educated Lin Daiyu into the Jia Fu will not be surprised. In practical training, methods to reduce overfitting are generally as follows:

Regularization (Regularization)

L2 regularization: Add the sum of the squares of the property weight w parameters to the objective function, forcing all w to be as close to zero as possible but not zero. Because when overfitting, the fitting function needs to worry about every point, the final fitting function fluctuates greatly. In some very small intervals, the value of the function changes dramatically, that is, some W is very large. For this reason, the addition of L2 regularization penalizes the tendency of increasing weights.

L1 regularization: Increase the sum of absolute values of owner-weight w parameters in the objective function, forcing more w to be zero (i.e., thinning). L2, because its derivative also tends to 0, does not rush to zero as fast as L1). A key reason why sparse regularization is so popular is that it enables automatic feature selection. In general, most of the elements of the xi (characteristics) are has nothing to do with the final output yi or do not provide any information, while minimizing the objective function considering the xi these additional features, although can get smaller training error, but in the prediction of new samples, the weight will be considered useless features, It interferes with the prediction of correct YI. Sparse regularization operator is introduced to complete the glorious mission of automatic feature selection. It will learn to remove these useless features, that is, reset the weight corresponding to these features to 0.

Random dropout

During the operation of training, neurons are activated with the probability of hyperparameter P (that is, the probability of 1-p is set to 0), so that each W is randomly involved, so that any W is not indispensable, and the effect is similar to a large number of model integration.

Batch normalization

In this method, the output of each layer is normalized (equivalent to adding a linear transformation layer to the network), so that the input of the next layer is close to the Gaussian distribution. This method is equivalent to the next layer of W training to avoid the input of the incomplete, so the generalization effect is very good.

Early Stopping

The number of theoretically possible local minima increases exponentially with the number of parameters, and reaching an exact minimum is a source of bad generalization. Practice shows that the pursuit of fine grain minimum has a high generalization error. This is intuitive, because we usually expect our error function to be smooth, and the precise minimum value of the corresponding error surface is highly irregular, and our generalization requires reducing the accuracy to obtain the smoothing minimum value, so many training methods put forward the early termination strategy. The typical method is to terminate in advance according to cross validation: before each training, the training data is divided into several pieces, one of which is the test set, and the other is the training set. After each training, the selected test set is immediately taken for self-test. This approach is called cross-validation because each test set is given one chance. The lowest error rate of cross-validation can be considered as the best generalization performance. At this time, although the training error rate continues to decline, it is necessary to terminate the training.

 

7 The connection and difference between LR and SVM. 1. LR and SVM can both handle classification problems, and are generally used to deal with linear dichotomies (under improved conditions, they can handle multi-classification problems). 2. So in many experiments, the results of the two algorithms are very close. Differences: 1. LR is a parametric model, while SVM is a non-parametric model. 2. From the objective function, the difference is that logistic regression uses the Logistical loss, while SVM uses hinge loss, both of the two loss functions aim to increase the weight of the data points that have a greater impact on classification, and reduce the weight of the data points that are less related to classification. 3. The processing method of SVM is to only consider support Vectors, that is, the few points most relevant to classification, to learn the classifier. Logistic regression greatly reduces the weight of the points far from the classification plane through nonlinear mapping, and relatively increases the weight of the data points most relevant to the classification. 4. Logistic regression model is relatively simple and easy to understand, especially convenient for large-scale linear classification. However, the understanding and optimization of SVM are relatively complicated. After SVM is transformed into a dual problem, classification only needs to calculate the distance between SVM and a few support vectors, which has obvious advantages in the calculation of complex kernel functions and can greatly simplify the model and calculation. 5. Logic can do what SVM can do, but there may be some problems in accuracy. Logic can do what SVM can do. Source: blog.csdn.net/timcompp/ar…

 

Tell me what you know about the kernel. Machine learning ML foundation is easy

Usually people will choose from some common kernel (depending on the problem and data, choose different parameters, in effect get different kernel), for example:

  • Polynomial kernel, obviously the example we just gave is a special case of the polynomial kernel here (R = 1, d = 2). It’s cumbersome and unnecessary, but the mapping that this kernel corresponds to is actually writable, and the dimensions of this space are, includingIt’s the dimension of primordial space.
  • Gaussian kernelAnd this kernel is the guy I mentioned at the beginning that maps the original space to infinite dimensions. However, ifIf you choose very large, the weights on higher-order features actually decay very fast, so that they actually (numerically approximate) correspond to a lower-dimensional subspace; On the other hand, ifChoose small, and you can map arbitrary data to linearly separable — which is not always a good thing, of course, since it can lead to serious overfitting problems. However, in general, by adjusting the parametersGaussian kernel is actually quite flexible and one of the most widely used kernel functions. The following example maps low-dimensional linearly indivisible data to a higher-dimensional space using a Gaussian kernel:

  • The linear nuclearThis is actually the inner product in our original space. The existing main purpose is to make the nuclear problem in space after the “map” and “the problem of space mapping” both are unified in form (mean, sometimes, we write code, or writing formula, as long as write a template or a general expression, and then into the nucleus of different, can, in this, was unified in form, Don’t write a linear one and a nonlinear one.

 

9. Differences and connections between LR and linear regression. Machine learning ML Model medium @AntZ: In industry, LR is generally referred to Logistic Regression rather than Linear Regression. LR applies sigmoid function to the output values of the real range of Linear Regression to converge to the range of 0~1. Its objective function is thus changed from a difference square sum function to a logarithmic loss function to provide the required derivatives for optimization (sigmoid function is a binary special case of Softmax function, whose derivatives are f*(1-f) of function values). Note that LR is often used to solve binary 0/1 classification problems, but it is so tightly coupled to linear regression that it is unconsciously given the name of regression. To request multivariate classification, change sigmoid to the famous Softmax. Nishizhen: in my opinion, both logistic regression and linear regression are generalized linear regression. Secondly, the optimization objective function of the classical linear model is the least square, while logistic regression is the likelihood function. In addition, linear regression makes predictions in the whole real number field with consistent sensitivity, while the classification range needs to be in [0,1]. Logistic regression is a regression model that reduces the prediction range and limits the predicted value to between [0,1]. Therefore, for this kind of problems, logistic regression has better robustness than linear regression. The model of logistic regression is essentially a linear regression model. Logistic regression is supported by linear regression theory. However, linear regression model can not achieve the nonlinear form of SIGmoID, sigmoID can easily deal with 0/1 classification problem.

 

10 what is the difference between GBDT and XGBoost (decision tree, Random Forest, Booting, Adaboot)? Machine learning ML model difficult @AntZ integration learning integration object is the learner. Bagging and Boosting belong to two types of integrated learning methods. The Bagging method involves putting back the same number of samples to train each learner and then integrating them together (simple voting); Boosting method trains each learner sequentially using all samples (with adjustable weights), with iterative integration (smooth weighting). Decision tree is one of the most commonly used learners, and its learning process is to build the tree from the root, that is, how to make decisions about leaf node splitting. The ID3/C4.5 decision tree uses information entropy to calculate optimal splitting, the CART decision tree uses Gini index to calculate optimal splitting, and the XGBoost decision tree uses second-order Taylor expansion coefficient to calculate optimal splitting. Bagging method: there is no strong dependence between learners, learners can be trained and generated in parallel, and the integration method is generally voting. Random Forest is a representative of Bagging. When sampling is put back, each learner randomly selects some features to optimize. Boosting method: There is a strong dependence between learning devices, which must be generated serially, and the integration method is weighted sum. Adaboost is Boosting, which uses exponential loss function to replace the 0/1 loss function of the original classification task. GBDT is an excellent Boosting representative, which performs gradient descent on function residual approximation and uses CART regression tree as learner to integrate regression model. Xgboost is a Boosting integrator, which performs gradient descent for function residual approximation, uses second-order gradient information in iteration, and the integration model can be classified and regressive. Because it can be calculated in parallel at the feature granularity, the structural risk and engineering implementation are much optimized, and the generalization, performance and scalability are better than GBDT. For decision trees, here’s the decision Tree Algorithm. The Random Forest is a classifier containing multiple decision trees. AdaBoost is an abbreviation of Adaptive Boosting. For AdaBoost, you can read the article Principle and Derivation of AdaBoost Algorithm. GBDT (Gradient Boosting Decision Tree), namely, Gradient Boosting Decision Tree algorithm, is equivalent to a combination of Decision Tree and Gradient Boosting algorithm. Xijun LI: XGBoost is like an optimized version of GBDT, both in accuracy and efficiency. Compared with GBDT, the specific advantages are as follows: 1. The loss function is approached by Taylor’s binomial approximation, rather than the first derivative 2 as in GBDT. The structure of the tree is regularized to prevent excessive complexity of the model and reduce the possibility of overfitting 3. Node split in different ways, GBDT is to use the gini coefficient, xgboost is after optimization is more see: xijunlee. Making. IO / 2017/06/03 /…

 

11 Why xGBoost uses Taylor to expand, what are the advantages? Machine learning ML model difficult @AntZ: XgBoost uses first and second partial derivatives, the second derivative is conducive to faster and more accurate gradient descent. By using Taylor expansion to obtain the second derivative form of function as independent variable, leaf splitting optimization can be carried out by only relying on the input data without selecting the specific form of loss function. In essence, the selection of loss function and model algorithm optimization/parameter selection are separated. This decoupling increases xGBoost’s applicability, allowing it to select loss functions on demand, which can be used for classification as well as regression.

How does XGBoost find the optimal feature? Is it put back or not put back? Machine learning ML model difficulty @AntZ: Xgboost gives a gain score for each feature during training, The features with maximum gain will be selected as the splitting basis, so that the importance of each feature in model training can be remembered — the number of times involving a feature from root to leaf is ranked as the importance of the feature. Xgboost belongs to Boosting integrated learning method, and the samples are not put back, so the samples are not repeated in each round of calculation. Xgboost, on the other hand, supports sub-sampling, which means that the entire sample may not be used for each round of calculation to reduce overfitting. Furthermore, XGBoost also has column sampling, which randomly samples a percentage of features in each round of calculation, both to improve calculation speed and reduce overfitting.

What about discriminant and generative models? Machine learning based discriminant method: easy to ML learning directly by the data decision function Y = f (X), or by the conditional distribution probability P (Y | X) as a predictive model, namely the discriminant model. Methods: the data by learning the joint probability density distribution function P (X, Y), and then to find the conditional probability distribution P (Y | X) as a prediction model, namely the generation model. The discriminant model can be obtained from the generative model, but the generative model cannot be obtained from the discriminant model. The common discriminant models include k-nearest Neighbor, SVM, decision tree, perceptron, Linear discriminant analysis (LDA), linear regression, traditional neural network, Logistic regression, Boosting, and conditional random field common generation models include: Naive Bayes, Hidden Markov model, Gaussian mixture model, Document Topic Generation model (LDA), restricted Boltzmann machine L1 and L2 difference. L1 norm refers to the sum of absolute values of each element in a vector, also known as Lasso regularization. Such as vector A = (1, 1, 3), then A L1 norm for | | 1 + + | | 3 | | – 1. The L1 norm is the sum of the absolute values of each element of x vector. L2 norm: is the sum of squares of each element of x vector to the 1/2 power. L2 norm is also known as Euclidean norm or Frobenius norm Lp norm: is the sum of absolute values of each element of X vector to the P power and 1/ P power. In the learning process of support vector machine, L1 norm is actually a process of solving the optimal cost function. Therefore, L1 norm regularization adds L1 norm to the cost function to make the learning result meet the requirement of sparsity, so as to facilitate feature extraction by human. L1 norm can make the weights sparse and facilitate feature extraction. L2 norm can prevent overfitting and improve the generalization ability of the model. AntZ: The difference between L1 and L2, why does it make a big difference if one minimizes absolute value and the other minimizes square? If you look at the derivatives one is one and the other is w, near zero, L1 goes down to zero at a constant rate, while L2 comes to a complete stop. This indicates that L1 is to eliminate unimportant features (or that the importance is not in the same order of magnitude) as soon as possible, while L2 is to minimize the contribution of features but not to zero. When the two work together, they work together as equals (in short, neither idle nor superhuman) those characteristics that are of an order of magnitude (the most important).

14. What distribution do L1 and L2 regular priors obey respectively? Machine learning ML foundation is easy

In the interview, L1 and L2 regular priors respectively obey what distribution? L1 is the Laplacian distribution and L2 is the Gaussian distribution.

AntZ: Prior is the starting line of optimization. The advantage of having a prior is that it can have good generalization performance in a small data set. Of course, this is obtained when the prior distribution is close to the real distribution.

Introducing gaussian normal prior distribution to parameters is equivalent to L2 regularization, which is familiar to all:



Introducing a Laplacian prior to parameters is equivalent to L1 regularization, as shown in the following figure:



It can be seen from the above two figures that L2 prior approaches zero around and L1 prior approaches zero itself.

 

15 the most successful application of CNN is in CV, so why can many problems of NLP and Speech be solved by CNN? Why is CNN in AlphaGo? What is the similarity between these unrelated questions? In what way does CNN capture this commonality? Deep learning DL application is difficult

Xu Han, Source:zhuanlan.zhihu.com/p/25005808

Deep Learning -Yann LeCun, Yoshua Bengio & Geoffrey Hinton

Learn TensorFlow and deep learning, without a Ph.D.

The Unreasonable Effectiveness of Deep Learning -LeCun 16 NIPS Keynote

The correlation of the above unrelated problems lies in the relationship between the local and the whole. The low-level features are combined to form high-level features, and the spatial correlation between different features is obtained. As shown below: low level features such as lines/curves, combined into different shapes, resulting in a representation of the car.

CNN mainly uses four methods to grasp this commonality: local connection/weight sharing/pooling operation/multi-level structure.

Local connection enables the network to extract local features of data. Weight sharing greatly reduces the training difficulty of the network. A Filter extracts only one feature and convolves in the whole image (or voice/text). The pooling operation, together with the multi-level structure, realizes the dimensionality reduction of the data, and combines the local features of the lower level into the features of the higher level, so as to represent the whole image. The diagram below:



In the figure above, if the same Filter is used for processing each point, it is full convolution; if different filters are used, it is Local-conv.

Also, about CNN, here’s an article calledCNN Notes: Popular understanding of convolutional neural networks”.

 

16 Talk about Adaboost, weight update formula. When the weak classifier is Gm, the weight of each sample is W1, W2… Please write down the final decision formula. Machine learning ML model is difficult

Given a training dataset T={(x1,y1), (x2,y2)… (xN,yN)}, where example, and instance spaceYi belongs to the tag set {-1,+1}. The purpose of Adaboost is to learn a series of weak classifiers or basic classifiers from training data, and then combine these weak classifiers into a strong classifier.

The algorithm flow of Adaboost is as follows:

  • * * steps1.** First, initialize the weight distribution of training data. Each training sample is initially given the same weight: 1/N.

  • * * steps2.** for multiple iterations, use m = 1,2… , M represents the number of iterations

A. Use the training data set with weight distribution Dm to learn and get the basic classifier (select the threshold with the lowest error rate to design the basic classifier) :

B. Calculate the classification error rate of Gm(x) on the training data set

According to the above formula, the error rate em of Gm(x) on the training data set is the sum of weights of samples misclassified by Gm(x).

C. Calculate the coefficient of Gm(x), and AM represents the importance of Gm(x) in the final classifier (objective: to get the weight of basic classifier in the final classifier) :

According to the above formula, when EM <= 1/2, AM >= 0, and AM increases with the decrease of EM, which means that the basic classifier with a smaller classification error rate plays a greater role in the final classifier.

D. Update the weight distribution of the training data set (objective: to obtain the new weight distribution of the sample) for the next iteration

In this way, the weight of the misclassified sample by the basic classifier Gm(x) increases, while the weight of the correctly classified sample decreases. In this way, the AdaBoost method can “focus” or “focus” on samples that are harder to distinguish.

Where, Zm is the normalization factor, making Dm+1 a probability distribution:

  • Step 3. Combine the weak classifiers

Thus, the final classifier is obtained as follows:



More please see this article:Principle and derivation of Adaboost algorithm”.

 

17. LSTM structure derivation, why is better than RNN? Deep learning DL model is difficult to deduce the changes of forget gate, input gate, cell state, hidden information, etc. Since the LSTM is in and out and the current cell Informaton is superimposed after input gate control, and the RNN is multiplicative, the LSTM can prevent the gradient from disappearing or exploding

Those of you who do a lot of Internet searching know that when you accidentally type in a word that doesn’t exist, the search engine will prompt you to enter the right word. For example, if you type “Julw” into Google, the search engine will guess your intention: “July”, as shown below:

This is called spell checking. According to an article written by a Google employee, Google’s spell-checking is based on a Bayesian approach. Please share with us your understanding of how Google uses bayesian methods to implement “spell checking”. Machine learning ML applications are difficult

When a user types a word, it may be spelled correctly or incorrectly. If spelling is spelled as c for correct and W for wrong, the job of spell check is to try to infer c from w when w is present. In other words: given w, and given a number of alternatives, find the most likely c, i.eThe maximum value of.

According to Bayes’ theorem, we have:

  

Since all the alternative c’s correspond to the same W, their P(w) is the same, so we just want to maximize

 

Can. Among them:

  • P(c) is the “probability” of the correct word, which can be replaced by “frequency”. If we have a large enough text library, then the frequency of each word in that library is the same thing as the probability of its occurrence. The higher the frequency of a word, the greater the P(c). For example, if you type the wrong word “Julw”, the system is more likely to guess that the word you might want to type is “July” rather than “Jult”, because “July” is more common.
  • | c P (w) said trying to spell c cases, the probability of spelling mistakes w. In order to simplify the problem, assume that the closer the two words in the glyph, have the more likely a misspelling, P | c (w). A one-letter spelling difference, for example, is more likely to happen than a two-letter spelling difference. If you want to spell July, you are more likely to misspell Julw (with one letter difference) than Jullw (with two letters difference). It’s worth noting that this problem is commonly referred to as “editorial distance.” See this blog post.

So, we compare the frequency of all words with similar spellings in the text library, and then we pick the word that appears most frequently, the word that the user wants to type most. See here for details of the calculation process and the drawbacks of this method.

 

Why is Naive Bayes so “naive”? Machine learning ML models are easy because they assume that all features in a dataset are equally important and independent. As we all know, this assumption is far from true in the real world, so to say naive Bayes is really naive. AntZ: Naive Bayesian Model Naive means “very simply and naively” to assume that sample characteristics are independent of each other. This assumption is largely nonexistent in reality, but there are plenty of cases where feature correlations are small, so the model still works well.

 

19 Please compare plSA and LDA roughly. Machine learning ML model medium

  • In the pLSA, topic distribution and word distribution are determined, and with a certain probability (,) Select specific topics and words separately to generate good documents. Then, according to the generated documents, the topic distribution and word distribution are back-deduced. Finally, the EM algorithm (maximum likelihood estimation idea) is used to solve the values of two unknown but fixed parameters:(by) and(byConverted from).

    • The probability that document D produces topic Z, and the probability that topic Z produces the word W are both fixed values.

      • Take an example of document D producing topic Z. Given a document d, distribution is a certain topic, such as {P (zi | d), I = 1, 2, 3} may be {0.4, 0.5, 0.1}, said z1, z2, z3, the probability of the three themes was selected document d is a fixed value: P (z1 | d) = 0.4, P (z2 | d) = 0.5, P (z3 | d) = 0.1, as shown in the figure below (figure interception from Shen Bo PPT) :

  • However, in LDA under the Bayesian framework, we no longer consider topic distribution (the probability distribution of occurrence of various topics in documents) and word distribution (the probability distribution of occurrence of various words in a certain topic) as uniquely determined (but random variables), but there are many possibilities. But a document has to have a topic distribution and a word distribution, so what do we do? LDA gives them two Dirichlet priors, which randomly extract a topic distribution and a word distribution for a document.

    • The probability that document D generates topic Z (to be precise, Dirichlet prior generates topic distribution θ for document D, and then generates topic Z according to topic distribution θ), and the probability that topic Z generates word W are no longer certain two values, but random variables.

      • Again, document D specifically produces topic Z. Given a document d, now has more than one theme z1, z2, z3, the theme of their distribution {P (zi | d), I = 1, 2, 3} may be {0.4, 0.5, 0.1}, may also be a {0.2, 0.2, 0.6}, namely these themes are d the selected probability is no longer considered a certain value, May be P (z1 | d) = 0.4, P (z2 | d) = 0.5, P (z3 | d) = 0.1, it is possible that P (z1 | d) = 0.2, P (z2 | d) = 0.2, P (z3 | d) = 0.6, etc., and subject distribution of what is we are not sure which values set (why? / why not? This is the core idea of Bayesianism, which regards unknown parameters as random variables and no longer as certain values), but its prior distribution is dirichlet distribution, so a topic distribution can be randomly extracted from infinite topic distributions according to dirichlet prior. As shown in the figure below (taken from SHEN Bo PPT) :

In other words, LDA gives these two parameters (,) added two prior distribution parameters (Bayesitizing) : a prior distribution of the topic distribution Dirichlet distribution, and a prior distribution of word distribution Dirichlet distribution.

To sum up, LDA is really just a Bayesian version of pLSA. After the document is generated, both of them need to infer their topic distribution and word distribution according to the document, but they use different parameter inference methods. PLSA uses maximum likelihood estimation to infer two unknown fixed parameters, while LDA makes these two parameters into random variables. And add dirichlet prior.

For more information, see: Popular Understanding of LDA Theme Models.

 

20 Please briefly talk about the EM algorithm. Machine learning ML model medium

@ tornadomeet, subject analytical source: www.cnblogs.com/tornadomeet… Sometimes because of sample production and the implicit variable (implicit variables cannot be observed), for when the parameters of the model generally adopts maximum likelihood estimation, it contains the implicit variable, so the likelihood function parameters derivation is asked not to come out, then the EM algorithm can be used to evaluate the parameters of the model of the corresponding model parameters (the number may have multiple), EM algorithm is generally divided into two steps:

Step E: select a set of parameters and work out the conditional probability value of the implicit variable under this parameter;

Step M: Combined with the conditional probability of implicit variables obtained by step E, the maximum value of the lower bound function of the likelihood function (in essence, an expectation function) can be obtained.

Repeat the above two steps until convergence.

The formula is as follows:

   

Derivation process of lower bound function in m-step formula:

   

A common example of EM algorithm is GMM model. Each sample may be generated by k Gauss, but the probability of producing by each Gauss is different. Therefore, each sample has a corresponding Gauss distribution (one of K), and the implicit variable in this case is a certain Gauss distribution corresponding to each sample.

The E step formula of GMM is as follows (calculate the probability of each sample corresponding to each Gaussian) :

   

The more specific calculation formula is:

  

The m-step formula is as follows (calculate the proportion, mean and variance of each Gaussian) :

   

 

21 how do I pick K in KNN? Machine learning ML model easy about what is KNN, you can see this article: “from k-nearest Neighbor algorithm, distance measurement to KD tree, SIFT+BBF algorithm”. The selection of K value in KNN has great influence on the results of k-nearest neighbor algorithm. As Dr. Li Hang puts it in his book statistical Learning Methods:

  1. If you choose the smaller values of K, equivalent to a smaller training instances in the field of forecast, “learning” approximation error will decrease, only with the input instance is close or similar training instances will only work on forecast results, at the same time the problem is to “learn” the estimation error will increase, in other words, the decrease of the K value means the whole model is complicated, Overfitting is easy to occur;
  2. If a larger value of K is selected, it is equivalent to using training examples in a larger field to make predictions. Its advantage is that it can reduce the estimation error of learning, but its disadvantage is that the approximation error of learning will increase. At this time, training instances far from the input instance (dissimilar) will also act on the predictor, making the prediction error, and the increase of K value means that the overall model becomes simple.
  3. When K=N, it is completely inadequate, because no matter what the input instance is at this time, it is simply predicted that it is the most tired in the training instance. The model is too simple, and a large amount of useful information in the training instance is ignored.

In practical application, K value is generally taken as a relatively small value. For example, the optimal K value is selected by cross-validation method (simply speaking, part of the sample is used as the training set and part of the test set).

 

22 Methods to prevent overfitting. The reason why ML base of machine learning is easy to overfit is that the learning ability of the algorithm is too strong. Some assumptions (such as sample independence and uniform distribution) may not be true; The training sample is too small to estimate the distribution of the whole space. Treatment methods:

  • Early stop: stop the training if the model performance is not significantly improved after several iterations
  • Data set amplification: add original data, add random noise to original data, resampling
  • regularization
  • Cross validation
  • Feature selection/feature dimension reduction
  • Creating a validation set is the most basic way to prevent overfitting. The goal of the model we end up training is to perform well on the validation set, not the training set.
  • Regularization can limit the complexity of a model.

 

In machine learning, why do we normalize data so often? Machine learning ML basic medium

@ zhanlijun, subject analytical source: www.cnblogs.com/LBSer/p/444…

Machine learning models are widely used in the Internet industry, such as sorting (see: Sorting learning practices), recommendation, anti-cheating, positioning (see: Localization algorithm based on Naive Bayes), etc. In general, most of the time spent in machine learning applications is spent on feature processing. The key step is to normalize feature data. Why should we normalize? Many students do not understand the explanation given by Wikipedia: 1) Normalization accelerates the speed of finding the optimal solution of gradient descent; 2) Normalization may improve accuracy. Let’s expand on these two points a little bit.

1 normalized to what can improve the speed of the gradient descent method to solve the optimal solution?

The Stanford machine learning video explains it very well: class.coursera.org/ml-003/lect…

As shown in the figure below, the blue circles represent contour lines of two features. In the figure on the left, the interval between the two features X1 and X2 is very different. The interval between X1 and X2 is [0,2000], and the interval between X2 is [1,5]. The contour lines formed by them are very sharp. When the gradient descent method is used to seek the optimal solution, it is likely to take the “Zigzag” route (vertical contour line), resulting in many iterations before convergence;

The figure on the right normalized the two original features, and their corresponding contour lines appear very round, which can converge quickly in the process of solving gradient descent.

Therefore, if the machine learning model uses gradient descent method to find the optimal solution, normalization is often very necessary, otherwise it is difficult to convergence or even cannot convergence.

2 normalization may improve accuracy

Some classifiers need to calculate distances between samples (such as Euclidean distance), such as KNN. If the range of a feature range is very large, then the distance calculation mainly depends on this feature, which is contrary to the actual situation (for example, the actual situation is that the feature with small range is more important).

3. Normalized types

1) Linear normalization

This normalization method is more suitable for the case of more concentrated values. One drawback of this method is that if Max and min are not stable, it is easy to make the normalization result unstable and the effect of subsequent use unstable. In practice, you can use empirical constant values instead of Max and min.

2) Standard deviation standardization

The processed data conform to the standard normal distribution, that is, the mean value is 0, the standard deviation is 1, and the transformation function is as follows:

μ is the mean value of all sample data, and σ is the standard deviation of all sample data.

3) Nonlinear normalization

It is often used in scenarios where data is highly fragmented, some values are large and some are small. The original values are mapped by some mathematical function. The method includes log, exponent, tangent and so on. Depending on the distribution of the data, you need to determine the curve of the nonlinear function, such as log(V, 2) or log(V, 10).

Talk about normalization in deep learning. Deep learning DL foundation is easy



See this video for details:Normalization in deep learning”.

 

Which machine learning algorithms do not need to be normalized? Machine learning ML base-easy probability models do not need normalization because they do not care about the values of variables, but rather the distribution of variables and conditional probabilities between variables, such as decision trees, RF. However, optimization problems such as Adaboost, SVM, LR, KNN and KMeans need normalization. Dr. Guan: I understand that normalization and standardization are mainly to make calculation more convenient. For example, if two variables have different dimensions, one may have a much larger value than the other, and then they may cause numerical calculation problems when they are used as variables at the same time. For example, finding the inverse of the matrix may be very inaccurate or the convergence of gradient descent method is difficult. And if you want to calculate Euclidean distance maybe you have to adjust the dimensions so I guess it’s good to be sure of LR and KNN. As for other algorithms, I also think it would be good to standardize if the variable dimensions are very different. @Hanxiaoyang: Normally I’m used to saying tree model. The probability model here may mean something similar.

 

25 Why is there no need to normalize tree structures? Machine learning ML base easy answer: numerical scaling, does not affect the splitting point position. Because the first step is sorted according to the eigenvalues, the sorting order remains the same, so the branches and split points will not be different. For a linear model, such as LR, I have two features, one is (0,1) and the other is (0,10000). In this way, when gradient descent is used, the loss contour is an ellipse shape, so I want to iterate to the best advantage, so it takes many iterations, but if normalization is carried out, then the contour line is round. Then SGD will iterate to the origin, requiring fewer iterations. In addition, note that the tree model cannot perform gradient descent, because the tree model is a step, and the step point is not differentiable, and the derivation is meaningless, so the tree model (regression tree) seeks the best thing by looking for the optimal split point.

 

26 Reasons for normalization (or normalization, note that normalization is different from normalization) of data. Machine learning ML based @ I love big bubbles easily, source: blog.csdn.net/woaidapaopa… It should be emphasized that it is better not to normalize the data because the dimensions of each dimension are not the same. And it needs to be normalized depending on the situation.

  • After some models are unevenly scaled in each dimension, the optimal solution is not equivalent to the original (such as SVM) and needs to be normalized.
  • For example, LR does not need normalization, but in practice, model parameters are often solved iteratively. If the objective function is too flat (imagine a very flat Gaussian model), the iterative algorithm will not converge, so it is better to normalize data.

Supplement: In fact, the essence is caused by different loss functions. SVM adopts Euler distance. If a feature is large, other dimensions will be dominated. LR can be adjusted by weight to keep the loss function unchanged.

Please briefly describe the process of a complete machine learning project. Abstract a problem into a mathematical problem is the first step in machine learning. The training process of machine learning is usually very time-consuming, and the time cost of random attempts is very high. By abstracting as a mathematical problem, we mean we know what kind of data we can get, whether the goal is a classification or regression or clustering problem, and if not, if it falls into one of these categories. Data determines the upper bound of machine learning results, and the algorithm is just as close as possible to this upper bound. The data must be representative, or it will inevitably overfit. And for classification problems, data skew should not be too serious, the number of different categories of data should not be several orders of magnitude difference. In addition, there should be an evaluation of the magnitude of data, such as how many samples and how many features, to estimate the consumption degree of memory, and to judge whether the memory can be put down in the training process. If you can’t, you have to consider improving the algorithm or using some tricks to reduce dimension. If the amount of data is too large, consider distribution. 3 feature preprocessing and feature selection Good data can be effective only when good features can be extracted. Feature preprocessing and data cleaning are very important steps, which can improve the performance of the algorithm significantly. Normalization, discretization, factorization, missing value processing, collinearity removal, etc., a lot of time is spent in data mining process. These tasks are simple and reproducible, with stable and predictable returns, and are essential steps for machine learning. Winnowing out salient features and discarding non-salient features requires machine learning engineers to repeatedly understand the business. This has a decisive effect on many outcomes. With good feature selection, very simple algorithms can produce good, stable results. This requires the use of relevant techniques of feature validity analysis, such as correlation coefficient, Chi-square test, mean mutual information, conditional entropy, posterior probability, logistic regression weight, etc. Training model and tuning until this step, the algorithm we described above is used for training. Many algorithms can now be packaged into black boxes for human use. But the real test is to tweak the parameters of these algorithms to make the results better. This requires a deep understanding of how algorithms work. The deeper you understand it, the better you can identify the root of the problem and come up with good tuning solutions. 5. How does model diagnosis determine the direction and thinking of model tuning? This requires techniques for diagnosing the model. Over-fitting and under-fitting judgment is a crucial step in model diagnosis. Common methods such as cross validation, drawing learning curves and so on. The basic tuning idea of overfitting is to increase the amount of data and reduce the complexity of the model. The basic tuning idea of under-fitting is to improve the quantity and quality of features and increase the complexity of the model. Error analysis is also a crucial step in machine learning. By observing the error sample, we can comprehensively analyze the causes of error: the problem of parameter or algorithm selection, the problem of feature or the problem of data itself…… The model after diagnosis needs to be tuned, and the new model after tuning needs to be re-diagnosed. This is a process of repeated iteration and continuous approximation, and continuous attempts are needed to achieve the optimal state. 6 Model fusion Generally speaking, after model fusion, the effect can be improved to some extent. And it works well. In engineering, the main method to improve the accuracy of the algorithm is to work on the front end of the model (feature cleaning and pre-processing, different sampling modes) and the back end (model fusion) respectively. Because they are more standard replicable, the effect is more stable. Direct referrals aren’t much work, because training large amounts of data is too slow, and the results are hard to guarantee. 7 Online operation is mainly related to engineering implementation. Engineering is result-oriented, and the effect of the model running online directly determines the success or failure of the model. It includes not only its accuracy and error, but also its running speed (time complexity), resource consumption (space complexity), and whether its stability is acceptable. These work flows are mainly summarized from engineering practice. Not every project contains a complete process. The part here is just a guide. Only by practicing more and accumulating more project experience can we have a deeper understanding of ourselves. Therefore, based on this, each ML algorithm class online in July is hereby added feature engineering, model tuning and other related courses. For example, here’s an open class video called Feature Processing and Feature Selection.

 

28. Why should features be discretized in logistic regression? Machine learning medium @ YanLin ML model, ontology analysis source: www.zhihu.com/question/31…

In the industry, continuous values are seldom directly taken as the feature input of the logistic regression model, instead, continuous features are discretized into a series of 0 and 1 features and given to the logistic regression model, which has the following advantages:

0. It is easy to increase and decrease discrete features, which is easy for rapid iteration of the model;

1. The inner product multiplication of sparse vectors is fast, and the calculation results are easy to store and expand;

2. The discretized features have strong robustness to abnormal data: for example, if a feature is age >30, it is 1; otherwise, it is 0. If features are not discretized, an abnormal data “300 years old” will cause great disturbance to the model.

3. Logistic regression is a generalized linear model with limited expression ability; After the single variable is discretized into N, each variable has its own weight, which is equivalent to introducing nonlinearity into the model, which can improve the expression ability of the model and increase the fitting.

4. Feature crossing can be carried out after discretization, from M+N variables to M*N variables, which further introduces nonlinearity and improves expression ability;

5. After feature discretization, the model will be more stable. For example, if the age of users is discretized, 20-30 will not be a completely different person just because a user is one year older. Of course, the samples next to the interval will be the opposite, so how to divide the interval is an art;

6. Feature discretization simplifies the logistic regression model and reduces the risk of model overfitting.

Li Mu once said: whether to use discrete features or continuous features in a model is actually a tradeoff between “a large number of discrete features + simple model” and “a small number of continuous features + complex model”. It can be discretized using linear models, or continuous features plus deep learning. It depends on whether you like to fiddle with features or fiddle with models. Generally speaking, the former is easy, and can be done in parallel by n people with successful experience; The latter looks great so far, but how far it will go remains to be seen.

 

The difference between New and malloc. C/C + + programming development @ Sommer_Xia, source: blog.csdn.net/shymi1991/a… 1. Malloc and free are C++/C standard library functions, and new/delete are C++ operators. They can both be used to allocate dynamic memory and free memory. 2. For non-internal data types, maloc/free alone cannot meet the requirements of dynamic objects. The constructor is automatically executed when an object is created, and the destructor is automatically executed before the object dies. Since malloc/free is a library function and not an operator, it is not within the compiler’s control to impose the task of executing constructors and destructors on Malloc /free. 3. Therefore, C++ language needs an operator new that can complete the work of dynamic memory allocation and initialization, and an operator delete that can complete the work of cleaning and freeing memory. Note that new/delete is not a library function. C++ programs often call C functions, and C programs can only manage dynamic memory with malloc/free

 

30 Hash conflicts and solutions. Data structure/algorithm of medium @ Sommer_Xia, source: blog.csdn.net/shymi1991/a… Hash conflicts occur when elements with different keyword values are mapped to the same address in the hash table. Solution: 1) Open addressing: When a conflict occurs, use some probe (also called probe) technique to form a probe sequence in the hash table. The sequence is searched cell by cell until the given keyword is found, or an open address is encountered (that is, the address cell is empty). (To insert, the new node to be inserted can be stored in the address cell when the open address is detected.) If an open address is detected during the search, it indicates that there is no keyword to be searched in the table, that is, the search fails. 2) Rehash method: construct multiple different hash functions at the same time. 3) Chained address method: all elements whose hash address is I are formed into a single linked list called synonym chain, and the head pointer of the single linked list is stored in the ith cell of the hash table, so the search, insertion and deletion are mainly carried out in the synonym chain. The chained address method is suitable for frequent inserts and deletions. 4) Establishment of public overflow area: Hash table is divided into basic table and overflow table. All elements that conflict with the basic table will be filled into the overflow table.

 

31 Which of the following is not an advantage of CRF models over HMM and MEMM models (B) Machine learning ML model medium A. Flexible features B. fast speed C. Can accommodate more context information D. Global optimal First, CRF, HMM(hidden horse model), MEMM(maximum entropy hidden horse model) are commonly used to model sequence annotations. Hidden horse model is one of the biggest weakness is due to its output independence assumption, led to its inability to consider the characteristics of the context, limits the choice of features Maximum entropy hidden horse model has solved the problem of the implicit horse, can choose any character, but because of its in each node to be normalized, so can only find the local optimal value, but also has brought the tag bias problem, That is to say, if all the situations that do not appear in the training corpus are ignored, the conditional random fields can solve this problem well. Instead of normalizing every node, all the features are normalized globally, so the global optimal value can be obtained. In addition, the Machine Learning Engineer Issue 8 covers probability graph models.

 

What is entropy? Machine learning ML foundation is easy

Entropy, by its name, is a bit of a mystery. In fact, entropy is simply defined as the uncertainty of a random variable. The reason for the mystique is probably why the name is chosen and how it is used.

The concept of entropy originated in physics as a measure of the disorder of a thermodynamic system. In information theory, entropy is a measure of uncertainty.

The introduction of entropy

In fact, the original English word for entropy is entropy, originally developed by the German physicist Rudolf Clausius, as follows:

 

 

It represents the most stable internal state of a system when it is free from external interference. Later, a Chinese scholar translated entropy as “entropy”, considering that entropy is the quotient of energy Q and temperature T and is related to fire.

We know that the normal behavior of any particle is random motion, or “disordered motion,” and that energy must be expend if the particle is to be “ordered.” Thus, temperature (heat energy) can be regarded as a measure of ordering, and entropy as a measure of disorder.

Without an external energy input, closed systems tend to get more and more chaotic (entropy gets higher and higher). For example, if the room is not cleaned, it will not get cleaner (ordered), but more chaotic (disordered). And to make a system more orderly, you have to have an input of external energy.

In 1948, Claude E. Shannon introduced information (entropy) and defined it as the probability of occurrence of discrete random events. The more orderly a system is, the lower the entropy of information is; Conversely, the more chaotic a system, the higher the entropy of information. Therefore, information entropy can be regarded as a measure of the degree of ordering of a system.

See “Mathematical Derivations in Maximum Entropy Models” for more.

 

Definition of entropy, joint entropy, conditional entropy, relative entropy and mutual information. Machine learning ML basic medium

To better understand the probabilities, the following are essential:

  1. The uppercase letter X represents a random variable, and the lowercase letter X represents a specific value of the random variable X;
  2. P (X) is the probability distribution of random variable X, P (X, Y) is the joint probability distribution of random variable X, Y, P (Y | X) is known under the condition of random variable X conditional probability distribution of random variable Y;
  3. P (X = X) represents the probability of random variable X taking a specific value, namely p(X);
  4. P (X = X, Y = Y) said the joint probability, shorthand for p (X, Y), p (Y = Y | X = X) according to conditional probability, shorthand for p (Y | X), with: p (X, Y) = p (X) * p (Y | X).

Entropy: If the possible values of a random variable X are X = {x1, x2… , xk}, its probability distribution is P(X = xi) = PI (I = 1,2… , n), then the entropy of random variable X is defined as:

    

Put the first minus sign at the end, and you get:

Either of the two formulas for entropy above can be used, and they are equivalent and have the same meaning (both formulas will be used later).

 

Joint Entropy: The Joint distribution of two random variables X and Y, which can form a Joint Entropy, denoted by H(X,Y). Condition entropy: on the premise of a random variable X occurs, random variable Y in new entropy is defined as the conditional entropy of Y, H (Y | X), said the measure under the condition of random variable X is known of the uncertainty of the random variable Y.

And the formulas were established: H (Y | X) = H (X, Y) – H (X), the whole formula said entropy (X, Y) contains minus X occur alone contains entropy. See derivation for how:

Just a quick explanation of what we did. The whole thing has 6 rows, where

  • The second row goes to the third row based on the fact that the marginal distribution P (x) is equal to the sum of the joint distribution P (x,y);
  • The third row goes to the fourth row by multiplying the common factor logp of x, and then writing x and y together;
  • The fourth row on line 5 is based on the premise that because the two sigma has p (x, y), so the extracting common factor p (x, y) on the outside, and then put inside – (log p (x, y) – log p (x)) written – log (p (x, y)/p (x));
  • Line 5 to 6 lines is based on the premise that p (x, y) = p (x) * p (y | x), so the p (x, y)/p (x) = p (y | x).

Relative entropy: also known as mutual entropy, cross entropy, discrimination information, Kullback entropy, Kullback-Leible divergence, etc. Let p(x) and q(x) be the two probability distributions of values in x, then the relative entropy of P to Q is:

To a certain extent, the relative entropy to measure the “distance” of two random variables, and D (p | | q) indicates a D (q | | p). In addition, it is worth mentioning that D (p | | q) is necessarily greater than or equal to 0.

Mutual information: The mutual information of two random variables X and Y is defined as the relative entropy of the product of the joint distribution and independent distribution of X and Y, which is represented by I(X,Y) :

 

And I (X, Y) = D (P (X, Y) | | P (X) P (Y)). Let’s calculate H(Y) -i (X,Y) as follows:

By the calculation process of the above, we found that there were H (Y) – I (X, Y) = H (Y | X). Therefore, according to the definition of conditional entropy, there are: H (Y | X) = H (X, Y) – H (X), and according to the mutual information definition on H (Y | X) = H (Y) – I (X, Y), combining the former to the latter, there is the I (X, Y) = H (X) + H (Y) – H (X, Y), the conclusion is most literature as the definition of the mutual information. See “Mathematical Derivations in Maximum Entropy Models” for more.

 

What is maximum entropy? Machine learning ML foundation is easy

Entropy is a measure of the uncertainty of random variables. The greater the uncertainty, the greater the entropy value. If the random variable degenerates to a constant value, entropy is zero. If there is no outside interference, the random variable always tends to be disordered, after enough time of stable evolution, it should be able to achieve the maximum degree of entropy.

In order to accurately estimate the state of random variables, we generally habitually maximize entropy, believing that among all possible probability models (distributions), the model with the maximum entropy is the best one. In other words, on the premise of partial knowledge, the most reasonable inference about the unknown distribution is to conform to the inference that the known knowledge is the most uncertain or random. The principle is to admit the known things (knowledge), and make no assumptions about the unknown things without any prejudice.

For example, if you were to roll a die and ask “what is the probability of each of these faces coming up?” you would say equal probability, which means that each point has a 1/6 probability. Because nothing is certain about this “know-nothing” dice, and it makes the most sense to assume that each of them is equally likely to come up. From an investment point of view, this is the least risky approach, and from an information theory point of view, it preserves the maximum uncertainty, that is, maximizes entropy.

3.1 The principle of no bias

Here’s another example that most articles on maximum entropy models like to use.

For example, if the word “learning” appears in a passage, is it subject, predicate or object? In other words, it is known that “learning” can be a verb or a noun, so “learning” can be marked as subject, predicate, object, attribute and so on.

  • Let x1 for “learning” be marked as a noun and X2 for “learning” as a verb.
  • Let y1 denote “learning” be marked as the subject, y2 denote being marked as the predicate, y3 denote the object, and y4 attribute.

And the sum of these probabilities must be 1, i.e.According to the unbiased principle, the probability of taking each value in this distribution is equal, so:

Since there is no prior knowledge, this judgment is reasonable. What if I have some prior knowledge?

That is, if we know that the probability of “learning” being labeled as attributive is very small, only 0.05, that is, the rest is still based on the unbiased principle, and can be obtained:

Further, when “learning” is labeled as the noun x1, the probability that it is labeled as the predicate y2 is 0.95, i.eAt this time, the principle of no bias should still be adhered to to make the probability distribution as even as possible. But how do you get as unbiased a distribution as possible?

Both practical experience and theoretical calculation tell us that uniform distribution is equivalent to maximum entropy in completely unconstrained state (with constraints, it is not necessarily uniform distribution with equal probability. For example, given the mean and variance, the distribution with the highest entropy becomes a normal distribution.

So the question was transformed to: calculate the distribution of X and Y, to achieve maximum value H (Y | X), and satisfy the following conditions:

 

Therefore, also leads to the nature of the maximum entropy model, it is the problem to be solved by the known X, calculate the probability of Y, and as far as possible let the probability of Y is the largest (in practice, context information of X may be a word, Y is the word translated into me, I, us, we each probability), and according to the existing information, as far as possible the most accurate speculate unknown information, That’s what the maximum entropy model is all about.

Equivalent to the known X, Y the maximum possible probability calculation, convert formula, is to maximize the following formula H (Y | X) :

 

And meet the following four constraints:

 

Just a quick word on the difference between supervised and unsupervised learning. Machine learning ML basic easy supervised learning: the labeled training samples are learned to classify and predict the data outside the training sample set as much as possible. (LR,SVM,BP,RF,GBDT) Unsupervised learning: training learning of unlabeled samples to discover structural knowledge in these samples. (KMeans,DL)

 

Understand regularization. Machine learning ML based easy regularization is a fitting, and that thought in solving the optimal model is the smallest general optimization empirical risk, now on the empirical risk to join this one model complexity (regularization item is a model parameter vector norm), and use a rate ratio to weigh the model complexity with previous empirical risk weights, If the model complexity is higher, the structured experience risk will be greater, and the current goal is to optimize the structural experience risk, which can prevent the model training from being overly complex and effectively reduce the risk of overfitting. Occam’s Razor, which explains the data well and is very simple, is the best model.

 

What is the difference between covariance and correlation? Machine learning ML foundation is easy

Correlation is a standardized form of covariance. Covariances themselves are hard to compare. For example, if we calculate the covariance of salary ($) and age (years), since these two variables have different measures, we will get different covariances that cannot be compared.



To solve this problem, we calculate the correlation to get a value between -1 and 1, ignoring their different measurements.

 

The difference between linear classifier and nonlinear classifier and its advantages and disadvantages. Linear and nonlinear are based on model parameters and input characteristics; For example, if you input x, the model y is equal to ax+ax^2 then you have a nonlinear model, and if you input x and x^2 then you have a linear model. Linear classifier has good interpretability and low computational complexity, but its disadvantage is that the model fitting effect is relatively weak. The nonlinear classifier has strong effect fitting ability, but its disadvantages are that it is easy to overfit due to insufficient data, high computational complexity and poor interpretability. Common linear classifiers include LR, Bayesian classification, single-layer perceptron and linear regression. Common nonlinear classifiers include decision tree, RF, GBDT and multi-layer perceptron SVM (see linear kernel or Gaussian kernel).

 

Logical data storage structures (such as groups, queues, trees, etc.) have a very important influence on software development. Briefly analyze the various storage structures you know from the aspects of speed, storage efficiency, and application. Data structure/algorithm medium

  The running speed Storage efficiency Applicable occasions  
An array of fast high It’s good for lookups and things like matrices  
The list faster higher More suitable for frequent operation, dynamic allocation of memory  
The queue faster higher It is suitable for scheduling tasks and so on  
The stack general higher It is suitable for rewriting recursive classification procedures  
Binary tree (tree) faster general All hierarchical problems can be described by trees  
figure general general In addition to classic uses like minimum spanning tree, shortest path, topological sort, etc. It’s also used in artificial intelligence like neural networks and so on.  
         

 

What is a distributed database? Computer base database easy distributed database system is developed on the basis of the mature technology of centralized database system, but it is not a simple decentralized implementation of centralized database, it has its own properties and characteristics. Many concepts and technologies of centralized database system, such as data independence, data sharing and redundancy reduction, concurrency control, integrity, security and recovery, have different and richer contents in distributed database system. Specifically, a cluster file system is a file system that runs on multiple computers and communicates with each other to integrate and virtualize all storage space resources in the cluster and provide file access services externally. It is different from local file systems such as NTFS and EXT in that the former is for extensibility, while the latter runs in a stand-alone environment and purely manages the mapping between blocks and files and file attributes. Clustered file systems are divided into multiple types. Based on the access mode of storage space, clustered file systems can be divided into shared storage cluster file systems and distributed cluster file systems. The former is a shared file system in which multiple computers identify the same storage space and coordinate with each other to manage the files on it. In the latter, each computer provides its own storage space and coordinates the management of files in all computer nodes. VxFS/VCS of Veritas, Quinton Stornext, BWFS of Zhongke Blue Whale, and MPFS of EMC belong to shared storage cluster file systems. However, HDFS, Gluster, Ceph, Swift and other large-scale cluster file systems commonly used on the Internet are all distributed cluster file systems. Distributed cluster file system has better scalability. Currently, it is known that the maximum scalability can be up to 10K nodes. According to the metadata management mode, it can be divided into symmetric clustered file system and asymmetric clustered file system. Each node of the former has equal roles and manages file metadata together. Information synchronization and mutual exclusion are performed between nodes through high-speed networks. The typical representative is the VCS of Veritas. However, in an asymmetric clustered file system, one or more nodes are responsible for metadata management, and other nodes need to frequently communicate with metadata nodes to obtain the latest metadata, such as directory list and file attributes, etc., which are typically represented by HDFS, GFS, BWFS, and Stornext. For a clustered file system, it can be distributed + symmetric, distributed + asymmetric, shared + symmetric, shared + asymmetric, or any combination of two. According to the file access mode, clustered file system can be divided into serial access and parallel access, the latter is commonly known as parallel file system. Serial access means that a client can access file resources in a cluster only from a node in the cluster, while parallel access means that a client can directly send and receive data from any or multiple nodes in the cluster at the same time to achieve parallel data access and speed up. Cluster file systems such as HDFS, GFS, and pNFS support parallel access. A dedicated client is required. Traditional NFS or CIFS clients do not support parallel access.

 

40 A little bit about Bayes’ theorem. Machine learning ML models are easy to learn several definitions before introducing Bayes’ theorem:

  • Conditional probability (also known as posterior probability) is the probability of event A occurring if another event B has already occurred. Conditional probability is expressed as P (A | B), pronounced “under the condition of B A probability”.

For example, in the same sample space Ω event or A subset of A and B, if one element of the randomly chosen from among Ω belongs to B, then the random selection of element also belongs to A probability is defined as in B under the premise of A conditional probability, so: P (A | B) = | | A studying B / | | B, then the molecules, the denominator is divided by | Ω |

  • Joint probabilityRepresents the probability of two events occurring together. The joint probability of A and B is expressed asor.
  • Edge probability (also known as prior probability) is the probability of an event happening. The edge probability is obtained as follows: In joint probability, it is called marginalization to eliminate the unnecessary events in the final result by merging them into their full probability (the total probability is obtained by summing discrete random variables and the total probability is obtained by integrating continuous random variables). For example, the edge probability of A is expressed as P(A). The edge probability of B is represented by P(B).

Then, consider A question: P (A | B) is in B under the condition of the possibility of A.

  1. First of all, before the occurrence of event B, we have A basic probability judgment on the occurrence of event A, called the prior probability of A, which is represented by P(A).
  2. Second, the event B occurs, our probability of event A reassessment, called A posterior probability, expressed in P (A | B);
  3. Similarly, before the occurrence of event A, we have A basic probability judgment on the occurrence of event B, called the prior probability of B, which is represented by P(B);
  4. Also, the event A occurs, our probability of event B, called B A posteriori probability, represented by P (B | A).

Bayes’ theorem is based on the following Bayes’ formula:

 

 

The derivation of the above formula is actually very simple, is derived from conditional probability.

 

According to the definition of conditional probability, the probability of event A occurring given that event B occurs is zero

 

 

 

 

Similarly, the probability of event B occurring given event A

By collating and combining the above two equations, we can obtain:

 

Then, divide both sides of the above equation by P(B). If P(B) is non-zero, we can obtain the formula expression of Bayes’ theorem:

Therefore, Bayes’ formula can be derived directly from the definition of conditional probability. Because P (A, B) = P (A) P (B | A) = P (A | B) P (B), so P (A | B) = P (A) P (B | A)/P (B). See this article for more information: From Bayesian Methods to Bayesian Networks.

 

41 What is the difference between #include and #include “filename.h”? It is easy to use the #include format to reference standard library headers (the compiler will start searching in the standard library directory). Use the #include “filename.h” format to refer to nonstandard library headers (the compiler will start searching from the user’s working directory).

42 A supermarket studies sales records and finds that people who buy beer are more likely to also buy diapers. What kind of data mining problem is this? (A) Easy data mining DM model A. Association rule discovery B. Clustering C. Category D. Natural language processing

43 Integration, transformation, dimension specification, and numerical specification of raw data are tasks in which of the following steps? (C) Data Mining DM basic easy A. Frequent pattern mining B. Classification and prediction C. Data preprocessing D. Data stream mining

44 Which of the following is not a data preprocessing method? (D) Data mining DM foundation easy A variable substitution B discretization C aggregation D estimation of missing values

45 What is KDD? (A) Data Mining AND knowledge Discovery Domain knowledge discovery c. document knowledge discovery D. dynamic knowledge discovery

46 What technology can be used to separate data with similar labels from data with other labels when the labels are not known? (B) Data mining DM model is easy to a. classification B. clustering C. Association analysis D. Hidden Markov chains

47 Build a model by which the known value of a variable can be used to predict which type of data mining task the value of another variable belongs to. (C) Data mining DM basic easy A. Retrieval by content B. Modeling description C. Predictive modeling d. Looking for patterns and rules

48 Which of the following methods is not A standard method for feature selection: (D) data mining, DM infrastructure, A embedding, B filtering, C packaging, D sampling

49 Write the find_string function in Python to search for and print the content in the text. The wildcard asterisks (asterisks) and question marks (?) are required. Python Easy examples of Python:

>>>find_string(‘hello\nworld\n’,’wor’) [‘wor’] >>>find_string(‘hello\nworld\n’,’l*d’) [‘ld’] >>>find_string(‘hello\nworld\n’,’o.’) [‘or’] def find_string(STR,pat): import re return re.findall(STR, re.i)

 

Let’s talk about five properties of red-black trees. Data structure tree easy

Red-black tree, a binary search tree, but each node adds a memory bit to indicate the color of the node, which can be Red or Black.

By limiting the coloring of nodes along any path from root to leaf, red-black trees ensure that no path is twice as long as any other, and so are close to equilibrium.

Red-black tree, as a binary search tree, satisfies the general properties of binary search tree. Now, let’s look at the general properties of binary search trees.

Sorted binary tree a sorted binary tree is an empty tree or a sorted binary tree that has the following properties:

If the left subtree of any node is not empty, the value of all nodes in the left subtree is less than the value of its root node.

If the right subtree of any node is not empty, the value of all nodes in the right subtree is greater than the value of its root node.

The left and right subtrees of any node are binary search trees respectively.

No duplicate nodes have the same key value.

Since the height of a binary lookup tree randomly constructed of n nodes is LGN, it stands to reason that the general operation of a binary lookup tree takes O(LGN) to execute. However, if the binary search tree degenerates into a linear chain with N nodes, the worst case running time of these operations is O(n).

Although red black tree is a binary search tree in essence, it adds coloring and correlation properties on the basis of the binary search tree to make the red black tree relatively balanced, so as to ensure that the search, insert, delete time complexity of red black tree is O(log n) at worst.

But how does it guarantee that the height of a red-black tree with n nodes is always logn? This leads to five properties of red-black trees:

Every node is either red or black.

The root is black.

Every leaf (leaf refers to NIL pointer or NULL at the end of the tree) is black.

If one node is red, then both of its sons are black.

For any node, each path to NIL at the end of the leaf tree contains the same number of black nodes.

It is these five properties of red-black tree that make a red-black tree with N nodes always maintain logn height, thus explaining the reason of the conclusion that “the worst time complexity of search, insert and delete of red-black tree is O(log n)”. See this article for more information:Give you an idea of red black trees”.

 

A brief description of the SigmoID activation function. Deep learning DL foundation is easy

Commonly used nonlinear activation functions include SigmoID, TANH, RELU, etc. Sigmoid/TANH of the former two are more common in the fully connected layer, while relU of the latter is more common in the convolution layer. Here is a brief introduction to the most basic sigmoID function (BTW, mentioned at the beginning of the SVM article in this blog).

The function expression for sigmoid is as follows

 

Where z is a linear combination, for example z can be equal to b +* + *. By substituting very large positive numbers or very small negative numbers into the g(z) function, we know that the result tends to be 0 or 1.

Thus, the graphical representation of the sigmoid function G (z) is as follows (the horizontal axis represents the domain Z and the vertical axis represents the range G (z)) :

In other words, the sigmoid function is equivalent to squeezing a real number between 0 and 1. When z is a very large positive number, g of z tends to 1, and when z is a very small negative number, g of z tends to 0.

What’s the use of squeezing from zero to one? The usefulness of this is that the activation function can then be regarded as a “probability of classification”, such that an output of 0.9 of the activation function can be interpreted as a 90% probability of positive samples.

For example, here’s the picture (from the Stanford Machine Learning Open Course)

 

    z = b + * + *, where b is the offset term assuming -30,,Took to 20

  • if = 0 = 0, z = -30, g(z) = 1/(1 + e^-z) approaches 0. In addition, as can be seen from the graph of sigmoid function in the figure above, when z=-30, the value of g(z) tends to 0
  • if = 0 = 1, or= 1Is equal to 0, z is equal to b plus* + *Is equal to minus 30 plus 20 is equal to minus 10, and again, g of z tends to 0
  • if = 1 Is equal to 1, then z is equal to b plus* + *So minus 30 plus 20 times 1 plus 20 times 1 is 10, and then g of z approaches 1.

In other words, onlyandWhen both values are 1, g(z)→1 is judged to be a positive sample.orWhen 0 is set, g(z)→0 is judged as negative sample, so as to achieve the purpose of classification.

To sum up, sigmod function is the compression function of Logistic regression. Its property is that it can compress the separated plane to a number (vector) in the interval [0,1]. When the linear partition plane value is 0, the corresponding SIGMOD value is 0.5, greater than 0 corresponds to sigmod value greater than 0.5, and less than 0 corresponds to SIGmod value less than 0.5. 0.5 can be used as the threshold of classification; The maximum value of exp is more convenient to solve. The logistic loss function is taken as the multiplication form, which makes the loss function convex. The disadvantage is that the SIGmod function has a dead zone when y approaches 0 or 1, so it is easy to cause gradient Mass when loss is passed in bp form.

 

What is convolution? Deep learning DL foundation is easy

Of image data window (different) and filter matrix (a set of fixed weight: because multiple weights of each neuron is fixed, so also can be regarded as a constant filter filter) for inner product (individually elements multiplication and summation) is the so-called “convolution operation, and the name of the convolutional neural network source.

Loosely speaking, the red box in the figure below can be thought of as a filter, a set of neurons with fixed weights. Multiple filters are added together to form the convolution layer.

OK, let’s take a concrete example. For example, in the figure below, the left part of the figure is the original input data, the middle part of the figure is the filter, and the right part of the figure is the new two-dimensional output data.

Let’s break this up

The numbers are multiplied and then added =

The intermediate filter and the data window to do the inner product, its specific calculation process is: 4*0 + 0*0 + 0*0 + 0*0 + 0*0 + 0*1 + 0*1 + 0*0 + 0*1 + -4*2 = -8

 

53 What is the POOL layer of CNN? Deep learning DL model easy

Pooling, in short, means taking the region average or maximum, as shown in the figure below (from CS231N)

The figure above shows that the region is maximized, that is, in the left part of the figure above, 6 is the largest in the 2×2 matrix in the upper left corner, 8 is the largest in the 2×2 matrix in the upper right corner, 3 is the largest in the 2×2 matrix in the lower left corner, and 4 is the largest in the 2×2 matrix in the lower right corner. Therefore, the result of the right part of the figure above is obtained: 6, 8, 3, and 4. Easy, isn’t it?

 

Briefly describe what generative adversarial networks are. The reason why GAN is adversarial in DL extension of deep learning is that there is an internal competitive relationship between GAN. One party is called generator, whose main job is to generate images and make them appear to come from training samples as much as possible. The other is discriminator, whose goal is to determine whether an input image is a real training sample. To put it more bluntly, think of a generator as a counterfeit money maker and discriminator as a police officer. The purpose of the generator is to make fake money as authentic as possible so that it can fool the discriminator by generating a sample that looks as if it came from a real training sample.


The left and right scenarios are shown below:

See this course for more information: Generative Adversarial Network Class.

 

What is the principle of learning van Gogh’s painting? GTX 1070 CUDA 8.0 TensorFlow GTX 1070 CUDA 8.0 TensorFlow GPU NeuralStyle Artistic images (Learn the principles behind Van Gogh’s paintings).

There is now a to z 26 elements, write a program to print a to z of take the combination of the three elements (such as print a, b, c, d, y, z) mathematical logic permutation and combination of analytical reference: blog.csdn.net/lvonve/arti…

 

Talk about gradient descent. Machine learning ML fundamentals

@ LeftNotEasy, subject analytical source: www.cnblogs.com/LeftNotEasy…

 

   

We use X1, X2.. Xn describes the components of the feature, such as x1= area of the room, x2= orientation of the room, etc., and we can make an estimation function:

θ is called a parameter here, where it means to adjust the influence of each component in the feature, that is, whether the area of the house or the lot of the house is more important. So if we set X0 equal to 1, we can write it as a vector:

Our program also needs a mechanism to evaluate whether we have a good θ, so we need to evaluate the h function we make. Generally, the function to evaluate is called loss function, which describes the degree of bad H function. In the following, we call this function J function

Here we can make the following loss function:

In other words, we take the sum of the squares of the difference between our estimate of x(I) and the true y(I) as the loss function, and we multiply this by 1/2 so that when we take the derivative, this coefficient disappears.

There are many ways to adjust θ so that J(θ) minimizes, including least square (min square), which is an entirely mathematical description, and gradient descent.

The algorithm flow of gradient descent method is as follows:

1) First assign a value to θ, either randomly, or let θ be an all zero vector.

2) Change the value of θ so that J(θ) decreases in the direction of gradient descent.

To make the description clearer, the following figure is given:

This is a diagram of the error function J of θ, and the red part is the high value of J of θ, and what we want is to make the value of J of θ as low as possible, which is the dark blue part. Theta 0 and theta 1 represent the two dimensions of the theta vector.

The first step in the gradient descent method mentioned above is to give an initial value of θ, assuming that the initial value given at random is a cross point on the graph.

Then we adjust θ in accordance with the direction of gradient descent, which will make J(θ) change to a lower direction, as shown in the figure below. The algorithm will end with θ falling until it can no longer fall.

Of course, the final point of possible gradient descent is not the global minimum, that is, it may also be a local minimum, as shown in the figure below:

The figure above is a local minimum point described, which is obtained by re-selecting an initial point. It seems that our algorithm will fall into the local minimum point to a large extent affected by the selection of initial point.

Let me use an example to describe the gradient reduction process by taking the partial derivative of our function J(θ) :

   

Here’s the update, where theta I decreases in the direction of the smallest gradient. θ I represents the value before the update, the following part represents the amount decreased in the gradient direction, and α represents the step size, which is how much each time the gradient decreases.

A very important place in it is worth noting that the direction of the gradient is, for a vector theta, each d component theta. I can work out a gradient direction, we can find the direction of a whole, at the time of change, we change in the direction of dropped most can achieve a minimum point, whether it is local or global.

In simpler mathematical language, step 2) looks like this:

   

 

Does gradient descent always find the fastest direction of descent? The gradient descent method is not the fastest descending direction in machine learning ML foundation, it is only the fastest descending direction in the tangent plane of the current point (of course, high-dimensional problems cannot be called planes). In practical implementation, Newtonian direction (considering Hessian matrix) is generally considered to be the fastest descending direction, which can reach the convergence speed of SuperLinear. The convergence rate of gradient descent algorithms is generally linear or even Sublinear (for some problems with complex constraints). By Lin Xiaoxi (www.zhihu.com/question/30… Gradient descent is usually explained by going downhill. Suppose you are at the top of the mountain and you have to reach the lake at the bottom of the mountain (the lowest part of the valley). But the trouble is, you’re blindfolded and you can’t tell where you’re going. In other words, you will no longer be able to see which way is the fastest down the mountain path, the figure below (photo: http://blog.csdn.net/wemedia/details.html?id=45460) :

The best way to do this is to take a step at a time. Take a step in all directions with your feet, test the terrain, and feel with your feet which direction is the biggest drop. In other words, each time you reach a position, solve for the gradient of the current position, and take a step in the negative direction of the gradient (down from the current steepest position). In this way, each step to take according to the position of the previous step to choose the current steepest and fastest downhill direction to take the next step, step by step, until we feel that we have reached the foot of the mountain. Of course, we may not necessarily reach the foot of the mountain, but only a local low point. In other words, gradient descent may not be able to find a global optimal solution, or it may only be a local optimal solution. Of course, if the loss function is convex, the solution obtained by the gradient descent method must be globally optimal.

 

 

@ ZBXZC (Blog.csdn.net/u014568921/…





In the formula above, D represents all input instances, or samples, D represents a sample instance, OD represents the output of the perceptron, and TD represents the expected output.

So, our goal is to find a set of weights that minimize the error, and obviously it would be a good idea to take the derivative of the error with respect to the weight, and the point of a derivative is to provide a direction, and changing the weight along that direction will make the total error larger, or more graphically called the gradient.







Since gradient determines the direction of E’s steepest ascent, the training rule for gradient descent is:

Gradient rise and gradient descent are actually the same idea. The weight update sign of the above equation is changed to – sign, which means gradient rise. Gradient ascent is used to maximize the function and gradient descent to minimize it.

So the direction of each movement is determined, but the distance of each movement is not known. This can be determined by the step size (also known as the learning rate), denoted by alpha. Thus, weight adjustment can be expressed as:

In short, the optimization idea of gDA is to use the negative gradient direction of the current position as the search direction, because this direction is the fastest descent direction of the current position, so it is also known as the “fastest descent method”. The closer the fastest descent method is to the target value, the smaller the step size and the slower the progress. The search iteration of gradient descent method is shown in the figure below:

Because the convergence rate of gradient descent method is obviously slower in the region close to the optimal solution, the solution using gradient descent method requires many iterations. In machine learning, two gradient descent methods are developed based on the basic gradient descent method, namely stochastic gradient descent method and batch gradient descent method. By @ wtq1993 blog.csdn.net/wtq1993/art…

 

 

58 Random gradient descent

The general gradient descent algorithm traverses the whole data set when updating the regression coefficient, which is a batch processing method. In this way, when the training data is extremely busy, the following problems may occur:

1) Convergence may be very slow;

2) If there are multiple local minima on the error surface, there is no guarantee that the process will find the global minimum.

To solve this problem, in practice we use a variant of gradient descent called stochastic gradient descent.

The error in the above formula is obtained for all training samples, while the idea of stochastic gradient descent is to update the weight according to each individual training sample, so our gradient formula above becomes:

After deduction, we can get the final weight update formula:

 

With the updated formula for the above weights, we can continuously adjust the weights according to the results we expect by entering a large number of sample instances, resulting in a set of weights that enable our algorithm to get the correct or infinitely close result for a new sample input.

Here’s a comparison

Let the cost function be

 

 

Batch gradient descent

 

 

Parameter updated to:

         

I is the subscript of sample number, j is the subscript of sample dimension, m is the number of samples, and n is the number of features. So updating a θj requires traversing the entire sample set

 

Stochastic gradient descent

 

Parameter updated to:

        

 

I is the subscript of sample number, j is the subscript of sample dimension, m is the number of samples, and n is the number of features. So it only takes one sample to update a theta j.

 

The following two pictures can be very image contrast of various optimization methods (photo source: sebastianruder.com/optimizing-…

Performance of SGD optimization methods on loss surface

As can be seen from the figure above, Adagrad, Adadelta and RMSprop can immediately transfer to the correct moving direction to achieve rapid convergence on the loss surface. Momentum and NAG can lead to off-tracks. At the same time, NAG is able to quickly correct its course after deviation because it improves responsiveness based on gradient correction.

Performance of SGD optimization methods at saddle point of loss surface

 

What’s the difference between the 59 Newton method and the gradient descent method. Machine learning ML fundamentals

@ wtq1993 blog.csdn.net/wtq1993/art… 1) Newton’s Method

Newton’s method is a method to approximate the solution of equations in the field of real and complex numbers. Methods Use the first few terms of the Taylor series of the function f (x) to find the roots of the equation f (x) = 0. The greatest characteristic of Newton’s method is that it converges quickly.

Specific steps:

First, select an x0 that is close to the zero of the function f (x) and compute the corresponding f (x0) and the slope of the tangent line F ‘(x0) (where f’ represents the derivative of the function f). Then we compute the x-coordinate of the X-axis intersection of the line through the point (x0, f (x0)) with slope F ‘(x0), that is, solve the following equation:

We’ll call our new point x1, and in general x1 will be closer to the solution to the equation f (x) = 0 than x0. So we can now start the next iteration with X1. The iterative formula can be simplified as follows:

It has been shown that if f prime is continuous and the zero x to be found is isolated, then there is a region around zero x, and Newton’s method must converge as long as the initial value, x0, is in that vicinity. And, if f ‘(x) is not zero, then Newton’s method has the property of square convergence. Roughly speaking, this means that with each iteration, the significant number of Newtonian results doubles.

Because Newton’s method is based on the tangent line of the current position to determine the next position, Newton’s method is also vividly called the “tangent line method.” The search path of Newton’s method (two-dimensional case) is shown in the figure below:

Comparison of efficiency between Newton method and gradient descent method:

A) In terms of convergence speed, Newton’s method is second-order convergence, while gradient descent is first-order convergence. The former Newton’s method has a faster convergence speed. However, Newton’s method is still a local algorithm, but it is more detailed locally. The gradient method only considers the direction. Newton’s method not only considers the direction but also the size of the step.

B) according to the explanation on the wiki, say from geometry, Newton’s method is to use a quadric surface to fit the local surface of your current location, and the gradient descent method is to use a plane to fitting the current local curved surface, usually, the quadric surface fitting is better than flat, so Newton method choice of descent path is more accord with the real optimal path.

Note: The red iteration path of Newton’s method and the green iteration path of gradient descent method.

Summary of advantages and disadvantages of Newton method:

Advantages: second order convergence, fast convergence;

Disadvantages: Newton method is an iterative algorithm, and every step needs to solve the inverse matrix of the Hessian matrix of the objective function, so the calculation is complicated.

What are Quasi-Newton Methods? Machine learning ML fundamentals

@ wtq1993 blog.csdn.net/wtq1993/art… Quasi-newton method is one of the most effective methods to solve nonlinear optimization problems. It was put forward by W.C.Davidon, a physicist at Argonne National Laboratory in the United States in 1950s. At the time, Davidon’s algorithm was considered one of the most innovative inventions in nonlinear optimization. It was not long before R. Fletcher and M. J. D. Powell proved that the new algorithm was far faster and more reliable than other methods, making the discipline of nonlinear optimization leap forward overnight.

The essential idea of quasi-Newton’s method is to improve the defect that Newton’s method needs to solve the inverse of complex Hessian matrix every time. It uses positive definite matrix to approximate the inverse of Hessian matrix, thus simplifying the operation complexity. Like the steepest descent method, the quasi Newtonian method only requires that the gradient of the objective function be known at each iteration. By measuring the variation of the gradient, a model of the objective function is constructed that is sufficient to produce superlinear convergence. Such methods are vastly superior to the fastest descent method, especially for difficult problems. In addition, the quasi-Newtonian method is sometimes more efficient than Newton’s method because it does not require information about the second derivative. Today, optimization software contains a large number of quasi-Newtonian algorithms to solve unconstrained, constrained, and large-scale optimization problems.

Specific steps:

The basic idea of quasi Newtonian method is as follows. First, construct the quadratic model of the current iteration xK of the objective function:

Here Bk is a symmetric positive definite matrix, so we take the optimal solution of the quadratic model as the search direction, and obtain a new iteration point:

Wherein we require step ak to meet Wolfe condition. This iteration is similar to Newton’s method, except that the approximate Hessian matrix Bk is used

Instead of the real Hessian matrix. So the most critical part of the quasi-Newton method is the matrix Bk in each iteration

 

The update. Now suppose we get a new iteration xk+1 and get a new quadratic model:

 

 

 

 

 

 

 

 

We used as much information as we could from the previous step to pick Bk. Specifically, we’re going to ask

 

To get

This formula is called the secant equation. DFP algorithm and BFGS algorithm are common quasi-Newton methods.

 

What are the problems and challenges of stochastic gradient descent? Machine learning ML fundamentals



So how do you optimize random gradient methods? For details, please click:Detailed analysis of gradient descent optimization algorithms (video and PPT download).

What about conjugate gradients? Machine learning based in @ wtq1993 ML, blog.csdn.net/wtq1993/art… Conjugate gradient method is between the gradient descent method, the steepest descent method, a method with Newton’s method, it only using first derivative information, but overcomes the drawback of gradient descent method converges slowly, and avoids the Newton’s method need to be stored and calculating Hessian matrix and the disadvantage of the inverse conjugate gradient method is not only one of the most useful method to solve large linear equations, It is also one of the most efficient algorithms for solving large nonlinear optimization. Among all kinds of optimization algorithms, conjugate gradient method is very important. It has the advantages of small storage, gradual convergence, high stability, and does not need any external parameters.

The following figure shows the path comparison between conjugate gradient method and gradient descent method for searching optimal solutions:

 

Note: green is gradient descent method, red is conjugate gradient method

 

For all optimization problems, is it possible to find a better algorithm than the one now known? Machine learning ML fundamentals

Abstract monkey, source:www.zhihu.com/question/41…

No free lunch theorem:

For training samples (black dots), different algorithms A/B have different performance in different test samples (white dots), which means: for A learning algorithm A, if it is better than learning algorithm B in some problems, there must be some problems where B is better than A.

That is: for all problems, no matter how smart the learning algorithm A is and how clumsy the learning algorithm B is, they have the same expected performance.

However, there is no free lunch fixed assumption that all problems have the same probability of occurrence. In practical application, different scenarios will have different problem distribution. Therefore, when optimizing the algorithm, analyzing specific problems is the core of algorithm optimization.

 

What is the least square method? Machine learning ML fundamentals

We often say orally: generally speaking, on average. If, on average, non-smokers have better health than smokers, the reason why I use the word “average” is because there are exceptions to everything, there is always a special person who smokes but because of regular exercise, his health is better than that of his non-smoking friends. One of the simplest examples of least squares is arithmetic average.

Least square method (also known as least square method) is a mathematical optimization technique. It finds the best function match for the data by minimizing the sum of the squares of error. The least square method can be used to obtain the unknown data easily and minimize the sum of squares of errors between the obtained data and the actual data. Expressed as a function:

The method of finding an estimate by minimizing the sum of squares of the errors “the error is, of course, the difference between the observed value and the actual value” is called the least square method, and the estimate obtained by using the least square method is called the least square estimate. Of course, taking the sum of squares as the target function is just one of many ways to do it.

The general form of the least square method can be expressed as:

 

The effective least square method was published by Legendre in 1805. The basic idea is that there is an error in measurement, so the cumulative error of all equations is

 

We can solve the parameter leading to the minimum cumulative error:

 

 

Legendre explained the advantages of the least square method in his paper:

  • The least square method minimizes the sum of the squares of errors and establishes a balance between the errors of the various equations, thus preventing any one extreme error from dominating
  • In the calculation, only the partial derivative is required to solve the linear equations, and the calculation process is clear and convenient
  • The least square results in the arithmetic mean as an estimate

This last point is a statistically important property. The reasoning is as follows: suppose that the truth value is θ, x1,… and x n is the value of n measurements, and the error of each measurement is E I =x I −θ. According to the least square method, the accumulated error is

To solve the 使It gets to the minimum, which is exactly the arithmetic average.

Since the arithmetic mean is a tested method, and the above reasoning shows that the arithmetic mean is a special case of the least square method, so it shows the superiority of the least square method from another Angle, which makes us more confident in the least square method. One of the principles of the least square method: when the estimation error is normally distributed, the least square method is equivalent to the maximum likelihood estimation. If y = f(x) + e, where y is the target, f(x) is the estimate, and e is the error term. If e is normal distribution, then the details can see: www.zhihu.com/question/20…

The least square method was quickly accepted and widely used in data analysis practice. But gauss has been credited with the least square method, and that’s what happened. Gauss also published the method of least squares in 1809 and claimed to have used it for years. Gauss invented the mathematical method of asteroid location, and used the least square method to calculate in the data analysis, and accurately predicted the location of Ceres. By the way, what does the least square method have to do with SVM? See Popular Introduction to Support Vector Machines (Understanding the Three-tier realm of SVM).

64 Look at your T-shirt: Life is short, I use Python, can you tell me **** what kind of language Python is? You can answer your question by comparing other technologies or languages. Python Python easy @David 9, nooverfit.com/wp/15%E4%B8…

Here are some key points: Python is an interpreted language. This means that unlike C and other languages, Python does not need to be compiled before it can run. Other interpreted languages include PHP and Ruby.

  • Python is dynamically typed, which means you don’t need to specify a type when you declare a variable. You can define it firstX = 111, thenX = “I ‘m a string”.
  • Python is an object-oriented language, so classes can be defined and can be inherited and combined. Python has no access identifier as in C++public.Private, this very trust the quality of programmers, believe that every programmer is "adult" ~
  • In Python, functions are first-class citizens. This means that they can be assigned, return values from other functions, and pass function objects. Class is not a first-class citizen.
  • Writing Python code is fast, but it runs slower than compiled languages. Fortunately, Python allows you to write programs using C extensions, so bottlenecks can be handled. The Numpy library is a good example, because much of the code is not written directly in Python, so it runs quickly.
  • Python is used in many scenarios – Web application development, big data applications, data science, artificial intelligence, and more. It is also often seen as a “glue” language, allowing different languages to be joined together.
  • Python simplifies things so that programmers can worry about rewriting code rather than looking at the underlying implementation in detail.

July: Python has become the first language in the AGE of AI. In order to help you learn more about Python, data analysis and crawlers, a series of Python courses will be held online in July. For those who need them, check out the Python Data Analysis Training Camp.

 

How does Python manage memory? Python Python Basics @tom_junsong, source: www.cnblogs.com/tom-gao/p/6… Python uses reference counting internally to keep track of objects in memory. All objects have reference counting. When the reference count increases: 1 when an object is assigned a new name 2 and placed in a container (such as a list, tuple, or dictionary) the reference count decreases: 2. The sys.getrefcount() function gets the current reference count of an object. In most cases, the reference count is much larger than you might guess. For immutable data, such as numbers and strings, the interpreter shares memory in different parts of the program to save memory. 1. When an object’s reference count reaches zero, it is disposed of by the garbage collection mechanism. 2. When two objects A and B refer to each other, the DEL statement reduces the reference count of a and B and destroys the name used to refer to the underlying object. However, because each object contains an application to another object, the reference count does not go to zero and the object is not destroyed. (resulting in memory leaks). To solve this problem, the interpreter periodically executes a loop detector that searches for loops of unreachable objects and removes them. Python provides a garbage collection mechanism for memory, but it puts unused memory into the pool rather than returning it to the operating system. 1. Pymalloc mechanism. To speed up Python’s execution, Python introduced a memory pool mechanism to manage the allocation and release of small chunks of memory. 2. All objects smaller than 256 bytes in Python use pymalloc’s allocator, while large objects use the system’s Malloc. 3. Python objects such as integers, floating-point numbers, and lists have separate private memory pools, and objects do not share their memory pools. That is, if you allocate and free a large number of integers, the memory used to cache them can no longer be allocated to floating point numbers.

 

66 Write Python code to remove duplicate elements from a list. Python Python development @tom_junsong, www.cnblogs.com/tom-gao/p/6… Answer: 1, using the set function, set (list) 2, use a dictionary function, > > > a =,2,4,2,4,5,6,5,7,8,9,0 [1] > > > b = {} > > > b = b. romkeys (a) > > > c = list (b.k eys ()) > > > c

67 Sort by sort, and then judge from the last element. Python developing a =,2,4,2,4,5,7,10,5,5,7,8,9,0,3 [1] @ Tom_junsong, www.cnblogs.com/tom-gao/p/6… a.sort() last=a[-1] for i inrange(len(a)-2,-1,-1): if last==a[i]: del a[i] else:last=a[i] print(a)

68 How to generate random numbers in Python? Python Python development @tom_junsong, www.cnblogs.com/tom-gao/p/6… Randint (a,b) : returns a random integer x,a<=x<=b random. Randrange (start,stop,[,step]) : Returns a random integer in the range (start,stop,step), excluding the end value. Random. Random (): returns floating point numbers between 0 and 1 random. Uniform (a,b): Returns floating point numbers in a specified range. More Python written interview see: python.jobbole.com/85231/

What is a common loss function? Machine learning ML foundation is easy

For a given input X, f(X) gives the corresponding output Y, the predicted value f(X) of which may or may not agree with the true value Y (remember that sometimes loss or error is inevitable), and a loss function measures the degree of prediction error. The loss function is called L(Y, f(X)).

The commonly used loss functions are as follows (basically quoted from Statistical Learning Methods) :

      

Thus, SVM has a second understanding, that is, optimization + minimum loss, or as @Xiafen_ Baidu said, “SVM, Boosting, LR and other algorithms may have different gains from the perspective of loss function and optimization algorithm”. For more understanding of SVM, please refer to: Popular Introduction to Support Vector Machines (Understanding the three layers of SVM)

 

70 Brief introduction of Logistics return? Machine learning ML models are easy

Logistic regression aims to learn a 0/1 classification model from features, and this model takes the linear combination of features as independent variables, since the value range of independent variables is from minus infinity to plus infinity. Therefore, the logistic function (or sigmoid function) is used to map the independent variable to (0,1), and the mapped value is considered to be the probability of y=1.

Hypothesis function

Where x is the n-dimensional eigenvector and the function G is the logistic function.

    而The image is

 

 

 

 

 

So you can see that you’re mapping infinity to 0,1.

And the hypothesis function is the probability that the features are y equals one.

 

 

Thus, when we want to determine which class a new feature belongs to, we just needIf can,A value greater than 0.5 is y=1, and a value greater than 0.5 is y=0.

In addition,And the onlyAbout,> 0, thenG (z) is just a mapping, but the real category decision is still. Moreover, whenWhen,= 1, and vice= 0. If we just go fromStarting from, the goal of the model is to make the characteristics of y=1 in the training dataIt’s the property of y equals 0. Logistic regression is all about learning, so that the characteristics of positive cases are much greater than 0, and the characteristics of negative cases are much less than 0, and this goal should be achieved in all training cases.

Next, try to transform the logistic regression. First, replace the result tags y = 0 and y = 1 used with y = -1,y = 1, and then() in theI’m going to replace it with b, and I’m going to replace it with bReplace with(i.e.). In this case, there is. In other words, except for y changing from y=0 to y=-1, the formal representation of linear classification function and logistic regressionNo difference.

Further, we can take the hypothesis functionG (z) is a simplification, simply mapped to y=-1 and y=1. The mapping is as follows:

Finally, if the distributions of two sets of points in n-dimensional space obey multivariate normal distribution, then logistic regression is equivalent to classifying points in space using maximum likelihood estimation. Details can be found at blog.sciencenet.cn/blog-508318…

 

Seeing that you are engaged in visual field, which CV framework you are familiar with, by the way, how about the development history of CV in the last five years? Deep learning DL application is difficult

Adeshpande3.github. IO Author: Adit Deshpande, UCLA CS Graduate Student Translator: Xinzhiyuan wen Fei, Hu Xiangjie Translation link: mp.weixin.qq.com/s?__biz=MzI… The structure of this paragraph is as follows:

AlexNet (2012)

ZF.net (2013)

VGG.net (2014)

GoogLeNet (2015)

Microsoft ResNet (2015)

Regional CNN(R-CNN-2013, Fast R-CNN-2015, Faster R-CNN-2015)

Generative Adversarial Network (2014)

Generating image Description (2014)

Space Converter Network (2015)

AlexNet (2012)

It all started here (though some would say Yann LeCun’s 1998 paper really ushered in an era). The paper, entitled ‘ImageNet Classification with Deep Convolutional Networks’, has been cited 6,184 times and is widely regarded as one of the most important in the industry. Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton created a “large deep convolutional Neural network” that won the 2012 ILSVRC(2012 ImageNet Large-scale Visual Recognition Challenge). By way of introduction, this competition is known as the Annual Olympiad for computer vision, and teams from all over the world get together to see which visual model performs best. In 2012, CNN achieved a Top 5 error rate of 15.4% for the first time (Top 5 error rate refers to the probability that a given image’s label is not among the five results considered most likely by the model), with a second-best error rate of 26.2% at that time. Needless to say, this performance shocked the computer vision community. Arguably, it was only then that CNN became a household name.

In the paper, the authors discuss the architecture of the network (named AlexNet). Compared to modern architectures, they used a relatively simple layout, with the entire network consisting of five convolutional layers, a maximum pooling layer, a Dropout layer, and three full convolutional layers. The network can classify 1,000 potential categories.

  

 

AlexNet architecture: It looks a bit strange, because there are two “streams” because two Gpus are used for training. The reason for using two Gpus for training is that the amount of calculation is too large, so we can only separate them.

The main points of

Using the ImageNet Data Training network, the ImageNet database contains more than 15 million tagged images and more than 22,000 categories.

Use ReLU instead of traditional tangent function to introduce nonlinearity (ReLU is several times faster than traditional tangent function and reduces training time).

Data enhancement techniques such as image translation, horizontal Reflection and patch extraction are used.

Dropout layer is used to solve the problem of overfitting training data.

Using batch stochastic gradient descent training model, specify momentum attenuation and weight attenuation values.

Using two GTX 580 Gpus, trained for 5 to 6 days

Why is it important?

The neural network, developed by Krizhevsky, Sutskever and Hinton in 2012, was CNN’s big debut in computer vision. This is the first time a model has performed this well in the ImageNet database, which is notoriously difficult. The methods proposed in the paper, such as data enhancement and Dropout, are also in use today, and this paper really demonstrates CNN’s strengths and supports them with record-breaking competition results.

ZF.net (2013)

AlexNet took the limelight in 2012, ILSVRC had a large number of CNN models in 2013. The 2013 winner, ZF Net, a network designed by Matthew Zeiler and Rob Fergus of New York University, had an 11.2% error rate. The ZF Net model is more like a fine-tuned version of the AlexNet architecture, but it does come up with some key ideas about optimizing performance. The other reason is that this paper is very well written, and the authors spend a lot of time explaining the intuitive concepts of convolutional neural networks and showing the right way to visualize filters and weights.

In their paper, Visualizing and Understanding Convolutional Neural Networks, Zeiler and Fergus begin by explaining how big data and GPU computing power have revived interest in CNN, It is pointed out that “developing a better model is actually a process of trial and error”. Although we know a little more now than we did three years ago, the questions raised in the paper still exist today! The main contribution of this paper is that it puts forward a slightly better model than AlexNet and gives details, and also provides some methods for making visual feature maps.

  

 

The main points of

The overall architecture is very similar to AlexNet, except for some minor modifications.

AlexNet training uses 15 million images, while ZFNet only uses 1.3 million.

AlexNet uses a filter with a size of 11×11 in the first layer, while ZF uses a filter with a size of 7×7. The overall processing speed is also slowed down. The reason for this modification is that, for input data, the first convolution layer helps retain a large amount of original pixel information. The 11 by 11 filter misses a lot of relevant information, especially because this is the first convolution layer.

As the network grows, the number of filters used increases.

The activation function of ReLU was used, the cross entropy cost function was used as the error function, and batch stochastic gradient descent was used for training.

12 days of training using a GTX 580 GPU.

Develop a visualization technique called a Deconvolutional Network to help examine different feature activations and their spatial relationships to inputs. Deconvnet is called “deconvnet” because it maps features to pixels (as opposed to the convolutional layer).

DeConvNet

The basic principle that DeConvNet works is that each layer of trained CNN is followed by a layer of “deconvet,” which provides a path back to the pixels of the image. After entering the image into CNN, each layer calculates activation. But forward. Now, suppose we want to know the activation value of a feature at level 4 convolution layer, we will save the activation value of this feature map and set the other activation values for this layer to 0, and then feed this feature map into deconvnet as input. Deconvnet has the same filter as the original CNN. Input goes through a series of unpool(maxpooling inverted), correction, and filtering operations on the previous layer until the input space is full.

The logic behind this process is that we want to know what structure is activating a feature map. Let’s look at the visualization of the first and second layers.

  

 

The first layer of ConvNet was always the low-level feature detector, in this case detecting simple edges and colors. The second layer has a more sleek character. Let’s look at the third, fourth and fifth layers.

  

 

These layers show more advanced features, such as dog faces and flowers. It is worth mentioning that after the first convolution layer, we usually shrink the image with a pooling layer (for example, from 32x32x32 to 16x16x3). The effect is to widen the second layer’s view of the original image. You can read the paper for more details.

Why is it important?

ZF Net is not only the winner of the 2013 contest, but also provides excellent visual information on how CNN works, showing more ways to improve performance. The visualization method described in this paper not only helps to clarify the internal mechanism of CNN, but also provides useful information for optimizing network architecture. The Deconv visualization and occlusion experiment also make this paper a personal favorite of mine.

VGG.net (2015)

Simple and deep, this is the model VGG Net with a 7.3% error rate in 2014 (not the ILSVRC 2014 champion). Karen Simonyan and Andrew Zisserman Main Points from Oxford University create a 19-layer CNN, strictly using a 3×3 filter (stride =1, pad= 1) and a 2×2 maxpooling layer (stride =2). Simple, right?

  

 

The main points of

Here the filter using 3×3 and AlexNet in the first layer using 11×11 filter and ZF Net 7×7 filter function is completely different. The author thinks that the combination of two 3×3 convolution layers can achieve an effective receptive field of 5×5. This simulates a large filter while keeping the filter size small and reducing parameters. In addition, two layers of RELUS can be used with two convolutional layers.

The convolution layer has an effective receptive field of 7×7.

The number of filters behind each Maxpool layer is doubled. Further reinforces the idea of shrinking the size of the space, but keeping the depth growing.

The image classification and location tasks worked well.

Modeling using the Caffe toolkit.

Scale Jittering’s data enhancement technique is used in training.

ReLU layer and batch gradient descent training were used after each convolution layer.

Two to three weeks of training using 4 Nvidia Titan Black Gpus.

Why is it important?

In my opinion, VGG Net is one of the most important models because it emphasizes once again that CNN must be deep enough for a hierarchical representation of visual data to be useful. Deep and simple.

GoogLeNet (2015)

Understand the concept of simplification in the neural network architecture we were talking about? By launching the Inception model, Google is sort of throwing the concept out the window. GoogLeNet is a 22-layer convolutional neural network, which entered the Top 5 in ILSVRC2014 with an error rate of 6.7%. As far as I know, this is the first convolutional neural network architecture that does not really use a general approach, which is to simply stack the convolutional layers and then stack each layer in a sequence structure. The paper’s authors also stress that the new model takes into account memory and energy consumption. This is an important point that I often overlook myself: stacking all the layers and adding a lot of filters is computationally and memory-intensive and increases the risk of overfitting.

  

 

Another way to look at GoogLeNet:

  

 

Inception model

When we first saw the construction of GoogLeNet, we immediately noticed that not everything was in order, unlike the architecture we had seen before. We have networks that can react simultaneously and in parallel.

  

 

This box is called the Inception model. You can see what it’s made of up close.

  

 

The green box at the bottom is our input layer, and the one at the top is our output layer (rotate this image 90 degrees to the right and you’ll see the model that corresponds to the image that shows the entire network). Basically, at each level in a traditional convolutional network, you have to choose between the pool of operations and the convolution operation (and also the size of the filter). All the Inception model allows you to do is perform all operations in parallel. In fact, this is the most “original” idea the author has ever conceived.

  

 

Now, let’s see why it works. It leads to many different results, and we end up with extremely large depth channels in the output layer volume. The author approaches this problem by adding a 1X1 convolution operation in front of the 3X3 and 5X5 layers respectively. The 1X1 convolution (or network at the network layer) provides a way to reduce dimensions. For example, let’s say you have an input layer that is 100x100x60(not necessarily the three dimensions of the image, just the input for each layer in the network). Adding 20 1X1 convolution filters will allow you to reduce the volume of the input to 100X100X20. This means that the 3X3 and 5X5 layers do not need to handle as much volume as the input layer. This can be considered a ‘pooling of feature’ because we are reducing the height of the volume, similar to reducing the width and length using the commonly used maxpooling layers. Another thing to note is that these 1X1 convolution layers are followed by ReLU units, which is certainly not harmful.

You may ask, “What is the use of this architecture?” To put it this way, the model consists of a network in the network layer, a medium-size filter convolution, a large filter convolution, and a pooling operation. The network in the convolutional layer of the network can extract the information in every detail of the input volume, and the 5×5 filter can also cover most of the input of the receiving layer, so as to pick up the information. You can also perform a pool operation to reduce space size and reduce overfitting. On top of these layers, you have a ReLU behind each convolution layer, which improves the nonlinear characteristics of the network. Basically, the network can perform these basic functions while also considering computing power. The paper also provides higher-level reasoning, including topics that are sparse and tightly linked (see sections 3 and 4 of the paper).

The main points of

Nine Inception models are used throughout the architecture, totaling over 100 layers. That’s pretty deep… Fully connected layers are not used. They used an average pool instead, dropping the volume from 7x7x1024 to 1x1x1024, which saved a lot of parameters. 12X less than AlexNet parameters in the test, multiple clipping of the same image is built and then filled into the network, calculate the mean value of Softmax probabilities and then we can get the final solution. In the perception model, the concept of R-CNN is used. Inception is available in some upgraded versions (versions 6 and 7) and can be trained in a week on “a few high-end Gpus”.

Why is it important?

GoogLeNet was the first model to introduce the concept that the layers of CNN do not need to be stacked sequentially all the time. Using the Inception model, the authors demonstrate a creative hierarchy that leads to performance and computational efficiency improvements. This paper really lays the groundwork for some of the amazing architectures we’re likely to see in the next few years.

Microsoft ResNet(2015)

Imagine a deep CNN architecture, no matter how deep, deep, deep, it is estimated that it is not as deep as ILSVRC 2015 champion, Microsoft’s 152-layer ResNet architecture. In addition to setting a record for the number of layers, ResNet’s error rate was surprisingly low, at 3.6 percent, compared with around 5 to 10 percent for humans.

Why is it important?

There’s only a 3.6 percent margin of error, which should be enough to convince you. ResNet model is the best CNN architecture at present, and is a great innovation of residual learning concept. The error rate has been falling every year since 2012, and I doubt it will continue to fall until ILSVRC2016. I believe we will not achieve significant performance gains by stacking more layers now. We have to create a new architecture.

Regional CNN: R-CNN(2013), Fast R-CNN(2015), Faster R-CNN(2015)

Some might argue that the emergence of R-CNN is more influential than any previous paper on a new network architecture. The first paper on R-CNN was cited more than 1,600 times. Ross Girshick and his team at UC Berkeley have made the most influential advances in machine vision. As they write, Fast R-CNN and Faster R-CNN can make models Faster and better suited to modern object recognition tasks.

R-cnn aims to solve the problem of object recognition. After we get a particular image, we want to be able to draw the edges of all the objects in the image. This process can be divided into two components, one is regional recommendations and the other is classification.

The paper’s authors stress that any suggested method of classifying unknowable regions should apply. Selective Search is dedicated to RCNN. Selective Search works by aggregating up to 2,000 different regions that have the highest probability of containing an object. After we design a series of regional suggestions, these suggestions are combined into an image-sized region, which can be filled into the trained CNN(AlexNet is the example in the paper), and a corresponding feature can be extracted for each region. This vector is then used as input to a linear SVM trained for each type and output classification. The vector can also be filled into a bounded regression region to obtain the most accurate consistency.

  

 

Non-extreme suppression is then used to suppress boundary regions, which have a great deal of overlap with each other.

Fast R-CNN

The original model has been improved for three main reasons: training requires multiple steps, which are computationally expensive and slow. Fast R-CNN can solve the problem quickly by fundamentally analyzing the calculation of convolution layer in different proposals, and simultaneously disrupting the smooth generation of regional proposals and running CNN.

  

 

Faster R-CNN

Faster R-CNN’s work is to overcome the complexity of the training pipeline demonstrated by R-CNN and Fast R-CNN. The authors introduce a region-suggested network (RPN) at the last convolutional layer. This network is able to produce regional recommendations based on the characteristics of the last layer. At this level, the same R-CNN pipeline is available.

  

 

Why is it important?

Being able to identify an object in an image is one thing, but being able to identify the exact location of an object is a huge leap forward in computer knowledge. The faster R-CNN has become the standard object recognition program today.

Generative Adversarial Network (2015)

According to Yann LeCun, generative adversarial networks could be the next big thing in deep learning. Suppose there are two models, a generative model and a discriminant model. The task of the discriminant model is to determine whether an image is real (from a database) or machine-generated, and the task of the model generation is to generate an image that can fool the discriminant model. The two models become “antagonistic” to each other, eventually reaching an equilibrium where the generator’s image is indistinguishable from the real one, and the discriminator cannot distinguish between the two.

  

 

The left column is the image in the database, that is, the real image, and the right column is the machine-generated image. Although it looks basically the same to the naked eye, it looks very different to CNN.

Why is it important?

It sounds simple, but this is a model that can only be built if you understand the “inherent representation of the data.” You can train the network to understand the difference between a real image and a machine-generated image. Therefore, this model can also be used for feature extraction in CNN. In addition, you can use generative counter models to create realistic images.

Generating image Description (2014)

What happens when you combine CNN and RNN? This paper by Andrej Karpathy and Li Feifei discusses the natural language description of different image regions by combining CNN and bidirectional RNN. In short, the model can take an image and output it

  

 

It’s amazing. In traditional CNN, each image in the training data has a single mark. This paper describes a model in which each image is accompanied by a sentence (or picture). Such tagging is called weak tagging, and using this training data, one deep neural network “deduces latent alignment between a part of a sentence and the region it describes,” and another takes an image as input to generate a description of the text.

Why is it important?

Using seemingly unrelated RNN and CNN models creates a very useful application that combines computer vision and natural language processing. This paper provides a new idea for how to model and handle cross-domain tasks.

Space Converter Network (2015)

Finally, let’s look at a recent paper in this field. This article was written a year ago by a team at Google DeepMind. The main contribution of this paper is to introduce Spatial Transformer module. The basic idea is that this module transforms the input image so that subsequent layers can be categorized more easily. The author tries to change the image before it reaches a particular layer, not the main CNN architecture itself. The module hopes to correct two things: postural standardization (objects tilt or zoom in a scene) and spatial attention (focusing attention on the right object in a dense image). For traditional CNN, if you want to keep your model constant for different sizes and rotated images, you need a large number of training samples to make the model learn. Let’s look at how this module helps solve this problem.

In the traditional CNN model, the maxpooling layer deals with spatial invariance. The reason for this is that once we know that a particular feature is still a starting input (with a high activation value), its exact location is not as important as its relative location to other features, as other functions. This new spatial converter is dynamic and produces a different behavior (different distortion/distortion) for each input image. This is not just simple and predefined like traditional MaxPool. Let’s take a look at how this module works. This module includes:

A localized network will absorb the input and output the parameters of the spatial transformation to be applied. The argument can be a 6-dimensional affine transformation.

The sampling grid, which is produced by both the crimped regular grid and the affine transform (Theta) created in the positioning network.

A sampler whose purpose is to perform warping of input function diagrams.

  

 

This module can be placed anywhere in CNN and can help networks learn how to transform feature graphs in a way that minimizes cost functions during training.

  

 

Why is it important?

CNN’s improvement does not have to be achieved through major changes in network architecture. We do not need to create the next ResNet or Inception model. This paper implements the simple idea of performing an affine transformation on the input image so that the model remains invariant to translation, scaling and rotation. Read more on CNN’s top 10 Essays.

 

What are the frontiers of deep learning in the field of vision? Deep learning difficult @ DL application yuan feng, subject analytical source: zhuanlan.zhihu.com/p/24699780

The introduction

At this year’s neural Networks conference NIPS2016, Professor Yann Lecun, one of the top three deep learning champions, gave an interesting metaphor about supervised, unsupervised and augmented learning in machine learning. He said: If Intelligence is compared to a cake, then unsupervised learning is the cake itself, reinforcement learning is the cherry on the cake, and supervised learning can only be regarded as the icing on the cake (Figure 1).

 

Figure 1. Yann LeCun’s metaphor for the value of supervised, enhanced, and unsupervised learning

 

 

1. Progress of deep supervised learning in computer vision

1.1 Image Classification

Ever since Alex and his mentor Hinton (the granddad of deep learning) beat the runner-up (74.2%, using traditional computer vision methods) by 10 percentage points (83.6% Top5 accuracy) in the ImageNet Large-scale Image Recognition Competition (ILSVRC2012), Deep learning really started to get hot, convolutional neural network (CNN) began to become a household name, from AlexNet (83.6%) in 2012, to the winner of ImageNet large-scale Image Recognition contest in 2013, 88.8%, In 2014, 92.7% of VGG and 93.3% of GoogLeNet in the same year. Finally, in 2015, in 1000 types of image recognition, the residual network (ResNet) proposed by Microsoft has the Top5 accuracy rate of 96.43%. Top5 accuracy refers to that when a picture is given, the model gives the 5 most likely labels. As long as the correct labels are included in the 5 predicted results, the model is correct

Figure 2. Evolution trend of image recognition error rate of ILSVRC contest from 2010 to 2015

 

1.2 Image Dection

Along with the task of image classification, there is another more challenging task – image detection, which refers to classifying images while enclosing objects with rectangular boxes. From 2014 to 2016, well-known frameworks such as R-CNN,Fast R-CNN, Faster R-CNN, YOLO and SSD emerged successively, whose detection average accuracy (mAP), PASCAL VOC detection average accuracy (mAP) on a well-known data set of computer vision, From 53.3% of R-CNN, 68.4% of Fast RCNN, and 75.9% of Faster R-CNN, the latest experiment shows that the detection accuracy of Faster RCNN combined with residual network (ResNET-101) can reach 83.8%. The detection speed of deep learning is also getting Faster and Faster. From the original RCNN model, it takes more than 2 seconds to process an image, to the 198 milliseconds per image of Faster RCNN, to the 155 frames per second of YOLO (its defect is that the accuracy is low, only 52.7%), and finally to the SSD with high accuracy and speed, 75.1% accuracy. The speed is 23 frames per second.

 

Figure 3. Example of image detection

 

1.3 Semantic Segmentation

Image segmentation is an interesting research field, it is the purpose of the various objects in the image to use different color segmentation, as shown in the figure below, the average accuracy (mIoU, predicting area and practical area intersection divided by predicting area and practical area and set), also from the very beginning of FCN model (image semantic segmentation to connect to the Internet, The paper received 62.2% of the top CVPR2015 best papers in Computer vision, 72.7% of the DeepLab framework, and 74.7% of the Oxford CRF AS RNN. The field is a work in progress and still has a lot of room for improvement.

Figure 4. Example of image segmentation

 

 

1.4 Image Tagging — Image Captioning

Image annotation is a compelling study field, its research purpose is to give a picture, you give me a paragraph to describe it, as shown in figure in, in the picture, the first figure, automatically gives a description of the process is “a man on the dusty dirt road to ride a motorcycle”, the second picture is “two dogs on the grass to play.” In recent years, research has been carried out by companies such as Baidu, Google and Microsoft in industry, as well as the University of Berkeley in academia and the University of Toronto, a centre of deep learning research, because of its commercial value (such as image search).**

FIG. 5. Image annotation, generating description text according to the picture

 

1.5 Image Generator – Text to Image

The image tagging task is originally a semicircle, and since we can generate descriptive text from images, we can also generate images from text. As shown in figure 6, the first column, “a large passenger plane fly in the blue sky” model according to the text generated 16 images automatically, the third column is interesting, “a herd of elephants walking on the dry grass” (it’s contrary to common sense, because the elephant in the rainforest, don’t walk on the dry grass), the model and the corresponding generated the corresponding images, although the quality of the generated isn’t too good, But it’s on the straight and narrow.



Figure 6. Generate images from text

 

 

Reinforcement Learning

In supervised learning tasks, we are given a sample with a fixed label and then go to train the model. However, in the real environment, it is difficult for us to give labels for all samples. At this time, reinforcement learning comes into use. In simple terms, given some reward or punishment, reinforcement learning is about letting the model go through trial and error, optimizing how to get more points. AlphaGo, which became popular in 2016, uses reinforcement learning to train. It has mastered the optimal strategy through constant self-trial and error and game. Using reinforcement learning to play Flyppy Bird, I’ve been able to play it to the tens of thousands.

Figure 7. Reinforcement learning to play Flappy Bird

 

One of the classic games is Breakout. DeepMind proposed a model that only used pixels as input and had no prior knowledge. In other words, the model didn’t know what the ball was or what it was playing. After 240 minutes of training, not only did he learn to correctly catch the ball and hit the block, he even learned to keep hitting the same spot, the faster the game was won (and the higher his reward was). Video link :Youtbe(need to cross the wall), Youku

 

Figure 8. Playing Atari Breakout using deep reinforcement learning

Reinforcement learning has great applications in robotics and autonomous driving, and there are papers appearing on the ArXiv almost every few days. Robots learn trial and error to learn optimal performance, which may be the best way for the evolution of artificial intelligence, and probably the only way to strong artificial intelligence.

 

Deep Unsupervised Learning — predictive Learning

In contrast to the limited amount of supervised learning data, nature has an infinite amount of unlabeled data. If artificial intelligence could learn automatically from the vastness of nature, wouldn’t that open a new era? Perhaps the most promising area of research right now is unsupervised learning, which is why Yann Lecun likens unsupervised learning to artificial intelligence. After deep learning guru Ian Goodfellow proposed generative adversarial networks in 2014, this field became more and more popular and became one of the hottest research fields in the past 16 years. “Fighting the Internet is the most exciting thing since sliced bread,” Yann LeCun once said. This statement is enough to illustrate how important generative adversarial networks are. A simple explanation for generating adversarial networks is as follows: Let’s say there are two models, one Generative Model and the other Discriminative Model. The task of the Discriminative Model is to determine whether an instance is real or Generative. The task of generating model (G) is to generate an instance to deceive the discriminant model (D). The two models are against each other, and a balance will be reached if the development goes on. The instance generated by generating model is no different from the real one, and the discriminant model cannot distinguish the natural from the model generated. Take the fake dealer as an example. The fake dealer (generative model) makes fake Picasso paintings to deceive the experts (discriminant model D). The fake dealer keeps improving his imitation level to distinguish the experts, and the experts also keep learning real and fake Picasso paintings to improve their recognition ability. Eventually, forgeries became so good at copying Picassos that connoisseurs could hardly tell the difference between a genuine and a fake. The following are some generated pictures of Goodfellow in his paper on generative adversarial networks. It can be seen that the model generated by the model is still quite different from the real one. However, this paper is 14 years old, and the progress in this field is very fast in 16 years. Then Conditional Generative Adversarial Nets and InfoGAN emerged. Deep Convolutional Generative Adversarial Network (DCGAN), and more importantly, Generative Adversarial Network extends its reach into the field of video prediction. As we all know, humans mainly rely on video sequences to understand nature. Images are a very small part of it, and when ai learns to understand video, it really starts to show its power.

Here is a review paper written by Ian GoodFellow at the beginning of 2017 in conjunction with his talk at NIPS2016NIPS 2016 Tutorial: Generative Adversarial Networks

 

 

 

 

 

Figure 9 generates some images generated by the adversarial network. The last column is the production image closest to the images in the training set

 

3.1 Conditional Generative Adversarial Nets (CGAN)

Generative adversarial networks generally generate instances of specific types of images according to random noises, while conditional generative adversarial networks limit outputs according to certain inputs, such as generating specific instances according to several descriptive nouns, which is somewhat similar to generating images from words introduced in Section 1.5. This is an image from the Conditioanal Generative Adversarial Nets paper, which generates images based on specific noun descriptions. (Note: the description text of the left column of pictures does not exist in the training set, that is, the model generates pictures based on the description that has not been seen, while the description of the right column of pictures exists in the training set.)

Figure 10. Generate images from text

Conditions to generate another interesting papers is against network image to the image of translation, the paper puts forward the model can according to the input photo, and then gives a model to generate images, below is the picture of the paper, the upper left corner of the first pair is very interesting, model of the input image segmentation results, generated are given the results of the real scene, This is similar to reverse engineering of image segmentation.

Figure 11. Generate some interesting output images based on specific inputs

The SRGAN model was proposed in 2016. After sampling the original HD image, the SRGAN model was used to restore the image to generate a more natural image that was closer to the original image. The original image is at the far right of the following figure. After sampling it down and using Bicubic Interpolation, the image is fuzzy. The version using residual network (SRResNet) is much cleaner, we can see that the image generated by SRGAN is more realistic.

 

Figure 12. An example of generating adversarial networks at super resolution. The original image is at the far right

Another influential paper on generative adversarial network is deep convolutional generative adversarial network DCGAN. The author combines convolutional neural network and generative adversarial network. The author points out that this framework can learn the features of things well. Man with eyes – Man without glasses + woman without eyes = woman with eyes, the model gives a similar vectorization of the picture.

 

 

 

 

 

 

Figure 13. Example diagram from DCGAN paper

The development of generative adversarial networks is too hot to list in one article. If you are interested in this topic, you can search for relevant papers on openAI. A blog describing generative adversarial networks is great because Ian Goodfellow works at openAI. So the quality of this blog is pretty guaranteed. The link is: Open AI generation vs web blog

 

3.2 Video prediction

Yann LeCun also proposed that “to replace unsupervised learning with predictive learning”. Predictive learning makes predictions about the changes of the world by observing and understanding how the world works. Machines learn to perceive the changes of the world and then infer the state of the world. In this year’s NIPS, MIT scholar Vondrick et al. published a paper entitled Generating Videos with Scene Dynamics, which proposed that the model can automatically predict the following scenes based on a static picture, such as providing a picture of a person standing on a beach. The model automatically gives a short video of the waves that follow. The model was trained in an unsupervised manner on a large number of videos. The model shows that it can automatically learn useful features from videos. If you can’t view it normally, please turn to the example of video generation on the official website. The video in the following figure is automatically generated by the model. We can see that the picture is not perfect, but it can represent a scene quite well.

**

 

 

Figure 14. Randomly generated video of waves surging on the beach and trains running

Conditional video generation, as shown in the following figure, input a static picture, and the model automatically generates a small video.

 

 

**

 

 

Figure 15. According to a static picture of grass, the model automatically predicts the movement scene of people. The picture is a GIF, if you cannot view it, please visit

Figure 16. A railway map is given. The model automatically predicts the appearance of a train running by

MIT’s CSAIL Lab also published a blog called “Teaching Machines to Predict The Future.” After training The model on YouTube videos and TV shows like The Office and Desperate Housewives, if you show The model a picture of The moment before a kiss, The model can automatically predict the action of adding hugs and kisses, as shown in the picture below.

Figure 17. A static diagram is presented in which the model automatically predicts the next action

Harvard University’s Lotter et al. proposed PredNet, which was also trained on the KITTI dataset. The model could then predict the next few frames of the dashcam image based on the previous video. The model was trained using a long and short-term memory neural network (LSTM). Specific examples are shown in the figure below. The first few images of the vehicle data recorder are given to automatically predict the next five frames of scenes. After the model inputs several frames of images, the next five frames are predicted. The image is a GIF. If you can’t view it properly, please visit the author’s blog

 



FIG. 18 shows the first few pictures of the dashcam, and automatically predicts the following five frames of the scene. The figure is a GIF, and if it cannot be viewed,Please visit the

 

4 summarizes

Generated against network, unsupervised learning video predict paper is really too much, my energy is limited, interested readers can brush every day the arxiv section of computer vision and pattern recognition, computer vision of neural network and evolutionary computation and artificial intelligence and other corresponding section, basically every day have new papers. Image detection and segmentation, enhanced learning, adversarial network generation, and predictive learning are all hot directions for the development of artificial intelligence. We hope that we who are interested in deep learning can make some achievements in this aspect. Thank you friends for reading, interested in deep unsupervised learning friends, welcome to study and exchange, please private message me.

5 References

In the course of writing this article, I tried to attach the website of the paper in the form of a link to the body of the text. Most of the blogs and papers referred to in this article are arranged as follows for your convenience and your own future research.

Refer to the blog

  1. Yann LeCun: Replacing Unsupervised Learning with predictive Learning
  2. Eleven milestones in computer vision and CNN
  3. Generative Models
  4. Generating Videos with Scene Dynamics
  5. Teaching machines to predict the future
  • The reference papers
  1. Resnet model, image classification, beyond human computer recognition level. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
  2. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
  3. Conditional Random Fields as Recurrent Neural Networks
  4. Show and Tell: A Neural Image Caption Generator
  5. Text Generative Image Adversarial Text to Image Synthesis
  6. Flyppy Bird Using Deep Q-network to Learn How to Play Flappy bird
  7. Playing Atari with Deep Reinforcement Learning
  8. Generative Adversarial Networks
  9. Conditional Generative Adversarial Nets
  10. Generative Adversarial Network for Image super-resolution Photo-realistic Single Image super-resolution Using a Generative Adversarial Network
  11. Deep Convolutional Generative Adversarial Networks Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
  12. Generating Videos with Scene Dynamics
  13. Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning

73 What is the difference between HashMap and HashTable? Data structure in the hash table

HashMap is based on the implementation of Hashtable. The difference is that HashMap is asynchronous and allows NULL values and keys. Hashtable does not allow NULL values.oznyang.iteye.com/blog/30690. In addition, remember that hashmap/ hashSet, etc. with hash words, are implemented based on HashTable, and set/map without hash words, are implemented based on red-black tree, with the former unordered and the latter ordered. See Part 1 of this article.Teach you how to kill quickly: 99% of the massive data processing interview questions”.

(Photo source: Julyppt of interview & Algorithm lecture in Shanghai Jiaotong University on September 28thVdisk.weibo.com/s/zrFL6OXKg…) :

 

74 In classification problems, we often encounter different data volumes of positive and negative samples. For example, positive samples contain 10W data, while negative samples contain only 1W data. What is the most appropriate processing method as follows? In the ML basis of machine learning, A repeats negative samples 10 times to generate 10W sample size, and disorganizes the order to participate in classification. B directly conducts classification, which can make maximum use of data. C randomly selects 1W from 10W positive samples to participate in classification. Accurately, actually each have advantages and disadvantages of these methods in options, specific to the analysis, an article analyses the advantages and disadvantages of various methods, and speak well Interested students can refer to: www.analyticsvidhya.com/blog/2017/0…

 

Questions 69 to 83 are from: blog.csdn.net/u011204487 75 Deep learning is A very popular machine learning algorithm. In deep learning, A lot of matrix multiplication is involved. Now it is necessary to calculate the product ABC of three dense matrices A,B and C, assuming that the dimensions of the three matrices are respectively M *n, N *p and P *q. And m

 

76 Nave Bayes is A special Bayes classifier. The characteristic variable is X and the category label is C. One of its assumptions is () A in the ML model of machine learning. B. Normal distribution C with 0 as mean and SQR (2)/2 as standard deviation. Characteristic dimensions of variable X is class condition of independent random variables D.P gaussian distribution (X | C) is the correct answer: C @ BlackEyes_SGC: the basic hypothesis of naive bayes is each of the variables are independent of each other.

 

77 regarding support vector machine SVM, the following statement is incorrect: () A.L2 regular term in machine learning ML model, whose function is to maximize the classification interval and make the classifier have stronger generalization ability. B. inge loss function, whose function is to minimize the empirical classification error C. Classification interval is 1 / | | w | | to | | w | | for vector die D. When the parameter C more hours, classification interval, the greater the classification error, the more tend to owe to learn the correct answer: C @ BlackEyes_SGC: A correct. The reason to consider adding regularization term is as follows: imagine a perfect data set, y>1 is positive class, y<-1 is negative class, decision surface Y =0, add a positive noise sample y=-30, then the decision surface will become “distorted” a lot, the classification interval will be smaller, and the generalization ability will be reduced. After the addition of the regular term, the fault tolerance of the noise sample is enhanced. In the example mentioned above, the decision surface will not be so “crooked”, which makes the classification interval larger and improves the generalization ability. B is correct. C error. Interval should be 2 / | | w | |, after half sentence should be yes, vector module usually means the second norm. D is correct. When considering soft intervals, the effect of C on the optimization problem is to limit the range of A from [0, + INF] to [0,C]. C is smaller, the smaller then a will, the objective function of Lagrange function derivative to zero can calculate w ∗ = sum ai yi ∗ xi, a smaller making smaller w, so the interval of 2 / | | w | |

 

78 In HMM, if the observation sequence and the state sequence that generated the observation sequence are known, which of the following methods can be used to perform parameter estimation directly () A. M algorithm B in ML model of machine learning. D@blackeyes_SGC: EM algorithm: only observation sequence, no state sequence to learn model parameters, namely baum-Welch Algorithm Viterbi algorithm: Using dynamic programming for solving the problem of prediction of HMM, don’t back to before the parameter estimation algorithm, is used to calculate the probability of maximum likelihood estimation: the observation sequence and the corresponding state sequence is supervised learning algorithm, is used to estimate the parameters in a given observation sequence to estimate model parameters, and the corresponding state can make use of maximum likelihood estimate. EM is only used if there is no corresponding state sequence for a given observation sequence, and the state sequence is not immeasurable hidden data.

 

79 Suppose a student used a Naive Bayesian (NB) classification model and accidentally duplicated the two dimensions of the training data. What is the correct statement about NB? Machine learning ML model A. The deterministic role of the repeated feature in the model will be enhanced B. C. If all features are repeated, the prediction result of the model is the same as that of the model without repetition. D. When the features of two columns are highly correlated, the conclusion obtained when the features of two columns are the same cannot be used to analyze the problem e.nb can be used to do least squares regression F. The core of NB is that it assumes that all components of a vector are independent of each other. In bayesian systems, there is an important assumption of conditional independence: all features are assumed to be independent of each other so that joint probabilities can be split

 

Which of the following methods cannot be used to categorize text directly? Machine learning ML model easy A, Kmeans B, decision tree C, support vector machine D, KNN BlackEyes_SGC: A: Kmeans is A clustering method, A typical unsupervised learning method. Classification is a supervised learning method, and BCD is a common classification method.

A, the best criterion of principal component analysis is to decompose A set of data according to A set of orthogonal basis. When only the same number of components are taken, the mean square error is used to calculate the minimum truncation error B. After principal component decomposition, the covariance matrix becomes diagonal matrix C. Principal component analysis is k-L transformation D. The principal component is the correct answer by finding the characteristics of the covariance matrix: C@blacKEYES_SGC: K-L transformation and PCA transformation are different concepts. PCA transformation matrix is covariance matrix, and k-L transformation matrix can have many kinds (second-order matrix, covariance matrix, total in-class dispersion matrix, etc.). When k-L transformation matrix is covariance matrix, it is equivalent to PCA.

Kmeans complexity? Machine learning ML models are easy



Time complexity: O(tKmn), where t is the number of iterations, K is the number of clusters, m is the number of records, and n is the dimension. Space complexity: O((m+K)n), where K is the number of clusters, m is the number of records, and n is the dimension

 

82 What is incorrect about Logit regression and SVM is that (A) in the ML model of machine learning a. Logit regression is essentially A method of maximum likelihood estimation of weights based on samples, and the posteriori probability is proportional to the product of the prior probability and likelihood function. Logit just maximizes the likelihood function, it doesn’t maximize the posterior probability, let alone minimize the posterior probability. The output of Logit regression is the probability that the sample belongs to the positive category, which can be calculated. The probability is correct. C. The goal of SVM is to find the hyperplane that separates the training data as much as possible and maximizes the classification interval, which should belong to the structural risk minimization. D. SVM can control the complexity of the model through regularization coefficients to avoid overfitting. BlackEyes_SGC: The objective function of Logit regression is to minimize the posterior probability. Logit regression can be used to predict the occurrence probability of events. The objective of SVM is to minimize structural risk, and SVM can effectively avoid model overfitting.

83 Input image size 200×200, convolution (kernel size 5×5, padding 1, stride 2), pooling (kernel size 3×3, padding 0, stride 1), After another convolution (kernel size 3×3, padding 1, stride 1), the output feature graph size is () Computing dimensions that are not divisible is only encountered in GoogLeNet. Convolution rounds down, pooling rounds up. (200-5+2*1) /2+1 = 99.5, 99 (99-3) /1+1 = 97 (97-3+2*1) /1+1 = 97 When the kernel is 3 and the padding is 1 or the kernel is 5 and the padding is 2, that’s the same size before and after convolution. The same goes for calculating the size of GoogLeNet.

84 The main factors affecting the results of clustering algorithm are (B, C, D) machine learning ML model is easy to A. Sample quality of known categories; B. Classification criteria; C. Feature selection; D. Pattern similarity measure

 

85 In pattern recognition, the advantage of (C, D) ML model of machine learning is easier than that of Euclidean distance. Translation invariance; B. Rotation invariance; C-scale invariance; D. The distribution of patterns is considered

 

86 The main factors affecting the basic K-means algorithm are (BD) machine learning ML model easy A. Sample input sequence; B. Pattern similarity measure; C. Clustering criteria; D. Selection of initial class center

 

87 In the statistical pattern classification problem, when the prior probability is unknown, ML model can be used for (BD) machine learning easy A. Minimum loss criterion; B. Minimum maximum loss criterion; C. Minimum misjudgment probability criterion; D. N – P

 

88 If the correlation coefficient of feature vector is used as the pattern similarity measure, the main factors affecting the clustering algorithm result are (BC) machine learning ML model yi A. Known sample quality of category; B. Classification criteria; C. Feature selection; D. Dimensional Euclidean distance has (A B);

 

89 horse distance with (A B C D) MACHINE learning ML base easy A. Translation invariance; B. Rotation invariance; C. Scale invariant; D. Dimensionless properties

What deep learning (RNN, CNN) tuning experience do you have? Deep learning based in DL @ bleak, source: www.zhihu.com/question/41…

Parameter initialization

Pick one of the following ways, and the results are pretty much the same. But do it. Otherwise, it may slow down the convergence speed, affect the convergence result, and even cause a series of problems such as Nan.

N_in below is the input size of the network, n_out is the output size of the network, and n is either n_in or (n_IN +n_out)*0.5

Xavier’s original paper: jmlr.org/proceedings…

He initialization paper: arxiv.org/abs/1502.01…

  • Uniform: w= np.random. Uniform (low=-scale, high=scale, size=[n_in,n_out])

    • Xavier initiation method for ordinary activation functions (TANH, Sigmoid) : Scale = Np.sqRT (3/n)
    • He initialization, for ReLU: scale = Np.sqRT (6/n)
  • Normal Gaussian distribution initialization: w = Np.random. Randn (n_in,n_out) * stdev # stdev is the standard deviation of the Gaussian distribution, and the mean value is set to 0

    • Xavier initiation method, for ordinary activation function (TANh, Sigmoid) : stdev = Np.sqrt (n)
    • He initialization, for ReLU: stdev = np.sqrt(2/n)
  • SVD initialization: has good effect on RNN. Reference paper: arxiv.org/abs/1312.61…

Data preprocessing mode

  • Zero-center, we use it a lot. X -= np.mean(X, axis = 0) # zero-centerX /= np.std(X, axis = 0) # normalize
  • PCA Whitening is rarely used.

Training skills

  • You do gradient normalization, which is the calculated gradient divided by the minibatch size
  • Value = SQRT (w1^2+w2^2) SQRT (w1^2+w2^2… .). , if value exceeds the threshold, calculate a attenuation coefficient and set value equal to the threshold: 5,10,15
  • Dropout has a very good effect on preventing overfitting of small data, and the value is generally set to 0.5. In most of my experiments, the effect of dropout+ SGD on small data has been significantly improved. So try it if you can. The dropout position is very important. For RNN, it is recommended to place the dropout position in the input ->RNN and RNN-> output positions. For information on how to use dropout in RNN, see this paper :arxiv.org/abs/1409.23…
  • Adam, Adadelta et al., in terms of small data, the effect of my experiment here is not as good as SGD. The convergence speed of SGD is slower, but the final convergence results are generally better. If SGD is used, you can choose to start from the learning rate of 1.0 or 0.1, and check on the verification set every once in a while. If cost does not decrease, the learning rate is halved. I’ve read a lot of papers that do this, and MY own results are pretty good. Of course, you can also use Ada series to run first, and then change to SGD to continue training when the convergence is fast. There will also be an improvement. Adadelta is said to be generally better at classification problems and Adam is better at generating problems.
  • Do not use sigmoid except in places like gate, where you need to limit the output to 0-1. Use activation functions like TANh or RELu instead. Outside the interval, the gradient is close to 0, it is easy to cause the gradient disappearance problem. 2. Input 0 mean, sigmoid function output is not 0 mean.
  • The dim and EMbdding sizes of RNN are generally adjusted from around 128; the batch size is generally adjusted from around 128; the appropriate batch size is the most important, not the larger the better.
  • Word2vec initialization, in small data, not only can effectively improve the convergence speed, but also can improve the results.
  • Shuffle the data as much as possible
  • Forget Gate Bias of LSTM, initializing with a value of 1.0 or greater, can achieve better results from this paper :jmlr.org/proceedings… , I set the experiment to 1.0, which can improve the convergence speed. In practice, different tasks may need to try different values.
  • Batch Normalization is said to improve performance, but I have not tried it and suggest it as a final means of enhancing the model. Refer to the paper: Accelerating Deep Network Training by Reducing Internal Covariate Shift
  • If your model contains a full connection layer (MLP), and the input and output sizes are the same, you can consider replacing MLP with Highway Network. I try to improve the result a little, and suggest to improve the model at last. The principle is very simple, that is, adding a gate to the output to control the flow of information. For detailed introduction, please refer to the paper: arxiv.org/abs/1505.00…
  • Tips from @Zhang Xinyu: One round with regular, one round without regular, repeated.

Ensemble

Ensemble is the ultimate nuclear weapon for paper results, and deep learning generally takes place in the following ways

  • Same parameters, different initialization methods
  • Cross-validation selects the best set of parameters
  • The same parameters, different stages of model training, that is, different iterations of the model.
  • Linear fusion of different models. Examples are RNN and traditional models.

For more tips on deep learning, see Alchemist Lab – Zhihu

 

91 What is the principle of RNN? Deep learning DL model

When we prepare for the college entrance examination in senior three, the knowledge at this time is synthesized from the knowledge we have learned in senior two and before, that is, our knowledge is foreshadaged and memorized. For example, when “I am” appears on the movie subtitle, you will naturally associate “I am Chinese”.



RNN conditional generation, attention, LSTM, etc.Deep learning [the best in its category, training DL engineers].

92 What is RNN? Deep learning DL model

A bird in the skyBlog.csdn.net/heyongluoya…

The purpose of RNNs is to process sequence data. In the traditional neural network model, from the input layer to the hidden layer and then to the output layer, the layers are fully connected, and the nodes between each layer are unconnected. But such ordinary neural networks are powerless for many problems. For example, if you want to predict what the next word in a sentence will be, you usually need to use the preceding words because the preceding and subsequent words in a sentence are not independent. RNNs is called cyclic neural network, that is, the current output of a sequence is also related to the previous output. Specifically, the network will memorize the previous information and apply it to the calculation of the current output, namelyThe nodes between the hidden layers are no longer connectionless but connected, and the input of the hidden layer includes not only the output of the input layer but also the output of the hidden layer at the last moment. In theory, RNNs can process sequence data of any length. In practice, however, to reduce complexity, it is often assumed that the current state is related only to the previous states. Here is a typical RNNs:

 

From Nature 

RNNs contains Input units, whose Input sets are labeled {x0,x1… ,xt,xt+1,… }, and the Output set of Output units is marked {y0,y1… ,yt,yt+1.,.. }. RNNs also contains Hidden units, whose output set we label {s0, S1… ,st,st+1,… }, these hidden units do most of the work. You can see in the diagram that there is a one-way flow of information from the input unit to the hidden unit, while another one-way flow of information from the hidden unit to the output unit.In some cases, THE RNNs breaks the latter restriction, guiding information from the output unit Back to the hiding element. These are called “Back Projections,” and the input to the hiding layer also includes the status of the upper hiding layer, where nodes can be self-connected or interconnected.

The figure above expands the recurrent neural network into a full neural network. For example, for a five-word statement, the unfolding network is a five-layer neural network, with each layer representing a single word. The calculation process for the network is as follows:

  • Xt stands for t,t=1,2,3… Input of step. For example, x1 is the one-hot vector of the second word (according to the figure above, x0 is the first word);
  • St is the state of the t-th step of the hidden layer, and it is the memory unit of the network. St is calculated based on the output of the current input layer and the state of the previous hidden layer. St = F (Uxt+Wst−1), where F is generally a nonlinear activation function, such as TANH or ReLU. When calculating S0, namely the state of the hidden layer of the first word, S −1 is needed, but it does not exist and is generally set to 0 vector in implementation.
  • Ot is the output of step t, represented by a vector of the following words, ot=softmax(Vst). Read more here: RNN, Recurrent Neural Networks.

 

93 How are RNN constructed step by step from a single-layer network? The source of deep learning how hard @ DL model, analytical source: ontology zhuanlan.zhihu.com/p/28054589

First, start from the single-layer network

Before learning RNN, we should first understand the most basic single-layer network, whose structure is shown as follows:

The input is x, and the transformation of Wx+ B and the activation function F gives the output y. I’m sure you’re already familiar with this.

2. Classical RNN structure (N vs N)

In practical applications, we will also encounter a lot of sequential data:

Such as:

  • Natural language processing problems. X1 could be the first word, x2 could be the second word, and so on.
  • Speech processing. Now, x1, x2, x3… It’s the sound of each frame.
  • Time series. For example, daily stock prices and so on.

Sequential data is harder to process with primitive neural networks. In order to model sequence problems, RNN introduces the concept of hidden state (H), which can extract features from sequential data and then transform them into outputs. Let’s start with h1 calculation:

The meanings of the symbols in the drawings are:

  • Circles or squares are vectors.
  • An arrow represents a transformation of that vector. As shown in the figure above, h0 and x1 are connected by an arrow respectively, indicating that a transformation has been performed on h0 and x1 respectively.

Similar notations will appear in many papers. It is easy to get confused at the beginning, but as long as you grasp the above two points, you can easily understand the meaning behind the diagrams.

H2 is calculated similarly to H1. It should be noted that in calculation, the parameters U, W and B used in each step are the same, that is to say, the parameters of each step are shared, which is an important feature of RNN and must be kept in mind.

Calculate the rest in turn (using the same parameters U, W, b) :

We’re just going to draw the case of length 4 for convenience, but in fact, this calculation can go on indefinitely.

Our current RNN has not been output, and the method to obtain the output value is to calculate directly through H:

So as we said before, an arrow represents a transformation of the corresponding vector like f of Wx plus b, and this arrow right here represents a transformation of h1, which gives you the output y1.

The rest of the output is similar (using the same arguments V and c as y1) :

OK! And you’re done! This is the classic RNN structure, and we built it like a building block. Its inputs are x1, x2,….. Xn, the output is y1, y2… Yn, that is, the input and output sequences have to be the same length.

Due to this limitation, the scope of application of classical RNN is relatively small, but there are also some problems suitable for modeling classical RNN structure, such as:

  • Calculate the classification label for each frame in the video. Because each frame is evaluated, the input and output sequences are of equal length.
  • The probability that the input is a character and the output is the next character. This is known as The Char RNN (See: The Gray-Effectiveness of Recurrent Neural Networks for details, The Char RNN can be used to generate articles, poems, and even code. In this blog post, there is an experimental tutorial on automatic lyric generation: “Based on Torch learning Wang Feng’s lyric Writing, chatbot, Image coloring/generation, Picture speaking, subtitles generation”).

N VS 1

Sometimes we’re dealing with a problem where the input is a sequence and the output is a single value instead of a sequence, how do we model that? In fact, we can only perform the output transformation on the last h:

This structure is usually used to deal with sequence classification problems. For example, input a text to determine its category, input a sentence to determine its emotional orientation, input a video and determine its category, and so on.

1 VS N

What if the input is not a sequence and the output is a sequence? We can do input calculations only at the beginning of the sequence:

There is also a construction that takes the input information X as the input for each stage:

The following figure, which omits some X circles, is an equivalent representation:

This 1 VS N structure can handle the following problems:

  • An image caption is then generated, with an X as an image feature and a sequence of Y as a sentence
  • Generate speech or music, etc., from categories

N vs M

Let’s introduce one of the most important variants of RNN: N vs M. This structure is also called Encoder-Decoder model, or Seq2Seq model.

The original N vs N RNN requires the sequence to be of equal length. However, most of the problems we encounter are of unequal length. For example, in machine translation, the sentences in the source language and the target language often do not have the same length.

To do this, the Encoder-Decoder structure first encodes the input data into a context vector C:

There are many ways to get C, the simplest way is to assign the last hidden state of Encoder to C, you can also transform the last hidden state to get C, or you can transform all the hidden states.

After getting c, it decodes it with another RNN network, which is called Decoder. To do this, enter c into the Decoder as the previous initial state H0:

Another option is to use C as input for each step:

Because this Encoder-Decoder structure does not limit the sequence length of input and output, it is widely used, such as:

  • Machine translation. Encoder-Decoder is the most classic application. In fact, this structure was first proposed in the field of machine translation
  • Text abstract. The input is a sequence of text, and the output is a sequence of summaries of that sequence of text.
  • Reading comprehension. Encode the input text and the question separately, and then decode it to get the answer to the question.
  • Speech recognition. The input is a speech signal sequence and the output is a text sequence.

 

94 Can only TANH instead of ReLu be used as the activation function in RNN? Deep learning DL model analytic see: www.zhihu.com/question/61…

Deep Learning (CNN RNN Attention) solving large-scale text categorization problems. Deep learning difficult zhuanlan.zhihu.com/p/25928551 DL application

How to solve the problem of RNN gradient explosion and dispersion? Cs224d Language Model, RNN, LSTM, AND GRU

In order to solve the gradient explosion problem, Thomas Mikolov first proposed a simple and enlightening solution, that is, when the gradient is greater than a certain threshold, it is truncated to a smaller number. As described in Algorithm 1:

Algorithm: Truncate gradient when gradient explodes (pseudocode)


G ^ please partial W partial E

If ∥ g ^ ∥ acuity threshold then

G ^ please threashold ∥ ∥ g ^ ∥ ∥ g ^


The following figure visualizes the effect of gradient truncation. It shows the decision surface of a small RNN where W is the weight matrix and b is the bias term. This model is composed of RNN cells over a short period of time; Solid arrows show the training course of each gradient descent. When the objective function of the model has a high error during gradient descent, the gradient will be sent away from the decision surface. The truncation model produces a dashed line that pulls the error gradient back close to the original gradient.

 


 

Gradient explosion, gradient truncation visualization

 

In order to solve the problem of gradient dispersion, we introduce two methods. The first approach is to change the random initialization W(hh) to an associated matrix initialization. The second method is to substitute the SigmoID function with ReLU (Rectified Linear Units). The derivative of ReLU is either 0 or 1. Therefore, the gradient of the neuron will always be 1 and will not decrease as the gradient propagates over a certain period of time.

 

97 How to Understand LSTM networks? Deep learning hard @ Not_GOD DL model, ontology analysis source: www.jianshu.com/p/9dc9f41f0…

Recurrent Neural Networks

Humans don’t always start their thinking with a blank mind. As you read this article, you infer the true meaning of the current word based on what you already know about the word you’ve seen before. We don’t throw everything away and think with a blank mind. Our thoughts have permanence. Traditional neural networks can’t do this, and that seems like a huge disadvantage. For example, suppose you want to categorize the time types at each point in time in a movie. Traditional neural networks should have a hard time dealing with this problem – extrapolating from previous events in the movie to subsequent ones. RNN solves this problem. RNN is a network containing loops that allow persistence of information.

RNN contains loops

In the example figure above, the module of the neural network, A, is reading an input X_i and output A value h_i. Loops allow information to be passed from the current step to the next. These loops make RNN seem mysterious. However, if you think about it, this is no more difficult to understand than a normal neural network. RNN can be thought of as multiple copies of the same neural network, with each neural network module passing messages to the next. So, if we expand the loop:

 

An RNN

The chaining nature reveals that RNN is intrinsically sequence – and list-related. They are the most natural neural network architecture for this kind of data. And RNN is already being used! In the past few years, RNN has been used with some success in speech recognition, language modeling, translation, image description and more, and the list is still growing. I recommend Andrej Karpathy’s blog, The Gray-Effectiveness of Recurrent Neural Networks, to find more interesting and successful uses of RNN. The key to these successful applications is the use of LSTM, a special RNN that performs better on many tasks than the standard RNN. Almost all the exciting results of RNN are achieved through LSTM. This post will also expand on LSTM.

Problem of long-term Dependencies

One of the key points of RNNS is that they can be used to connect previous information to the current task, for example using past video clips to infer an understanding of previous video clips. If RNNS can do this, they become very useful. But can it? The answer is, there are a lot of dependencies. Sometimes, we just need to know the previous information to perform the current task. For example, we have a language model that predicts the next word based on the previous word. If we are trying to predict the final word of “The clouds are in the sky”, we don’t need any other context — so the next word should obviously be sky. In such scenarios, the gap between the relevant information and the predicted word position is so small that the RNN can learn to use the previous information.

Not too long between relevant information and location

But there are also more complex scenarios. Suppose we try to predict “I grew up in France… I speak Fluent French “. The current information suggests that the next word may be the name of a language, but if we need to figure out what language it is, we need the context of the previously mentioned France, which is far away from the current location. This means that the gap between the relevant information and the current predicted position must become quite large. Unfortunately, as this interval increases, the RNN loses the ability to learn to connect information so far away.

Considerable distance between relevant information and location

In theory, RNN can definitely handle such long-term dependency problems. One can carefully pick and choose parameters to solve the most rudimentary form of such problems, but in practice RNNS will certainly not succeed in learning this knowledge. Bengio, et al. (1994) made an in-depth study of this problem and found some fairly fundamental reasons that make training RNN very difficult. However, fortunately, LSTM doesn’t have this problem!

LSTM network

Long Short Term networks — commonly known as LSTM — are a special type of RNN that can learn long-term dependencies. As @hanxiaoyang said: LSTM and baseline RNN are not particularly different in structure, but they use different functions to calculate hidden states. The “memories” of LSTM are called cells /cells, and you can think of them as black boxes with the input of the former state HT −1 and the current input XT. These “cells” decide which previous information and states need to be retained/remembered and which need to be erased. In practical applications, it is found that this method can effectively save the association information for a long time. LSTM was proposed by Hochreiter & Schmidhuber (1997) and recently improved and popularized by Alex Graves. In many problems, LSTM has achieved considerable success and has been widely used. LSTM is deliberately designed to avoid long-term dependency issues. Remember that long-term information is the default behavior of LSTM in practice, not a very expensive ability to acquire! All RNNS have the form of a chain of repeating neural network modules. In a standard RNN, this repeating module has a very simple structure, such as a TANH layer.

Repeating modules in standard RNN contain a single layer

 

The LSTM has the same structure, but the repeated modules have a different structure. Instead of a single neural network layer, there are four, interacting in a very specific way.

The repeating module in the LSTM consists of four interactive layers

 

Don’t worry about the details here. We will walk through the LSTM parse diagram step by step. Now, let’s get familiar with the ICONS for the various elements used in the figure below.

ICONS in LSTM

 

In the illustration above, each black line carries an entire vector from the output of one node to the input of the other. The pink circles represent pointwise operations, such as vector sums, and the yellow matrices are learned neural network layers. The joined lines represent the connection of vectors, and the separated lines represent the content being copied and then distributed to different locations.

The core idea of LSTM

The key to LSTM is the cellular state, with horizontal lines running across the top of the diagram. The cellular state is like a conveyor belt. It runs directly along the chain, with just a few linear interactions. It would be easy for the message to circulate and stay the same.

LSTM has the ability to remove or add information to the cellular state through carefully designed structures called gates. Gates are a means of letting information through selectively. They contain a Sigmoid neural network layer and a Pointwise multiplication operation.

     

 

The Sigmoid layer outputs values between 0 and 1 that describe how much of each section passes. 0 means “nothing is allowed through” and 1 means “anything is allowed through”!

The LSTM has three gates that protect and control the cell state.

Step by step to understand LSTM

The first step in our LSTM is to decide what information we will discard from the cellular state. This decision is made through a layer called the forgetgate. The gate reads h_{t-1} and X_T, and outputs a value between 0 and 1 for each number in the cell state C_{t-1}. 1 indicates completely reserved, 0 indicates completely discarded. Let’s go back to the example of language models to predict the next word based on what we’ve already seen. In this case, the cellular state may contain the gender of the current subject, so the correct pronoun can be selected. When we see a new subject, we want to forget the old subject.

 

Decide to discard information


The next step is to determine what new information is stored in the cellular state. There are two parts to this. First, the sigmoID layer, called the “input gate layer,” determines what values we are going to update. Then, a TANh layer creates a new candidate value vector,\tilde{C}_t, will be added to the state. Next, we’ll talk about these two pieces of information to generate state updates.

In the case of our language model, we want to add the gender of the new subject to the cellular state to replace the old subject that needs to be forgotten.

Identify the updated information

It is now time to update the old cell state, C_{t-1} to C_t. The previous steps have determined what will be done, and we are now going to actually do it. We multiply the old state by f_T, discarding the information we know we need to discard. Then add i_t * \tilde{C}_t. This is the new candidate value, varying according to how much we decide to update each state. In the case of the language model, this is where we actually discard the gender information of the old pronouns and add the new information based on the goals identified earlier.

Update cell state

Finally, we need to decide what value to output. This output will be based on our cell state, but also a filtered version. First, we run a sigmoID layer to determine which part of the cell state will be output. Next, we process the cell state through TANH (to get a value between -1 and 1) and multiply it by the output of the Sigmoid gate. Finally, we will only output what we are sure to output. In the example of the language model, because he sees a pronoun, he might need to output information related to a verb. For example, we might output whether the pronoun is singular or negative, so that if it’s a verb, we also know what inflection the verb needs to make.

Output information

LSTM variation

So far we have been covering normal LSTM. But not all LSTMS look the same. In fact, almost all papers that contain LSTM use minor variations. The differences are small, but worth mentioning. An LSTM variant of one of these manifolds, proposed by Gers & Schmidhuber (2000), adds a “Peephole Connection”. That is, we make the portal layer accept cellular inputs as well.

Peephole connection

In the illustration above, we added Peephole to each door, but many papers add some peephole instead of all of them.

Another variation is through the use of coupled forget and input gates. Instead of deciding separately what to forget and what new information to add, decide together. We only forget when we are going to type in the current location. We only enter new values into states where we have forgotten the old information.

 

Coupled forget gate and input gate

Another highly modified variant is Gated Recurrent Unit (GRU), which was proposed by Cho, et al. (2014). It combines the forget gate and the input gate into a single update gate. There was also a mix of cell and hidden states, among other changes. The resulting model is simpler than the standard LSTM model and is a very popular variant.

 

GRU


Here are just a few of the popular LSTM variants. Of course there are many others, such asYao, et al. (2015)Proposed Depth Gated RNN. There are also radically different approaches to the problem of long-term dependence, such asKoutnik, et al. (2014)Proposed by Clockwork RNN.

Which variant is the best? Does the difference really matter?Greff, et al. (2015)A comparison of popular variants is given, and it is concluded that they are essentially the same.Jozefowicz, et al. (2015)More than 10,000 RNN architectures have been tested and it is found that some architectures also achieve better results than LSTM on certain tasks.

Screenshots of paper by Jozefowicz et al

conclusion

At the beginning, I mentioned getting important results with RNN. Essentially all of this can be done using LSTM. Does show better performance for most tasks! Because LSTM is generally expressed through a series of equations, LSTM is a little confusing. However, the step-by-step explanation in this article clears up much of the confusion. LSTM was a major success we had with RNN. It’s natural to wonder: Where are the big breakthroughs? The prevailing view among researchers is: “Yes! The next step is already there — attention!” The idea is to have each step of the RNN pick up information from a larger set of information. For example, if you use RNN to produce a description of a picture, you might select a part of the picture and use that information to produce the output word. In fact, Xu, et al. (2015) has already done this — this might be an interesting place to start if you wish to explore attention in depth! There’s also some pretty exciting research on the use of attention, and it looks like there’s a lot more to explore… Nor is attention the only development in the field of RNN research. For example, the Grid LSTM proposed by Kalchbrenner, et al. (2015) also looks promising. Models using RNN for generating models such as Gregor, et al. (2015) Chung, et al. (2015) and Bayer & Osendorfer (2015) are also interesting. In the past few years, RNN research has been quite hot, and the research results will certainly be more abundant! Again, this explanation is basically taken from Not_GOD’s translation of Christopher Olah’s blog “Understanding LSTM Networks”. Thank you.

 

98 Differences between RNN, LSTM, and GRU. Deep learning hard @ DL model I love big bubble, subject analytical source: blog.csdn.net/woaidapaopa…

  • RNN introduces the concept of circulation, but in the actual process, the problem of initial information disappearing over time, namely long-term Dependencies, is introduced into LSTM.
  • LSTM: Because LSTM is in and out, and the current cell Informaton is superimposed after input gate control, RNN is multiply, so LSTM can prevent gradient disappearance or explosion changes is the key, the following is very clear for memory:

  • GRU is a variant of LSTM that combines forget gates and inputs into a single update gate.

When machine learning performance hits a bottleneck, how do you optimize? The ML application of machine learning can be tried from the following four aspects: based on data, with the help of algorithm, with the help of algorithm tuning, with the help of model fusion. Of course, how much detail you can go into depends on your experience. Here’s a reference list: the Machine learning Performance Improvement cheat sheet.

 

How to improve the performance of deep learning? Deep learning DL application difficult blog.csdn.net/han_xiaoyan…

What kind of machine learning projects have you done? Like how to build a recommendation system from scratch. Here is a recommendation system of the open class “recommendation system”, in addition, and recommend a course: machine learning project class [10 pure project explanation, 100% pure actual combat].

 

What datasets are inappropriate for deep learning? Deep learning difficult @ DL application abstract monkey, source: www.zhihu.com/question/41…

  1. When the data set is too small and the data samples are insufficient, deep learning has no obvious advantages over other machine learning algorithms.
  2. Data sets have no local correlation characteristics. Currently, the fields in which deep learning performs well are mainly image/speech/natural language processing, etc. A common feature of these fields is local correlation. Pixels in images form objects, phonemes in speech signals form words, and words in text data form sentences. Once the combination of these characteristic elements is disrupted, the meaning of the expression is also changed. For data sets without such local correlation, deep learning algorithms are not suitable for processing. For example, the parameters used to predict a person’s health — age, occupation, income, family status and so on — can be scrambled without affecting the outcome.

How can generalized linear models be used in deep learning? Deep learning DL model

Xu Han, Source:www.zhihu.com/question/41…

A Statistical View of Deep Learning (I): Recursive GLMs

From a statistical point of view, deep learning can be regarded as a recursive generalized linear model.

Compared with the classical linear model (y=wx+b), the core of the generalized linear model lies in the introduction of the connection function G (.). Y =g−1(wx+b).

The activation function of neurons is the link function of the recursive generalized linear model in deep learning. Logistic function of Logistic regression (a kind of generalized linear model) is the Sigmoid function in neuron activation function. Many similar methods have different names in statistics and neural network, which is easy to cause confusion for beginners (mainly me here). Below is a comparison table

 

What theoretical knowledge should YOU know when preparing for a machine learning interview? Machine learning ML model @ MuWen, source: www.zhihu.com/question/62…

 

Look down, the answers to these questions are basically in this BAT machine learning interview 1000 questions series.

 

102 What is the difference between standardization and normalization? Ontology based easy @ ai huafeng machine learning ML, parsing the source: www.zhihu.com/question/20… Normalization method: 1. Changing the number into a decimal between (0, 1) is mainly proposed for the convenience of data processing. It is more convenient and fast to map the data to the range of 0 ~ 1 for processing. 2. The normalization of a dimensional expression into a dimensionless expression is a way to simplify calculation. In other words, the dimensional expression is transformed into a dimensionless expression and becomes a scalar. Standardization method: Standardization of data is scaling the data so that it falls into a small, specific interval. Since the measurement units of each index in the credit index system are different, in order to participate in the evaluation calculation of the index, it is necessary to normalize the index and map its value to a certain value interval through function transformation.

How does random forest deal with missing values? Method 1 (Na. Roughfix) in the ML model of machine learning is simple and crude. For the data of the same class in the training set, if the classification variable is missing, mode is used to fill in; if the continuous variable is missing, median is used to fill in. Method two (rfImpute) is better than method one (rfImpute) because it requires a lot of calculation. It’s hard to judge. Na. roughfix was used to fill in the missing value, and then the forest was constructed and the proximity matrix was calculated. Then, the missing value was looked back. If it is a continuous variable, the weighted average method is used to compensate for the loss value. Then iterate 4-6 times. The idea of compensating for the missing value is similar to KNN 12.

 

How does random forest assess the importance of features? There are two methods to measure the importance of variables in machine learning ML model, Decrease GINI and Decrease Accuracy: 1) Decrease GINI: For regression problems, argmax(VarVarLeftVarRight) is directly used as the evaluation standard, that is, the variance of the current node training set Var minus the variance VarLeft of the left node and the variance VarRight of the right node. Decrease in Accuracy: For Tb(x) of a tree, we can get the test error 1 by using OOB samples; Then, the JTH column of OOB sample is randomly changed: the other columns remain unchanged, and the JTH column is randomly replaced up and down to obtain error 2. At this point, we can use error 1- error 2 to describe the importance of variable j. The basic idea is that if a variable j is important enough, changing it will greatly increase the test error; On the contrary, if the test error does not increase by changing it, it means that the variable is not that important.

 

104 Optimize Kmeans? In ML model of machine learning, kd tree or Ball tree is used to build all observation instances into a KD tree. Before, each cluster center needed to calculate the distance in sequence with each observation point, but now these cluster centers only need to calculate a local area nearby according to kd tree

 

Selection of center point of 105 KMeans initial class cluster. The basic idea of selecting initial seeds by k-Means ++ algorithm in ML model of machine learning is that the distance between initial clustering centers should be as far as possible. 1. Select a point randomly from the set of input data points as the first clustering center 2. For each point x in the data set, calculate its distance D(x) from the nearest cluster center (the selected cluster center) 3. Select a new data point as a new clustering center, and the selection principle is as follows: the point with larger D(x) has a higher probability of being selected as the clustering center. 4. Repeat 2 and 3 until k cluster centers are selected 5. Use the k initial cluster centers to run the standard K-means algorithm

 

Explain the concept of duality. An optimization problem can be investigated from two perspectives, one is primal problem and the other is dual problem, namely duality problem. In general, duality problem gives the lower bound of the optimal value of the main problem. Under the condition of strong duality, the optimal lower bound of the main problem can be obtained by duality problem. Dual problem is convex optimization problem, which can be solved well. In SVM, primal problem is converted into dual problem for solving, thus further introducing the idea of kernel function.

 

107 How to select features? Feature selection is an important data preprocessing process in MACHINE learning based on ML, mainly for two reasons: one is to reduce the number of features and dimensionality, so as to enhance model generalization ability and reduce over-fitting; The second is to enhance the understanding of features and eigenvalues. Common feature selection methods: 1. Remove features with small variance. 2. Regularization. 1 regularization can generate sparse models. L2 regularization is more stable because useful features tend to correspond to coefficients that are non-zero. 3. Random forest: For classification problems, Gini impurity or information gain is usually adopted; for regression problems, variance or least square fitting is usually adopted. In general, tedious steps such as feature engineering and parameter adjustment are not required. Its two main problems, 1 is that important features are likely to score very low (correlation feature problem), 2 is that this method is more favorable for features with more categories of feature variables (bias problem). 4. Stability selection. Is a new method based on the combination of secondary sampling and selection algorithm, the selection algorithm can be regression, SVM or other similar methods. Its main idea is to run the feature selection algorithm on different data subsets and feature subsets, repeat repeatedly, and finally summarize the feature selection results. For example, we can count the frequency of a feature that is considered to be important (the number of times that a feature is selected as important is divided by the number of times that its subset is tested). Ideally, the score for important features would be close to 100%. The weaker feature will have a non-zero score, while the least useful feature will have a score close to zero.

 

108 Data preprocessing. Machine learning ML base easy 1. Missing values, fillna: I. Discrete: None, II. Continuous: mean. Iii. If there are too many missing values, the column is directly removed. 2. Some models (such as decision trees) require discrete values. 3. Binarization of quantitative features. The core is to set a threshold that is 1 for those greater than the threshold and 0 for those less than or equal to the threshold. For example, image operation 4. Pearson correlation coefficient, remove highly correlated columns

 

A brief description of feature engineering. Machine learning ML fundamentals



Source:www.julyedu.com/video/play/…

 

What do you know about data processing and feature engineering processing? Machine learning ML applications



For more information, please refer to theMachine Learning Engineer # 8Lesson 7 Feature Engineering.

 

111 Please compare the three activation functions Sigmoid, Tanh and ReLu. Deep learning DL basics

Sigmoid function, also called logistic function, is used in logistic regression. The purpose of logistic regression is to learn a 0/1 classification model from features, and this model takes the linear combination of features as independent variables, since the value range of independent variables is from minus infinity to plus infinity. Therefore, the logistic function is used to map the independent variable to (0,1), and the mapped value is considered to be the probability of y=1.

Hypothesis function

Where x is the n-dimensional eigenvector and the function G is the logistic function.

    而The image is

 

 

 

 

 

So you can see that you’re mapping infinity to 0,1.

And the hypothesis function is the probability that the features are y equals one.

 

 

Thus, when we want to determine which class a new feature belongs to, we just needIf can,A value greater than 0.5 is y=1, and a value greater than 0.5 is y=0.



See more:Mp.weixin.qq.com/s/7DgiXCNBS…

Therefore, the SIGmoid function can be regarded as a probability by mapping the output to the range 0-1, and thus is the activation function of the Logstic regression model.

However, the sigmoid function has the following disadvantages:

The forward calculation includes exponential calculation, and the derivative of backpropagation also includes exponential calculation and division operation, so the computational complexity is very high.

The mean of the output is non-zero. This makes the network prone to gradient disappearance or gradient explosion. This is also an issue that Batch Normalization addresses.

If the sigmoid function is f(x), then f'(x)= F (x)(1-f(x)), since f(x) outputs between 0 and 1, then f'(x) is always greater than 0. This leads to the fact that all gradients are positive or negative depending on the gradient on the loss function. It is easy to lead to unstable training, and the parameters are damaged at the same time.

Similarly, f'(x)=f(x)(1-f(x)), since f(x) has an output between 0 and 1, f'(x) has an output between 0 and 1. At deeper levels, the derivative at the bottom is a multiplication of many numbers between 0 and 1, leading to the gradient vanishing problem.

For TANh, it’s similar to sigmoID, but the output is between -1 and 1, with an average of 0, which is its improvement over SigmoID. But since the output is between -1 and 1, the output can’t be considered a probability.



ReLU has the following advantages over SigmoID and TANH:

There are no exponents and division operations.

It won’t saturate, because if x is greater than 0, the derivative is equal to 1

The convergence speed is fast. In practice, it can be known that its convergence speed is 6 times that of SigmoID.

Relu will make the output of some neurons be 0, thus resulting in the sparsity of the network, reducing the interdependence of parameters and alleviating the occurrence of over-fitting problems

But Relu has a downside, and the downside is,

If a particularly large derivative passes through the neural unit so that the input is less than 0, then the unit will never get parameter updates, because the derivative is also 0 if the input is less than 0. This creates a lot of dead cells.

 

112 What are the shortcomings or deficiencies of the three activation functions Sigmoid, Tanh and ReLu? Are there any improved activation functions? Deep learning DL basics

@Zhang Yushi: The shortcomings of Sigmoid, Tanh and ReLU have been explained in 121. In order to solve the dead cell situation of ReLU, we invented Leaky ReLU, which does not let the output be 0 when the input is less than 0, but multiplies by a smaller coefficient to ensure the existence of a derivative. For the same purpose, there is an ELU, as shown in the following diagram.



Another activation function is Maxout, which takes two sets of w and B arguments and outputs larger values. Essentially Maxout can be seen as a generalized version of Relu, because if a set of w’s and B’s are all zeros, then it is normal Relu. Maxout overcomes the disadvantages of Relu, but doubles the number of parameters.

I love Big Bubbles, source:Blog.csdn.net/woaidapaopa…

 

How can XgBoost handle missing values? Some SVM models are sensitive to missing values. Machine learning ML model www.zhihu.com/question/58…

Why introduce nonlinear excitation functions? Deep learning DL basics

@ Zhang Yushi: First, for the neural network, each layer of the network is equivalent to F (Wx + B)= F (W ‘x). For the linear function, it is actually equivalent to F (x)= X. Under the linear activation function, each layer is equivalent to multiplying x by a matrix. According to the matrix multiplication rule, multiple matrices are multiplied to produce a large matrix. Therefore, under linear excitation function, the multi-layer network is equivalent to the one-layer network. For example, a two-layer network f(W1*f(W2x))=W1W2x=Wx.

Second, nonlinear transformation is one of the reasons why deep learning is effective. The reason is that nonlinearity is equivalent to transforming the space, and when the transformation is completed, it is equivalent to simplifying the problem space, and what was previously a linearly unsolvable problem is now solvable.

The picture below illustrates this problem graphically, while the picture on the left cannot be divided by a single line. After a series of transformations, it becomes linearly solvable.

Begin Again, source:www.zhihu.com/question/29…

If you don’t use the excitation function, which is essentially f(x) = x, in which case you have each layer of output as a linear function of the input from the upper layer, it’s easy to verify that no matter how many layers you have in your neural network, the output is a linear combination of the input, equivalent to having no hidden layer, This situation is the most primitive Perceptron.

Because of the above reasons, we decided to introduce nonlinear functions as excitation functions so that the deep neural network makes sense (no longer a linear combination of inputs and can approximate arbitrary functions). The original idea was the Sigmoid function or TANh function, whose output is bounded and easily serves as the next level of input (and biological interpretation for some people).

 

115 May I ask why ReLu is better than TANH and Sigmoid function in artificial neural network? Deep learning DL basics

Sigmoid, TANh and RelU:



Begin Again, source:www.zhihu.com/question/29…

 

 

First, sigmoID and other functions are used to calculate the activation function (exponential operation), which requires a large amount of calculation. When backward propagation is used to calculate the error gradient, the derivation involves division and exponential operation, which requires a relatively large amount of calculation. However, Relu activation function is used to save a lot of calculation in the whole process.

 

Second, for the deep network, when the sigmoID function is propagated back, the gradient will disappear easily (when the SigmoID is close to the saturation region, the transformation is too slow and the derivative tends to 0, which will cause information loss). This phenomenon is called saturation, so the training of the deep network cannot be completed. ReLU, on the other hand, doesn’t tend to saturate, doesn’t have a very small gradient. Thirdly, Relu can make the output of some neurons be 0, which causes the sparsity of the network and reduces the interdependence of parameters, alleviating the occurrence of over-fitting problems (and some people’s biological interpretation of Balabala). Of course, there are also some improvements to RELU, such as PRELU and Random Relu, etc. In different data sets, there will be some improvements in training speed or accuracy. You can refer to relevant papers for details.

 

 

As an added note, the mainstream approach now is to do an additional step of batch normalization to try to ensure the same distribution of inputs across each layer of the network [1]. In the latest paper[2], after adding bypass Connection, they found that changing the position of Batch normalization would have a better effect. If you are interested, you can take a look. [1] Ioffe S, Szegedy C. Batch normalization: Accelerating deep Network training by reducing internal covariate Shift [J]. ArXiv preprint arXiv: 152.03167, [2] He, Kaiming, et al. “Identity Mappings in Deep Residual Networks.” arXiv preprint arXiv:1603.05027 (2016).

 

116 Why there are sigmoID and TANH activation functions in LSTM model? Deep learning DL model is difficult

Why not choose one sigmoID or tanH instead of a mix? What is the purpose of this?



C.www.zhihu.com/question/46…

@Beanfrog: Different purpose

Sigmoid is used on various gates to generate values between 0 and 1, which is usually the most straightforward.

Tanh is used for states and outputs, for processing data, and this might work with other activation functions.

HHHH: also see section4.1 of A Critical Review of Recurrent Neural Networks for Sequence Learning, which states that both tanh can be replaced with something else.

 

117 How good is a classifier? Machine learning based @ I love big bubbles in ML, source: blog.csdn.net/woaidapaopa… Here we must first know TP, FN (true judgment false), FP (false judgment true), TN four (can draw a table). Several commonly used indicators:

  • Precision = TP/(TP+FP) = TP/~P (~P is the number of predicted true)
  • Recall rate = TP/(TP+FN) = TP/ P
  • F1 value: 2/F1 = 1/recall + 1/precision
  • ROC curve: ROC space is a plane represented by a two-dimensional coordinate system with false positive rate (FPR) as X-axis and true positive rate (TPR) as Y-axis. The true positive rate of TPR = TP/P = recall, false positive rate of FPR = FP/N more detail please click: siyaozhang. Making. IO / 2017/04/04 /…

 

What is the physical significance of AUC in machine learning and statistics? Machine learning based in ML www.zhihu.com/question/39…

 

119 Observe the gain, the larger the alpha and gamma, the smaller the gain. Right? Machine learning ML fundamentals

AntZ: XgBoost’s criterion for finding segmentation points is maximized gain. Considering that the traditional greedy method of enumerating all possible segmentation points for each feature is inefficient, XGBoost implements an approximate algorithm. The general idea is to list several candidates that may become segmentation points according to the percentile method, and then calculate Gain from the candidates to find the best segmentation point according to the maximum value. Its calculation formula is divided into four terms, which can be adjusted by regularization parameters (LAMda is the coefficient of sum of squares of weight of leaves, gAMA is the number of leaves):







The first term is assumed to be the weight score of the split left child, the second term is assumed to be the right child, the third term is the unsplit population score, and the last term is the complexity loss of introducing a node

According to the formula, the larger GAMA is, the smaller gain is, the larger LAMDA is, and the gain may be small or large.

The original problem is alpha, not lambda, which is not mentioned in the paper here, and is present in the XGBoost implementation. Above is the answer I understand from the paper, below is the search:

Zhidao.baidu.com/question/21…

Lambda [default 1] weight L2 regularized term. (Similar to Ridge Regression). This parameter is used to control the regularization part of XGBoost. Although this parameter is rarely used by most data scientists, it is still useful for reducing overfitting. Alpha [default 1] L1 regularizers for weights. (Similar to Lasso Regression). It can be applied to very high dimensional cases, making the algorithm faster.

Gamma [default 0] When a node is split, the node will only be split if the loss function value decreases after the split. Gamma specifies the minimum loss function drop value required for node splitting. The greater the value of this parameter, the more conservative the algorithm.

 

120 What causes the gradient extinction problem? So let’s derive it. Deep learning based in DL @ xu han, source: www.zhihu.com/question/41…

  • Yes you should understand beautiful water -Andrej Karpathy
  • How does the ReLu solve the vanishing gradient problem?
  • In the training of neural network, by changing the weight of neurons, the output value of the network is as close to the label as possible to reduce the error value. BP algorithm is commonly used in training. The core idea is to calculate the loss function value between the output and the label, and then calculate its gradient relative to each neuron to carry out weight iteration.
  • The disappearance of gradient will cause slow updating of weights and increase the difficulty of model training. One reason for the gradient’s disappearance is that many activation functions squeeze the output values into very small intervals, with a gradient of 0 over a wide range of definition domains at both ends of the activation function, causing learning to stop.

    In short, the derivative of the sigmoid function F (x) is f(x)*(1-f(x)). Since the output of F (x) is between 0 and 1, as the depth increases, the derivative passed from the top is multiplied by two numbers less than 1 each time and soon becomes very, very small.

 

121 What is gradient extinction and gradient explosion? In DL basis of deep learning, @Hanxiaoyang, the continuous product brought by the chain rule in back propagation, if the number is very small tends to 0, the result will be extremely small (the gradient disappears); If the numbers are large, the result may be large (gradient explosion). @bike, next segment source: Zhuanlan.zhihu.com/p/25631496 layer neural network model of more also can appear some problems in training, These include the gradient vanishing problem and the gradient exploding problem. Gradient extinction and gradient explosion problems generally become more and more obvious with the increase of network layers.

For example, for the neural network with three hidden layers as shown in the following figure, when the problem of gradient disappearance occurs, the weight update of hidden layer 3, which is close to the output layer, is relatively normal, but the weight update of hidden layer 1, which is in front of it, will be slow, resulting in the weight of the front layer almost unchanged and still close to the initialized weight. This results in that hidden Layer 1 is just a mapping layer, which makes a same mapping for all the inputs, so the learning of the deep network is equivalent to the learning of the shallow network with only the following layers.

 

Why does this problem arise? Take the backpropagation of the following figure as an example (assuming only one neuron per layer and for each layer, includingIs the sigmoid function.

 

It follows that

 

And the derivative of sigmoidThe following figure

 

 

 

 

 

Visible,The maximum value ofAnd we initialize the network weightsIt’s usually less than 1, soSo if you take the derivative of the chain, the more layers you have, you take the derivativeThe smaller the gradient, thus causing the disappearance of the situation.

Thus, the cause of the gradient explosion problem is obvious, i.e, that is,Big case. This is less common with sigmoID activation functions. becauseThe size is also associated withAbout (), unless the input value for that layerIn a relatively small range.

In fact, the problems of gradient explosion and gradient disappearance are both caused by the network being too deep and the unstable updating of network weights, which is essentially caused by the multiplication effect in gradient back propagation. For the more general gradient extinction problem, consider replacing the Sigmoid activation function with the ReLU activation function. In addition, the structure design of LSTM can also improve the problem of gradient disappearance in RNN.

 

122 How to solve gradient extinction and gradient expansion? Deep learning DL basics (1) Gradient disappearance: According to the chain rule, if the partial derivative of each layer of neurons to the output of the previous layer times the weight result is less than 1, then even if the result is 0.99, after enough multi-layer propagation, the partial derivative of the error to the input layer will tend to 0. ReLU activation function can be used to effectively solve the gradient disappearance. Batch Normalization can also address this issue. Why does Batch Normalization have a good effect in deep learning? See also: www.zhihu.com/question/38… According to the chain rule, if the partial derivative of each neuron to the output of the next layer is greater than 1 multiplied by the weight, after enough layers of propagation, the partial derivative of the error to the input layer tends to infinity. This problem can be solved by activation function or Batch Normalization.

123 Backpropagation ****Backpropagation. Deep learning based hard @ DL I love big bubble, source: blog.csdn.net/woaidapaopa…

First, we need to understand the basic principle of back propagation, which is the chain rule of derivatives.



Reflected in the neural network:



The following formula is used to derive the loss function.

Back propagation is the method used in solving the derivative of the loss function L with respect to the parameter W. The purpose is to take the derivative of the parameter layer by layer through the chain rule. It is important to note that parameters should be randomly initialized instead of all zeros, otherwise all hidden values will be related to the input, which is called symmetry failure.

The general process is as follows:

  • Firstly, the activation and output values of all nodes are calculated by forward conduction.

  • Calculate the overall loss function:

  • Then, the residual is calculated for each node of the L layer (here is because UFLDL refers to residual, which is essentially the derivative of the overall loss function with respect to the activation value Z of each layer), so to take the derivative of W, just multiply by the derivative of the activation function with respect to W

     

 

124 SVD and PCA. Machine learning ML model

The idea of PCA is to maximize the variance of data after projection. It is only necessary to find such a projection vector that meets the condition of maximum variance. After removing the mean value, SVD decomposition can be used to solve such a projection vector and select the direction with the largest eigenvalue. The essence of PCA is to estimate the likelihood of a matrix distribution, and SVD is an effective method of matrix approximation. See: www.zhihu.com/question/40…

 

125 Data imbalance problem. Machine learning ML foundation is easy

This is mainly due to the uneven distribution of data. Solutions are as follows:

  • Sampling, add noise sampling to small sample, sampling to large sample
  • Data generation, using known samples to generate new samples
  • Perform special weights, such as in Adaboost or SVM
  • An algorithm insensitive to unbalanced data sets is adopted
  • Change evaluation criteria: use AUC/ROC for evaluation
  • By adopting the method of Bagging/Boosting/ensemble
  • The prior distribution of data is considered when designing the model

 

Brief introduction to the development history of neural networks. In 1949, Hebb proposed the neuropsychological learning paradigm — Hebbian Learning Theory. 1952, IBM’s Arthur Samuel wrote the checkers program. 1957, Rosenblatt’s perceptron algorithm was the second machine learning model with a neuroscience background, and three years later Widrow made ML history by inventing the Delta learning rule, which was immediately applied to training perceptrons, whose popularity was dashed by Minskey in 1969. He proposed the famous XOR problem and demonstrated the weakness of perceptrons in linear unfractionable data similar to the XOR problem. Although BP’s idea was put forward by Linnainmaa in the 70s as a “flip mode of automatic differentiation”, it was not applied to multilayer perceptron (MLP) until 1981 by Werbos, NN New big boom. The work of Hochreiter in 1991 and Hochreiter in 2001 both showed that gradient loss would occur after NN element saturation when using BP algorithm. Stagnation has occurred again. Time finally comes to the present, with the growth of computing resources and the growth of data volume. A new NN field, deep learning, has emerged. In short, MP model + SGN — > single-layer perceptron (linear only) + SGN — Minsky Trough — > multi-layer perceptron +BP+ SigmoID — (trough) — > Deep learning +pre-training+ReLU/ SigmoID

 

127 Common methods of deep learning. Deep learning based in @ SmallisBig DL, source: blog.csdn.net/u010496169/… Fully connected DNN (adjacent layers are interconnected and there is no connection within the layer) : RBM — > Feature detector — > Stack stacking greedy training RBM — >DBN solve the full connection problem of fully connected DNN — >CNN Solve the problem that fully connected DNN can not model the changes on the time series — >RNN — solve the problem of gradient disappearance on the time axis — >LSTM @ Zhang Yushi: Now DNN, CNN and RNN are widely used in the application field. DNN is a traditional fully connected network that can be used for AD click-through estimates, recommendations, etc. It uses embedding to encode many discrete features into neural network, which can greatly improve the results. CNN is mainly used in the field of Computer Vision. The emergence of CNN mainly solves the problem that DNN has too many parameters in the field of image. Meanwhile, CNN has developed a series of special aspects such as convolutional, pooling, Batch Normalization, Inception, ResNet and DeepNet, which has made great progress in many areas such as classification, object detection, face recognition, and image segmentation. At the same time, CNN is not only widely used in images, but also has made great progress in natural language processing. At present, there are language models based on CNN that can achieve better effects than LSTM. ResNet in CNN is also one of the two basic algorithms in the latest AlphaZero. GAN is a training method applied in model generation, and now there are many applications in CV, such as image translation, image superclarity, image repair and so on. RNN is mainly used in the field of Natural Language Processing to deal with sequence to sequence problems. Ordinary RNNS have problems with gradient explosion and gradient extinction. So now in the NLP domain, the LSTM model is generally used. Recently, Attention, as a new means, has been introduced into the field of machine translation. In addition to DNN, RNN and CNN, AutoEncoder, Sparse Coding, deep belief network (DBM) and restricted Boltzmann machine (RBM) have also been studied.

The Neural Network model is named for its inspiration from the human brain. Deep learning DL foundation is easy

The neural network is composed of many neurons. Each Neuron receives an input and gives an output after processing the input, as shown in the figure below. Which of the following statements about neurons is true?

  1. A Each neuron can have one input and one output
  2. B Each neuron can have multiple inputs and one output
  3. C Each neuron can have one input and multiple outputs
  4. D Each neuron can have multiple inputs and outputs
  5. E All of the above are true

Answer :(E)

Each neuron can have one or more inputs, and one or more outputs.

 

The figure below is a mathematical representation of a neuron. Deep learning DL foundation is easy

These components are expressed as follows:

– the x1, x2,… , xN: indicates the input of neurons. This can be the actual observation of the input Layer or the intermediate value of one of the Hidden layers

W1, w2,… WN: indicates the weight of each input

– BI: indicates the bias unit or offset. As a constant term added to the input of the activation function, like an Intercept.

-a: As Activation function of neurons, it can be expressed as

-y: output of the neuron

Considering the above annotation, can the linear equation (y = mx + c) be considered to belong to neurons:

A. is

B. no

Answer :(A)

The input has only one variable, and the activation function is linear. So you can think of it as a linear regression function.

Knowing the weight and bias of each neuron is the most important step in a neural network. If you know the exact weights and deviations of neurons, you can approximate any function, but how do you know the weights and deviations of each nerve? Deep learning DL base easy A search for every possible combination of weights and deviations until the best value B assigns an initial value, then check the difference from the best value, continuously adjust the weight C randomly assigned, and leave it to chance D :(B) option B is A description of gradient descent.

 

What are the correct steps of the gradient descent algorithm? Deep learning DL foundation is easy

  1. Calculate the error between the predicted value and the true value
  2. Iterate until the optimal value of network weight is obtained
  3. Pass the input into the network and get the output
  4. Initialize weights and deviations with random values
  5. For each neuron that produces an error, adjust the corresponding value to reduce the error

A. 1, 2, 3, 4, 5

B. 5, 4, 3, 2, 1

C. 3, 2, 1, 5, 4

D. 4, 3, 1, 5, 2

(D)

132 known:

– The brain is made up of many things called neurons, and a neural network is a simple mathematical representation of the brain.

– Each neuron has inputs, processing functions, and outputs.

– Neurons combine to form a network that can fit any function.

– In order to get the best neural network, we use the gradient descent method to update the model continuously

Given the above description of neural networks, when is a neural network model called a deep learning model? Deep learning DL foundation is easy

(A) More layers means A deeper network. (B) More layers means A deeper network. There is no strict definition of how many layers a model is called a depth model. At present, if there are more than 2 hidden layers, it can also be called a depth model.

133 When CNN is used, is it necessary to preprocess the input such as rotation, translation and scaling? (A) A series of data pre-processing (i.e., rotation, translation, scaling) is required before the data is fed into the neural network, and the neural network itself cannot perform these transformations.

134 Which of the following operations can achieve a Dropout effect similar to that of neural networks? (B) DL foundation Boosting B Bagging C Stacking D Mapping Dropout Can be considered as an extreme Bagging. Each model is trained on individual data, and at the same time, by sharing corresponding parameters with other models, Thus, the model parameters are highly regularized.

Which of the following introduces nonlinearity in a neural network? Deep learning DL foundation is easy

  1. A random gradient descent
  2. B Modified linear unit (ReLU)
  3. C convolution function
  4. D

B.

The modified linear element is a nonlinear activation function.

 

136 When training the neural network, the loss function (loss) does not decrease during the first few epochs. What are the possible reasons? (D) Deep learning DL foundation easy

A Learning rate is too low

B The regular parameter is too high

C falls into a local minimum

D or above

137 Which of the following statements about model capacity is true? Deep learning DL foundation is easy

  1. A The number of hidden layers increases, and the model capacity increases
  2. The ratio of B Dropout increases and the model capacity increases
  3. C The learning rate increases and the model ability increases
  4. D

Answer :(A)

138 If the number of hidden layers of Multilayer Perceptron is increased, the classification error will be reduced. Is this statement true or false? Deep learning DL foundation is easy

  1. A correct
  2. B error

B.

Not always. An increase in the number of layers can lead to overfitting, which can lead to an increase in errors.

Build a neural network that takes the output of the previous layer and itself as input. Deep learning DL model easy

Which of the following architectures has feedback connections?

  1. A cyclic neural network
  2. B Convolutional neural network
  3. C Limit Boltzmann machine
  4. D is not

Answer :(A)

What is the sequence of Perceptron tasks? Deep learning DL basic easy 1 randomly initialize the weight of perceptron 2 go to the next batch of data set 3 If the predicted value and output are inconsistent, adjust the weight 4 to an input sample, A. 1, 2, 3, 4 b. 2, 3, 1 c. 3, 1, 2 d. 1, 3, 2 答案 :(D)

141 Suppose you needed to adjust parameters to minimize cost function, which of the following techniques would be used? Deep learning DL foundation is easy

A. Exhaustive search

B. Random search

C. Bayesian optimization

D. Gradient descent

(D)

142 In which of the following situations does a step descent not necessarily work correctly (it may get stuck)? Deep learning DL foundation is easy

 

D. None of the above is true

B.

This is a classic example of a gradient descent at Saddle Point. The subject comes from: www.analyticsvidhya.com/blog/2017/0…

143 The following figure shows the relationship between the accuracy of trained 3-layer convolutional neural network and the number of parameters (the number of feature kernels). Deep learning DL foundation is easy

As can be seen from the trend in the figure, if the width of the neural network is increased, the accuracy will increase to a certain threshold and then begin to decrease. What are the possible reasons for this phenomenon?

  1. A Even if the number of convolution kernels is increased, only A small number of kernels will be used for prediction
  2. B When the number of convolutional kernels increases, the predictive Power (Power) of the neural network will decrease
  3. C When the number of convolution kernels increases, it leads to overfitting
  4. D

(C)

 

When the network scale is too large, noise in data may be learned, leading to overfitting

144 Suppose we have a hidden layer as shown below. The hidden layer plays a certain role in weft reduction in this network. Suppose we now replace this hidden layer with another dimensionally descending method, such as principal component analysis (PCA). Deep learning DL foundation is easy

So, is the output the same?

Is A.

B. no

B.

PCA extracts the direction with large variance of data distribution, and the hidden layer can extract features with predictive ability

 

Which of the following functions cannot be used as an activation function? Deep learning DL foundation is easy

 

A. y = tanh(x)

B. y = sin(x)

C. y = max(x,0)

D. y = 2x

(D)

A linear function cannot be used as an activation function.

146 Which of the following neural network structures will share weights? Deep learning DL model easy

A. Convolutional neural network

B. Recurrent neural network

C. Fully connected neural network

[D]

(D)

What are the benefits of Batch Normalization? Deep learning DL basics a. Normalize (change) all inputs before passing them to the next layer B. It takes the normalized mean and standard deviation of weights C. It is a very efficient back propagation (BP) method D. None of these are the answers :(A)

148 Which of the following methods can be used to handle over-fitting in a neural network? (D) Deep learning DL base EASY A Dropout B regularization D Both of which deal with the principles of fitting for choice C. Because the normalized values of the same data in different batches will be different, which is equivalent to a data augmentatio.

What happens if we use an excessive learning rate? Deep learning DL basic easy A neural network convergence B hard to say C is not true D neural network convergence

The network shown below is used to train the recognition characters H and T, as follows (deep learning DL Foundation easy) :

What is the output of the network?



D. It may be A or B, depending on the weight Settings of the neural network

(D)

Without knowing what the weights and biases of the neural network are, it is impossible to determine what output it will give.

150 Suppose we have trained a convolutional neural network on the ImageNet dataset (object recognition). And then feed this convolutional neural network an all-white picture. The probability that the output of this input will be any kind of object is the same, right? A) True B) uncertain C) dependent D) false D, The convolutional neural network has been trained, and each neuron has been carefully completed. For the input of all-white images, the values of j-layer activation output to the last full-connected layer are almost impossible to be identical, and will not be the same after softmax conversion. So “the output is equally likely of any kind”, that is, every term of Softmax is equal, and the probability is extremely low.

151 When the pooling layer is added into the convolutional neural network, the invariance of the transformation will be retained, right? In DL model of deep learning, A does not know B depending on the situation C yes D no answer :(C) pooling algorithm, such as taking the maximum value/average value, has the same result after the rotation of input data, so there is also such invariance after multi-layer stacking.

152 Which gradient descent method is more effective when the data is too large to be processed simultaneously in RAM? (A) Deep learning is easy to A Stochastic Gradient Descent B) do not know C) Full Batch Gradient Descent D) Neither Gda can be divided into random gDA (one sample at a time), small-batch GDA (a small batch of samples is used to calculate the total loss, so the gradient of back propagation is compromised), and full-batch GDA uses all samples at one time. In these three methods, the gradient orientation is more accurate than the other for the loss function surface of all samples. However, in engineering applications, due to memory/disk I/O throughput performance, the best balance between gradient direction accuracy and data transfer performance is needed to minimize the actual computation time of gradient descent. Therefore, when the data is too large to be processed simultaneously in RAM and RAM can only load one sample at a time, the random gradient descent method can only be selected.

 

The figure below is a gradient descent diagram of a neural network training with four hidden layers using sigmoid function as activation function. The neural network suffers from gradient disappearance. Which of the following statements is true? (A) Deep learning in DL fundamentals

The first hidden layer corresponds to D, the second hidden layer to C, the third hidden layer to B, and the fourth hidden layer to A

The first hidden layer corresponds to A, the second hidden layer to C, the third hidden layer to B, and the fourth hidden layer to D

The first hidden layer corresponds to A, the second hidden layer corresponds to B, the third hidden layer corresponds to C and the fourth hidden layer corresponds to D

The first hidden layer corresponds to B, the second hidden layer corresponds to D, the third hidden layer corresponds to C and the fourth hidden layer corresponds to A

As the back propagation algorithm enters the initial layer, the learning ability is reduced, which is the gradient disappearance. In other words, gradient disappearance means that the gradient gradually decreases to 0 in the forward propagation. According to the title of the figure, the four curves are the learning curves of the four hidden layers, so the gradient of the first layer is the highest (the loss function curve drops significantly), and the gradient of the last layer is almost zero (the loss function curve becomes flat and straight). So D is the first layer and A is the last layer.

154 For a classification task, which of the following statements is true if the weights of the neural network are not randomly assigned at the beginning and both are set to 0? (C) Deep learning DL basic easy (A) No other choice B) No problem, neural networks will start training normally C) Neural networks can be trained, but all neurons end up learning the same thing D) neural networks won’t start training, Because there’s no gradient change so that the ownership weight is initialized to zero which sounds like a reasonable idea maybe the best hypothesis we have, but it turns out to be wrong, because if the neural network computs the same output, then the back propagation algorithm computs the same gradient, And the parameter update values are the same (w= W −α∗dw). More generally, if the weights are initialized to the same value, the network is symmetric, and eventually all the neurons will end up recognizing the same thing.

 

155 As shown in the figure below, when training began, the error remained high because the neural network was stuck in the local minimum until it moved towards the global minimum. To avoid this, which of the following strategies can be adopted? Deep learning DL foundation is easy

A Changing the learning rate, for example, for the first few training cycles

B starts by slowing down the learning rate by a factor of 10, and then using momentum

C increases the number of parameters so that the neural network does not get stuck at the local optimum

D

Answer :(A)

Option A extracts the neural network trapped in local minima.

 

156 Which of the following neural networks is a better solution to an image recognition problem (finding a cat in a photo)? (D) Deep learning DL basic easy A cyclic neural network B perceptron C Multilayer perceptron D Convolutional neural network Convolutional neural network will be better suited to image-related problems because of the inherent nature of changes in position near images.

 

157 Suppose that in training we suddenly encountered a problem, after several cycles, the error suddenly decreased

You think there’s something wrong with the data, so you draw the data and you realize that maybe the data is too large to be causing the problem.

What are you going to do to deal with the problem? Deep learning DL foundation is easy


A normalizes the data

B takes the logarithm of the data

C is correct

D Principal component analysis (PCA) and normalization of data

(D)

You first remove the relevant data, then set it to zero. Specifically, the error decreases instantly, generally because multiple data samples have strong correlation and are suddenly matched, or data samples with large variance are suddenly matched. Therefore, principal component analysis (PCA) and normalization of data can improve this problem.

 

Which of the following decision boundaries is generated by a neural network? (E) Deep learning DL foundation easy

A A

B D

C C

D B 

E All of the above

Neural network can fit any function in an approximate way, so all the above figures can be obtained by neural network through supervised learning and training.

 

In the figure below, we can observe many small “fluctuations” in error. Should we be worried about this? Deep learning DL foundation is easy

A yes, which may mean that there is A problem with the learning rate of the neural network

B No, just a cumulative drop in the training set and cross validation set

C I don’t know

D to say

B.

Option B is correct. To reduce these “ups and downs”, try increasing the batch size. Specifically, when the overall trend of the curve is down, in order to reduce these “ups and downs”, the batch size can be increased to reduce the swing range of the batch comprehensive gradient direction. When the overall curve trend is flat, there will be considerable “ups and downs”, and the learning rate can be reduced to further convergence. If the “ups and downs” are not significant, the training should be terminated in advance to avoid over-fitting

 

160 Which of the following parameters should be considered when choosing the depth of the neural network? Deep learning DL basic easy 1 type of neural network (e.g. MLP,CNN) 2 input data 3 computational power (hardware and software capabilities determine) 4 learning rate 5 mapped output functions A 1,2,4,5 B 2,3,4,5 C all need to be considered D 1,3,4,5 答案 : (C) All of the above factors are important in selecting the depth of the neural network model. The more layers required by feature extraction, the higher the dimension of input data, the more complex the nonlinear output function of mapping, and the deeper the required depth. In addition, hardware computing power and learning rate should also be taken into account to design a reasonable training time in order to achieve the best results and increase the number of parameters brought by increasing depth.

 

161 When considering a specific problem, you may have only a small amount of data to solve the problem. But fortunately you have a neural network that has been pre-trained for a similar problem. Which of the following methods can take advantage of this pre-trained network? (C) Deep learning DL basic easy A freeze all layers except the last layer and retrain the last layer B retrain the whole model with new data C fine tune only the last few layers D evaluate each layer of model If there is a pre-trained neural network, it is equivalent to replacing random initialization with a reliable prior for each parameter of the network. If the new small amount of data is from the previous training data (or the previous training data volume well describes the data distribution, and the new data is sampled from the exact same distribution), then freeze all the previous layers and retrain the last layer. However, in general, the distribution of new data deviates from the distribution of previous training sets, so when the prior network is insufficient to fully fit the new data, most of the previous layer network can be frozen, and only the last few layers can be tuned for training (also known as Fine Tune).

 

162 Is increasing the size of the convolution kernel necessary to improve the effectiveness of convolutional neural networks? (C) Increasing the size of the kernel does not necessarily improve performance. This question depends largely on the data set.

 

Briefly describe the history of neural networks. Deep learning DL foundation is easy

@ SIY. Z. Subject analytical source: zhuanlan.zhihu.com/p/29435406 sigmoid saturated, gradient to disappear. Hence ReLU. The negative half axis of ReLU is the dead zone, causing the gradient to change to 0. And we have LeakyReLU, PReLU. Emphasis was placed on the stability of gradients and weight distributions, from which came ELU and, more recently, SELU. It’s too deep for the gradient to go down, so you have highway. Even the parameters of highway were abandoned, and the residual was directly changed, hence ResNet. The mean and variance of parameters are forcibly stabilized, hence the BatchNorm. Add noise to gradient flow, and you have Dropout. RNN gradient is unstable, so add several paths and gating, and then have LSTM. LSTM simplifies, GRU. There is a problem with the JS divergence of GAN, which causes the gradient to disappear or become invalid, hence WGAN. WGAN had problems with gradient clip, hence wGAN-GP.

Talk about Spark’s performance tuning. Big data Hadoop/ Spark tech.meituan.com/spark-tunin… https://tech.meituan.com/spark-tuning-pro.html

 

What are the common classification algorithms? Machine learning ML based easy SVM, neural network, random forest, logistic regression, KNN, Bayesian

 

What are the common supervised learning algorithms? Machine learning ML basic perceptron, SVM, artificial neural network, decision tree, logistic regression

 

166 Other things being equal, which of the following practices is likely to cause overfitting problems in machine learning Increase the amount of training set b. reduce neural network node number of hidden layers on the c. remove sparse characteristics of d. SVM algorithm using gauss/RBF kernel, instead of the linear nuclear correct answer: D @ 320, Liu Xuan subject title and parsing the source: blog.csdn.net/column/deta… In general, the more complex the system, the higher the probability of overfitting, the general model is relatively simple will have better generalization ability. B. It is generally believed that increasing the number of hidden layers can reduce the network error (some literatures believe that it may not be effectively reduced), improve the accuracy, but also complicate the network, thus increasing the training time of the network and the tendency of “over-fitting”. The Gaussian kernel function of SVM is more complex than the linear kernel function model, and it is easy to over-fit D. Description of the radial basis (RBF) kernel/Gaussian kernel, which maps the original space to infinite dimensional space. For parameters, if you choose very large, the weights on the higher-order features actually decay very quickly, and are actually (numerically approximate) equivalent to a lower-dimensional subspace; Conversely, if you choose very small, you can map any data to linearly separable — which is not always a good thing, of course, since it can lead to very serious overfitting problems. In general, however, gaussian kernels are actually quite flexible by adjusting parameters and are one of the most widely used kernels.

 

167 Which of the following time series models best fits the analysis and prediction of volatility? Machine learning ML model easy to A.A R B.M A model c.a. RMA model D.G ARCH model is the correct answer: D @ 320, Liu Xuan subject title and parsing the source: blog.csdn.net/column/deta… AR model is a linear prediction, that is, given N data, the model can deduce the data before or after the NTH point (let’s deduce point P), so its essence is similar to interpolation. MA Model (Moving Average Model), in which the trend moving average method is used to establish the prediction model of linear trend. Regressive moving Average model (ARMA) is one of the high-resolution spectral analysis methods of model parameter method. This method is a typical method to study rational spectrum of stationary random process. Compared with AR model and MA model, it has more accurate spectral estimation and better spectral resolution performance, but its parameter estimation is more complicated. GARCH model is called generalized ARCH model, which is an extension of ARCH model and developed by Bollerslev(1986). It is an extension of ARCH model. GARCH(P,0) model is equivalent to ARCH(P) model. GARCH model is a customized regression model for the volume of financial data. In addition to the common regression model, GARCH conducts further modeling on the variance of errors. It is especially applicable to the analysis and prediction of volatility, which can play a very important guiding role in investors’ decision-making, and its significance often exceeds the analysis and prediction of the value itself.

 

Is the best criterion for linear classifier below 168? Machine learning ML model easy A. Perceptual criterion function B. Bayesian classification C. Support vector machine (SVM) D.F isher criteria: the right answer ACD @ 320, Liu Xuan subject title and parsing the source: blog.csdn.net/column/deta… There are three categories of linear classifiers: perceptron criterion function, SVM and Fisher criterion, while Bayesian classifier is not a linear classifier. Perception criterion function: criterion function minimizes the sum of distances from misclassified samples to the interface. Its advantage is that the classifier function can be modified by the information provided by misclassified samples. This criterion is the basis of multilayer perceptron in artificial neural network. Support vector machines: The basic idea is that under the condition of linearly separable two classes, the classifier interface is designed to maximize the interval between the two classes. Its basic starting point is to minimize the expected generalization risk. Fisher’s Criterion: More widely known as linear discriminant analysis (LDA), projects all samples onto a straight line starting from a distant point, making the distance between samples of the same kind as small as possible, and the distance between samples of different kinds as large as possible, specifically to maximize the “generalized Rayleigh quotient”. According to the characteristics of the two kinds of samples, the best normal vector direction of linear classifier is found, so that the projection of the two kinds of samples in this direction can meet the requirements of the density within the class and separation between the classes. This measure is achieved by the intra-class discrete matrix Sw and the inter-class discrete matrix Sb.

 

169 The advantages of h-K algorithm based on quadratic criterion function over perceptron algorithm are ()? Deep learning DL foundation is easy a. small amount of computation b. can determine whether the problem is linearly separable C. Its solution is suitable for nonlinear can be divided into completely d. the adaptability of its solution is better Correct answer: BD @ Liu Xuan 320, subject title and parsing the source: blog.csdn.net/column/deta… The idea of HK algorithm is very simple, which is to obtain the weight vector under the minimum mean square error criterion. Its advantage over perceptron algorithm is that it is suitable for linearly separable and nonlinear separable cases. For linearly separable cases, the optimal weight vector can be given, and for nonlinear separable cases, it can be discriminated to exit the iterative process.

 

170 What is true in the following statement is that () THE MACHINE learning ML model of A. VM is robust to noise (e.g. noise samples from other distributions) B. In AdaBoost algorithm, the weight update ratio of all misclassified samples is the same. B. Boosting and Bagging are both voting methods combining multiple classifiers, and the weight D is determined according to the accuracy of a single classifier. Given n data points, if half of which is used for training, generally used for testing, the difference between training error and test error will reduce with the increase of n the correct answer: BD @ 320, Liu Xuan subject title and parsing the source: blog.csdn.net/column/deta… A, SVM robustness to noise (such as noise samples from other distributions) SVM itself has A certain robustness to noise, but the experiment proves that when the noise rate is lower than A certain level, SVM does not have much impact on the recognition rate of the classifier will decrease with the increase of the noise rate. B. In AdaBoost algorithm, the weight update ratio of all misclassified samples is the same. In AdaBoost algorithm, different training sets are realized by adjusting the corresponding weight of each sample. At the beginning, the corresponding weight of each sample is the same, that is, n is the number of samples, and a weak classifier is trained under this sample distribution. For the misclassified samples, the corresponding weight should be increased; For the correctly classified samples, the weight is reduced, so that the misclassified samples are highlighted, and a new sample distribution is obtained. Under the new sample distribution, the weak classifier is obtained by training the samples again. In this way, all the weak classifiers are superimposed to get a strong classifier. C, Boost and Bagging are all voting methods that combine multiple classifiers. In both cases, the weight of a single classifier is determined according to the accuracy of the classifier. Bagging and Boosting differ in sampling methods. Bagging uses uniform sampling, while Boosting samples based on error rates. Bagging’s various prediction functions are unweighted, while Boosting is weighted. Bagging’s prediction functions can be generated in parallel, while Boosing’s prediction functions can only be generated sequentially. What @antz A. SVM solves is the minimum structural risk and weak empirical risk processing, so it is sensitive to data noise. B. In AdaBoost algorithm, each iteration trains a learner and obtains the weight alpha of the learner according to its misclassification rate. The weight of the learner calculates two update ratios to correct the weight of all samples: positive samples are exp(-alpha) and negative samples are exp(alpha). So all the misclassified samples have the same weight update ratio. C. Bagging learners have different weights and simply take voting results; Boosting’s Adaboost determines the weight based on the misclassification rate, while Boosting’s GBDT is a fixed small weight (also known as learning rate), with the approximate pseudo-residual function itself replacing the weight. D: According to the central limit law, as n increases, the difference between training error and test error will inevitably decrease — this is the origin of big data training

Input image size 200×200, convolution (kernel size 5×5, padding 1, stride 2), pooling (kernel size 3×3, padding 0, stride 1), After another convolution (kernel size 3×3, padding 1, stride 1), the output feature graph is as follows:

A. 95

B. 96

C. 97

D. 98

E. 99

F. 100

171 deep learning DL base easily, the correct answer: C @ 320, Liu Xuan subject title and parsing the source: blog.csdn.net/column/deta… First of all, we should know the formula for calculating the size after convolution or pooling:

Outputw = ⌊ imagew + 2 padding – kernelsizestride ⌋ + 1

 

Outputh = ⌊ imageh + 2 padding – kernelsizestride ⌋ + 1

 

Where, the padding refers to the edge size extended outward, and the stride is the step length, that is, the length of each move.

This makes it much easier. First of all, the length and width are equally large, so we only need to calculate one dimension, so that the magnitude after the first convolution is:

200 + 2 + 1 = 99-52

The size after the first pooling is:

99 + 0 + 1 = 97-31

After the second convolution, the magnitude is:

97 + 2 + 1 = 97-31

 

The final result is 97.

172 In the basic analysis module of SPSS, the function is “to reveal the relationship between data in the form of column and column table” is () big data Hadoop/ Spark easy A. A. what B. what C. what D. what

173 A prison face recognition access system is used to identify the identity of people waiting to enter the prison. This system includes the identification of four different types of people: prison guards, thieves, food delivery personnel, and others. Which of the following learning methods is best suited to the application requirements :() machine learning B. Binary classification problem C. hierarchical clustering problem D. K-center clustering problem E. Regression problems f. structure to analyze the correct answer: B @ 320, Liu Xuan subject title and parsing the source: blog.csdn.net/column/deta… Dichotomies: Each classifier can only classify samples into two categories. The prison samples were guards, thieves, food delivery men, and others. Dichotomies will never work. Wapusk Nick 95 basis of support vector machine (SVM) is a binary classification classifiers, the classifier learning a process is a solution to be inferred based on the classification of plus or minus 2 an optimal programming problem (dual problem), to solve the multiple classification problems using the decision tree to classify the second classifier cascade, the concept of the VC dimension is the complexity of this matter. Hierarchical clustering: Create a hierarchical hierarchy to decompose a given data set. The people in the prison are the guards, the thieves, the food delivery guys, or whatever, and they are supposed to be equal, so no. This method is divided into top-down (decomposition) and bottom-up (merger) two modes of operation. K-center clustering: Pick actual objects to represent clusters, and use one for each cluster. It’s a rule of division around a central point, so it’s not appropriate here. Regression analysis: A statistical method that deals with correlations between variables where there is no direct correlation between prison guards, thieves, food delivery men, and others. Structural analysis: The structural analysis method is a statistical method that calculates the proportion of each component on the basis of statistical grouping, and then analyzes the internal structural characteristics of a certain overall phenomenon, the nature of the overall internal structure, and the change regularity of the overall internal structure as time goes by. The basic form of structural analysis is to calculate the structural index. It doesn’t work here either. Multi-classification problem: Train several different weak classifiers for different attributes, and then integrate them into one strong classifier. Here, the prison guard, thief, food delivery man and other person are set up according to their characteristics, and then distinguished and identified.

174 What is incorrect about Logit regression and SVM is that () machine learning ML model easy A.Logit regression objective function is to minimize posterior probability B. Logit regression can be used to predict the size of the probability of event occurrence c. SVM objective is to minimize structural risk D.S VM can effectively avoid the model fitting of the correct answer: A @ Liu Xuan 320, subject title and parsing the source: blog.csdn.net/column/deta… A. Logit regression is essentially A method of maximum likelihood estimation of weights based on samples, and the posterior probability is proportional to the product of prior probability and likelihood function. Logit just maximizes the likelihood function, it doesn’t maximize the posterior probability, let alone minimize the posterior probability. Minimizing a posterior probability is what naive Bayes algorithm does. The output of Logit regression is the probability that the sample belongs to the positive category, which can be calculated. The probability is correct. C. The goal of SVM is to find the hyperplane that separates the training data as much as possible and maximizes the classification interval, which should belong to the structural risk minimization. D. SVM can control the complexity of the model through regularization coefficients to avoid overfitting.

175 has two sample points. The first point is a positive sample, and its eigenvector is (0,-1). X +2y=5 C. x+2y=3 D. 2X-y =0 a. 2X +y=4 B. x+2y=5 C. x+2y=3 And this simplifies the problem, because the maximum distance between two points is the vertical bisector, so you just have to figure out the vertical bisector. Slope is the slope of the two cords of bottom – 1 / ((1-3)/(2-0)) = 1/2, available y = x + c (1/2), a midpoint ((0 + 2) / 2, (1 + 3) / 2) = (1, 1), can get c = 3/2, so choose c.

176 What is wrong with the following description of the classification algorithm’s accuracy, recall rate and F1 value? A. Accuracy is the ratio of the number of relevant documents retrieved to the total number of documents retrieved, which measures the accuracy of the retrieval system B. Recall rate refers to the ratio between the number of relevant documents retrieved and the number of all relevant documents in the document library, and measures the recall rate C of the retrieval system. The accuracy rate, recall rate and F value are all between 0 and 1. The closer the value is to 0, the higher the accuracy or recall rate is D. In order to solve the conflict between accuracy and recall rate, the correct answer of F1 score is introduced: C analysis: Precision and recall are commonly used to evaluate the problems of binary classification. Generally, the concerned class is the positive class and the other classes are the negative class. The classifier’s prediction on the test data set is either correct or incorrect. The total number of the four cases is denoted as: TP — the positive class is predicted as the number of positive classes FN — the positive class is predicted as the number of negative classes FP — the negative class is predicted as the number of positive classes TN — the negative class is predicted as the number of negative classes thus: Accuracy is defined as: P = TP/(TP + FP) Recall rate is defined as: R = TP/(TP + FN) F1 value is defined as: F1 = 2 P R/(P + R) Both the accuracy and recall rate and F1 value are between 0 and 1. If the accuracy and recall rate are higher, F1 value will also be higher. There is no saying that the closer the value is to 0, it should be the higher the value is to 1.

177 The following Discriminative Model methods include () machine learning ML Model easy 1) Mixed Gaussian Model 2) conditional random field Model 3) discrimination training 4) Hidden Markov Model A.2,3,3,4,4, 1,4, 1,2 Correct answer: A @ Liu Xuan 320, subject title and parsing the source: blog.csdn.net/column/deta… Common discriminant models are: Logistic regression (logistical regression) Linear discriminant analysis (Linear discriminant analysis) Supportvector machines (Supportvector machines) Boosting (ensemble learning) Conditional random fields Common generative models of Linear regression Neural networks include: Gaussian mixture Model and othertypes of mixture Model Hidden Markov Model AODE (average single rely on estimation) Latent Dirichlet allocation (LDA theme model) Restricted the Boltzmann Machine (limit postman Machine) to generate the type model is based on probability by the results of the discriminant model is input, Calculate the result.

178 In SPSS, data sorting functions are mainly concentrated in menus such as (). Big data Hadoop/ Spark easy A. Data B. Direct marketing C. Analysis of d. convert the correct answer: AD @ Liu Xuan 320, subject title and parsing the source: blog.csdn.net/column/deta… The data is organized in the data and conversion menu.

 

179

Deep learning is A popular machine learning algorithm. In deep learning, which involves A large number of matrix multiplication, the product ABC of three dense matrices A,B and C needs to be calculated. Assume that the dimensions of the three matrices are M ∗n, N ∗ P, p∗q, and M

B.AC(B)

C.A(BC)

D. So the efficiency is the same

A. deep learning B. DL foundation C. DL foundation D. DL foundation

@ 320, Liu Xuan subject title and parsing the source: blog.csdn.net/column/deta… First of all, just A simple matrix thing, because A times B, the number of columns in A has to be the same as the number of rows in B. So you can eliminate B, and then you can look at A and C. In option A, the product of matrix A of M ∗n and matrix B of n∗ P yields matrix A*B of M ∗ P, while each element of A∗B requires n multiplications and n-1 addition. Ignoring addition, A total of M ∗n∗ P multiplications are required. In the same case, when A*B is analyzed and then multiplied by C, A total of m∗ P ∗q multiplications are required. Therefore, A (AB)C requires m∗n∗p+m∗p∗q. In the same way, C A (BC) needs n∗p∗q+m∗n∗q. Since M ∗n∗ P < M ∗n∗q, m∗p∗q

 

180

Nave Bayes is A special Bayes classifier. The characteristic variable is X and the category label is C. One of its assumptions is :() A. The prior probabilities P(C) of all classes are equal

B. A normal distribution with 0 as mean and SQR (2)/2 as standard deviation

C. Each dimension of characteristic variable X is a class-conditional independent random variable

D.P (X | C) is a gaussian distribution

In machine learning ML model, correct answer: C

@ 320, Liu Xuan subject title and parsing the source: blog.csdn.net/column/deta…

The condition for naive Bayes is that each variable is independent.

 

181

Regarding support vector machine SVM, the following statement is wrong with () A.l2 regular term, which is used to maximize classification interval and make the classifier have stronger generalization ability

B. Inge loss function, which minimizes empirical classification errors

C. classification interval is 1 / | | w | | to | | w | | on behalf of the vector model

D. The smaller parameter C is, the larger the classification interval is, the more classification errors are, and the less learning is likely

Machine learning ML model, easy, correct answer

@ 320, Liu Xuan subject title and parsing the source: blog.csdn.net/column/deta…

A correct. The reason to consider adding regularization term is as follows: imagine a perfect data set, y>1 is positive class, y<-1 is negative class, decision surface Y =0, add a positive noise sample y=-30, then the decision surface will become “distorted” a lot, the classification interval will be smaller, and the generalization ability will be reduced. After the addition of the regular term, the fault tolerance of the noise sample is enhanced. In the example mentioned above, the decision surface will not be so “crooked”, which makes the classification interval larger and improves the generalization ability.

B is correct.

C error. Interval should be 2 / | | w | |, after half sentence should be yes, vector module usually means the second norm.

D is correct. When considering soft intervals, the effect of C on the optimization problem is to limit the range of A from [0, + INF] to [0,C]. C is smaller, the smaller then a will, the objective function of Lagrange function derivative to zero can calculate w ∗ = sum ai yi ∗ xi, a smaller making smaller w, so the interval of 2 / | | w | |

 

182 In HMM, if the observation sequence and the state sequence that generated the observation sequence are known, which of the following methods can be used to estimate parameters directly () Machine learning ML model is easy to a. M algorithm B. Viterbi algorithm c. d. back to before the algorithm maximum likelihood estimate the correct answer: D @ 320, Liu Xuan subject title and parsing the source: blog.csdn.net/column/deta… EM algorithm: only observation sequence, no state sequence to learn model parameters, baum-Welch algorithm Viterbi algorithm: dynamic programming to solve the prediction problem of HMM, not parameter estimation forward and backward algorithm: used to calculate maximum likelihood probability estimation: That is, supervised learning algorithm is used to estimate parameters when both observation sequence and corresponding state sequence exist. Note that maximum likelihood estimation can be used to estimate model parameters when given observation sequence and corresponding state sequence. EM is only used if there is no corresponding state sequence for a given observation sequence, and the state sequence is not immeasurable hidden data.

 

183 Suppose A student using A Naive Bayesian (NB) classification model accidentally duplicated the two dimensions of the training data, then the statement about NB is correct () : the ML model of machine learning is easy to A. The decisive role of the repeated feature in the model will be strengthened b. The model effect will be less accurate than that without repeated feature C. If all features are repeated, the model predictions are the same as if the model did not repeat. D. When the features of two columns are highly correlated, the conclusion obtained when the features of two columns are the same cannot be used to analyze the problem e.nb can be used to do least squares regression F. The condition of BD naive Bayes is that each variable is independent of each other. If highly correlated features are introduced twice in the model, thus increasing the importance of the feature, then its performance is degraded because the data contains highly correlated features. The correct approach is to evaluate the correlation matrix of features and remove those that are highly correlated.

 

184 L1 and L2 norm. Machine learning ML foundation is easy

What would happen if both L1 and L2 norms were added in Logistic Regression ()

A. Feature selection can be made and over-fitting can be prevented to A certain extent

B. Can solve dimensional disaster problems

C. It can speed up the calculation

D. More accurate results can be obtained

Correct answer :ABC

@ 320, Liu Xuan subject title and parsing the source: blog.csdn.net/column/deta… L1 norm has the characteristics of coefficient solutions, but it should be noted that the features not selected by L1 do not mean that they are not important, because two highly correlated features may only retain one. If you need to determine which features are important, do cross validation. It has the excellent property of producing sparsity, causing many terms in W to become zero. In addition to the computational benefits of sparseness, it is more “interpretable”. So it speeds up calculations and alleviates dimension disasters. So BC is correct.

After the cost function is added the regular term, L1 is Losso regression, L2 is ridge regression. L1 norm refers to the sum of absolute values of each element in a vector, which is used for feature selection. L2 norm refers to the sum of squares of all elements of the vector and then the square root, which is used to prevent overfitting and improve the generalization ability of the model. So that’s choice A.

For detailed solutions to norm regularization in machine learning, i.e., L0,L1, and L2 norms, see Norm Regularization.

185 regularization. What is the difference between L1 regularization and L2 regularization in machine learning? A. Use L1 to get sparse weights b. Use L1 to get smooth weights C. Using the L2 can be sparse weights of d. use L2 can get smooth weights of the correct answer: AD @ Liu Xuan 320, subject title and parsing the source: blog.csdn.net/column/deta… L1 regularization tend to thin, it will automatically feature selection, get rid of some useless characteristics, namely to reset to 0. These characteristics corresponding to a L2 main function is to prevent the fitting, the smaller request parameter is, that model is simple, while the model is more simple, the more tendency to slip, to prevent a fitting. L1 regularization adds the L1 norm of the coefficient W to the loss function as a penalty term. Since the regular term is non-zero, this forces the coefficients corresponding to weak features to zero. Therefore, L1 regularization tends to make the learned models sparse (the coefficient W is often 0), which makes L1 regularization a good feature selection method. The L2 regularization adds the L2 norm of the coefficient vector to the loss function. Since the coefficients in THE penalty term L2 are quadratic, there are many differences between L2 and L1. The most obvious one is that L2 regularization will average the values of coefficients. For correlation features, this means that they can obtain more similar corresponding coefficients. Again, taking Y=X1+X2 as an example, assuming X1 and X2 have a strong correlation, if you use L1 regularization, the punishment is the same, 2alpha, regardless of whether the model you learn is Y=X1+X2 or Y=2X1. But for L2, the first model has a penalty of 2alpha, but the second model has a penalty of 4 alpha. It can be seen that when the sum of the coefficients is constant, the penalty is minimal when the coefficients are equal, which is why L2 makes the coefficients tend to be the same. It can be seen that L2 regularization is a stable model for feature selection, unlike L1 regularization, where coefficients fluctuate due to subtle data changes. Therefore, L2 regularization and L1 regularization provide different values, and L2 regularization is more useful for feature understanding: the coefficient corresponding to a feature with strong capability is non-zero. So, in a nutshell, L1 tends to produce fewer features and all the other features are zero, while L2 chooses more features and all of them are close to zero. Lasso is very useful for feature selection, whereas Ridge is just regularization. For details, see Feature selection in Machine learning and Regularization of Machine learning norms.

 

186 potential function method. The cumulative potential function K(x) of machine learning ML basic potential function is equivalent to () A in Bayes decision. Prior probability posteriori probability b. c. d. class probability density class probability density and prior probability of product of the correct answer: AD 320, @ Liu Xuan subject title and parsing the source: blog.csdn.net/column/deta… In fact, AD says the same thing. For detailed explanation of potential functions, please refer to “potential function method”.

 

187 Hidden Markov. Machine learning ML model easy hidden Markov model three basic problems and the corresponding algorithm statement is correct is () A. Evaluation — Forward and backward algorithm B. decoding — Viterbi algorithm C. learning — Baum-Welch algorithm D. A. forward/backward algorithm B. forward/backward algorithm C. forward/backward algorithm D. forward/backward algorithm

What kind of classifier should be selected when the feature is larger than the amount of data? Machine learning based linear classifier, ML, because of the high dimension data generally are sparse in dimensions, is likely to be linearly separable from http://blog.sina.com.cn/s/blog_178bcad000102x70r.html

 

188 The following fall under the category of unsupervised learning: machine learning. C) maximum entropy D) supervised learning Which of the following does not belong to the advantages of CRF models over HMM and MEMM models () A in machine learning ML models. The advantages of CRF are as follows: the advantages of CRF are as follows: the advantages of CRF are as follows: the advantages of CRF are as follows: the advantages of CRF are as follows: the advantages of CRF are as follows: the advantages of CRF are as follows: the advantages of CRF are as follows: the advantages of CRF are as follows: Slow CRF does not have the strict independence assumptions of HMM, so it can accommodate arbitrary context information. At the same time, CRF overcomes the shortcoming of label-bias of maximum entropy Markov model because it calculates the conditional probability of the globally optimal output node. ———— compared with MEMM CRF is to calculate the joint probability distribution of the whole marking sequence by using Viterbi algorithm under the condition of given observation sequence to be marked, rather than defining the state distribution of the next state under the condition of given current state. ———— compared with ME

189 What is the method to deal with missing values in data cleaning? Machine learning ML basic easy A. estimation B. whole case delete C. variable delete D. Remove the right answer in pairs: ABCD @ Liu Xuan 320, subject title and parsing the source: blog.csdn.net/column/deta… Due to survey, coding and input errors, there may be some invalid values and missing values in the data, which need to be properly processed. Common processing methods are: estimation, whole case deletion, variable deletion and pair deletion. Estimate (estimation). The simplest way to do this is to replace invalid and missing values with the sample mean, median, or mode of a variable. This method is simple, but does not fully consider the existing information in the data, the error may be large. Another approach is to estimate the responses of respondents to other questions through correlation analysis or logical inference between variables. For example, the ownership of a certain product may be related to household income, and the probability of owning the product can be calculated according to the household income of the respondents. Casewise deletion is to delete samples containing missing values. As many questionnaires may have missing values, the effective sample size may be greatly reduced as a result of this approach, and the collected data cannot be fully utilized. Therefore, it is only suitable for the absence of key variables, or the proportion of samples containing invalid or missing values is very small. Variables are deleted. If there are many invalid and missing values for a variable, and the variable is not particularly important to the problem under study, you may consider removing the variable. This reduces the number of variables for analysis, but does not change the sample size. A special code (usually 9, 99, 999, etc.) is used to represent invalid values and missing values, and all variables and samples in the data set are reserved at the same time. However, only samples with complete answers are used in the specific calculation, so different analyses will have different effective sample sizes due to different variables involved. This is a conservative approach that maximizes the information available in the dataset. Different treatment methods may affect the results of the analysis, especially when the occurrence of missing values is not random and the variables are significantly correlated. Therefore, invalid values and missing values should be avoided to ensure the integrity of data.

190 For the description of linear regression, the following is correct: () machine learning ML foundation easy A. The basic hypothesis includes a standard normal distribution B with a mean of 0 and a variance of 1. The basic assumption includes the homovariance normal distribution with mean 0 under random disturbance c. When the basic assumption is violated, the ordinary least square estimator is no longer the optimal linear unbiased estimator D. When the basic assumption is violated, the model can no longer estimate E. DW can be used to test whether the residual exists sequence correlation F. Multicollinearity makes parameter estimation variance decreases the correct answer: ACEF @ 320, Liu Xuan subject title and parsing the source: blog.csdn.net/column/deta…

The basic assumption of AB unary linear regression is 1. The random error term is a random variable with expected value or mean value 0. 2. For all observed values of explanatory variables, the random error terms have the same variance; 3. The random error terms are not correlated with each other; 4. Explanatory variables are deterministic variables, not random variables, and are independent from random error terms; 5. There is no exact (complete) linear relationship between explanatory variables, that is, the sample observation value matrix of explanatory variables is a full rank matrix; 6. The stochastic error term is normally distributed. The econometric model that violates the basic assumption can be estimated, but it cannot be estimated using ordinary least square method. When heteroscedasticity exists, ordinary least square estimation has the following problems: although the parameter estimation value is unbiased, it is not a linear unbiased estimation of minimum variance. The durbin-Watson (DW) test is the most common method used in statistical analysis to test the first-order autocorrelation of sequences. F The so-called Multicollinearity refers to that the model estimation is distorted or difficult to estimate accurately due to the existence of accurate correlation or high correlation between explanatory variables in linear regression model. Influence: (1) The parameter estimator does not exist in the case of complete collinearity. (2) The variance of the parameter estimator increases in the case of OLS estimator’s non-effective multicollinearity in the case of approximate collinearity. 1/(1-R2) is a Variance Inflation Factor (VIF). (3) The economic meaning of parameter estimator is unreasonable; (4) the significance test of variables is meaningless, and important explanatory variables may be excluded from the model; (5) the prediction function of the model is invalid. The larger variance tends to make the “interval” of interval prediction larger, making the prediction meaningless. For the linear regression model, when the response variables are normally distributed and the error terms satisfy the Gauss-Markov condition (zero mean, equal variance, no correlation), the least square estimation of the regression parameters is the uniformly minimum variance unbiased estimation. Of course, this condition is only an idealized assumption, in order to have a corresponding mathematically mature conclusion. Most practical problems do not fully satisfy these idealized assumptions. The development of linear regression model theory is to get many new methods when the idealization condition is not satisfied. Such as weighted LSE, ridge estimation, compression estimation, BOX_COX transformation and a series of segments. We must transcend the idealized conditions in books when we do practical work.

192 the main reasons that affect the effect of clustering algorithm are as follows :() machine learning ML basic yi A. Feature selection model similarity measure c. b. d. classification criterion of the known categories sample quality correct answer: ABC @ 320, Liu Xuan subject title and parsing the source: blog.csdn.net/column/deta… D) Clustering is used to cluster data without categories, and does not use marked data. Clustering analysis and Comparison of Various Algorithms.

Which of the following is the common time series algorithm model () machine learning ML model yi A. risi B. Macd C.ARMA D. DJ The order of the model is gradually increased and the higher-order model is fitted until the order of the model is further increased and the variance of the residual error is no longer significantly reduced. The other three are not on the same level. A. Relative Strength Index (RSI) is used to analyze the intention and Strength of the market to buy and sell by comparing the average closing gain and average closing loss over A period of time, so as to determine the future trend of the market. Based on the construction principle of Moving Average Convergence Divergence, MACD smoothen the closing price of stock prices and then calculate the arithmetic mean, which is a trend indicator. Random index (KDJ) generally calculates the immature random value RSV of the last calculation cycle according to the highest and lowest prices and the closing price of the last calculation cycle within a specific cycle (usually 9 days,9 weeks, etc.) and the proportion relationship among the three. Then according to the smooth moving average method to calculate the K value, D value and J value, and draw a curve to judge the stock trend.

194 The following are not SVM kernel functions are () machine learning ML models easy A. Polynomial kernel function is B.l ogistic kernel function C. Radial basis kernel function D.S igmoid kernel function is the correct answer: B @ 320, Liu Xuan subject title and parsing the source: blog.csdn.net/column/deta… SVM kernel functions include linear kernel function, polynomial kernel function, radial basis kernel function, Gaussian kernel function, power exponential kernel function, Laplace kernel function, ANOVA kernel function, quadratic rational kernel function, multivariate quadratic kernel function, inverse multivariate quadratic kernel function and Sigmoid kernel function. The definition of kernel function is not difficult. According to functional theory, as long as a function K (xi, xj) satisfies Mercer’s condition, it corresponds to the inner product of a transformation space. Important breakthroughs have also been made in determining which functions are kernels so far, obtaining Mercer’s theorem and the following commonly used types of kernels: ⋅ x I (2) Polynomial kernel K (x, x I) = ((x x I) + 1) D (3) RBF K (x, X I) = exp (− ∥ x − ∥ 2 σ 2) Gauss radial basis function is a locally strong kernel function, and its extrapolation ability decreases with the increase of parameter σ. Kernel functions in polynomial form have good global properties. The locality is poor. (4) Fourier kernel K (x, x I) = 1 − q 2 2 (1 − 2 Q cos (x − x I) + q 2) X I) = B 2 n + 1 (x − x I) (6)Sigmoid kernel function K (x, x I) = tanh (κ (x, x I) − δ Support vector machine is a kind of multi-layer perceptron neural network. By using SVM method, the number of hidden layer nodes (which determines the structure of neural network) and the weight of hidden layer nodes to input nodes are determined automatically in the process of design (training). Moreover, the theoretical basis of support vector machine determines that it finally obtains the global optimal value rather than the local minimum value, and also ensures its good generalization ability for unknown samples without learning phenomenon. Selection of kernel function In the selection of kernel function to solve practical problems, usually adopt the following methods: one is to use the prior knowledge of experts to select the kernel function in advance; Cross-validation method is adopted, in which different kernel functions are tried out respectively during the selection of kernel functions, and the kernel function with the least induction error is the best kernel function. For example, for The Fourier kernel and RBF kernel, combined with the function regression problem in the signal processing problem, through the simulation experiment, the comparison and analysis of the same data conditions, SVM using The Fourier kernel is much smaller than the SVM using RBF kernel. The third is to adopt the hybrid kernel method proposed by Smits et al., which is the mainstream method for selecting kernel functions and another pioneering work on how to construct kernel functions. The basic idea of the hybrid kernel method is that the combination of different kernels will have better properties.

195 Given the covariance matrix P of A group of data, the following error about principal component is () data mining DM foundation YI A. The best criterion of principal component analysis is to decompose a set of data according to a set of orthogonal basis, and calculate the minimum truncation error B with mean square error when only the same number of components are taken. After principal component decomposition, the covariance matrix becomes diagonal matrix C. Principal component analysis is k-L transformation D. Analysis: K-L transformation and PCA transformation is a different concept, PCA transformation matrix is covariance matrix, K-L transformation matrix can have many kinds of (second order matrix, covariance matrix, total in-class dispersion matrix and so on). When k-L transformation matrix is covariance matrix, it is equivalent to PCA.

196 In the classification problem, we often encounter the situation that the data volume of positive and negative samples is not equal, for example, the positive sample is 10W data, and the negative sample is only 1W data. The following most appropriate processing method is () machine learning ML basic yi A. Negative samples were repeated for 10 times to generate a sample size of 10W and participate in classification B in disordered order. Direct classification can maximize the use of data C. 1W of 10W positive samples are randomly selected to participate in classification D. Set the weight of each negative sample to 10 and the weight of each positive sample to 1, and participate in the training process. Resampling. A can be seen as A re-sampling deformation. Changing the data distribution to eliminate imbalances may lead to overfitting. 2. Undersampling. C’s scheme improves the classification performance of a few classes, but may lose important information of most classes. If 1:10 is even, you can divide most classes into 1000 pieces. Then each sample is combined with a small number of samples for training to get a classifier. Assemble then assembles a classifier from those 1,000. [A]. Another: if the goal is that the distribution of prediction is consistent with the distribution of training, then increase the penalty coefficient for distribution inconsistency. 3. Weight adjustment. Plan D is one of them. Of course, this is just the corresponding processing on the data set, and there are corresponding processing methods in the algorithm.

 

197

In the statistical pattern recognition problem, when the prior probability is unknown, ML basic yi can be learned by () machine

A. Minimum loss criterion

B.N -p

C. Minimum maximum loss criterion

D. Minimum misjudgment probability criterion

Correct answer :BC

@ 320, Liu Xuan subject title and parsing the source: blog.csdn.net/column/deta… Choice A’s minimum loss criterion requires prior probability

Option B is divided into known and unknown cases for prior probability P (y) in Bayesian decision making. 1. Given p(y), the posterior probability can be calculated directly by using Bayes formula; 2. As p(y) is unknown, neemann-Pearson decision (N-P decision) can be used to calculate the decision surface. NieMan – Pearson (N – P judgment) decision boils down to find a threshold value, namely: if P (x | w1)/P (x | w2) > a, x belongs to w1, If p (x | w1)/p (x | w2) < a, belong to w x 2;

C. The maximum and minimum loss rule is mainly used to solve the problem that the prior probability is unknown or difficult to calculate in the minimum loss rule.

198 The algorithm to solve the prediction problem in hidden horse model is () A in ML model of machine learning. Algorithm to mount after the forward algorithm b. d. aum – Welch algorithm viterbi algorithm is the correct answer: D @ 320, Liu Xuan subject title and parsing the source: blog.csdn.net/column/deta… A, B: The forward and backward algorithms solve an evaluation problem, that is, given A model, calculate the probability of A particular observation sequence, which is used to evaluate the model that best matches the sequence. C: Baum-Welch algorithm solves a model training problem, that is, parameter estimation. It is an unsupervised training method, mainly achieved through EM iteration. D: Viterbi algorithm solves the problem of finding the state sequence most likely to produce the output given a model and a specific output sequence. For example, it is a prediction problem and a decoding problem in communication to observe the weather (state series) by seaweed changes (output series).

 

199 In general, k-NN nearest neighbor method works better in the case of () machine learning ML model easy A. Large number of samples with poor typicality B. small number of samples with good typicality C. clumpy distribution of samples D. The k-nearest neighbor algorithm mainly relies on the surrounding points, so if there are too many samples, it will be impossible to distinguish them. Therefore, B sample should be chosen to be clumped, which is confusing. It should mean that the whole sample is distributed in clumps, so that kNN cannot play its advantage of seeking neighbors. The overall sample should have good typicality and fewer samples, which is more appropriate.

200 Of the following methods that can be used for feature dimension reduction include () deep learning DL model easy A. Principal component analysis (PCA) B. Linear discriminant analysis (LDA) C. Deep learning SparseAutoEncoder D. Matrix singular value decomposition (SVD) E. LeastSquares Deep learning is a dimension reduction method which is kind of new, and in fact, if you think about it, it’s a dimension reduction method, because if the number of neurons in the hidden layer is smaller than the number of neurons in the input layer, then it’s dimension reduction, but if the number of neurons in the hidden layer has more input layers, then it’s not dimension reduction.

201 The least square method is a solution to linear regression, which is actually a projection, but without dimensionality reduction. Which of the following are kernel based machine learning algorithms? () machine learning ML models are easy to use. (EM) b. Radial Basis Function (RBF) C. Linar Discrimimate D.Support Vector Machine (SVM) (Support Vector Machine) Radial basis kernel function is a very common kernel function, and the conventional method of principal component analysis is linear, but when the nonlinear problem is encountered, the kernel method can also be used to make the nonlinear problem into a linear problem. Kernel function is also very important when support vector machines deal with nonlinear problems.

 

202

Machine learning ML foundation is easy

203

Deep learning DL model

See:Blog.csdn.net/snoopy_yuan…

What is the real meaning of activation Function in Neural Network? What are the necessary attributes an activation function needs to have? What other attributes are good but unnecessary? Deep learning based in DL @ Hengkai Guo, subject analytical source: www.zhihu.com/question/67… Let me explain my understanding of a good activation function, some places may not be too strict, welcome to discuss. (Refer in part to Activation Function.) 1. Nonlinearity: the derivative is not a constant. This condition is mentioned in many previous answers, which is the basis of the multilayer neural network to ensure that the multilayer network does not degenerate into a single-layer linear network. That’s what the activation function is all about. 2. Differentiability almost everywhere: Differentiability guarantees computability of gradients in optimization. Traditional activation functions such as sigmoID are everywhere differentiable. For piecewise linear functions such as ReLU, only differentiable almost everywhere (i.e., only non-differentiable at a finite number of points). For the SGD algorithm, as it is almost impossible to converge to the position where the gradient is close to zero, the finite non-differentiable points will not have a great influence on the optimization results [1]. 3. Calculation is simple: As the topic says, there are many nonlinear functions. At the extreme, a multi-layer neural Network can also be used as a nonlinear function, similar to the practice of Network In Network[2], which regards it as a convolution operation. However, the calculation times of activation function in front of neural network are proportional to the number of neurons, so a simple nonlinear function is naturally more suitable for activation function. This is one of the reasons ReLU and the like are more popular than other activation functions that use operations such as Exp. 4. Unsaturated: Saturation refers to the cases in which the gradient is close to zero (i.e., the gradient disappears) in some intervals, making it impossible for parameters to be updated. The classic example is Sigmoid, whose derivative is close to zero for both larger positive values and smaller negative values. A more extreme example is the step function, which, because it has a gradient of 0 almost everywhere, is saturated everywhere and cannot be used as an activation function. The derivative of ReLU is always 1 when x>0, so it will not be saturated for any larger positive value. But at the same time for x<0, its gradient is always 0, at this time, it will also appear saturation phenomenon (in this case, it is usually called dying ReLU). Leaky ReLU[3] and PReLU[4] have been proposed to address this problem. 5. Monotonic: that is, the derivative symbol does not change. This is true of most activation functions, except for things like sines and cosines. Personally, monotonicity makes the gradient direction at the activation function less likely to change, making training easier to converge. 6. Limited output range: The limited output range makes the network more stable for some large inputs, which is why the early activation functions are mainly such functions, such as Sigmoid and TanH. But this leads to the aforementioned gradient extinction problem, and forcing the output of each layer to a fixed range limits its expressivity. Therefore, this type of function is only used for some occasions requiring a specific output range, such as probability output (log operation in loss function can offset the effect of gradient disappearance [1]), gate function in LSTM. 7. Near identity: approximately equal to X. The advantage of this is that the amplitude of the output does not increase significantly as the depth increases, which makes the network more stable and the gradient can be transmitted back more easily. This is somewhat contradictory to nonlinearity, so the activation function basically only partially satisfies this condition. For example, TanH only has a linear region near the origin (0 at the origin and its derivative is 1 at the origin), while ReLU is only linear when x>0. This property also makes it easier to derive the initialization parameter range [5][4]. In addition, the nature of this identity transformation is also used for reference by other network structure designs, such as ResNet[6] in CNN and LSTM in RNN. 8. Few parameters: Most activation functions have no parameters. Taking a single parameter like PReLU increases the size of the network slightly. Another exception is Maxout[7]. Although it has no parameters, the number of input channels required by K-channel Maxout is K times that of other functions under the same number of output channels, which means that the number of neurons also needs to be k times. However, if the number of output channels is not considered, the activation function can reduce the number of parameters by k times. Normalization: This concept has been developed recently. The corresponding activation function is SELU[8]. The main idea is to normalize the sample distribution automatically to the distribution of zero mean and unit variance to stabilize the training. Prior to this, this idea of Normalization has also been used in the design of network architecture, such as Batch Normalization[9]. References: [1] Goodfellow I, Bengio Y, Courville A. Deep learning[M]. MIT press, 2016. [2] Lin M, Chen Q, [3] Yan S. Network In Network [J]. ArXiv Preprint arXiv:1312.4400, 2013. Ng A Y. Rectifier nonlinearities improve neural network acoustic models[C]//Proc. ICML. 2013, 30(1). [4] He K, Zhang X, Ren S, et al. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification[C]//Proceedings of the IEEE international conference on computer vision. 2015: 1026-1034. [5] Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks[C]//Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 2010: 249-256. [6] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: [7] Warde-Farley D, Mirza M, Et al. Maxout networks[J]. ArXiv Preprint arXiv:1302.4389 2013. [8] Klambauer G, Unterthiner T, Mayr A, Et al. Self-normalizing Neural Networks[J]. ArXiv Preprint arXiv:1706.02515, 2017. [9] Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift[C]//International Conference on Machine Learning. 2015: 448-456.

 

The neural network of 205 gradient descent method is easy to converge to local optimum, why is it widely used? Deep learning DL basics

@Li Zhenhua,www.zhihu.com/question/68…

It is probably an illusion that deep neural networks “converge easily to local optimum”, but the reality is that we may never find a “local optimum”, let alone a global optimum.

Many people have a view that “local optimization is the main difficulty of neural network optimization”. This comes from the intuition of the one-dimensional optimization problem. In the case of univariate, the most intuitive difficulty of optimization problem is that there are many local extreme values, such as

People intuitively imagine that in higher dimensions there are more of these local extremes, exponentially more, and it becomes harder to optimize to global optimum. However, an important difference between univariate and multivariable is that in univariate, Hessian matrix has only one eigenvalue, so no matter the sign of this eigenvalue is positive or negative, a critical point is a local extreme value. However, in the case of multivariable, Hessian has multiple different eigenvalues. At this time, each eigenvalue may have more complex distribution, such as positive and negative indeterminations and semi-stereotypes with multiple degenerate eigenvalues (zero eigenvalues)

In the latter two cases, it is difficult to find local maxima, let alone global optima.

Now it seems that the difficulty of neural network training is mainly the saddle point problem. In practice, we probably never really encounter local extremes. Eigenvalues of the Hessian in Deep Learning (arxiv.org/abs/1611.07…

• Training stops at a point that has a small gradient. The norm of The gradient is not zero, therefore it does not, technically speaking, • They are still negative eigenvalues even when they are small in magnitude.

On the other hand, a good news is that even if there is local extremum, the attraction domain of local extremum of loss with poor is still very small. Perspective of Loss Landscapes. (arxiv.org/abs/1706.10…

For the landscape of loss function for deep networks, the volume of basin of attraction of good minima dominates over that of poor minima, which guarantees optimization methods with random initialization to converge to good minima.

So it’s possible that we actually stopped training when we didn’t find anything, took it to the test set and said, “Gee, it’s working fine.”

As a sidebar, these are experimental results. Theoretically, under various assumptions, the number of saddle points of Landscape in deep neural networks increases exponentially, while the local extremums with poor loss are very few.

 

207 Compare the EM algorithm, HMM, and CRF. Machine learning ML model

These three are not quite right together, but they are related to each other, so they are put together here. Focus on the idea of the algorithm. (1) EM algorithm EM algorithm is used for maximum likelihood estimation or maximum posterior estimation of models with hidden variables, consisting of two steps: E step, expectation; M steps, maxmization. In essence, EM algorithm is an iterative algorithm, which continuously calculates the current variable by estimating the hidden variable of the previous generation parameter until convergence. Note: THE EM algorithm is sensitive to initial values, and EM is an algorithm for maximizing logarithmic likelihood functions by constantly solving lower bounds, that is, the EM algorithm is not guaranteed to find global optimal values. The EM export method should also be mastered. (2) The HIDDEN Markov model of HMM algorithm is a generation model for annotation problems. There are several parameters (π, A, B) : the initial state probability vector π, the state transition matrix A, and the observation probability matrix B. The three elements of the markov model. Markov’s three basic questions:

  • Probability calculation problem: given the model and observation series, calculate the probability of the output of observation series under the model. – “forward and backward algorithm
  • Learning problem: given the observation sequence, estimate the model parameters, namely the maximum likelihood estimation to estimate the parameters. Baum-Welch(EM algorithm) and maximum likelihood estimation.
  • Prediction problem: given the model and observation sequence, solve the corresponding state sequence. Approximation algorithm (greedy algorithm) and dimensional bit algorithm (dynamic programming to find the optimal path)

(3) Conditional probability distribution density of another set of output random variables under the condition that airport CRF gives a set of input random variables. Conditional random fields assume that the output variables constitute Markov random fields, while most of the linear chain random fields we usually see are discriminant models that predict the output from the input. The solution method is maximum likelihood estimation or regularized maximum likelihood estimation. The reason why HMM is always compared with CRF is that both CRF and HMM use graph knowledge, but CRF uses Markov random fields (undirected graphs), while HMM is based on Bayesian networks (directed graphs). And CRF also has: probability calculation problem, learning problem and prediction problem. The approximate calculation method is similar to HMM, except that EM algorithm is not required for learning problems.

(4) The fundamental difference between HMM and CRF lies in the basic concept, one is the generation model, the other is the discriminant model, which leads to the difference in solving methods.

Several models commonly used by CNN. Deep learning DL model

The name of the The characteristics of
LeNet5 Nothing special. – But the first CNN should know
AlexNet We introduce ReLU and Dropout, data enhancement, pooling covering each other, three convolution, one maximum pooling + three fully connected layers
VGGNet The use of 1*1 and 3*3 convolution kernels and the maximum pooling of 2*2 make the number of layers deeper. Vggnet-16 and VGGNet19 are commonly used
Google Inception Net This method achieves better classification performance while controlling computation and parameter number. Compared with the above method, it has several major improvements: 1. It removes the last full connection layer and replaces it with a global average pooling; 2. 2. Introduction of Inception Module, which is a structure combining four branches. All branches use 1*1 convolution, because 1*1 is very cost-effective and can achieve nonlinear and characteristic transformation with very few parameters. Inception V2 Second edition changes all 5*5 pieces into 2 3*3 pieces, and presents the famous Batch Normalization. Inception V3 version 3 is even more abnormal, splitting a large two-dimensional convolution into two smaller one-dimensional convolution to speed up computation, reduce overfitting, and change the structure of Inception Module.
Microsoft ResNet Residual Neural Network 2. ResNet’s second version changes the ReLU activation function into a linear function of y=x

 

208 Why can SVM with Kernel classify nonlinear problems? The essence of a kernel function is the inner product of two functions, which can be expressed as a high-dimensional mapping of input values in SVM. Note that kernel is not a direct mapping, kernel is just an inner product of common kernel and kernel conditions: kernel selection should start from the linear kernel, and there is no need to choose gaussian kernel in the case of many features, should choose the model from simple to difficult. We usually say that the kernel function refers to the positive definite sum function, if and if for any x belonging to x, the Gram matrix corresponding to K is required to be a semi-positive definite matrix. RBF kernel radial basis, and this kind of function depends on the distance between certain points, so the Laplace kernel is also a radial basis kernel. Linear kernel: Mainly used for linearly separable case polynomial kernel

Bagging and Boosting

(1) Random forest Random forest changes the problem that decision trees are easy to overfit, which is mainly optimized by two operations:

1) Boostrap is the extracted sample value put back from the bag

2) A certain number of features are randomly selected each time (usually SQR (n)). Classification problem: Bagging voting is used to select the most frequent category regression problem: directly take the average of the results of each tree.

Common parameters The error analysis advantages disadvantages
2. The number of trees 3. The minimum number of samples on the node 4. Oob (out-of-bag) takes the unsampled samples of each tree as the statistical error of predicted samples and the error rate You can do parallel computation without feature selection you can summarize the importance of features you can deal with missing data and you don’t need to design additional test sets Cannot output continuous results on regression

(2) Boosting AdaBoost Boosting is actually an addition model, which learns multiple classifiers and performs some linear combinations by changing the weight of training samples. Adaboost is the addition model + exponential loss function + front term distribution algorithm. Adaboost is repeated training based on weak classifiers, in which the weight of data or probability distribution is constantly adjusted, and the weight of samples misclassified by weak classifiers in the previous round is improved. Finally, votes are taken with classifiers (but classifiers are of different importance). Boosting GBDT changes the base classifier to a binary tree, regression to a binary tree, and classification to a binary tree. Compared with Adaboost above, the loss function of regression tree is square loss, and the classification problem can also be defined by exponential loss function. But what about the general loss function? GBDT (Gradient ascending decision tree) is to solve the optimization problem of the general loss function. The method is to use the value of the negative gradient of the loss function in the current model to simulate the approximation of the residual in the regression problem. Note: Since GBDT is prone to over-fitting, the recommended GBDT depth should not exceed 6, while random forest can be above 15.

(4) Xgboost has the following features:

  • Support for linear classifiers
  • You can customize the loss function, and you can use second partial derivatives
  • Regularization items: number of leaf nodes and L2-norm of score output of each leaf node are added
  • Supporting feature sampling
  • In certain cases, parallelism is supported, which can only be used in the stage of tree construction. Each node can search for split features in parallel.

Logistic regression related issues

(1) Formula derivation must be able to

(2) Basic concepts of logistic regression This is best analyzed from the perspective of generalized linear model, logistic regression assumes that Y obeys Bernoulli distribution.

(3) L1-norm and L2-norm are sparse in fact because l0-norm is the number of direct statistical parameters that are not 0 as the rule item, but it is not easy to execute in fact, so L1-norm is introduced. In essence, L1norm assumes that the priors of parameters are Laplace distribution, while L2-norm assumes that the priors of parameters are Gaussian distribution. This is the principle of solving this problem with images that we see on the Internet. However, it is difficult to solve l1-norm, which can be solved by the axis descent method or the minimum Angle regression method.

(4) COMPARISON between LR and SVM

First of all, the biggest difference between LR and SVM lies in the selection of loss function. LR’s loss function is Log loss (or even logical loss), while SVM’s loss function is Hinge loss.



Second, both are linear models.

Finally, SVM only considers support vectors (i.e. a few points related to classification)

(5) Difference between LR and random forest

Random forest and other tree algorithms are nonlinear, while LR is linear. LR focuses more on global optimization, while tree model is mainly local optimization.

(6) Common optimization methods

Logistic regression itself can be solved by formula, but because of the high complexity of inverse, gradient descent algorithm is introduced.

First-order methods: gradient descent, stochastic gradient descent, mini stochastic gradient descent. The stochastic gradient descent is not only faster than the original gradient descent, but also can inhibit the occurrence of local optimal solutions to a certain extent.

Second order method: Newton method, quasi-Newton method:

Here is a detailed description of the basic principle of Newton’s method and Newton’s method of application. Newton’s method basically updates the position of the tangent line by the intersection of the curve and the X-axis, until it reaches the intersection of the curve and the X-axis to get the solution of the equation. In practical application, we often ask to solve the convex optimization problem, that is, to solve the position where the first derivative of the function is 0, and Newton’s method can provide a solution to this problem. In practical application, Newton’s method firstly selects a point as the starting point, and performs a second-order Taylor expansion to get the point with the derivative of 0, and then updates until the requirement is reached. At this time, Newton’s method also becomes a second-order solution problem, which is faster than first-order method. We often see x as a multidimensional vector, which leads to the concept of a Hessian matrix (the second derivative matrix of x). Disadvantages: Newton method is a fixed-length iteration without step size factor, so it cannot guarantee the stable decline of function value, and even fails in serious cases. And Newton’s method requires that the function be second order differentiable. Moreover, the inverse complexity of calculating Hessian matrix is very large.

Quasi-newton method: the method of constructing approximate positive definite symmetric matrix of Hessian matrix without second partial derivative is called quasi-Newton method. The idea of quasi Newtonian method is to simulate the Hessian matrix or its inverse in a way that satisfies quasi Newtonian conditions. There are mainly DFP method (approximate the inverse of Hession), BFGS (directly approximate Hession matrix), L-BFGS (can reduce the storage space required by BFGS).

 

209 Demonstrate the principle of Dropout using Bayesian probability

Source: @ xu han, zhuanlan.zhihu.com/p/25005808

Dropout as a Bayesian Approximation: Insights and Applications

(MLG. Eng. CAM. Ac. UK/yarin/PDFs /…

Why do a lot of face Paper end up with a Local Connected Conv?

Source: @ xu han, zhuanlan.zhihu.com/p/25005808

Take FaceBook DeepFace:

DeepFace first performed two full convolution + one pooling to extract low-level edge/texture features. Three local-conV layers are then connected. The reason for using Local-ConV here is that faces have different features in different regions (the distribution position of eyes/nose/mouth is relatively fixed). When there is no global Local feature distribution, local-ConV is more suitable for feature extraction.

 

210 What is collinearity and what does it have to do with overfitting?

@ abstract monkey, source: www.zhihu.com/question/41…

Collinearity: In multivariable linear regression, the high correlation between variables makes the regression estimate inaccurate.

Collinearity creates redundancy and leads to overfitting.

Solution: eliminate the correlation of variables/add weight re.

 

211 Why should poor Local Optima be avoided when networks are deep enough?

See also: The Loss Surfaces of Multilayer Networks (arxiv.org/pdf/1412.02…

 

212 Positive and negative samples in machine learning

In the classification problem, the problem is relatively well understood, as in the case of the face recognition, is the sample is easy to understand, is the human face image, negative selection of sample is related to the problem scenario, in particular, if you want to undertake the classroom of middle school students face recognition, so negative samples is the classroom Windows, walls, etc., that is to say, It should not be a random scene that is irrelevant to the problem you are studying, such a negative sample does not make sense. Negative samples can be generated based on the background, sometimes there is no need to look for additional negative samples. In general, 3000-10000 positive samples need 5,000,000-100,000,000 negative samples to learn. In the field of mutual gold, the ratio of positive and negative is generally adjusted to 3:1-5:1 by sampling method before entering the mold.

 

213 What are the engineering methods for feature selection in machine learning?

Data and features determine the upper limit of machine learning, and models and algorithms only approximate this upper limit

1. Calculate the correlation between each feature and the response variable: Common engineering means include calculating Pearson’s coefficient and mutual information coefficient. Pearson’s coefficient can only measure linear correlation, while mutual information coefficient can measure various correlations well, but calculation is relatively complicated. Fortunately, many Toolkit contain this tool (such as MINE of Sklearn). Once you get the correlation, you can sort and select features;

2. Build a model of a single feature and rank the features according to the accuracy of the model to select the features;

3. By L1 regularization to select feature: L1 regular method has characteristic of sparse solution, so natural to have the characteristics of feature selection, note, however, the characteristics of the L1 didn’t choose to do not represent is not important, because the two have the feature of high relevance may only keep one, if you want to determine which features should be important in L2 regular way cross check *;

4. Train the pre-selected model that can score features: RandomForest and Logistic Regression can score features of the model, and train the final model after obtaining correlation through scoring;

5. Select features after feature combination: Such as the user id and user characteristics of the combination to obtain larger features set to choose again, this kind of practice are common in the recommendation system and advertising system, which is the so-called levels than even billions features the main source of the reason is that the user data is sparse, combine characteristics can both global model and the model of personalized, this problem can have a chance to speak.

Feature selection through deep learning: At present, this method is becoming a method with the popularity of deep learning, especially in the field of computer vision, because deep learning has the ability of automatic learning, which is also the reason why deep learning is also called unsupervised feature learning. After selecting the features of a neural layer from the deep learning model, it can be used to train the final target model.

 

214 The best method to detect outliers in an N-dimensional space is () machine learning ML base easy A. B. box plot c. mahalanobis distance d. scatter plot 答案 : C

Mahalanobis distance is a statistical method for measuring multivariate outlier points based on the Chi square distribution.

There are M sample vectors X1~Xm, the covariance matrix is denoted as S, and the mean is denoted as vector μ. Then, the Mahalanobis distance between sample vector X and u can be expressed as:

(Each element in the covariance matrix is the covariance of various vector elements Cov(X,Y), Cov(X,Y) = E{[x-e (X)] [y-e (Y)]}, where E is the mathematical expectation)

Where, the Mahalanobis distance between vector Xi and Xj is defined as:

If the covariance matrix is the identity matrix (independent and identically distributed among each sample vector), the formula becomes:

That’s the Euclidean distance.

If the covariance matrix is diagonal, the formula becomes the normalized Euclidean distance.

(2) Advantages and disadvantages of Mahalanobis distance: dimensionless, excluding the interference of correlation between variables.

See more here and “Various Distances”.

 

What is the difference between 215 Logistic Regression and general regression analysis? A. Logarithmic probability regression is designed to predict the likelihood of events. B. Log-probability regression can be used to measure the degree of model fitting C. Log-probability regression can be used to estimate regression coefficients D. A: This is mentioned in this article. Logarithmic probability regression is actually designed to solve the classification problem. B: Logarithmic probability regression can be used to test the fit degree of the model to the data. Although log-probability regression is used to solve classification problems, once the model is built, it is possible to estimate the relevant regression coefficients based on independent features. As far as I’m concerned, this is just an estimation of the regression coefficient and can’t be used directly to make a regression model.

 

216 What does Bootstrap data mean? Bootstrap and Boosting machine learning ML model Sampling M features from a total of M features with replacement B. Sampling M features from a total of M features without replacement C. Sample N out of a total of N samples with replacement D. sample N out of a total of N samples without replacement Boostrap means to lift one’s shoes. It involves sampling samples (rather than features) with a number of relocations equal to the total number of samples. This random sampling process determines that the final sample sampled, after removing duplicates, occupies a ratio of 1/ E of the original sample.

 

217 “overfitting” occurs only in supervised learning, but not in unsupervised learning, which is () machine learning ML foundation A. We can evaluate unsupervised learning methods using indicators of unsupervised learning. For example, we can evaluate the clustering model by adjusting the Adjusted Rand score.

 

() Machine learning ML foundation a. Larger k is not necessarily better, and selecting A larger K will increase the evaluation time B.) Select the larger k, there will be a smaller bias (because the training set is closer to the total data set) C. When selecting K, the variance between data sets should be minimized d. All the above answers: the larger the K, the smaller the bias and the longer the training time. In training, the principle of small variance difference between data sets should also be considered. For example, for binary classification problems, 2-fold cross-validation is used. If all the data in the test set are of class A and all the data in the training set are of class B, obviously, the test effect will be poor. If you don’t understand the concepts of bias and variance, be sure to refer to the following links: Gentle Introduction to the Bias-Variance Trade-Off in Machine Learning Understanding the Bias-Variance Tradeoff

 

219 There is multicollinearity in regression model, how do you solve this problem? In machine learning ML model A. Remove the two collinear variables B. We can first remove A collinear variable C. Calculate VIF(variance inflation factor) and take corresponding measures d. In order to avoid loss of information, we can use some regularization methods, such as Ridge regression and Lasso regression. A. 1 b. 2 c. 2 and 3 d. 2, 3 and 4 答案: D to solve multiple common linearity, you can use the correlation matrix to remove variables with A correlation of more than 75% (subjective component). VIF can also be used. If VIF value <=4, the correlation is not very high, and IF VIF value >=10, the correlation is high. We can also use ridge regression and Lasso regression methods with penalized regular terms. We can also add random noise to some variables to make them different from each other, but this method should be used carefully, which may affect the prediction effect.

 

What does the high bias of model 220 mean and how can we reduce it? A. Reducing features in feature space B. Increasing features in feature space C. B Bias too high indicates that the model is too simple and the data dimension is not enough to predict the data accurately, so raise the dimension.

 

221 Training decision tree model, attribute node splitting, graph with maximum information gain is below which () machine learning ML model is easy



A. Outlook

B. Humidity

C. Windy

D. Temperature

A information gain, increasing the purity of the average subset, for detailed study, please click the following link:

A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python)

Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio

 

222 For information gain, decision tree split nodes, the following statement is true is () machine learning ML model easy A. Nodes with high purity need more information to distinguish b. The information gain can be obtained by using “1-bit-entropy” C. If an attribute is selected with many categorization values, the information gain is biased a.1 b. 2 C.2 and 3 D. All the above answers: C For detailed research, please stamp the following link: A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python) Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio

 

223 If the SVM model is not fit, the following methods can improve the model () machine learning ML model

A. Increase the value of penalty parameter C

B. Reduce the value of penalty parameter C

C. Reduce the kernel coefficient (Gamma parameter)

@David 9Nooverfit.com/wp/12-%E6%9…

A If the SVM model is not fit, we can adjust the value of parameter C to increase the complexity of the model. In LibSVM, the target function of SVM is:



The gamma argument is an argument that comes with the radial basis function when you select it as the kernel. It implicitly determines the distribution of data mapped to a new feature space.

The parameter gamma is independent of the parameter C. The higher the parameter gamma is, the more complex the model is.

 

224 The following figure shows the same SVM model, but with different gamma parameters of radial basis kernel function, which are G1, G2, g3 in sequence. The following sizes are more correct:



A. g1 > g2 > g3

B. g1 = g2 = g3

C. g1 < g2 < g3

D. g1 >= g2 >= g3E. g1 <= g2 <= g3

C

 

225 Suppose we want to solve a binary classification problem, we have established the model, the output is 0 or 1, the initial threshold is set as 0.5, over 0.5 probability estimation, the discrimination is 1, otherwise the discrimination is 0; If we now use another threshold greater than 0.5, then what is now true about the model statement is: A. The recall rate of model classification will decrease or remain unchanged B. The recall rate of model classification will increase C. Model classification accuracy will increase or remain unchanged D. Model classification accuracy will decrease A. 1 B. 2 C.1 and 3 D. 2 and 4 E. C This article describes the influence of thresholds on accuracy and recall rates: Confidence Splitting Criterions Can Improve Precision And Recall in Random Forest Classifiers 99% of people don’t click and 1% of people click in, so it’s a very lopsided data set.

 

226Suppose, now that we have built A model to classify, and with 99% predictive accuracy, we can conclude that A. The prediction accuracy of the model is very high, so we don’t need to do anything. B. The prediction accuracy of the model is not high, so we need to do something to improve the model. (B) 99% accuracy may mean that you are accurate in predicting people who don’t check in (because 99% of people won’t check in, which is easy to predict). It doesn’t mean that your model is accurate about the people who click in, so for an unbalanced data set like this, we want to focus on a small part of the data, the people who click in. Please refer to this article for details

 

227 KNN algorithm with K =1 is used. For the binary classification problem in the following figure, “+” and “O” respectively represent two classes. Then, what is the error rate of cross validation using the cross validation method that only takes out one test sample:



A. 0%

B. 100%

C. 0% by the year 100

[D]

The KNN algorithm is, look at k samples around the sample, and most of them are classified as class A, so let’s classify this sample as class A. Obviously, KNN with k=1 is not a good choice in the figure above, and the error rate of classification is always 100%

 

228 We want to train decision trees on large data sets. In order to use less time, we can: A. Increase the depth of the tree b. Increase the learning rate C. Reducing the depth of the tree D. reducing the number of trees c. Increasing the depth of the tree causes all nodes to split continuously until the leaf nodes are pure. So, if you increase depth, you increase training time.

229 Decision trees have no learning rate parameters to tune (unlike ensemble learning and other step-size learning methods).

Decision trees only have one tree, not a random forest.

For the statement of neural network, the following are correct: 1. Increasing the number of neural network layers may increase the classification error rate of test data set. 2. 3. Increasing the number of neural network layers can always reduce the classification error rate of training data set. A. 1 B. 1 and 3 C. 1 and 2 D. 2 A The success of deep neural network has proved that increasing the number of neural network layers can increase the model normalization ability, that is, both the training data set and the test data set perform better. But more layers does not necessarily guarantee better performance (arxiv.org/pdf/1512.03… A. the number of layers B. the number of layers C. the number of layers D. the number of layers

 

230 If we use the nonlinear separable SVM objective function as the optimization object, how can we ensure that the model is linearly separable? A. C=1 B. C=0 C. C= infinity D. all linear inseparability is tolerable

After training the SVM model, we can discard the samples that are not support vectors or continue to classify them: A. A. right B. wrong C. right D. wrong

In the 231 SVM model, it is the support vector that really affects the decision boundary

Which of the following algorithms can be constructed by using neural networks: 1. KNN 2. Linear regression 3. Logarithmic probability regression A. 1 and 2 B. 2 and 3 C. 1, 2 and 3 D. KNN algorithm does not need training parameters, and all neural networks need training parameters, so neural network can not help. 2. The simplest neural network, the perceptron, is actually the training of linear regression. 3. We can construct logarithmic probability regression with a layer of neural network

 

232 Please select the option below to apply the hidden Markov (HMM) model: A. B. Movie browsing C. Stock market D. D) If you have a problem with time series, try HMM

 

233 We build a machine learning model with 5000 features and 1 million data. How can we effectively deal with such big data training: A. We randomly select some samples and train on these small samples B. We can try online machine learning algorithm c. We apply PCA algorithm to reduce dimension and reduce feature number D. B and C E. A and B F. All of the above answers: F

 

234 We want to reduce the number of features in the data set, namely dimension reduction. Choose the following options:

1. Use forward feature selection method 2. Use backward feature exclusion method 3. We first use all the features to train a model to get performance on the test set. Then we remove a feature, go back to training, and use cross-validation to see performance on the test set. A. 1 and 2 b. 2, 3 and 4 c. 1, 2 and 4 d. All 答案: D 1. Forward feature selection method and backward feature exclusion method are common methods in feature selection 2. If the forward feature selection method and backward feature exclusion method are not applicable to big data, the third method can be used. 3. It’s also a good idea to use a measure of correlation to remove redundant features. All D is true

 

235 For stochastic forest and GradientBoosting Trees, the following statement is true: 1. Among individual Trees in the random forest, there are dependencies between Trees, while GradientBoosting Trees have no dependencies between individual Trees. 2. Both models use random feature subsets to generate many individual Trees. 3. We can generate GradientBoosting Trees individually in parallel because there is no dependence between them. GradientBoosting Trees training model always performs better than random forest. 2. Boosting random forest is based on Bagging, while Gradient Boosting trees are based on Boosting. In a single tree of random forest, there is no dependence between trees. While GradientBoosting Trees have dependent relationships among individual Trees. 2. Both models use random feature subsets to generate many individual trees. All A’s are true

 

236 For the transformed features of PCA, naive Bayes’ “independent hypothesis” always holds because all major components are orthogonal, which is stated as follows: A. B. This statement is false. First, “independent” and “irrelevant” are two different things, and second, transformed features can also be relevant

 

237 What is true about PCA is that 1. We must normalize the data before using PCA. 2. We should choose the principal component that causes the model to have the greatest variance. 3. We should choose the principal component that causes the model to have the least variance. A. 1, 2 and 4 b. 2 and 4 c. 3 and 4 d. 1 and 3 e. 1, 3 and 4 答案: A 1) PCA is sensitive to the data scale, for example, if the unit changes from km to cm, Such data scales may have a great influence on the final results of PCA (from less important components to very important components). 2) We should always choose the principal components with the largest variance in the model. 3) Sometimes in low dimensions, the left figure needs the help of PCA for dimensionality reduction

 

238 What is the best choice of principal ingredient for the following figure? :

A. 7

B. 30

C. 35

D. Can ‘t Say

Answer: B

  • Principal component selection makes the variance larger, the better. On this premise, the less principal component, the better.

 

239 Data scientists may use multiple algorithms (models) to make predictions at the same time, and eventually combine the results of these algorithms for the final prediction (ensemble learning). The following is correct for ensemble learning:

A. There is high correlation between individual models

B. There is low correlation between individual models

C. It is better to use “average weight” rather than “voting” in integrated learning

D. A single model is an algorithm

Answer: B

  • Please refer to the following article for details:

    • Basics of Ensemble Learning Explained in Simple English
    • Kaggle Ensemble Guide
    • 5 Easy questions on Ensemble Modeling everyone should know

 

How do we use clustering methods in supervised learning? :

A. We can first create clustering categories, and then use supervised learning to learn separately on each category

B. We can use clustering “category ID” as a new feature term, and then use supervised learning to learn separately

C. Before supervised learning, we cannot create new cluster categories

D. We cannot use clustering “category ID” as a new feature item and then use supervised learning to learn separately

A. 2 and 4

B. 1 and 2

C. 3 and 4

D. 1 and 3

Answer: B

We can build different models for each cluster to improve the accuracy of prediction.

“Category ID” is trained as a feature item, which can effectively summarize the data characteristics.

So B is correct

 

241 The following statements are true:

A. A machine learning model with high accuracy always indicates that the classifier is good

B. If the model complexity is increased, the test error rate of the model will always be reduced

C. If the model complexity is increased, the training error rate of the model will always be reduced

D. We cannot use clustering “category ID” as a new feature item and then use supervised learning to learn separately

A. 1

B. 2

C. 3

D. 1 and 3

Answer: C

The question is overfitting and underfitting.

 

242 corresponds to GradientBoosting tree algorithm, and the following statements are correct:

A. When increasing the minimum sample splitting number, we can resist overfitting

B. When the minimum sample splitting number is increased, overfitting will occur

C. When we reduce the number of samples for training a single learner, we can reduce variance

D. When we reduce the number of training samples for a single learner, we can reduce bias

A. 2 and 4

B. 2 and 3

C. 1 and 3

D. 1 and 4

Answer: C

  • The minimum number of sample splits is used to control the “overfit” parameter. Too high a value will result in an “under-fit” parameter, which should be adjusted with cross-validation.
  • The second is based on the concepts of bias and variance.

 

243 Which of the following graphs is the training boundary of the KNN algorithm:

A) B

B) A

C) D

D) C

E) are not

Answer: B

The KNN algorithm is definitely not a linear boundary, so the straight boundary is irrelevant. In addition, this algorithm looks at the classification of the nearest K samples to determine the classification, so the boundary must be bumpy.

 

244 If a trained model is 100% accurate on a test set, does that mean it will perform equally well on a new data set? :

A. Yes, this means that the model is normalized enough to support new data sets

B. No, there are still other factors that the model does not take into account, such as noise data

No model can always adapt to new data. We can’t be 100% accurate.

 

245 The following cross-validation methods:

I. There is a Bootstrap method for putting back

Ii. Leave a test sample for cross-validation

Iii. 5 folding cross validation

Iv. Repeat the 50% discount tutorial twice for verification

When the sample is 1000, the following order of execution times is correct:

A. i > ii > iii > iv

B. ii > iv > iii > i

C. iv > i > ii > iii

D. ii > iii > iv > i

Answer: B

  • Boostrap method is a traditional validation method with random sampling and once verification. It only needs to train the model once, so the time is the least.
  • To keep one test sample for cross-validation, n training sessions are required (n is the number of samples). In this case, 1000 models are trained.
  • 5 fold cross validation requires training of 5 models.
  • 5 fold cross validation is repeated twice, requiring training of 10 models.

All B’s are true

 

246 Variable selection is used to select the best subset of discriminators. If we want to consider model efficiency, what variable selection considerations should we make? :

1. Multiple variables actually have the same use 2. How useful are variables for the interpretation of the model 3. Information carried by features 4. Cross verification

A. 1 and 4

B. 1, 2 and 3

C. 1, 3 and 4

D. All of the above

Answer: C

Note that this problem is concerned with model efficiency, so do not consider option 2.

 

247 For linear regression models, including additional variables, the following may be true:

R-squared and Adjusted R-squared are both increasing. R-squared is constant and Adjusted R-squared is increasing. R-squared is decreasing. So you can get Adjusted R-squared is also decreasing. 4. R-squared is decreasing, so you can get Adjusted R-squared is increasing

A. 1 and 2

B. 1 and 3

C. 2 and 4

D. None of the above

Answer: D

R-squared does not determine coefficient estimation and prediction bias, which is why we estimate residuals. But R-Squared has problems that R-Squared and Predicted R-Squared don’t. Every time you add a predictor to the model, R-squared increases or stays the same.

Please refer to this link: Discussion for details.

 

248 For the training of the following three models, the following statement is correct:

1. The training error in the first picture is the largest compared with the other two pictures. 2. The second graph is more robust than the first and third, and is the best model among the three. 4. 5. The three graphs are the same because we haven’t tested the data set yet

A. 1 and 3

B. 1 and 3

C. 1, 3 and 4

D. 5

Answer: C

 

249 Which of the following assumptions should we make about linear regression? :

2. Linear regression requires that all variables must conform to normal distribution. 3. Linear regression assumes that there is no multiple linear correlation between data

A. 1 and 2

B. 2 and 3

C. 1, 2 and 3

D. None of the above

Answer: D

  • The first point is right
  • It doesn’t have to be, of course, but if it’s a normal distribution, it’s better
  • A small amount of multiple linear correlation is ok, but we should try to avoid it

 

250 When we construct linear models, we pay attention to the correlations between variables. When searching the correlation coefficient in the correlation matrix, if we find that the correlation coefficient of 3 pairs of variables is (Var1 and Var2, Var2 and Var3, Var3 and Var1), it is -0.98, 0.45, 1.23. What conclusions can we draw:

Since Var and Var2 are very correlated, we can remove one of them. 3. The 1.23 correlation coefficient between Var3 and Var1 is impossible

A. 1 and 3

B. 1 and 2

C. 1, 2 and 3

D. 1

Answer: C

  • The correlation coefficient between Var1 and Var2 is negative, so this is multiple linear correlation, and we can consider removing one of them.
  • In general, if the correlation coefficient is greater than 0.7 or less than -0.7, it is highly correlated
  • The correlation coefficient range should be [-1,1]

 

251 In a highly nonlinear and complex set of variables, a tree model may work better than a normal regression model. Only:

A. that’s right

B. wrong

Answer: A,

 

252 For very low dimensional features, choose linear or nonlinear classifier? Nonlinear classifier, low dimensional space may have a lot of features running together, leading to linear inseparability. 1. If the number of features is large, which is similar to the number of samples, LR or Linear Kernel SVM should be used at this time. If the number of features is small and the number of samples is average, SVM+Gaussian Kernel is selected. 3. If the number of features is relatively small and the number of samples is large, some features need to be manually added to become the first case.

 

253 Processing of missing values of feature vectors 1. There are many missing values. Simply discard this feature, otherwise you might end up making too much noise, which might adversely affect the results. 2. There are few missing values, and the missing values of other features are less than 10%. We can deal with it in many ways: 1) NaN is directly taken as a feature, assuming that it is represented by 0; 2) Fill with mean value; 3) Use random forest and other algorithms to predict filling

 

254 Comparison of SVM, LR and decision tree. Model complexity: SVM supports kernel function, which can deal with linear and nonlinear problems; LR model is simple, fast training speed, suitable for linear problems; The decision tree is easy to overfit, and the pruning loss function is needed: SVM Hinge loss; LR L2 regularization; Adaboost index loss data sensitivity: SVM tolerance is insensitive to outliers, only care about support vectors, and need to be normalized first; LR is sensitive to far-point data volume: LR is used for large data volume; SVM nonlinear kernel is used for small data volume and few features

 

255 What is ill-condition? After training, test samples will be slightly modified to get very different results, which is the ill-conditioned problem. The prediction ability of the model for unknown data is very poor, that is, the generalization error is large.

 

256 Describe the process of KNN nearest neighbor classification algorithm. 1. Calculate the distance of each sample point in the training sample and test sample (common distance measures include Euclidean distance, Mahalanobis distance, etc.); 2. Sort all the above distance values; 3. Select the first k samples with the minimum distance; 4. Vote according to the labels of the K samples to get the final classification category;

 

257 What are the commonly used clustering divisions? Enumeration represents the algorithm. 1. Partitioning based clustering :K-means, K-medoids, CLARANS. 2. Hierarchical clustering: AGNES (bottom-up), DIANA (top-down). 3. Density-based clustering: DBSACN, OPTICS, BIRCH(CF-tree), CURE. 4. Grid-based methods: STING, WaveCluster. 5. Model-based clustering: EM,SOM, COBWEB.

 

258 Which of the following is wrong in describing weak learners in the integrated learning model? A. They often do not overfit b. They are usually highly biased, so they cannot solve complex learning problems C. C) Weak learners are a particular part of the problem. So they usually don’t overfit, which means that weak learners usually have low variance and high bias.

 

259 Which of the following is true of the description of k-fold cross-validation? 1. Increasing K will lead to more time for cross-validation results. 2. A larger K value will have more confidence in the cross-validation structure than a smaller K value 3. A. 1 and 2 b. 2 and 3 c. 1 and 3 d. 1, 2 and 3 答案 (D) : A large K value means less bias and more running time for overestimating the true expected error (the number of folds trained will be closer to the total validation set sample size) (and as you get closer to the limit case: leave one cross validation). We also need to consider the equilibrium between k-fold accuracy and variance when selecting K value.

 

260 The best known dimensionality reduction algorithms are PAC and T-SNE. These two algorithms are applied to data “X” respectively, and data sets “X_projected_PCA” and “X_projected_tSNE” are obtained. Which of the following is correct in describing “X_projected_PCA” and “X_projected_tSNE”? A. X_projected_PCA can be interpreted in the nearest neighbor space B. X_projected_tSNE can be interpreted in the nearest neighbor space C. Both of them can be explained in the nearest neighbor space d. Neither of them can be explained in the nearest neighbor space (B) : T-SNE algorithm reduces data dimensions by considering nearest neighbor points. So after using t-SNE, the reduced dimension can be interpreted in the nearest neighbor space. But PCA cannot.

 

261 Given three variables X, Y, and Z. Pearson correlation coefficients of (X, Y), (Y, Z) and (X, Z) were C1, C2 and C3, respectively. Now all the values of X are plus 2, all the values of Y are minus 2, and Z remains the same. Then the correlation coefficients of (X, Y), (Y, Z) and (X, Z) after operation are D1, D2 and D3 respectively. Now what is the relationship between D1, D2, D3 and C1, C2, C3? A. D1= C1, D2 < C2, D3 > C3 B. D1 = C1, D2 > C2, D3 > C3 C. D1 = C1, D2 > C2, D3 < C3 D. D1 = C1, D2 < C2, Answer (E) : The correlation coefficient between features does not change by adding or subtracting a number of features.

 

262 What do you need to do in PCA to get the same projection as SVD? Answer (A) : PCA has the same projection as SVD when the data has A 0 mean vector, otherwise you have to average the data to 0 before using SVD.

 

263 Suppose we have a data set that, with the help of a decision tree of depth 6, can be trained with 100% accuracy. Now consider two things and choose the right option based on them. Note: All other hyperparameters are the same and all other factors are unaffected. 1. There will be high bias and low variance when the depth is 4. 2. (A) If you fit A decision tree of depth 4 in such data, it means that it is more likely to underfit the data. Therefore, in the case of underfitting, you will get high bias and low variance.

 

264 In the K-mean algorithm, which of the following can be used to obtain the global minimum? A. Try to initialize the running algorithm for different centroids B. Adjust the number of iterations c. find the optimal number of clusters D. All of the above answers (D) : All can be used to debug to find the global minimum.

 

265 You are dichotomizing logistic regression with L1 regularization, where C is the regularization parameter and w1 and w2 are the coefficients of X1 and x2. Which of the following is true when you increase the value of C from 0 to a very large value?

A. The first w2 becomes 0, and then w1 becomes 0

B. The first w1 becomes 0, and then w2 becomes 0

C. w1 and w2 become 0 at the same time

D. Even after C becomes large, neither w1 nor w2 can become 0

Answer (C) : The regularization function of L1 is shown below, so w1 and w2 can be 0. And w1 and w2 are symmetric, so you don’t have a situation where one is zero and the other is not zero.

 

266 Suppose you use the log-loss function as the evaluation criterion. Which of the following options is the correct interpretation of log-loss as a criterion for evaluation? A. If A classifier is confident of incorrect classification, log-loss will severely criticize it. B. For a particular observation, the classifier assigns a very small probability of the correct category, and then the corresponding distribution of log-loss is very large. C. The lower the log-loss, the better the model.

 

267 Which of the following is deterministic? The answer is (A) : Deterministic algorithm indicates that the output of the algorithm does not change in different operations. If we run the algorithm again, PCA will get the same result, but K-means will not.

 

268 What are the normalization methods of feature vectors? Linear function conversion, y=(x-minvalue)/(maxvalue-minvalue) log function conversion, y=log10 (x) inverse cotangent function conversion, y=arctan(x)*2/PI minus the mean, divided by the variance: y=(x-means)/ variance

 

269 Optimization algorithm and its advantages and disadvantages? Tips: When answering an interviewer’s questions, you tend to answer them in a big way so that you don’t get caught up in small technical problems and end up killing yourself. In brief, 1) Advantages of stochastic gradient descent: it can solve the problem of local optimal solution to some extent. Disadvantages: the convergence speed is slow. 2) Advantages of batch gradient descent: Easy to fall into local optimal solution. 3) Mini_batch gradient descent synthesis of stochastic gradient descent and batch gradient descent advantages and disadvantages, extraction of a neutralization method. 4) Newton’s Method Newton’s method needs to calculate Hessian matrix during iteration. When the dimension is higher, it is difficult to calculate Hessian matrix. 5) Quasi-Newton method Quasi-Newton method is an algorithm extracted to improve Newton method in the iterative process of calculating Hessian matrix, which is solved by approximating Hessian. To be specific, gDA is distinguished from each batch of data: All data sets are used for training each time. Advantage: The optimal solution is obtained. Disadvantage: Slow running speed and insufficient memory. Mini-batch gradient descent advantages: fast training speed, no memory problems, less oscillation Disadvantages: May not reach the optimal solution from the optimization method points: Disadvantages of stochastic gradient descent (SGD) It is difficult to select an appropriate learning rate. Using the same learning rate for all parameters is easy to converge to the local optimum and may be trapped in the saddle point SGD+Momentum advantages: Accumulate momentum, accelerate training local extremum near oscillation, due to momentum, jump trap gradient direction changes when momentum relief turbulence. Nesterov Mementum is similar to Mementum, benefits: Avoid moving too fast to increase sensitivity AdaGrad benefits: Control the learning rate, each component has its own different learning rate suitable for sparse data shortcomings rely on a global learning rate learning rate setting is too large, its influence is too sensitive later, adjust the learning rate of the denominator accumulation is too large, resulting in a very low learning rate, end the training in advance. RMSProp benefits: Solves the late end early problem. Disadvantages: Still dependent on global learning rate Combining the advantages of Adagrad’s good at dealing with sparse gradients and RMSprop’s good at dealing with non-stationary targets, different adaptive learning rates for different parameters are also applicable to most non-convex optimisations – suitable for large data sets and high-dimensional Spaces. Newton’s method needs to calculate Hessian matrix during iteration. When the dimension is relatively high, it is difficult to calculate Hessian matrix. Quasi-newton method is an algorithm extracted to improve Newton method in the iterative process of calculating Hessian matrix, which is solved by approximating Hessian.

 

What are the differences and connections between RF and GBDT? 1) Similarities: Both are composed of multiple trees, and the final result is determined by multiple trees. 2) Differences: The trees composed of a random forest can be classified trees or regression trees, while GBDT is only composed of regression trees. B The trees composed of random forest can be generated in parallel, while GBDT is generated in serial. C The result of random forest is voted by the majority, while GBDT is the sum of the accumulation of multiple trees. However, GBDT is sensitive to outliers. E Random forest reduces the variance of the model, while GBDT reduces the deviation of the model. F Random forest does not need feature normalization. GBDT requires feature normalization

 

The Pearson correlation coefficient of the two variables is zero, but the values of the two variables can also be correlated. (A) Pearson’s correlation coefficient can only measure linear correlation, but cannot measure nonlinear relationship. If y=x^2, x and y have a strong nonlinear relationship.

 

272 Which of the following hyperparameter increases may cause overfitting of random forest data? The answer is (B) : Under normal circumstances, increasing the depth of the tree may cause overfitting of the model. The learning rate is not a hyperparameter of random forest. Increasing the number of trees may result in an underfit.

 

The 8 actual values of the target variable on the training set [0,0,0,1,1,1,1,1], what is the entropy of the target variable? A. – 5/8 (log log (5/8) + 3/8 (3/8)) b. 5/8 log log (5/8) + 3/8 (3/8) c. 3/8 log log (5/8) + 5/8 (3/8) d. log (3/8) 5/8-3/8 The log of 5/8 is (A).

 

274 What is wrong with the following description of sequential pattern mining algorithms? (C) A AprioriAll algorithm and GSP algorithm belong to Apriori algorithm, both need to generate A large number of candidate sequences B FreeSpan algorithm and PrefixSpan algorithm do not generate A large number of candidate sequences and do not need to repeatedly scan the original database C in the time and space execution efficiency, FreeSpan better than PrefixSpan D compared with AprioriAll GSP execution efficiency is quite high @ CS green finch, subject analytical source: blog.csdn.net/ztf312/arti… 1. Apriori algorithm: the original algorithm for association analysis, used to discover frequent item sets from candidate item sets. Two steps: self-connect and prune. Disadvantages: no sequential sex. AprioriAll algorithm: The execution process of AprioriAll algorithm and Apriori algorithm is the same, but the difference lies in the generation of candidate set. It is necessary to distinguish before and after the last two elements. AprioriSome algorithm: can be seen as an improvement of AprioriAll algorithm AprioriAll algorithm and AprioriSome algorithm comparison: (1) AprioriAll is used to calculate all candidate Ck, and AprioriSome will be directly used to calculate all candidates, because it contains, so AprioriSome will produce more candidates. (2) Although AprioriSome is skip computing candidate, due to the large number of candidates generated by it, the memory may be filled up before the backtracking stage. (3) If the memory is full, AprioriSome is forced to calculate the last group of candidates. (4) For low support and long large sequences, AprioriSome algorithm is better. 2. GPS algorithm: Apriori like algorithm. Used to discover sequential frequent itemsets from candidate itemsets. Two steps: self-connect and prune. Disadvantages: All data sets need to be scanned each time support calculation; For the case of long sequence pattern, the scale of corresponding short sequence pattern is too large, so the algorithm is difficult to deal with. 3. SPADE algorithm: An improved GPS algorithm, which avoids the problem of carrying out full table scan on dataset D for many times. Basically the same as the GSP algorithm, there is an ID_LIST record, so that the ID_LIST of each time is obtained according to the ID_LIST of the last time (thus obtaining support). The size of ID_LIST shrinks as the pruning continues. Therefore, the problem of multiple scanning of data set D by GSP algorithm is solved. 4. FreeSpan algorithm: sequence pattern mining with frequent pattern projection. The core idea is divide-and-conquer. The basic idea is to recursively project the sequence database into a smaller set of projection databases using frequent terms, and generate sub-sequence fragments in each projection database. This process separates the data from the set of frequent patterns to be checked, and restricts each check to a smaller projected database that corresponds to it. Advantages: Reduces the overhead required to generate candidate sequences. Disadvantages: may produce many projection databases, expensive, will produce a lot of 5. PrefixSpan algorithm: derived from FreeSpan. The contraction rate is even faster than FreeSpan.

 

275 Which of the following does not belong to the usual feature selection algorithm for text classification? (D) A chi-square test B mutual information C information gain D principal component analysis 276 Feature selection method is often used. There are six common feature selection methods: 2) Mutual Information (MI) Mutual Information method is used to measure the amount of Information directly related to a feature word and a Document category. If the frequency of a feature is low, the mutual information score will be high, so the mutual information method tends to favor “low frequency” feature words. A word with a high relative frequency will get a lower score, and if the word carries a high amount of information, mutual information becomes inefficient. 3) The Information Gain method measures the importance of a feature word through the increase of the Information before and after in the corpus in the absence and presence of a feature word. 4) CHI(Chi-Square) Chi-square test makes use of the basic idea of “hypothesis test” in statistics: First of all, it is assumed that the feature word and category are directly unrelated. If the test value calculated using CHI distribution deviates from the threshold larger, it is more confident to deny the null hypothesis and accept the backup hypothesis of the null hypothesis: Feature words and categories have a high degree of correlation. 5) WLLR (Weighted Log Likelihood Ration) Weighted logarithm Likelihood 6) WFO (Weighted Frequency and Odds) Frequency and probability Weighted blog.csdn.net/ztf312/arti…

 

In the 277 class domain interface equation method, what is the method to find the approximate or exact solution of the classification problem in the case of linear inseparability? (D) A pseudo-inverse method – radial basis (RBF) neural network training algorithm, is to solve the linear indivisibility of the situation B based on quadratic criterion H-K algorithm: minimum mean square error criterion to obtain the weight vector, quadratic criterion to solve nonlinear problems C potential function method – nonlinear D perceptron algorithm – linear classification algorithm

 

278 What are the possible methods for feature selection in machine learning? (E) A, CHI square B, information gain C, mean mutual information D, expected cross entropy E

 

279 The following methods that cannot be used for feature dimension reduction include (E)

A principal component analysis PCA

B linear discriminant analysis LDA

C SparseAutoEncoder for deep learning

D matrix singular value decomposition SVD

E LeastSquares

Feature dimension reduction methods mainly include:

PCA, LLE, Isomap

SVD is similar to PCA and can also be regarded as a dimension reduction method

LDA: Linear discriminant analysis, can be used for dimensionality reduction

AutoEncoder: The structure of AutoEncoder is the same as the hidden layer of neural network, consisting of input L1, output L2, and weight connection in the middle. Autoencoder obtains input reconstruction L3 through L2 and minimizes the difference between L3 and L1 for training to get weight. Under such weight parameters, the obtained L2 can preserve the information of L1 as much as possible.

The dimension of L2 output of Autoencoder is determined by the number of neurons output. When the output dimension is larger than L1, sparse penalty items should be added to the training objective function to avoid L2 directly copying L1 (all weights are 1). So it’s called sparseAutoencoder(proposed by Andrew Ng).

Conclusion: SparseAutoencoder is dimensional enhancement in most cases, so it is not accurate to call it feature dimension reduction.

 

280 In general, k-NN nearest neighbor method works better in the case of (A). A. Large number of samples but poor typicality C. Small sample size but good typicality B. The samples are clumped D. The samples are distributed in chains

Which of the following methods can be used to reduce the dimension of high-dimensional data: A LASSO B principal component analysis C clustering analysis D wavelet analysis E linear discriminant F Laplian characteristic mapping LASSO achieves the purpose of dimensionality reduction by reducing parameters; Pca goes without saying that linear discrimination means LDA minimizes the distance between classes by finding a space and maximizes the distance between classes so it can be regarded as dimension reduction; Wavelet analysis has some transformation operations to reduce other interference can be seen as dimension reduction Laplace please look at this http://f.dataguru.cn/thread-287243-1-1.html

 

281 the following description is correct (D) A SVM is such A classifier, it has the minimum margin hyperplane searching for, so it is also often referred to as the minimum margin classifier in clustering analysis of B, the greater the similarity within the cluster, the greater the difference between clusters, cluster effect is poorer, C in decision tree, the tree node substation and too big, Even though the training error of the model continues to decrease, the testing error begins to increase, which is the reason for the inadequate model fitting. D Cluster analysis can be regarded as an unsupervised classification

 

(C) SVM is robust to noise (such as noise samples from other parts). (B) In adaboost algorithm, weight updating ratio of all misclassified samples is not the same. (C) Boosting and Bagging are both methods for combining multiple classifier voting. Given n data points, if half of them are used for training and half for user testing, the difference between training error and test error will decrease with the increase of n. Soft interval classifier A is robust to noise. Please refer to the http://blog.csdn.net/v_july_v/article/details/40718799 C B boosting is correct according to the classifier, determining the weight bagging isn’t. D The robustness of the model can be improved by the variation of training set.

 

283 The following statements about normal distribution are false: A. Normal distribution has centrality and symmetry B. The mean and variance of the normal distribution can determine the position and shape of the normal distribution c. The skewness of the normal distribution is 0, and the kurtosis is 1 D. The mean of the standard normal distribution is 0, the variance is 1 and that’s the answer C, the standard normal distribution.

 

284 In the following different scenarios, incorrect analysis methods are used: A. According to the operation and service data of the merchants in the last year, the clustering algorithm is used to judge the merchant level B of tmall merchants under their respective main categories. According to the transaction data of merchants in recent years, the clustering algorithm is used to fit the possible consumption amount formula C of users in the next month. Association rule algorithm is used to analyze whether it is suitable for buyers who have bought car seat MATS to recommend car floor MATS D. According to the user’s recently purchased product information, the decision tree algorithm is used to identify taobao buyers may be male or female

 

285 What is a gradient explosion? Error gradient is the direction and quantity calculated during the neural network training, which is used to update the weight of the network in the correct direction and the appropriate quantity. In deep networks or recurrent neural networks, error gradients can accumulate during updates and become very large gradients, which can then lead to large updates of network weights and thus make the network unstable. In extreme cases, the weights become so large that they overflow, resulting in NaN values. Exponential growth resulting from repeated multiplication of gradients (values greater than 1.0) between network layers produces a gradient explosion.

 

286 What problems do gradient explosions cause? In deep multilayer perceptron networks, gradient explosions can cause network instability, with the best result of not being able to learn from the training data and the worst result of NaN weight values that cannot be updated again. The gradient explosion makes the learning process unstable. — Deep Learning, 2016. In recurrent neural networks, gradient explosions make the network unstable and unable to learn from training data, and the best result is that the network cannot learn from long input sequences.

How do I determine if there is a gradient explosion? Gradient explosions during training may be accompanied by subtle signals, such as models that cannot be updated from training data (e.g. low losses). The instability of the model leads to significant changes in losses during the updating process. During training, the model loss becomes NaN. If you find these problems, then you need to look carefully for gradient explosion problems. Here are some slightly more obvious signs that can help determine if there is a gradient explosion problem. The model gradient increases rapidly during training. Model weights become NaN values during training. During the training, the error gradient value of each node and layer continued to exceed 1.0.

 

287 How to fix the gradient explosion problem? There are many ways to solve the gradient explosion problem, and this section lists some of the best experimental methods. 1. Redesign network model In deep neural networks, gradient explosions can be solved by redesigning the network with fewer layers. Using smaller batch sizes also has benefits for network training. In the recurrent neural network, updating on fewer previous time steps (truncated Backpropagation through time) during training can alleviate the gradient explosion problem. 2. Using ReLU activation function In deep multilayer perceptron neural network, the occurrence of gradient explosion may be due to activation functions, such as Sigmoid and Tanh functions, which were very popular before. Gradient explosions can be reduced by using the ReLU activation function. The use of ReLU activation functions is the best new practice for hiding layers. 3. In the recurrent neural network, the occurrence of gradient explosion may be due to the instability of the training of a certain network. For example, the back propagation over time essentially transforms the recurrent network into a deep multilayer perceptron neural network. The problem of gradient explosion can be reduced by using long and short term memory (LSTM) units and associated gate type neuron structures. The use of LSTM units is the latest best practice for sequence prediction of recurrent neural networks. 4. Gradient Clipping is used in very deep and large batch size multilayer perceptron network and LSTM with long input sequence, it is still possible to appear Gradient explosion. If the gradient explosion still occurs, you can check and limit the size of the gradient during training. That’s gradient truncation. There is a simple and effective solution to dealing with gradient explosions: if gradients exceed a threshold, truncate them. — Neural Network Methods in Natural Language Processing, 2017. Specifically, check whether the value of the error gradient exceeds the threshold, and if so, truncate the gradient and set the gradient to the threshold. The problem of gradient explosion can be mitigated to some extent by gradient truncation (gradient truncation, where the gradient is set to a threshold before performing the gradient descent step). In the Keras deep Learning library, you can use gradient truncation by setting the Clipnorm or ClipValue parameter on the optimizer prior to training. The default values are clipnorm=1.0 and CLIpValue =0.5. See: keras. IO/optimizers /… If the gradient explosion still exists, another method can be tried, that is, checking the size of the network Weight and punishing the loss function that produces a larger Weight value. This process is known as weight regularization and is usually done using either L1 penalty terms (absolute weight) or L2 penalty terms (weight squared). Using L1 or L2 penalty terms for cyclic weights helps mitigate gradient explosions. — On the difficulty of Training Recurrent neural Networks, 2013. In the Keras deep learning library, you can regularize weights by setting the Kernel_regularizer parameter on the layer and using L1 or L2 regularizers.

 

What is the input and output of 288 LSTM neural network? @ YJango, subject analytical source: www.zhihu.com/question/41… “Recurrent Layers” — An article published on January 4, 2017

  • The first thing to make clear is that the units that neural networks deal with are all vectors

Now, why do you see the training data as matrices and tensors

  • Regular feedForward input and output: matrix

Input matrix shape: (n_samples, DIM_input) Output matrix shape: (n_samples, DIM_output) Note: For real testing/training, the inputs and outputs of the network are just vectors. The dimension of N_samples is added in order to realize the training of multiple samples at one time and work out the average gradient to update the weight, which is called mini-Batch gradient descent. If n_samples is equal to 1, then this update mode is called Stochastic Gradient Descent (SGD). The input and output of Feedforward are essentially a single vector.

  • Regular Recurrent (RNN/LSTM/GRU) inputs and outputs: the tensor

Input tensor shapes: (time_steps, n_samples, DIM_input) Output tensor shapes: (time_steps, n_samples, DIM_output) note: The training method of mini-Batch gradient Descent is also retained, but the difference is that the dimension of time step is added. In essence, the input of Recurrent at any given moment is a single vector, which feeds the network in sequence from different moments. So you might want to think of it as a bunch of vectors a sequence of vectors, or a matrix.

The Python code represents the prediction:

Import numpy as np # hidden_state; Hidden_state =np.zeros((n_samples, DIM_input)) #print(inputs. Shape) : (time_steps, n_samples, DIM_input) outputs = np.zeros((time_steps, n_samples, DIM_output)) for I in range(time_steps): # output current output Elsif [I], elsiF = RNN. Predict (Inputs [I], ELsiF) #print(elsiF [I], elsiF.  (time_steps, n_samples, dim_output)Copy the code

But it’s important to note that the output of Recurrent nets can also be a matrix, not a three-dimensional tensor, depending on how you design it.

  1. If you want to use a sequence to predict another sequence, then the input and output are tensors (for example, speech recognition or machine translation of a Chinese sentence into an English sentence (a word counts as a vector). Machine translation is a special case, because the length of the two sequences may be different, so seq2seq is used.
  2. If a sequence is used to predict a value, the input is a tensor and the output is a matrix (sentiment analysis, for example, uses a string of words to predict a speaker’s mood).

What Feedforward can do is one-to-one mapping of vectors,

Recurrent has expanded to sequence-to-sequence mapping.

But a single vector can also be treated as a sequence of length one. So there are the following types:

Except for one to One, on the far left, which FeedForward does, it’s all Recurrent’s on the right

 

If you want to know more

  • It can be considered that Recurrent horizontal operations accumulate what has already happened, and the memory cell mechanism of LSTM chooses to remember or forget the accumulated information to predict the output at a given moment.
  • To understand this from a probabilistic point of view, it means constantly conditioning on what has already happened and constantly shrinking the Sample space
  • The idea of RNN is: Current output depends not only on current input, but also on previous state; It can be understood that current output is calculated by the input of current input and previous hidden state. And after each calculation, there will be residual information in the previous hidden state for the next calculation

 

289 What is wrong with PMF(probability mass function),PDF(probability density function),CDF(cumulative distribution function)? A.PDF describes the probability of continuous random variable at a specific value interval. B.CDF is the integral of PDF on a specific value interval. C. PFF describes the probability of discrete random variable at a specific value point D. A distributed CDF function H(x), then H(a) is equal to P(x <=a). P robability density function (PDF) is defined for continuous random variables. It is not a probability itself. It is a probability only after integrating the values of continuous random variables. The cumulative distribution function (CDF) can completely describe the probability distribution of a real random variable X, which is the integral of the probability density function.

 

290 For all real numbers x is contrasted with PDF. What are the basic assumptions of linear regression? (ABDE) a. The random error term is A random variable with an expected value of 0; B. For all observed values of explanatory variables, the random error terms have the same variance; C. Random error terms are correlated with each other; D. Explanatory variables are deterministic variables rather than random variables and are independent from random error terms; E. The distribution of classification variables in the test set is not known in advance when the random error term follows the normal distribution in the processing of categorical features. One-hot Encoding is applied to the category characteristic. So what are the possible difficulties in applying unique thermal codes to classification variables in the training set? A. All categories of classification variables do not appear in the test set. B. The frequency distribution of categories is different in the training set and the test set. C. Training set and test set will usually have the same distribution: A, B, if the category appears in the test set but not in the training set, the unique thermal code will not be able to carry out the category coding, which is the main difficulty. We need to be careful if the frequency distribution of the training and test sets is different.

 

291 Suppose you use the activation function X in the hidden layer of the neural network. Given any input to a particular neuron, you get an output of “-0.0001”. X could be which of the following activation functions? A. ReLU b. tanh c. SIGMOID d. none of the above 答案 : B, the activation function may be tanh because the value range of the function is (-1,1).

 

292 Which of the following are true descriptions of Type 1 (Type-1) and Type 2 (type-2) errors? A. Type 1 is usually called A false positive class, and type 2 is usually called A false negative class. B. Type 2 is usually called a false positive class, and type 1 is usually called a false negative class. C. Type 1 errors usually occur by rejecting assumptions when they are true. The answers are (A) and (C) : In statistical hypothesis testing, type I error refers to the incorrect rejection of the correct hypothesis, namely false positive error, while Type II error usually refers to the incorrect acceptance of the wrong hypothesis, namely false negative error.

 

293 Which of the following images is the multi-collinear feature? A. Features in FIG. 1 B. Features in FIG. 2 C. Features in FIG. 3 D. Features in Figure 1 and 2 e. Features in Figure 2 and 3 F. Features in Figure 1 and 3 Answer as (D) : In Figure 1, there is a high degree of positive correlation between features, while in Figure 2, there is a high degree of negative correlation. Therefore, these two graphs are characterized by multiple collinear characteristics.

Multivariate collinear characteristics are identified. So what’s the next possible move? A. Remove two collinear variables B. Remove one variable instead of two C. Removing relevant variables may result in information loss, and a regression model with penalty terms (such as Ridge or Lasso Regression) can be used. The answers are (B) and (C) : Since removing two variables would lose all information, we can only remove one feature, or we can use regularization algorithms (such as L1 and L2).

 

294 Adding an unimportant feature to a linear regression model may cause? (A) After A feature is added to A feature space, the R-square is usually increased, regardless of whether the feature is important or not.

 

295 It is assumed that the categories of the target variables are highly imbalanced, i.e. the main categories comprise 99% of the training data. Your model is now 99% accurate on the test set. So which of the following statements is true? A. Accuracy is not suitable for measuring unbalanced class problem B. accuracy is suitable for measuring unbalanced class problem C. Accuracy is suitable for measuring unbalanced class problem D. Accuracy and recall rates are not suitable for measuring unbalanced categories. The answers to the questions are (A) and (C).

 

296 What is bias and variance?

The generalization error can be decomposed as the square of the deviation plus the variance plus the noise. Deviation to measure the learning algorithm of expectation deviation degree of prediction and actual results, depicting the fitting ability of the algorithm itself, the variance measure the size of the same learning performance, as a result of changes in the training set, depict the disturbance caused by the data, the influence of noise to express the current task any learning algorithm can achieve the expectations of the generalization error of the lower bound, It depicts the difficulty of the problem itself. Deviation and variance are generally called bias and variance. The stronger the general training degree is, the smaller the deviation is and the larger the variance is. The generalization error generally has a minimum value in the middle. Deviation:Variance:

 

297 What is the solution to bias and Variance problems? Boosting, complex model (nonlinear model, adding layers in neural network), and more feature High Variance solutions: Agging, simplified model, dimension reduction

 

What models are solved by EM algorithm? Why not use Newton method or gradient descent method? The models solved by EM algorithm generally include GMM or collaborative filtering, and K-means actually belongs to EM. The EM algorithm must converge, but may converge to a local optimum. Since the number of summation terms will increase exponentially with the number of hidden variables, the gradient calculation will be troublesome.

How does XGBoost rate features? In the process of training, features of separation points are selected by Gini index. The more times a feature is selected, the higher the score of the feature is. [python] # feature importance print(model.feature_importances_) # plot pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_) pyplot.show() ========== # plot feature importance plot_importance(model) pyplot.show()

 

299 What is OOB? How is OOB calculated in random forest, and what are its advantages and disadvantages? In bagging method, about 1/3 of the samples collected by Bootstrap will not appear in the sample set collected by Bootstrap each time, and will not participate in the establishment of decision tree. The 1/3 data is called ooB (Out of bag), which can be used to replace the error estimation method of test set. The calculation method of ooB error is as follows: For random forests have been generated, with the bag outside data to test its performance, assumes that the total number of bags outside data for O, use it O a bag outside data as input, bring in before have generated random forests classifier, the classifier will give O data of the corresponding classification, because this article O the type of data is known, use the correct classification compared with the results of the random forest classifier system Count the number of random forest classifier classification errors, set as X, then the size of data error outside the bag =X/O; This has been proved to be unbiased, so there is no need for cross-validation or separate test sets in the random forest algorithm to obtain the unbiased estimation of the test set error.

 

300 Suppose John has 1000 songs in his MP3 and now wants to design a random algorithm to play them randomly. Different from ordinary random mode, Zhang SAN hopes that the probability of each song being randomly selected is proportional to the douban score (0~10 points) of a song. For example, Pu Shu’s Ordinary Road is rated 8.9 points, and Escape Plan’s Brightest Star in the Night Sky is rated 9.5 points. The ratio of those who want to hear “Ordinary Road” to “The Brightest Star in the Night Sky” is 89:95. Now we know the douban score of the 1000 songs :(1) please design a random algorithm to meet the needs of zhang SAN. (2) Write codes to implement their own algorithms. #include

#include

#include

using namespace std; int findIdx(double songs[],int n,double rnd){ int left=0; int right=n-1; int mid; while(left<=right){ mid=(left+right)/2; if((songs[mid-1]<=rnd) && (songs[mid]>=rnd)) return mid; if(songs[mid]>rnd) right=mid-1; else left=mid+1; } // return mid; } int randomPlaySong(double sum_scores[],int n){ double mx=sum_scores[n-1]; double rnd= rand()*mx/(double)(RAND_MAX); return findIdx(sum_scores,n,rnd); } int main() { srand(time(0)); Double scores [] = {5.5, 6.5, 4.5, 8.5, 9.5, 7.5, 3.5, 5.0, 8.0, 2.0}; int n=sizeof(scores)/sizeof(scores[0]); double sum_scores[n]; sum_scores[0]=scores[0]; for(int i=1; i


 

301: for the problem of logistic regession prob | x (t) = 1 / (1 + exp (w) * x + b) and label y = 0 or 1, please give loss function and weight w update formula and derivation.

The loss function of Logistic regression is log loss, and the formula is expressed as:



The updating formula of W can be obtained by minimizing loss function, namely:

The part inside the braces is equivalent to the log-likelihood function of the logistic regression model, so it can also be solved by the maximum likelihood function method. According to the gradient descent method, the updated formula is:

 

302 What is the relation between the entropy of the parent node and the child node of the decision tree? A. The parent node of the decision tree is larger B. The entropy of the child node is larger C. The two are equal D. B) it depends on the situation In feature selection, should information gain the biggest to the parent node, and the calculation of information gain as the IG (Y | X) = H (Y) – H (Y/X), H (Y/X) for the conditional entropy of feature nodes, H (Y/X) is smaller, namely the characteristics of property information of the whole said the “simple”, IG is bigger. This attribute can be better classified. The larger H(Y/X) is, the more “disordered” the attribute is, and the smaller IG is, which is not suitable for classification attribute.

 

303 What are the reasons for underfitting and overfitting respectively? How to avoid it? The reasons for underfitting: the model is too complex to fit all the data well, and the training error is large; Avoid underfitting: increase model complexity, such as using higher-order models (prediction) or introducing more features (classification). The reasons for over-fitting are as follows: the model complexity is too high, the training data is too little, the training error is small, and the test error is large; Avoid overfitting: reduce model complexity, such as adding regular penalty terms, such as L1 and L2, and increasing training data.

 

Parameter estimation of 304 language models often uses MLE (Maximum likelihood estimation). One problem is that the probability of non-occurrence is 0, which leads to poor performance of the language model. To solve this problem, use (A) a. smoothing B. denoising C. D. Add white noise

Now the update and maintenance of this paper are suspended, and the other nearly 3000 questions have been updated to the July online APP or the July online official website question bank plate, in other words, thousands of new questions of BAT written interview please click: July online AI question bank.

 

 

Corrections to remember

  • On February 2, 2017, the online lecturers team began to review all the answers and explanations. Because these questions will be published on the online official website and APP in July, and will be used by hundreds of thousands or even millions of people, we need to have answers and explanations for each question and ensure the accuracy of the answers and explanations. The division of labor is as follows: 1~20 AntZ, 21~40 Dr. Chu, 41~60 Dr. Liang Weiqi, 61~80 Dr. Guan, 81~100 Dr. Han Xiaoyang, 101~120 Dr. Zhao, 121~140 Zhang Yushi, 141~160 Wang Yun, 161~180 Liang Weiqi, 181~200 AntZ.
  • December 8, 2017, the second round of review, and began to label each question with classification and difficulty level
  • From December 9 to December 11, 2017, the third round of review, and started to input the official website and APP background system one by one with the operation team, and the official website and Android APP were launched on The day of Double 12.
  • On December 24, 2017, the BAT machine learning interview 1000 questions series has been completed to more than 300 questions, plus the “July online” official website and Android questions, the whole AI question bank has thousands of questions. It’s great to productize the question bank and keep adding questions.
  • Important Note: Update and maintenance of this article have been suspended since the question bank was launched on iOS on August 8. The other 3000 questions have been updated to the question bank plate of July online APP or July online official website.

 

 

Afterword.

To be honest, unlike the paper-based interview questions for data structures/algorithms, the difficulty of organizing machine-learning paper-based interview questions has skyrocketed because there are so few of these questions online that organizing one ML question can be as difficult as organizing at least 10 data structures/algorithms.

But the good thing is that in the process of sorting out this series, we also learned a lot. It was a process of sorting out and learning. Many problems were understood bit by bit in this sorting out, including various optimization algorithms, including RNN and so on. In the process of sorting out a problem, I will intentionally or unintentionally dig deep, and constantly ask myself questions related to it. In this way, through continuous thinking of question by question, it is a learning and progress for myself.

Let’s keep doing this until we get to 1,000 questions, or even thousands of questions, for one reason only: it’s good for everyone and it’s worth it in the long run.

Finally, if you are reading this, please leave a comment on the answer to the question, or share a question you already have on your hands (you can leave a comment directly in the comments of this article, or you are welcome to tweet @researcher July), share and help more people around the world, thanks.

July team, do not write the date, new questions please go to July online APP or July online official website.