preface

This blog was 10 ~ 13 years for four consecutive years to sort out each big company level data structure and algorithm of test questions, interview questions, at the same time, 2012, the AI hot, companies began to recruit talents AI, many students will be from the Internet to find all kinds of machine learning exam, interview questions, but and data structure is different, There are very few AI networks.

Since 2017, my team and I began to organize the BAT machine learning interview 1000 questions series, nearly a million people tracking, currently in July online official website /APP question bank has gathered AI written interview 4000 questions, this article selected part of the machine learning related interview questions, for everyone to look up at any time in the job search, review.

Generally speaking, entering big factory pays attention to the following three aspects of ability

  1. Coding ability is the most basic ability, including data structure and algorithm. To put IT bluntly, I have solid coding ability, which is not too bad in BOTH IT and AI. However, many people tend to ignore this ability.
  2. Machine learning and deep learning ability. Since 2016, with the sudden emergence of AlphaGo, deep learning has swept all fields. The key points include various models: decision tree, random forest, XGBoost, SVM, feature engineering, CNN, RNN, LSTM and so on.
  3. According to the technical capabilities of different business scenarios, such as business understanding and modeling, of course, different directions will use different technologies, such as CV, NLP, recommendation system and so on.

It is worth mentioning that in the machine Learning Intensive Training Camp jointly opened by CSDN and us, we have designed eight stages and thirteen practical projects (three enterprise projects and ten practical training projects) in order to challenge the annual salary of 400,000 yuan for three months. Covering Python fundamentals and data analysis, machine learning principles, machine learning practice, deep learning principles, deep learning practice, as well as CV recommendation NLP three directions of BAT industrial large project practice, interview and employment guidance.

Want to free audition training camp (stamp this link to understand the training camp details: t.csdnimg.cn/aTHV) in the SVM and Xgboost content (speak very thoroughly, more than countless information), you can add this enterprise wechat, add remarks: training camp free audition

Due to limited space, this article will not load the reference answers to each question, some abstracts will be picked out, and then the complete analysis can be scanned for, you have any questions welcome to leave a message, discussion, correction, thanks.

 

Machine learning interview 150 questions

Again: in this article, each topic’s resolution is just under the detailed analytical please add the following qr code to receive, in addition, we according to the test of 4000 selected 100 questions, sorting out the PDF version from AI interview questions 100, want to add the following qr code to receive (note: when adding items to receive 150 detailed analytical and 100 PDF).

  1. Please explain the principle of support Vector Machine (SVM) in detail. Support Vector machine is generally referred to as SVM because of its English name support Vector Machine. Generally speaking, it is a kind of binary classification model. Its basic model is defined as a linear classifier with maximum interval in feature space, and its learning strategy is interval maximization, which can be transformed into a convex quadratic programming problem.

  2. Which machine learning algorithms don’t need to be normalized?

    In practical applications, normalized models are required: 1. Model based on distance calculation: KNN. 2. Models solved by gradient descent method: linear regression, logistic regression, support vector machine, neural network.

    However, tree models do not need normalization, because they do not care about the value of variables, but about the distribution of variables and conditional probability between variables, such as decision tree and Random Forest.

  3. Why don’t trees need to be normalized? Because numerical scaling does not affect the splitting point position, it does not affect the structure of the tree model. Sorted according to the eigenvalues, the order of sorting remains the same, so the branches and split points will not be different. In addition, the tree model cannot carry out gradient descent, because the construction of the tree model (regression tree) is completed by looking for the optimal split point to find the best advantage, so the tree model is a step, the step point is not differentiable, and the derivation is meaningless, so there is no need for normalization.

  4. In K-means or kNN, we often use Euclidean distance to calculate the distance between the nearest neighbors, and sometimes also use Manhattan distance. Please compare the difference between these two distances. Euclidean distance, the most common representation of distance between two points or multiple points, is also called Euclidean metric, which is defined in Euclidean space..

  5. Reasons for data normalization (or normalization, note that normalization is different from normalization)

    If possible, it is better not to normalize. The reason for data normalization is that the dimensions of each dimension are different. And it needs to be normalized depending on the situation.

    After some models are unevenly scaled in each dimension, the optimal solution is not equivalent to the original (such as SVM) and needs to be normalized. For example, LR does not need normalization, but in practice, model parameters are often solved iteratively. If the objective function is too flat (imagine a very flat Gaussian model), the iterative algorithm will not converge, so it is better to perform data normalization.

  6. Please briefly describe the process of a complete machine learning project

    The first step in machine learning is to abstract a problem into a mathematical problem. The training process of machine learning is usually very time-consuming, and the time cost of random attempts is very high. By abstracting as a mathematical problem, we mean we know what kind of data we can get, whether the goal is a classification or regression or clustering problem, and if not, if it falls into one of these categories.

    Data determines the upper bound of machine learning results, and the algorithm is just as close as possible to this upper bound. The data must be representative, or it will inevitably overfit. And for classification problems, data skew should not be too serious, the number of different categories of data should not be several orders of magnitude difference. In addition, there should be an evaluation of the magnitude of data, such as how many samples and how many features, to estimate the consumption degree of memory, and to judge whether the memory can be put down in the training process. If you can’t, you have to consider improving the algorithm or using some tricks to reduce dimension. If the amount of data is too large, consider distribution.

    3. Feature preprocessing and feature selection Good data should be able to extract good features..

  7. Why does Logistic regression discretize features

    As the online teacher said in July ① nonlinear! Nonlinear! Nonlinear! Logistic regression is a generalized linear model with limited expression ability. After the single variable is discretized into N, each variable has its own weight, which is equivalent to introducing nonlinearity into the model, which can improve the expression ability of the model and increase the fitting. It is easy to increase and decrease discrete features, which is easy to iterate quickly.

    ② Fast! Fast speed! Fast speed! The inner product multiplication of sparse vectors is fast, and the calculation results are easy to store and expand.

    ③ Robustness! Robustness. Robustness. Discretized features have strong robustness to abnormal data: for example, if a feature is age >30, it is 1, otherwise it is 0. If features are not discretized, an abnormal data “300 years old” will cause great disturbance to the model.

    (4) Convenient crossover and feature combination: feature crossover can be carried out after discretization, from M+N variables to M*N variables, to further introduce nonlinear, improve expression ability;

    ⑤ Stability: After feature discretization, the model will be more stable. For example, if the age of users is discretized, 20-30 will not be a completely different person just because a user is one year older. Of course, the samples next to the interval will be the opposite, so how to divide the interval is an art;

    ⑥ Simplified model..

  8. A little bit about LR

    Rickjin: Tell LR from top to toe. Modeling, field mathematical derivation, principle of each solution, regularization, LR and Maxent model relationship. There are many people who can recite answers, asking for logical details is confused.

    Principle? Ask engineering how to parallelize, how many ways to parallelize, what open source implementations you’ve read. Yes, so ready to accept, by the way, the LR model development history.

    Although Logistic regression is named regression, its real identity is binary classifier. Let’s get one concept straight: linear classifiers..

  9. How to solve overfitting is overfitting, and its intuitive performance is shown in the figure below. As the training process goes on, the model complexity increases, and the error on training data gradually decreases. But error on the validation set gradually increases, because the trained network overfits the training set but does not work the data outside the training set, which is called generalization bad performance. Generalization performance is the primary goal in the evaluation of training effect. Without good generalization, it is equivalent to going in the opposite direction, and everything is useless.

  10. The relation and difference between LR and SVM

    Analysis 1 both LR and SVM can deal with classification problems, and are generally used to deal with linear dichotomies (in the case of improvement can deal with multi-classification problems)

    Differences: 1. LR is a parametric model, WHILE SVM is a non-parametric model. Linear and RBF are the differences between linearly separable and indivisible data. 2. From the objective function, the difference is that logistic regression uses the Logistical loss, while SVM uses hinge loss, both of the two loss functions aim to increase the weight of the data points that have a greater impact on classification, and reduce the weight of the data points that are less related to classification.

    3..

  11. What is entropy? Entropy, by its name, is a bit of a mystery. In fact, entropy is simply defined as the uncertainty of a random variable. The reason for the mystique is probably why the name is chosen and how it is used. The concept of entropy originated in physics as a measure of the disorder of a thermodynamic system. In information theory, entropy is a measure of uncertainty.

  12. Let’s talk about gradient descent

    What is gradient descent (GDA)? An algorithm is often seen in optimization problems in machine learning, that is, gradient descent (GDA). What is GDA?

    Wikipedia defines Gradient Descent as a first-order optimization algorithm, commonly known as the fastest descent method. To find the local minimum of a function using the gradient descent method, one must search iteratively for a point with a specified step in the direction opposite to the gradient (or approximate gradient) corresponding to the preceding point of the function. If the search iterates in the positive direction of the gradient, it will approach the local maximum point of the function. This process is called gradient ascent. .

  13. What’s the difference between Newton’s method and gradient descent? Newton’s method is a method for approximating equations in the fields of real numbers and complex numbers. Methods Use the first few terms of the Taylor series of the function f (x) to find the roots of the equation f (x) = 0. The greatest characteristic of Newton’s method is that it converges quickly. .

  14. Definition of entropy, joint entropy, conditional entropy, relative entropy, and mutual information To better understand the probability knowledge, you need to know the following: capital letter X represents a random variable, lowercase letter X represents a specific value of random variable X; P (X) is the probability distribution of random variable X, P (X, Y) is the joint probability distribution of random variable X, Y, P (Y | X) is known under the condition of random variable X conditional probability distribution of random variable Y; P (X = X) represents the probability of random variable X taking a specific value, namely p(X); P (X = X, Y = Y) said the joint probability, shorthand for p (X, Y), p (Y = Y | X = X) according to conditional probability, shorthand for p (Y | X), with: p (X, Y) = p (X) * p (Y | X).

  15. Usually people will choose from a number of commonly used kernels (depending on the question and data, choose different parameters, in effect get different kernels), for example:

  16. What are Quasi-Newton Methods?

    Quasi-newton method is one of the most effective methods to solve nonlinear optimization problems. It was put forward by W.C.Davidon, a physicist at Argonne National Laboratory in the United States in 1950s. At the time, Davidon’s algorithm was considered one of the most innovative inventions in nonlinear optimization. It was not long before R. Fletcher and M. J. D. Powell proved that the new algorithm was far faster and more reliable than other methods, making the discipline of nonlinear optimization leap forward overnight.

    The essential idea of quasi-Newton’s method is to improve the defect that Newton’s method needs to solve the inverse of complex Hessian matrix every time. It uses positive definite matrix to approximate the inverse of Hessian matrix, thus simplifying the operation complexity. Like the steepest descent method, the quasi Newtonian method only requires that the gradient of the objective function be known at each iteration. By measuring the variation of the gradient, a model of the objective function is constructed that is sufficient to produce superlinear convergence. Such methods are vastly superior to the fastest descent method, especially for difficult problems. .

  17. Kmeans complexity? Time complexity: O(tKmn), where t is the number of iterations, K is the number of clusters, m is the number of records (which can also be considered as the number of samples), n is the dimension. Space complexity: O((m+K)n), where K is the number of clusters, M is the number of records (which can also be considered as the number of samples), n is the dimension..

  18. What are the problems and challenges of stochastic gradient descent? So how do you optimize random gradient methods? For details, please click: thesis: the first phase of the public class, and other kinds of optimization algorithm of gradient descent (including video and PPT download) (link: ask.julyedu.com/question/79…

  19. What about conjugate gradients? Conjugate gradient method is between the gradient descent method, the steepest descent method, a method with Newton’s method, it only using first derivative information, but overcomes the drawback of gradient descent method converges slowly, and avoids the Newton’s method need to be stored and calculating Hessian matrix and the disadvantage of the inverse conjugate gradient method is not only one of the most useful method to solve large linear equations, It is also one of the most efficient algorithms for solving large nonlinear optimization. Among all kinds of optimization algorithms, conjugate gradient method is very important. It has the advantages of small storage, gradual convergence, high stability, and does not need any external parameters.

  20. For all optimization problems, is it possible to find a better algorithm than the one we already know?

    There is no free lunch theorem: for training samples (black dots), different algorithms A/B perform differently in different test samples (white dots), which means: for A learning algorithm A, if it is better than learning algorithm B on some problems, there must be some problems where B is better than A. That is: for all problems, no matter how smart the learning algorithm A is and how clumsy the learning algorithm B is, they have the same expected performance.

    However: There is no free lunch theorem assumes that all problems have the same probability of occurrence. In practical application, different scenarios will have different problem distribution. Therefore, when optimizing the algorithm, analyzing specific problems is the core of algorithm optimization.

  21. What is maximum entropy

    Entropy is a measure of the uncertainty of random variables. The greater the uncertainty, the greater the entropy value. If the random variable degenerates to a constant value, entropy is zero. If there is no outside interference, the random variable always tends to be disordered, after enough time of stable evolution, it should be able to achieve the maximum degree of entropy.

    In order to accurately estimate the state of random variables, we generally habitually maximize entropy, believing that among all possible probability models (distributions), the model with the maximum entropy is the best one. In other words, on the premise of known partial knowledge, the most reasonable inference about the unknown distribution is to conform to the inference that the known knowledge is the most uncertain or random. The principle is to recognize the known things (knowledge), and make no assumptions about the unknown things, without any prejudice.

  22. LR is generally referred to Logistic Regression rather than Linear Regression in industry. Sigmoid functions are applied to the output values of the real range of Linear Regression to converge to the range of 0~1. Its objective function is thus changed from a difference square sum function to a logarithmic loss function to provide the required derivatives for optimization (sigmoid function is a binary special case of Softmax function, whose derivatives are f*(1-f) of function values). Note that LR is often used to solve binary 0/1 classification problems, but it is so tightly coupled to linear regression that it is unconsciously given the name of regression. To request multivariate classification, change sigmoid to the famous Softmax.

  23. The difference between supervised learning and unsupervised learning is supervised learning: learning marked training samples to classify and predict the data outside the training sample set as far as possible. (LR,SVM,BP,RF,GBDT) Unsupervised learning: training learning of unlabeled samples to discover structural knowledge in these samples. (KMeans,PCA)..

  24. Key words: Decision tree; Random Forest; Boosting; Adaboot; GBDT; XGBoost

    The integrated object of integrated learning is the learner. Bagging and Boosting belong to two types of integrated learning methods. The Bagging method involves putting back the same number of samples to train each learner and then integrating them together (simple voting); Boosting method trains each learner sequentially using all samples (adjustable weights), with iterative integration (smooth weights).

    Decision tree is one of the most commonly used learners, and its learning process is to build the tree from the root, that is, how to make decisions about leaf node splitting. ID3/C4.5 decision tree uses information entropy to calculate optimal splitting, CART decision tree uses Gini index to calculate optimal splitting, XGBoost decision tree uses second-order Taylor expansion coefficient to calculate optimal splitting..

  25. What exactly does regularization in machine learning mean?

    Among them, the error/loss function encourages our model to fit the training data as much as possible, so that the final model will have less bias. Regularization terms encourage simpler models. When the model is simple, the randomness of the results obtained by fitting finite data is relatively small and it is not easy to overfit, which makes the prediction of the final model more stable.

    But there hasn’t been a good article on what exactly regularization is.

    When it comes to regularization, we should start with the overfitting problem.

  26. What’s a common loss function? For a given input X, f(X) gives the corresponding output Y, the predicted value f(X) of which may or may not agree with the true value Y (remember that sometimes loss or error is inevitable), and a loss function measures the degree of prediction error. The loss function is denoted as L(Y, f(X)), which measures the degree of inconsistency between the predicted value f(X) of your model and the true value Y..

  27. Why xGBoost expands with Taylor, and what are the advantages? Xgboost uses the first and second partial derivatives, and the second derivative is conducive to faster and more accurate gradient descent. By using Taylor expansion to obtain the second derivative form of function as independent variable, leaf splitting optimization can be carried out by only relying on the input data without selecting the specific form of loss function. In essence, the selection of loss function and model algorithm optimization/parameter selection are separated. This decoupling increases xGBoost’s applicability, allowing it to select loss functions on demand, which can be used for classification or regression..

  28. What’s the difference between covariance and correlation? Correlation is a standardized form of covariance. Covariances themselves are hard to compare. For example, if we calculate the covariance of salary ($) and age (years), since these two variables have different measures, we will get different covariances that cannot be compared.

  29. How does XGBoost find the optimal feature? Is it put back or not put back?

    Xgboost gives the gain score of each feature during training, and the feature with the maximum gain will be selected as the basis for splitting, thus memorizing the importance of each feature during model training — the number of times involving a feature from root to leaf is ranked as the importance of the feature.

  30. What about discriminant and generative models? Discriminant method: directly by the data learning decision function Y = f (X), or by the conditional distribution probability P (Y | X) as a predictive model, namely the discriminant model. Methods: the data by learning the joint probability density distribution function P (X, Y), and then to find the conditional probability distribution P (Y | X) as a prediction model, namely the generation model. The discriminant model can be obtained from the generative model, but the generative model cannot be obtained from the discriminant model. The common discriminant models include k-nearest Neighbor, SVM, decision tree, perceptron, Linear discriminant analysis (LDA), linear regression, traditional neural network, Logistic regression, Boosting, and conditional random field common generation models include: Naive Bayes, Hidden Markov model, Gaussian mixture model, Document Topic Generation model (LDA), Restricted Boltzmann machine

  31. The difference between linear classifier and nonlinear classifier and the pros and cons Linear and nonlinear are for model parameters and input characteristics; For example, if you input x, the model y is equal to ax+ax^2 then you have a nonlinear model, and if you input x and x^2 then you have a linear model. Linear classifier has good interpretability and low computational complexity, but its disadvantage is that the model fitting effect is relatively weak. The nonlinear classifier has strong effect fitting ability, but its disadvantages are that it is easy to overfit due to insufficient data, high computational complexity and poor interpretability. Common linear classifiers include LR, Bayesian classification, single-layer perceptron and linear regression. Common nonlinear classifiers include decision tree, RF, GBDT and multi-layer perceptron SVM (see linear kernel or Gaussian kernel).

  32. The difference between L1 and L2

    L1 norm refers to the sum of absolute values of each element in a vector, also known as Lasso regularization. Such as vector A = (1, 1, 3), then A L1 norm for | | 1 + + | | 3 | | – 1.

    The L1 norm is the sum of the absolute values of each element of x vector. L2 norm: is the sum of squares of each element of x vector to the 1/2 power. L2 norm is also known as Euclidean norm or Frobenius norm Lp norm: is the sum of absolute values of each element of X vector to the P power and 1/ P power..

  33. L1 and L2 regular priors obey what distribution are encountered in the interview, L1 and L2 regular priors obey what distribution, L1 is the Laplace distribution, L2 is the Gaussian distribution.

  34. A brief introduction to logistics return?

    Logistic Regression is a classification model in machine learning. Due to its simplicity and efficiency, it is widely used in practice. For example, in practical work, we may encounter the following problems:

    Predict whether a user will click on a particular product determine the user’s gender predict whether a user will purchase a given category determine whether a review is positive or negative

    All of these can be viewed as classification problems, or more precisely, dichotomies. To solve these problems, some existing classification algorithms, such as logistic regression, or support vector machines, are often used. They all belong to supervised learning, so a batch of labeled data must be collected as a training set before using these algorithms. Some tags can be retrieved from the log (user click, purchase), some can be retrieved from the information the user fills in (gender), and some may need to be manually tagged (comment polarity).

  35. Talk about Adaboost, the weight update formula. When the weak classifier is Gm, the weight of each sample is W1, W2… Please write down the final decision formula.

    Given a training dataset T={(x1,y1), (x2,y2)… (xN,yN)}..

  36. Those of you who do a lot of Internet searching know that when you accidentally type in a word that doesn’t exist, the search engine will prompt you to enter the right word. For example, if you type “Julw” into Google, the search engine will guess your intention: “July”

    When a user types a word, it may be spelled correctly or incorrectly. If spelling is spelled as c for correct and W for wrong, the job of spell check is to try to infer c from w when w is present. In other words: given W, find the most likely c among several alternatives.

  37. Why is Naive Bayes so naive?

    Because it assumes that all features are equally important and independent in the data set. As we all know, this assumption is far from true in the real world, so to say naive Bayes is really naive.

    Naive Bayesian models are Naive in the sense of assuming that sample characteristics are independent of each other. This assumption is largely nonexistent in reality, but there are plenty of cases where feature correlations are small, so the model still works well.

  38. Please roughly compare the difference between PLSA and LDA. The difference between the two represents the difference between the probabilistic school and the Bayesian school, that is, the latter adds the prior probability distribution..

  39. Please elaborate on the EM algorithm

    What exactly is an EM algorithm? Wikipedia explains:

    The Expectation maximization algorithm (Expectation maximization algorithm) is an algorithm to find the maximum likelihood estimate or the maximum posterior estimate of parameters in the probability model, in which the probability model depends on invisible variables that cannot be observed.

  40. How do I pick K in KNN?

    About what KNN is, you can check this article: “From k-nearest Neighbor algorithm, distance measurement to KD tree, SIFT+BBF algorithm” (link: blog.csdn.net/v_july_v/ar…

    If you choose the smaller values of K, equivalent to a smaller training instances in the field of forecast, “learning” approximation error will decrease, only with the input instance is close or similar training instances will only work on forecast results, at the same time the problem is to “learn” the estimation error will increase, in other words, the decrease of the K value means the whole model is complicated, Overfitting is easy to occur; If a larger value of K is selected, it is equivalent to using training examples in a larger field to make predictions. Its advantage is that it can reduce the estimation error of learning, but its disadvantage is that the approximation error of learning will increase. At this time, training instances far from the input instance (dissimilar) will also act on the predictor, making the prediction error, and the increase of K value means that the overall model becomes simple. When K=N, it is completely inadequate, because no matter what the input instance is at this time, it is simply predicted that it is the most tired in the training instance. The model is too simple, and a large amount of useful information in the training instance is ignored. In practical application, K value is generally taken as a relatively small value. For example, the optimal K value is selected by cross-validation method (simply speaking, part of the sample is used as the training set and part of the test set).

  41. Methods to prevent overfitting

    The reason of over-fitting is that the learning ability of the algorithm is too strong. Some assumptions (such as sample independence and uniform distribution) may not be true; The training sample is too small to estimate the distribution of the whole space.

    Treatment methods: 1. Early stop: stop the training if the model performance is not significantly improved after several iterations. 2. Original data increase, original data add random noise, resampling 3 regularization, regularization can limit the complexity of the model 4 cross validation 5 feature selection/feature dimension reduction 6 Creating a validation set is the most basic method to prevent overfitting. The goal of the model we end up training is to perform well on the validation set, not the training set

  42. In machine learning, why do we normalize data so often

    Machine learning model is widely used by the Internet industry, such as sorting (see: http://www.cnblogs.com/LBSer/p/4439542.html) sorting learning practice, recommend, anti cheat, positioning (see also: The positioning algorithm based on naive bayesian http://www.cnblogs.com/LBSer/p/4020370.html), etc.

    In general, most of the time spent in machine learning applications is spent on feature processing, in which a key step is to normalize feature data.

    Why normalization? Many students do not understand the explanation given by Wikipedia: 1) Normalization accelerates the speed of finding the optimal solution of gradient descent; 2) Normalization may improve accuracy.

  43. What’s the least square method?

    We often say orally: generally speaking, on average. If, on average, non-smokers have better health than smokers, the reason why I use the word “average” is because there are exceptions to everything, there is always a special person who smokes but because of regular exercise, his health is better than that of his non-smoking friends. One of the simplest examples of least squares is arithmetic average.

    Least square method (also known as least square method) is a mathematical optimization technique. It finds the best function match for the data by minimizing the sum of the squares of error. The least square method can be used to obtain the unknown data easily and minimize the sum of squares of errors between the obtained data and the actual data.

  44. Does gradient descent necessarily find the fastest direction of descent?

    The gradient descent method is not necessarily the fastest descending direction globally, it is only the fastest descending direction of the objective function on the tangent plane of the current point (of course, higher-dimensional problems cannot be called planes). In practical implementation, Newtonian direction (considering Hessian matrix) is generally considered to be the fastest descending direction, which can reach the convergence speed of SuperLinear. The convergence rate of gradient descent algorithms is generally linear or even Sublinear (for some problems with complex constraints). By Lin Xiaoxi (www.zhihu.com/question/30…

  45. Before introducing Bayes’ theorem, we need to learn some definitions: Conditional probability (also known as posterior probability) is the probability of occurrence of event A when another event B has already occurred. Conditional probability is expressed as P (A | B), pronounced “under the condition of B A probability”. For example, in the same sample space Ω event or A subset of A and B, if one element of the randomly chosen from among Ω belongs to B, then the random selection of element also belongs to A probability is defined as in B under the premise of A conditional probability, so: P (A | B) = | | A studying B / | | B, then the molecules, the denominator is divided by | Ω | get..

  46. How to understand the decision tree, XgBoost can handle missing values? Some SVM models are sensitive to missing values.

    Subject analytical source: www.zhihu.com/question/58…

    First, explain your confusion in two ways: just because the toolkit handles missing data automatically does not mean that a specific algorithm can handle missing items for missing data: models based on decision trees are better than models that rely on distance measures

    Tree models such as the Random Forest and how XGBoost handles missing values are also introduced in the answers. At the end of the paper, some suggestions for selecting models with missing values are summarized.

  47. What is standardization and normalization

    In simple terms, standardization is to process data according to the columns of the feature matrix, and convert the characteristic values of the sample to the same dimension through the method of z-score. The general formula is :(x-mean)/ STD, where mean is the average value and STD is the variance.

    From the formula, we can see that standardization is subtracting the average value of data according to its attributes (by column) and then dividing by the variance.

    This process is geometrically understood as moving the zero axis of the coordinate axis to the mean, and then scaling, which involves both translation and scaling. The result is that for each attribute (per column), all the data is clustered around 0 and the variance is 1. Each attribute/column is evaluated separately.

  48. How does random forest deal with missing values?

    Yieshah: As we all know, there are many ways to deal with missing values in machine learning. However, the topic “How to deal with missing values in random Forest” shows that the key problem is how to deal with random forest, so let’s give a brief introduction to random forest.

    Random forest is composed of a number of decision tree, the first thing to establish the Bootstrap data set, that is, from the original data back to random, as a new data set, the new data set will duplicate data, for each data set and then constructing a decision tree, but not directly with all the characteristics to build a decision tree, but for each step, In this way, multiple decision trees are constructed to form a random forest. Data are input into each decision tree to see the judgment results of each decision tree, and the prediction results of all decision trees are counted. Bagging integrates the results to obtain the final output.

    So how do you deal with missing values in a random forest? According to the characteristics of random forest creation and training, the processing of missing values in random forest is quite special.

  49. How does random forest assess the importance of features? There are two ways to measure the importance of a variable, Decrease GINI and Decrease Accuracy:

  50. What about the optimization of Kmeans?

    K-means analysis: Under the condition of big data, it will consume a lot of time and memory.

    Suggestions for k-means optimization: 1. Reduce the number of clustering k. Because each sample has to compute the distance from the center of the class. 2. Reduce the characteristic dimension of the sample. For example, dimension reduction through PCA etc. 3. Investigate other clustering algorithms and test the performance of different clustering algorithms by selecting toy data. 4. Hadoop cluster, k-means algorithm is easy to carry out parallel computing.

  51. The selection of k value of KMeans algorithm and the center point of initial class cluster

    KMeans algorithm is the most commonly used clustering algorithm, the main idea is: in a given values of K and K initial cluster center, under the condition of each point (i.e. data records) assigned to the cluster center of the nearest class represents the type of cluster, after all points distribution, according to all points within a class cluster recount the class (average) the center of the cluster, Then the steps of allocating points and updating the center point of the class cluster are iterated until the change of the center point of the class cluster is small or the specified number of iterations is reached.

    The idea of KMeans algorithm itself is relatively simple, but reasonable determination of K value and K initial cluster center points has a great influence on the clustering effect.

  52. Explain the concept of dual optimization problem can be from two aspects, one is the primal problem, one is a dual problem, is the dual problem, normally the dual problem is given the optimal value of the lower main problem, in the case of strong duality theorem was established by the dual problem can get optimal and lower bounds of the main problems, the dual problem is a convex optimization problem, It can be solved well. In SVM, primal problem is converted into dual problem for solving, thus further introducing the idea of kernel function.

  53. How to make feature selection? Feature selection is an important data preprocessing process, mainly for two reasons: one is to reduce the number of features and dimensionality, so that the model generalization ability is stronger and over-fitting is reduced; The second is to enhance the understanding of features and eigenvalues. Common feature selection methods: 1. Remove features with small variance. 2. Regularization. L1 regularization can generate sparse models. L2 regularization is more stable because useful features tend to correspond to coefficients that are non-zero. 3. Random forest: For classification problems, Gini impurity or information gain is usually adopted; for regression problems, variance or least square fitting is usually adopted. In general, tedious steps such as feature engineering and parameter adjustment are not required. Its two main problems, 1 is that important features are likely to score very low (correlation feature problem), 2 is that this method is more favorable for features with more categories of feature variables (bias problem). 4. Stability selection. Is a new method based on the combination of secondary sampling and selection algorithm, the selection algorithm can be regression, SVM or other similar methods. Its main idea is to run the feature selection algorithm on different data subsets and feature subsets, repeat repeatedly, and finally summarize the feature selection results. For example, we can count the frequency of a feature that is considered to be important (the number of times that a feature is selected as important is divided by the number of times that its subset is tested). Ideally, the score for important features would be close to 100%. The weaker feature will have a non-zero score, while the least useful feature will have a score close to zero.

  54. How good is a classifier? Here we must first know TP, FN (true judgment false), FP (false judgment true), TN four (can draw a table).

  55. What is the physical meaning of AUC in machine learning and statistics?

    Auc is one of the most common indicators of the evaluation model of good or bad, ontology analysis from: www.zhihu.com/question/39…

    Three parts, the first part is an introduction to the basic of AUC, including the definition of AUC, explanation, and the algorithm and code, the second part logistic regression was used as example to illustrate how to optimize the AUC directly to training, the third part, the content is entirely original @ li big cats – how to AUC value to calculate the real category, in other words, It’s reverse engineering the AUC.

  56. Fillna: I. Discrete: None, ii. Continuous: mean. Iii. If there are too many missing values, the column is directly removed. 2. Some models (such as decision trees) require discrete values. 3. Binarization of quantitative features. The core is to set a threshold that is 1 for those greater than the threshold and 0 for those less than or equal to the threshold. For example, image operation 4. Pearson correlation coefficient, remove highly correlated columns

  57. If you look at the gain, the larger alpha and gamma, the smaller the gain, right? Xgboost’s criterion for finding segmentation points is maximized gain. Considering that the traditional greedy method of enumerating all possible segmentation points for each feature is inefficient, XgBoost implements an approximate algorithm. The general idea is to list several candidates that may become segmentation points according to the percentile method, and then calculate Gain from the candidates to find the best segmentation point according to the maximum value. Its calculation formula is divided into four terms, which can be adjusted by regularization parameters (lamda is the coefficient of sum of squares of weight of leaves, gama is the number of leaves)..

  58. What causes the gradient extinction problem? Yes you should understand beautiful water -Andrej Karpathy How does the ReLu solve the vanishing gradient problem? In the training of neural network, by changing the weight of neurons, the output value of the network is as close to the label as possible to reduce the error value. BP algorithm is commonly used in training. The core idea is to calculate the loss function value between the output and the label, and then calculate its gradient relative to each neuron to carry out weight iteration. The disappearance of gradient will cause slow updating of weights and increase the difficulty of model training. One reason for the gradient’s disappearance is that many activation functions squeeze the output values into very small intervals, with a gradient of 0 over a wide range of definition domains at both ends of the activation function, causing learning to stop.

  59. What exactly is feature engineering? First, what do most machine learning practitioners do primarily in companies? Not do mathematical deduction, not invented many tall on the algorithm, characteristics of the project, but do as shown in the figure below (figure from: www.julyedu.com/video/play/…

  60. What do you know about data processing and feature engineering processing?

  61. What theory should you know when preparing for a machine learning interview?

  62. Data imbalance problem

    This is mainly due to the uneven distribution of data. Solutions are as follows:

    Sampling, add noise sampling to small samples, generate downsampling data for large samples, and use known samples to generate new samples for special weighting. For example, in Adaboost or SVM, the algorithm insensitive to unbalanced data sets is adopted to change evaluation criteria: To evaluate with AUC/ROC By adopting the method of Bagging/Boosting/ensemble at the time of design model considering the prior distribution of data

  63. What kind of classifier should be selected when the feature is larger than the amount of data? Linear classifiers, because at high dimensions, the data is generally sparse in the dimensional space, and it’s very possible to be linearly separable.

  64. What are the common classification algorithms? What are their strengths and weaknesses?

    The advantages of Bayesian classification are as follows: 1) Fewer parameters need to be estimated and it is not sensitive to missing data. 2) Have a solid mathematical foundation and stable classification efficiency.

    Disadvantages: 1) Assume that attributes are independent of each other, which is often not true. (I like tomatoes and eggs, but I don’t like tomatoes and eggs). 2) Prior probabilities need to be known. 3) There is an error rate in classification decisions.

  65. What are the common supervised learning algorithms? Perceptron, SVM, artificial neural network, decision tree, logistic regression

  66. Describe common optimization algorithms and their advantages and disadvantages? 1) Advantages of stochastic gradient descent: easy to fall into local optimal solution Disadvantages: fast convergence speed 2) Advantages of batch gradient descent: it can solve the problem of local optimal solution to some extent

  67. What are the normalization methods of eigenvectors? Linear function conversion, y=(x-minvalue)/(maxvalue-minvalue) logarithmic function conversion, y=log10 (x) inverse cotangent function conversion, y=arctan(x)*2/PI minus mean, divided by standard deviation: y=(x-means)/ Standard Deviation

  68. The difference and connection between RF and GBDT?

    1) Similarities: Both are composed of multiple trees, and the final result is determined by multiple trees.

    2) Differences: A Trees composed of random forest can be classified trees or regression trees, while GBDT is only composed of regression trees; B Trees forming the random forest can be generated in parallel, while GBDT is generated in serial. C The result of the random forest is voted by the majority, while GBDT is the sum of the accumulation of multiple trees. D The random forest is insensitive to outliers, while GBDT is more sensitive to outliers

    F GBDT will accumulate the results of all trees, which cannot be completed by classification, so the trees of GBDT are all CART regression trees rather than classification trees (although GBDT can also be used for classification after adjustment, it does not represent the trees of GBDT as classification trees).

  69. Try to prove the formula for the distance from any point x in the sample space to the hyperplane (w,b)

  70. Please compare EM algorithm, HMM, CRF these three together are not very appropriate, but they are related to each other, so I put them together here. Focus on the idea of the algorithm. (1) EM algorithm EM algorithm is used for maximum likelihood estimation or maximum posterior estimation of models with hidden variables, consisting of two steps: E step, expectation; M steps, maxmization. In essence, EM algorithm is an iterative algorithm, which continuously calculates the current variable by estimating the hidden variable of the previous generation parameter until convergence. Note: THE EM algorithm is sensitive to initial values, and EM is an algorithm for maximizing logarithmic likelihood functions by constantly solving lower bounds, that is, the EM algorithm is not guaranteed to find global optimal values. The EM export method should also be mastered.

  71. Why can SVM with kernel classify nonlinear problems?

    The essence of kernel function is the inner product of two functions. The kernel function is used to insinuate it into the high-dimensional space, and the nonlinear problem in the high-dimensional space is transformed into a linear problem. SVM obtains that the hyperplane is the linear classification plane of the high-dimensional spaceCopy the code
  72. Please talk about common kernel functions and their conditions

    We usually say that the kernel function refers to the positive definite sum function, if and if for any x belonging to x, the Gram matrix corresponding to K is required to be a semi-positive definite matrix. RBF kernel radial basis, and this kind of function depends on the distance between certain points, so the Laplace kernel is also a radial basis kernel. The key of SVM is to select the type of kernel functions. The commonly used kernel functions mainly include linear kernel, polynomial kernel, radial basis kernel (RBF) and SigmoID kernel.Copy the code
  73. Boosting and Bagging are both Boosting and Bagging

    (1) Bagging's random forest Random forest changes the problem that decision trees are easy to overfit, which is mainly optimized by two operations: 1) Boostrap extract sample values from the bag and 2) extract a certain number of features randomly each time (usually SQR (n)). Classification problem: Bagging voting is used to select the most frequent category regression problem: directly take the average of the results of each tree.Copy the code
  74. Logistic regression related issues

    (1) The derivation of the formula must be followed. (2) The basic concept of logistic regression, which is best analyzed from the perspective of a generalized linear model. Logistic regression assumes that Y obeys the Bernoulli distribution. (3) L1-norm and L2-norm are sparse in fact because l0-norm is the number of direct statistical parameters that are not 0 as the rule item, but it is not easy to execute in fact, so L1-norm is introduced. In essence, L1norm assumes that the priors of parameters are Laplace distribution, while L2-norm assumes that the priors of parameters are Gaussian distribution. This is the principle of solving this problem with images that we see on the Internet. However, it is difficult to solve l1-norm, which can be solved by the axis descent method or the minimum Angle regression method. (4) Comparison between LR and SVM Firstly, the biggest difference between LR and SVM lies in the selection of loss function. LR's loss function is Log loss (or even logical loss), while SVM's loss function is Hinge loss. Second, both are linear models. Finally, SVM only considers support vectors (that is, a few points related to classification). (5) LR and Random forest are different from random forest and other tree algorithms are nonlinear, while LR is linear. LR focuses more on global optimization, while tree model is mainly local optimization. (6)Copy the code
  75. What is collinearity and what does it have to do with overfitting?

    Collinearity: In multivariable linear regression, the high correlation between variables makes the regression estimate inaccurate. Collinearity creates redundancy and leads to overfitting. Solution: eliminate the correlation of variables/add weight re.Copy the code
  76. What are the engineering methods for feature selection in machine learning?

    Category 1 What is feature Engineering? 2 Data preprocessing 2.1 Dimensionless 2.1.1 Standardization 2.1.2 Interval scaling 2.1.3 Differences between standardization and normalization 2.2 Binarization of quantitative features 2.3 Dummy coding of qualitative features 2.4 Calculation of missing values 2.5 Data transformation 2.6 Review 3 Feature selection 3.1 Filter 3.1.1 Variance selection method 3.1.2 Correlation coefficient method 3.1.3 Chi-square test 3.1.4 Mutual information method 3.2 Wrapper 3.2.1 Recursive feature elimination method 3.3 Embedded 3.3.1 Feature selection method based on penalty term 3.3.2 Feature selection method based on tree model 3.4 Review 4 Dimension reduction 4.1 Principal Component analysis (PCA) 4.2 Linear Discriminant Analysis (LDA) 4.3 Review 5 Summary 6 References 1 What is feature engineering? It is widely said in the industry that data and features determine the upper limit of machine learning, and models and algorithms only approximate this limit. So what exactly is feature engineering? As the name implies, it is essentially an engineering activity designed to extract features from raw data to the maximum extent possible for use by algorithms and models. Through summary and induction, it is believed that feature engineering includes the following aspects:Copy the code
  77. Use Bayesian probability to illustrate the principle of Dropout

    Recall that with Bagging learning, we define k different models, construct K different data sets from the training set with replacement samples, and then train model I on the training set. The goal of Dropout is to approximate this process on an exponential number of neural networks. Dropout training is different from Bagging training. In the case of Bagging, all models are independent. In the case of Dropout, models share parameters, where each model inherits a different subset of the parent neural network's parameters. Parameter sharing makes it possible to represent exponential models with limited available memory. In the Bagging case, each model trains to convergence on its corresponding training set. In the case of Dropout, most models are not explicitly trained, and often the model is so large that it cannot sample all possible subnetworks until the end of the universe. Instead, it is possible for a small part of the sub-network to train for a single step, and parameter sharing results in the remaining sub-network being able to have good parameter Settings.Copy the code
  78. For very low-dimensional features, choose a linear or nonlinear classifier?

    Nonlinear classifier, low dimensional space may have a lot of features running together, leading to linear inseparability. 1. If the number of features is large, which is similar to the number of samples, LR or Linear Kernel SVM should be used at this time. If the number of features is small and the number of samples is average, SVM+Gaussian Kernel is selected. 3. If the number of features is relatively small and the number of samples is large, some features need to be manually added to become the first case.Copy the code
  79. How to deal with missing values of feature vectors

    On the one hand, there are many missing values. Simply discard this feature, otherwise you might end up making too much noise, which might adversely affect the results. On the other hand, there are few missing values, and the missing values of other features are less than 10%. We can deal with it in many ways: 1) NaN is directly taken as a feature, assuming that it is represented by 0; 2) Fill with mean value; 3) Use random forest and other algorithms to predict filling.Copy the code
  80. Comparison of SVM, LR and decision tree

    Model complexity: SVM supports kernel function, which can deal with linear and nonlinear problems; LR model is simple, fast training speed, suitable for linear problems; The decision tree is easy to overfit, and the pruning loss function is needed: SVM Hinge loss; LR L2 Logistical loss; Adaboost index loss data sensitivity: SVM tolerance is insensitive to outliers, only care about support vectors, and need to be normalized first; Data volume of LR sensitive to outliers: LR is used for large data volume; SVM nonlinear kernel is used for small data volume and few featuresCopy the code
  81. The process of KNN nearest neighbor classification algorithm is briefly described

    1. Calculate the distance of each sample point in the test sample and training sample (common distance measures include Euclidean distance, Mahalanobis distance, etc.); 2. Sort all the above distance values; 3. Select the first k samples with the minimum distance; 4. Vote according to the labels of the K samples to get the final classification category;Copy the code
  82. What are the commonly used clustering divisions? Enumeration representative algorithm

    1. Partitioning based clustering :K-means, K-medoids, CLARANS. 2. Hierarchical clustering: AGNES (bottom-up), DIANA (top-down), BIRCH(CF-tree). 3. Density-based clustering: DBSACN, OPTICS, CURE. 4. Grid-based methods: STING, WaveCluster. 5. Model-based clustering: EM,SOM, COBWEB.Copy the code
  83. What is bias and variance?

    The generalization error can be decomposed as the square of the deviation plus the variance plus the noise. Deviation to measure the learning algorithm of expectation deviation degree of prediction and actual results, depicting the fitting ability of the algorithm itself, the variance measure the size of the same learning performance, as a result of changes in the training set, depict the disturbance caused by the data, the influence of noise to express the current task any learning algorithm can achieve the expectations of the generalization error of the lower bound, It depicts the difficulty of the problem itself. Deviation and variance are generally called bias and variance. The stronger the general training degree is, the smaller the deviation is and the larger the variance is. The generalization error generally has a minimum value in the middle.Copy the code
  84. What is the solution to bias and Variance problems?

    Boosting, complex model (nonlinear model, adding layers in neural network), and bagging, simplified model, dimension reduction solutions with more features and High VarianceCopy the code
  85. What models are solved by EM algorithm? Why not use Newton method or gradient descent method?

    The models solved by EM algorithm generally include GMM or collaborative filtering, and K-means actually belongs to EM. The EM algorithm must converge, but may converge to a local optimum. Since the number of summation terms will increase exponentially with the number of hidden variables, the gradient calculation will be troublesome.Copy the code
  86. How does XGBoost rate features?

    As we know, in the training process, the CART tree selects the features of the separation point through Gini index. The more times a feature is selected, the higher the score of the feature will be. But xgboost? As for how a leaf node can be split, the XGboost authors in their original paper gave two ways to split the nodeCopy the code
  87. What is OOB? How is OOB calculated in random forest, and what are its advantages and disadvantages?

  88. Naive bayes classification derived P (c | d), the document d (consists of several word), the probability of this document belongs to the category c, indicating the formula which of probability can be calculated using the training set

  89. Please write down what you know about machine learning features engineering operations and what it means

  90. Please write down your understanding of the VC dimension

  91. How to determine the size of K in kmeans clustering

  92. Implement linear regression in Python, and think about a more efficient way to do it

  93. How to understand “the various models of machine learning correspond to their respective loss functions”?

  94. You are given a training data set with 1000 columns and 1 million rows. This data set is based on classification problems. The manager asked you to reduce the dimension of this data set to reduce the model calculation time. Your machine has limited memory. What would you do? (You are free to make all kinds of practical assumptions)

  95. Question 2: Is it necessary to do rotation transformations in PCA? If necessary, why? What happens if you don’t rotate those components?

  96. You are given a data set with missing values within 1 standard deviation from the median. What percentage of data will not be affected? Why is that?

  97. Here’s a data set for cancer detection. You’ve built the classification model with 96% accuracy. Why are you still dissatisfied with the performance of your model? What can you do?

  98. Explain the prior probability, likelihood estimation and marginal likelihood estimation in naive Bayes algorithm?

  99. You’re working on a time series data set. The manager asked you to build a high-precision model. You start using the decision tree algorithm because you know it works well on all types of data. Later, you tried the time series regression model and got better accuracy than the decision tree model. Will that happen? why

  100. You’ve been assigned a new project about helping food distribution companies save more money. The problem is, the company’s delivery line can’t deliver food on time. The result is that their customers are unhappy. Finally, to please the client, they had to avoid paying for the meal. Which machine learning algorithm will save them?

  101. You realize that your model suffers from low bias and high variance. Which algorithm should be used to solve the problem? Why is that?

  102. I’ll give you a data set. This data set contains many variables, some of which you know are highly correlated. The manager has asked you to use PCA. Do you remove the relevant variables first? Why is that?

  103. After a few hours of work, you now rush to build a high-precision model. As a result, you built five GBM (Gradient Vford Models) thinking boosting would show the magic. Unfortunately, none of the models performed better than the benchmark model. Finally, you decide to combine these models together. Although it is well known that binding models are usually highly accurate, you are out of luck. Where did you go wrong?

  104. What is the difference between KNN and KMEANS clustering?

  105. What is the relationship between the true positive rate and recall? Write the equation.

  106. After analyzing your model, the manager tells you that your model is multicollinearity. How would you verify that he’s telling the truth? Can you build a better model without losing any information?

  107. When is Ridge regression superior to Lasso regression?

  108. How do I select important variables on a data set? Give an explanation.

  109. Both Gradient Boosting algorithm (GBM) and Random Forest are tree-based algorithms. What are the differences between them?

  110. It’s easy to run binary classification tree algorithms, but do you know how a tree splits, how the tree decides which variables to assign to which root and subsequent nodes?

  111. You have a data set where the number of variables p is greater than the number of observations n. Why is OLS a bad choice? What is the best technique to use? Why is that?

  112. What is a convex hull? (Hint: think of SVM.) Other methods include subset regression and forward stepwise regression.

  113. We know that the OneHotEncoder increases the dimension of the dataset. But LabelEncoder does not. Why is that?

  114. What cross-validation techniques would you use on a time series data set? Is it k times or LOOCV?

  115. Give you a data set with more than 30% missing values? For example, eight out of 50 variables had missing values of more than 30%. How do you deal with that?

  116. “The customer who bought this also bought…….” Amazon’s advice is the result of what algorithm?

  117. How do you understand type 1 and type 2 mistakes?

  118. When you are solving a classification problem, for validation purposes, you have randomly divided the training set into training sets and validation sets. You are very confident that your model will perform well on unseen data because of the accuracy of your validation. However, after getting poor accuracy, you are disappointed. What went wrong?

  119. Please briefly elaborate the advantages and disadvantages of decision tree, regression, SVM, neural network and other algorithms. Regularization Algorithms Ensemble Algorithms Decision Tree Algorithm Regression Artificial neural network (Artificial neural Network) Regularization Algorithms Ensemble Algorithms Clustering is based on Dimensionality Reduction Algorithms and is based on Neural Network, Deep Learning, Support Vector Machine, and Clustering Algorithms Instance-based Algorithms The Bayesian Algorithms Association Rule Learning algorithm Graphical Models with Algorithms

  120. What are the steps to correct and clean up data before applying machine learning algorithms?

  121. What is the K-means clustering algorithm?

  122. How to understand the over-fitting and under-fitting of models, and how to solve them?

  123. Please elaborate on text feature extraction

  124. Please elaborate on image feature extraction

  125. If you know xGBoost, please explain how it works

  126. Please explain the principle of gradient ascending tree (GBDT) in detail

  127. Please talk about the principle and derivation of Adaboost algorithm

  128. What are the L0, L1, and L2 norms in machine learning?

  129. Please elaborate on the principle of decision tree construction

  130. How to determine the number of LDA topics?

  131. Does sklearn random forest feature importance favor numerical variables? When I did the Titanic problem of Kaggle, I used random forest and XGboost to find that the importance of two numerical variables was very high, much higher than gender, which was considered important in data analysis. According to the Sklearn document, the importance of features was ranked according to the contribution of features to the reduction of impurity. 44. I just found a paper online that suggested that this approach would feature selection based on impurity reduction is biased towards preferring variables Variables with more categories). The skLearn document states that skLearn decision trees are all CART trees. Cart trees can also be interpreted as category features with the number of samples. So is it because of this reason that random forest favors numerical features?

  132. Continuous features can be discretized and scaled. What are the two processing methods respectively applicable to the scene?

  133. I want to explain to you geometrically why the Lagrange multiplier method gets the optimal value?

  134. In-depth understanding of the mathematics of A/B testing

  135. How to make machine learning 100 days introduction plan more scientifically

  136. How to understand principal component analysis PCA

  137. How to understand LightGBM in general

  138. Linear regression requires the dependent variable to be normally distributed. Right?

  139. What are k-nearest Neighbor algorithms and KD trees?

  140. How to understand Bayesian methods and Bayesian networks popularly?

  141. Mathematical derivation of maximum entropy model

  142. What are the advantages of xGBoost using Taylor’s expansion? Taylor expansion can obtain the second derivative form of function as independent variable, and leaf splitting optimization can be carried out by only relying on the input data value without selecting the specific form of loss function. In essence, the selection of loss function and model algorithm optimization/parameter selection are separated. May I ask why the leaf splitting optimization calculation can be carried out only depending on the input data without selecting the specific form of the loss function?

  143. Have you ever used other models yourself and tuned in or something like that? Can you tell me the basic parameter adjustment process? XGBoost: XGBoost: XGBoost: XGBoost

  144. What are the differences between XGBoost and GBDT?

  145. How is the importance of XGB features judged?

  146. What does XGB’s pre-sort algorithm do?

  147. Which is more sensitive to outliers, RF or XGBoost

  148. When does XGB stop splitting?

  149. Compare XGB with lightGBM when nodes are split

  150. A brief description of Lightgbm’s advantages and disadvantages over XGBoost

  151. Is XGBoost sensitive to feature deletions, what does it do with missing values, and what are the problems

  152. What are the differences in features and data parallelism between XGB and LGB?

  153. Why doesn’t XgBoost use post pruning?

 

Afterword.

By learning the interview questions, you can get into machine learning. But if you want to further transform, get a job and get promoted, you can recommend the machine learning training camp again: t.csdnimg.cn/aTHV. For those who want to listen to the training camp’s SVM and Xgboost content, you can add this enterprise wechat (note: The training camp is free of charge).