By JP Tech

Compiler: ronghuaiyang \

takeaway

These 12 questions are the most popular questions in the current interview, which are not only very basic questions, but also can tell the level of the interviewer, with distinction. Both interviewers and job seekers can look at it.

These are the questions I often ask when interviewing for an AI engineer position. In fact, not all interviews need to use all of these questions, as it depends on the candidate’s previous experience and projects. Through a number of interviews, especially with students, I’ve collected 12 of the most popular interview questions in deep learning and will share them with you today in this post. I hope to receive many of your comments. All right, without further ado, let’s get started. \

1. Introduces the significance of Batch Normalization

This is a very good question because it covers most of what candidates need to know when using neural network models. You can answer in different ways, but the main points need to be made:

Batch Normalization is an effective method of training neural network models. The objective of this method is to normalize the features (the output after activation of each layer) to a mean of 0 and a standard deviation of 1. So the question is how does the non-zero mean affect model training: \

  • First of all, it can be understood as a non-zero mean, in which the exponential data is not distributed around the zero value, but most values in the data are greater than or less than zero. Combined with the high variance problem, the data becomes very large or very small. This problem is common when training neural networks with many layers. Features are not distributed in the stable interval (from small to large), which will affect the optimization process of the network. It is well known that optimizing neural networks requires the use of derivative calculation. Suppose a simple layer calculation formula is y = (Wx + b), and the derivative of y with respect to W is :dy = dWx. Therefore, the value of x directly affects the value of the derivative (of course, the concept of gradient in neural network models is not so simple, but in theory, x affects the derivative). Therefore, if x brings about an unstable change, its derivative may be too large or too small, making the learning model unstable. This also means that when using Batch Normalization we can use higher learning rates in training.
  • Batch Normalization avoids saturation of X values through nonlinear activation functions. Therefore, it ensures that the activation value is not too high or too low. This helps to learn weights. Some weights may never be learned when they are not used, but when they are used, almost all of them can be learned. This helps us to be less dependent on the initial values of the parameters.
  • Batch Normalization is also a form of Normalization that helps minimize overfitting. With Batch Normalization, we don’t have to use as much dropput, which makes sense because we don’t have to worry about losing too much information, and when we actually use it, we still recommend using a combination of the two techniques.

2. Explain the tradeoff between bias and Variance

What is bias? A bias is, understandably, the difference between the current model’s average prediction and the actual result we need to predict. A high bias model shows that it pays less attention to training data. This makes the model too simple and does not achieve good accuracy in training and testing. This phenomenon is also known as underfitting.

Variance can simply be understood as the distribution of model output over a data point. The larger Variance, the more likely the model is to pay close attention to the training data and not provide generalization of data that has never been encountered. Therefore, the model achieves very good results on the training data set, but very poor results compared to the test data set, which is the phenomenon of overfitting.

The relationship between these two concepts can be seen in the following figure:

In the figure above, the center of the circle is a model that perfectly predicts the exact value. In fact, you’ve never found a better model. As we get further and further away from the center of the circle, our predictions get worse and worse. \

We can change the model so that we can maximize the number of model guesses that fall into the center of the circle. There is a balance between the deviation and the variance. If our model is too simple and has few parameters, it may have high bias and low variance.

On the other hand, if our model has a large number of parameters, then it will have high variance and low bias. This is the basis for calculating model complexity when we design the algorithm.

3. If the deep learning model already has 10 million personal face vectors, how to find new faces through query?

This problem is about the application of deep learning algorithm in practice. The key of this problem is the index method of data. This is the final step in applying One Shot Learning to face recognition problems, but it is the most important step to deploy in practice.

Basically, to answer this question, you should first describe the general method of One Shot Learning for face recognition. It can be understood simply as converting each face into a vector, and the new face recognition is looking for the vector that is the closest (most similar) to the input face. Typically, one would use a deep learning model of the Triplet Loss function to achieve this.

However, as the number of images increases, it is not a sensible solution to calculate the distance to 10 million vectors in each recognition, making the system much slower. To make the query more convenient, we need to consider ways to index data in real vector space. \

The main idea of these approaches is to divide the data into structures (perhaps similar to tree structures) that make it easy to query new data. When new data is available, querying in the tree helps to quickly find the closest vector.

Several methods can be used for this purpose, such as Locality Sensitive Hashing — LSH, Approximate Nearest Neighbors — dirty Indexing, and so on. \

4. For classification problems, are the accuracy indicators completely reliable? What metrics do you usually use to evaluate your model?

There are many different ways to evaluate a class problem. In terms of accuracy, the formula simply divides the number of correctly predicted data points by the total data. That sounds reasonable, but in reality, the amount is not significant enough to address the problem of unbalanced data. Suppose we are building a predictive model of network attacks (assuming that attack requests account for 1/100,000 of the total number of requests).

If the model predicts that all requests are normal, the accuracy is 99.9999%, a number that is usually unreliable in classification models. The accuracy calculations above usually show us what percentage of the data is correctly predicted, but do not specify how to classify each class. Instead, we can use an obfuscation matrix. Basically, the obfuscation matrix shows how many data points actually belong to, and are predicted to belong to, a class. Its form is as follows:

In addition to showing the change of true and false positive indicators corresponding to each threshold defining the classification, we have a ROC diagram. According to the ROC curve, we can know whether the model is valid or not.

The ideal ROC curve is the orange line closest to the upper left corner. True positives are higher and false positives are lower. \

5. How to understand backpropagation? Explain how it works?

This question is designed to test your knowledge of how neural networks work. You need to make the following points clear: \

  • The forward process (forward calculation) is a process that helps the model calculate the weights of each layer, and the resulting calculation will produce oneypThe results. At this point, the value of the loss function is calculated, and the value of the loss function shows how good the model is. If the loss function is not good enough, we need to find a way to reduce the value of the loss function. Training neural networks is essentially a loss minimization function. Loss function L (yp.yt) saidypThe output value of the model andytThe degree to which the actual value of the data label differs.
  • To reduce the value of the loss function, we need to use the derivative. Back propagation helps us calculate the derivatives of each layer of the network. Based on the derivative values on each layer, the optimizer (Adam, SGD, AdaDelta…) Apply gradient descent to update the weight of the network.
  • Back propagation uses chain rules or derivative functions to compute the gradient of each layer from the last layer to the first layer.

6. What does activation function mean? What is the saturation interval of the activation function?

Meaning of activation function

Activation functions are created to break the linear nature of neural networks. These functions can be understood simply as using a filter to determine whether information passes through a neuron. In neural network training, activation function plays an important role in adjusting the slope of derivative. Some activation functions, such as Sigmoid, Fishy, or ReLU, are discussed further in the following sections.

However, we need to understand that the properties of these nonlinear functions allow neural networks to learn the representation of functions that are more complex than using linear functions alone. Most activation functions are continuously differentiable.

These functions are continuous, that is, if the input variable is small and differentiable (with a derivative at every point in its domain), then the output will have a small change. Of course, as mentioned above, the calculation of derivatives is very important, and it’s the determining factor in whether or not our neurons can be trained. Mention a few common activation functions, such as Sigmoid, Softmax, ReLU.

The saturation interval of the activation function

Tanh, Sigmoid, ReLU and other nonlinear activation functions all have saturation intervals.

It is easy to understand that the saturation range of a trigger function is the range in which the output value of the function does not change even if the input value changes. There are two problems in the change interval: in the forward propagation of the neural network, the value of this layer gradually falls into the saturation interval of the activation function, and multiple identical outputs will gradually appear. \

This will result in the same data flow throughout the model. This phenomenon is called covariance shift. The second problem is that in backpropagation, the derivative is zero in the saturated region, so the network learns almost nothing. This is why we need to set the value range to mean 0, as described in the Batch Normalization section.

7. What are the superparameters of the model? How is it different from a parameter?

What are the parameters of the model?

Going back to the essence of machine learning, we need a data set for machine learning, how can we learn without data? Once the data is available, the machine needs to find the connection between the inputs and outputs in the data heap. \

Let’s say our data is temperature, humidity, temperature and so on, and what we’re asking the machine to do is find a connection between those factors and whether your wife is angry or not. It sounds irrelevant, but machine learning is pretty stupid when it comes to doing it. Now suppose we use a variable y to indicate whether our wives are angry or not, variables X1, x2, x3… Represents the element of weather. We can boil down the function f (x) to the following relation:

You see the coefficients w1, W2, w3. w_1, w_2, w_3 .. w1, w2, w3 .. ? This is the relationship between the data and the elements we are looking for, the so-called model parameters. Therefore, we can define the model parameters as follows :\

Model parameters are model values generated from training data to help display relationships between quantities in the data.

Therefore, when we say we have found the best model for the problem, it should mean that we have found the model parameters that best fit the problem on the existing data set. It has the following characteristics:

  • Used to predict new data
  • It shows the power of the model we use. It’s usually expressed in terms of accuracy, which we call accuracy.
  • Learn directly from the training data set
  • Manual Settings are usually not required

Model parameters can take many forms, such as weights of neural networks, support vectors in support vector machines, and coefficients in linear regression or logistic regression algorithms.

What are the hyperparameters of the model?

We often assume that a model hyperparameter looks like a model parameter, but it’s not true. In fact, the two concepts are completely separate. If the model parameters are modeled from the training data set itself, the model hyperparameters are completely different. It is completely outside the model and does not rely on training data. What is its purpose? In fact, they have the following tasks :\

  • It is used to help the model find the most appropriate parameters in the training process
  • It is usually hand-picked by the participants in the model training
  • It can be defined based on several heuristic strategies

For a particular problem, we have absolutely no idea what the best model hyperparameter is. Therefore, in reality, we need to use some techniques to estimate the optimal range of values (such as the K coefficients in the K-nearest neighbor model), such as grid search. Here I would like to give a few examples of model hyperparameters:

  • Learning rate when training artificial neural network
  • Training support vector machineCandsigmaparameter
  • In the nearest neighbor modelkThe coefficient of

What happens when the learning rate is too high or too low?

When the model’s learning rate is set too low, model training will proceed very slowly because it makes very small updates to the weights. Multiple updates are required before local optimality is reached. \

If the learning rate is set too high, the model may not converge due to too much weight update. It is possible that in one step of updating weights, the model jumps out of the local optimization, making it difficult for the model to update to the optimal point in the future, but hops around the local optimization point.

9. When the image size is doubled, how many times does the parameter number of CNN become? Why is that?

This is a very misleading question for the candidates because most people will focus on how many times the number of CNN parameters will increase. However, let’s look at CNN’s architecture:

We can see that the number of parameters in the CNN model depends on the number and size of filters rather than the input image. Therefore, doubling the size of the image does not change the number of parameters in the model. \

10. How to deal with unbalanced data?

This is a question that tests the candidate’s approach to real data problems. Often, actual data and standard data sets differ greatly in the attributes and volume of the data set (standard data sets do not require adjustment). For real data sets, there can be data imbalances, that is, data imbalances between classes. We can now consider the following techniques:

  • Choose the right metrics to evaluate the model: Using accuracy to evaluate an unbalanced data set is a dangerous exercise, as described in the previous sections. Accuracy, recall, F1 score, AUC and other appropriate evaluation quantities should be selected.
  • Resampling training data sets: In addition to using different evaluation criteria, one can use several techniques to obtain different data sets. There are two ways to create a balanced dataset from an unbalanced dataset, namely undersampling and oversampling. Specific techniques include repetition, bootstrapping or HITS (a combination of a few oversampling techniques).
  • Integration of many different models: Generalizing models by creating more data is not always feasible in practice. For example, you have two layers, a rare class with 1000 data and a large class with 10,000 data samples. Therefore, we can consider a training solution of 10 models rather than trying to find 9000 data samples from a rare class for model training. Each model is trained from 1000 rare classes and 1000 large classes. Then use integration techniques to get the best results.

  • Model redesign – Loss function: Use penalty techniques to severely penalize most classes in the cost function to help the model itself better learn data from rare classes. This makes the value of the loss function more comprehensive in the class.
  • \

11. What do the concepts Epoch, Batch, and Iteration mean when training deep learning models?

These are very basic concepts for training neural networks, but the fact is that many candidates have trouble distinguishing between them. Specifically, you should answer the following questions:

  • Epoch: Represents an iteration of the entire data set (all contained in the training model).
  • Batch: When we cannot fit the entire data set into a neural network at once, we divide the data set into several smaller data sets.
  • Iteration: is the number of batches required to run the epoch. Suppose there are 10,000 images as data and the batch size (batCH_size) is 200. Then an epoch will contain 50 iterations (10,000 divided by 200).

12. What is the concept of a data generator? When do we need to use it?

Data generator is very important in writing code, data generator function helps us to directly generate data, and then sent to the model for each batch of training.

Generating functions can be very helpful in training big data. Because the data set does not always need to be fully loaded into RAM, this is a waste of memory, and if the data set is too large, memory can overflow and input data can take longer to process. \

conclusion

These are 12 deep learning interview questions I often ask candidates. However, depending on each candidate, the questions will be asked differently, or some questions will be asked at random from the candidates’ questions.

Although this article is about technical issues, it is relevant to the interview, and based on my personal opinion, attitude determines 50% of the success of an interview. So, in addition to accumulating your knowledge and skills, always express yourself in a sincere, progressive, humble manner, and you are sure to succeed in any conversation. I hope you will realize your wish soon.

— the END —

英文原文 :

medium.com/@itchishiki…

medium.com/@itchishiki…

medium.com/@itchishiki…

Note: the menu of the official account includes an AI cheat sheet, which is very suitable for learning on the commute.

Machine learning Online Manual Deep Learning online Manual AI Basic Download (Part 1)4500+ user ID:92416895), please reply to knowledge PlanetCopy the code

Like articles, click Looking at the