- The 10 Statistical Techniques Data Scientists Need to Master
- Originally written by James Le
- The Nuggets translation Project
- Permanent link to this article: github.com/xitu/gold-m…
- Translator: HearFishle
- Proofreader: Mymmon, Hu7May
Ten statistical Techniques that data scientists need to master
No matter where you stand on whether data science is “sexy,” it’s hard to ignore the fact that data, and our ability to analyze it, organize it, and determine its context, are increasingly important. Based on the sheer volume of job data and employee feedback, Glassdoor ranked data scientist no. 1 on its list of the 25 best jobs in the US. So while that role will remain, there’s no doubt that the specific tasks that data scientists do will evolve. Data scientists will continue to ride the waves of innovation and technological progress as technologies like machine learning gain popularity and emerging fields like deep learning gain significant attention from researchers and engineers as well as their companies.
While strong programming skills are important, data science isn’t exactly software engineering (in fact, it’s easier if you’re familiar with Python). Data scientists need the triple ability of programming, analysis and critical thinking. As Josh Wills puts it, “Data scientists have more statistical knowledge and programming skills than any other statistician.” I know too many software engineers who want to switch careers to data scientists. They blindly use machine learning frameworks such as TensorFlow or Apache Spark to process data without a full understanding of data science. The same is true of statistical learning, a theoretical framework for machine learning based on statistics and functional analysis.
Why study statistical learning theory? It is important to understand the ideas behind various technologies so that you know how and when to use them. In order to master more complex methods, one must first understand simpler methods. It is important to accurately evaluate the performance of a method to make sure that it is working properly. And, this is an exciting area of research, with very important applications in technology, industry and finance. Ultimately, statistical learning is an essential element of modern data scientist training. Examples of statistical learning problems include:
- Identify risk factors for prostate cancer.
- The recorded phonemes are classified according to the logarithmic periodic graph.
- Predict whether someone will have a heart attack based on demographic, dietary and clinical measurements.
- Custom email spam detection system.
- Recognize handwritten zip codes.
- Classifying tissue samples into one of several types of cancer.
- Establish a relationship between salary and demographic variables in census data.
In my last semester of college, I taught myself data mining. The material for this course covers all three books: Intro to Statistical Learning (Hastie, Tibshirani, Witten, James), Doing Bayesian Data Analysis (Kruschke) and Time Series Analysis and Applications (Shumway, Stoffer). I did a lot of exercises related to Bayesian analysis, Markov chains, hierarchical modeling, supervised and unsupervised learning. This experience deepened my interest in the academic field of data mining and convinced me to explore further. I recently taught myself the Intro to Statistical Learning Online Course at Stanford Lagunita, which covers all the materials in the Intro to Statistical Learning Book. Having touched on this twice, I’d like to share 10 statistical techniques from the book that I think any data scientist should learn to work with large data sets more effectively.
Before I start with these ten technologies, I want to distinguish between statistical learning and machine learning. I wrote earlier about one of the most popular approaches to machine learning so I’m pretty confident that I can tell the difference:
- Machine learning is a branch of artificial intelligence.
- Statistical learning is a branch of statistics.
- Machine learning has a strong emphasis on big data and predictive accuracy.
- Statistical learning emphasizes models and their interpretability, accuracy, and uncertainty.
- But the lines are blurring, and there is a lot of cross-discipline.
- Machine learning is more marketable!
1 — Linear regression:
In statistics, linear regression is a method to predict a target variable by fitting the optimal linear function between independent and dependent variables. When the sum of the distances between the values obtained at each point and the actual observed values is minimized, we can determine the best fit. In choosing the shape, this shape fits “best” when no other position produces less error. The two main types of linear regression are simple linear regression and multiple linear regression. Simple linear regression uses a single independent variable to predict a dependent variable by fitting an optimal linear relationship. Multiple linear regression is to use more than one independent variable to predict the dependent variable by fitting an optimal linear function.
Choose any two things in your life that are related. For example, I have monthly income and expenses and travel data for the past three years. Now I would like to answer the following questions:
- What will my monthly expenses be next year?
- Which factor (monthly income or number of trips per month) is more important in determining my monthly expenses?
- What is the correlation between monthly income and monthly trips and monthly expenses?
2 — Classification:
Classification is a data mining technique that categorizes data sets to aid in more accurate prediction and analysis. Classification, sometimes referred to as decision tree methods, is one of several methods for efficiently analyzing large data sets. The two main classification techniques that stand out are logistic regression and discriminant analysis.
Logistic regression is the appropriate method of regression analysis when the dependent variable is opposite (binary). Like all regression analysis, logistic regression is a predictive analysis. Logistic regression is used to describe data and explain the relationship between a dependent variable and one or more categorical, ordinal, distance, or ratio independent variables. Logistic regression can verify the following problems:
- What is the difference (yes or no) in lung cancer risk for each pound of weight gained and each extra pack of cigarettes smoked per day?
- Did weight, calorie intake, fat intake and age of participants have an impact on heart attack (yes or no)?
In discriminant analysis, two or more groups or groups or populations are known prior, and one or more observations are divided into known clusters according to the characteristics of the analysis. Discriminant analysis simulates the distribution of the predictor X in each response class and then converts it using Bayes’ theorem into a probability estimate for the response class with a given X value. These models can be linear or quadratic.
- Linear discriminant analysis classifies the observed values by calculating the discriminant score of each observed value. These scores are obtained by looking for linear combinations of independent variables. It assumes that the observations in each class are derived from a multivariate Gaussian distribution, and that the covariances of the predictive variables are the same at k levels of the response variable Y.
- Quadratic discriminant analysis provides another approach. Like LDA, QDA assumes that the observed values of each class of Y are derived from the Gaussian distribution. Unlike LDA, QDA assumes that each class has its own covariance matrix. In other words, the covariance of the predictive variable is not assumed to be the same at all k levels of the response variable Y.
3 — Resampling method:
Resampling refers to the method of extracting duplicate samples from original data samples. It is a nonparametric method of statistical inference. In other words, the resampling method does not involve using a universal distribution table to calculate the probability value of approximate P.
Resampling generates a unique sampling distribution based on the actual data. It uses experimental rather than analytical methods to generate this unique sampling distribution. It generates unbiased estimates based on an unbiased sample of all possible outcomes studied by the researcher. To understand the concept of resampling, you should know bootstrap and cross-validation:
- Bootstrap methods are applied to a variety of scenarios, such as verifying predictive model performance, integration methods, bias estimates, and model variances. It works by performing a put-back sampling of the raw data, using “unchecked” data points as test samples. We can perform it multiple times and calculate the mean to evaluate the performance of our model.
- Cross-validation, on the other hand, is used to verify model performance and is performed by dividing the training data into K parts. We use the former K-1 part as the training set and the “set aside” part as the test set. Repeat this step k times in different ways, and finally use the mean of the k points as a performance evaluation.
In general, for linear models, the ordinary least square method is the main criterion to be considered when fitting data. The following three methods can replace it and provide better prediction accuracy and interpretability of fitting linear models.
4 — Subset selection:
This method identifies a subset of p predictors that we consider to be relevant to the response. Then we use the least squares of subset features to fit the model.
- Optimal subset selection: Here, we fit an OLS regression for each possible combination of P predictors, and then observe the fitting effect of each model. The algorithm has two stages :(1) fitting all models containing k predictors, where k is the maximum length of the model. (2) Use cross-validation to predict losses to select a single model. It is important to use validation or test errors, and you cannot simply use training errors to evaluate how well the model fits, because RSS and R² increase monotonically with increasing variables. The best way to do this is to select the highest R² and lowest RSS in the test set to select the model and cross-validate it.
- Forward progressive selection studies a much smaller subset of P predictors. It starts with a model with no predictors and gradually adds predictors to the model until all predictors are included. The order of adding predictors is determined according to the degree of improvement of model fitting performance by different variables, and variables will be added continuously until there are no more predictors to improve the model in the cross validation error.
- Stepwise backward selection adds all P predictors to the model at the beginning, and then removes the least useful factor at each iteration.
- The hybrid method follows the forward step method. But after each new variable is added, the method may also remove variables that are not useful for model fitting.
5 — Feature reduction:
This method is suitable for models with all P predictors. However, the estimated coefficient will converge to zero according to the least square estimate. This contraction is also called regularization. It aims to reduce variance to prevent model overfitting. Since we use different methods of convergence, some coefficients will be estimated to be zero. Therefore, this method can also perform variable selection, and the most desirable techniques for converging variables to zero are Ridge regression and Lasso regression.
- Ridge regression is very similar to the least square method, except that it estimates coefficients by minimizing a slightly different value. Ridge regression, like OLS, seeks to reduce the coefficient estimation of RSS. But as the coefficient values approach zero, they penalize this contraction. This penalty term has the effect of reducing the coefficient estimate to zero. Don’t need math, you can know the ridge regression by column space of the minimum variance coefficient of convergence is very useful, such as principal component analysis, the ridge regression data d projection direction in space, and compared with high variance component, more low shrinkage of variance components, both of which is equal to minimum and maximum principal component principal component.
- Ridge regression has at least one disadvantage. It requires all P predictors to be included in the final model, mainly because the penalty will make the coefficients of many predictors close to zero, but certainly not equal to zero. This is usually not a problem for predictive accuracy, but it makes the model’s results harder to interpret. Lasso overcomes this shortcoming and is able to zero the coefficients of some predictors when S is small enough. Since S = 1 will lead to normal OLS regression, when S approaches 0, the coefficient will converge to zero. So Lasso regression is also a good way to perform variable selection.
6 — Dimension reduction:
The dimensionality reduction algorithm simplifies the p + 1 coefficient problem to the M + 1 coefficient problem, where M < P. The algorithm is performed by M different linear combinations or projections of computed variables. Then M projections were used as predictors and a linear regression model was fitted by least square method. Two approaches are principal Component regression and Partial least squares.
- Principal component regression (PCR) can be regarded as a method for deriving a set of low-dimensional features from a large set of variables. The first principal component of the data is the one along which the observed value changes the most. In other words, the first principal component is the line closest to the fitting data, and a total of P different principal components can be fitted. The second principal component is a linear combination of variables unrelated to the first principal component and has the largest variance under this constraint. The main idea is that principal components capture the maximum variance using linear combinations of data in each mutually perpendicular direction. Using this method, we can also extract more information from the data by combining the effects of the relevant variables, after all, one of the relevant variables needs to be discarded in the conventional least square method.
- The PCR method described above requires extraction of linear combinations of X to obtain the optimal characterization of predictors. Since the output Y of X cannot be used to help determine the principal component direction, these combinations (directions) are extracted using unsupervised methods. That is, Y does not supervise the extraction of principal components, and thus cannot guarantee that these directions are optimal representations of the predictor, nor that optimal predictive outputs can be obtained (although this is often assumed). Partial least squares (PLS) is a monitoring method as an alternative to PCR. Similar to PCR, PLS is also a dimension reduction method. It first extracts a new and smaller feature set (a linear combination of the original features), and then fits the original model into a new linear model with M features by least square method.
7 — Nonlinear regression:
In statistics, nonlinear regression is a form of regression analysis in which the observed data are modeled using a function (dependent on one or more independent variables) of a nonlinear combination of model parameters. It uses successive approximation to fit the data. Below are several important techniques for dealing with nonlinear models:
- Step function, whose variables are real numbers, can be written as a finite linear combination of interval indicator functions. The informal interpretation is that a step function is a piecewise constant function with only a finite number of parts.
- The piecewise function is defined by multiple subfunctions, each of which is defined on a defined interval in the domain of the main function. Segmentation is actually a way of representing a function rather than a feature of the function itself, but with additional qualification it can be used to describe the nature of the function. For example, a piecewise polynomial function is a function that is polynomial on each subdefinition, where each polynomial may be different.
- A spline is a special function defined in terms of polynomial segments. In computer graphics, spline curve is a piecewise polynomial parametric curve. Spline curves are commonly used because of their simplicity of structure, ease and precision of evaluation, and ability to approximate complex curves through curve fitting and interactive curve design.
- The generalized additive model is a generalized linear model in which the linear predictor linearly depends on the unknown smoothing functions of some predictor variables and its main function is to predict these smoothing functions.
8 — Tree-based approach:
Tree-based approaches can be used for regression and classification problems, including spatial layering or partitioning of predictors into several simple regions. Because the set of separation rules for the predictor space can be summarized as a tree, this kind of method is called decision tree method. The following approach is several different trees that can be combined to produce a single consistent forecast.
- Bagging reduces the variance of the prediction by generating additional data from the original data (by combining and repeating multiple segments of data of the same size as the original data) for training. The prediction ability of the model cannot be improved by increasing the training set, but only by reducing the variance and carefully adjusting the prediction to get the desired output.
- Boosting is a method of computing the output using a number of different models and then averaging the results using a weighted averaging method. Combining the strengths and weaknesses of these approaches, by changing the weighting formula, you can use a different, more nuanced tuning model to produce good predictive power over a wider range of input data.
- The random Forest algorithm is similar to bagging algorithm in that it extracts random bootstrap samples from the training set. However, in addition to bootstrap samples, a random subset of features can be extracted to train a single tree. In bagging, you need to provide the entire feature set for each tree. Since feature selection is random, each tree is more independent from each other than conventional Bagging algorithms, which generally results in better prediction performance (thanks to better variance-bias trade-offs). Because each tree only needs to learn a subset of features, the computation is also faster.
9 — Support vector machines:
Support vector machine (SVM) is a commonly used supervised learning classification technique. In layman’s terms, it is used to find hyperplanes (lines in 2D space, planes in 3D space, and hyperplanes in higher-dimensional space) that make the best separation between two sets of points. More formally, a hyperplane is an N-1-dimensional subspace of an n-dimensional space. The support vector machine is a separate hyperplane that preserves the maximum interval, so in essence, it is a constrained optimization problem in which the interval of the support vector machine is maximized under the constraint to classify the data perfectly (hard interval classifier).
Data points that “support” the hyperplane are called “support vectors”. In the figure above, the fill blue circle and two fill squares are the support vectors. In the case where the two types of data are not linearly separable, the data points will be projected into a higher dimensional space, making the data linearly separable. Problems involving data points of multiple categories can be decomposed into multiple “one-versus-one” or “one-versus-rest” dichotomies.
10 — Unsupervised learning:
So far, we have only discussed supervised learning techniques, where data classification is known and the experience provided to the algorithm is the relationship between entities and their classification. When the classification of the data is unknown, another technique is needed. They are called unsupervised because they need to discover patterns in the data for themselves. Clustering (Clustring) is a type of unsupervised learning in which data is divided into clusters based on correlation. Below are some of the most commonly used unsupervised learning algorithms:
- Principal component analysis helps generate low-dimensional representations of data sets by identifying linear connections between features with maximum variance and uncorrelated features. This linear dimensionality reduction technique helps to understand implicit variable interactions in unsupervised learning.
- K-means clustering: The data is divided into K different clusters according to the distance from the cluster center.
- Hierarchical clustering: Build a multi-level hierarchy of clusters by creating cluster trees.
This is a basic application of some basic statistical techniques that can help data science project managers and/or executives better understand the inner workings of their data science teams. In fact, some data science teams run algorithms purely through python and R libraries. Most of them don’t even have to think about basic math. However, being able to understand the basics of statistical analysis can provide a better approach for your team. Insight into the smallest parts makes manipulation and abstraction easier. Hopefully this guide to basic data science statistics will give you a good understanding!
** You can get all the lecture slides and RStudio lessons from [my Github source] (github.com/khanhnamle1…
If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.
The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.