Learn about ten machine learning algorithms you should know in order to become a data scientist.

Machine learning practitioners have different personalities. While some of them are “I’m an X expert, X can train any kind of data”, where X = some kind of algorithm, others are “the right tools for the right job” people. Many of them also subscribe to the “master of all trades” strategy, with a deep area of expertise and a slight understanding of different areas of machine learning. That said, no one can deny the fact that as practicing data scientists, we have to understand some of the basics of common machine learning algorithms that will help us deal with the new domain problems we encounter. Here’s a whirlwind tour of common machine learning algorithms, along with quick resources on them to help you get started.

Principal Component Analysis (PCA)/SVD

PCA is an unsupervised approach to understanding the global properties of a data set composed of vectors. The covariance data point matrix is analyzed here to see which dimensions (most)/data points (sometimes) are more important (i.e., they differ greatly from each other, but have little covariance with other dimensions). One way to consider the top-level PC of a matrix is to consider the eigenvectors with the highest eigenvalues. SVD is also essentially a way of calculating the ordered components, but you don’t need to get the covariance matrix of a point to get it.

This algorithm can help people overcome the problem of dimension by obtaining the data points of dimension reduction.

Libraries

Docs.scipy.org/doc/scipy/r…

Scikit-learn.org/stable/modu…

Introductory tutorial

Arxiv.org/pdf/1404.11…

Least square method and polynomial fitting

Remember in your numerical analysis class in college, when you used to use lines and curves to fit equations? For very small data sets with small dimensions, you can use them to fit curves in machine learning. (For large data or data sets with multiple dimensions, you may end up overfitting, so please don’t bother.) OLS has a closed solution, so you don’t have to use complex optimization techniques.

It is obvious to use this algorithm to fit simple curves/regressions.

Libraries

Docs.scipy.org/doc/numpy/r…
Docs.scipy.org/doc/numpy-1…

Introductory tutorial

Lagunita.stanford.edu/c4x/Humanit…

Constrained linear regression

The least square method can be confused with outliers, spurious fields and noise in the data. Therefore, we need constraints to reduce the variance of the lines we fit on the data set. The correct way to do this is to fit the linear regression model to ensure that the weights do not go wrong. Models can have L1 norm (LASSO) or L2 (ridge regression) or both (elastic regression). The average square loss is optimized.

These algorithms are used to fit regression lines with constraints and to avoid overfitting and masking noise dimensions from the model.

Libraries

Scikit-learn.org/stable/modu…

Introductory tutorial

www.youtube.com/watch?v=5as…

www.youtube.com/watch?v=jbw…

K – Means clustering

Everyone likes unsupervised clustering algorithms. Given a set of data points in vector form, we can make a set of points according to the distance between them. It is an expectation maximization algorithm that iteratively moves the center of the cluster and then moves the points in the center of each cluster one by one. The input the algorithm takes is the number of clusters to generate and the number of iterations it will attempt to aggregate.



As is obvious from the name, you can use this algorithm to create K clusters in a dataset.

Libraries

Scikit-learn.org/stable/modu…

Introductory tutorial

www.youtube.com/watch?v=hDm…

www.datascience.com/blog/k-mean…

Logistic regression

Logistic regression is constrained linear regression that has nonlinear application (mainly using Sigmoid function or you can also use TANH) after applying weights, thus limiting the output to close to +/- class (1 and 0 in the case of Sigmoid). Gradient Descent was used to optimize the cross entropy loss function. Note to beginners: Logistic regression is used for classification, not regression. You can also think of Logistic regression as a single-layer neural network. Logistic regression was trained using optimization methods such as gradient descent or L-BFGS. NLP people typically name this as the Maximum Entropy Classifier.

This is what Sigmoid looks like:

Use LR to train simple but very powerful classifiers.

Libraries

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Introductory tutorial

www.youtube.com/watch?v=-la…

SVM (Support vector Machine)

SVM are linear models such as linear/logistic regression, differing in that they have different edge-based loss functions (the derivation of support vectors is one of the most beautiful mathematical results I have seen in eigenvalue calculations). You can optimize the loss function using optimization methods such as L-BFGS or even SGD.

Another innovation of SVM is the use of the data kernel for feature design. If you have good domain insight, you can replace the old RBF kernel with a smarter RBF kernel and make a profit.

One unique thing that SVM can do is learn a class classifier.

SVM can be used to train classifiers (even regressors).

Library

Scikit-learn.org/stable/modu…

Introductory tutorial

www.youtube.com/watch?v=eHs…

Note: SgD-based Logistic regression and SVM training can both be found in SKLearn, which I use often because it allows me to check LR and SVM with a generic interface. You can also train it on > RAM size datasets using small batches.

Feedforward neural network

These are basically multi-level Logistic regression classifiers. The weights of many layers are separated by nonlinearity (Sigmoid, Tanh, Relu + Softmax and Cool New Selu). Another popular name is Multi-Layered Perceptrons. FFNN can be used for classification and unsupervised feature learning of autoencoders.

Multilayer perceptron

FFNN acts as an autoencoder

FFNN can be used to train classifiers or extract features as autoencoders.

Libraries

Scikit-learn.org/stable/modu…

Scikit-learn.org/stable/modu…

Github.com/keras-team/…

Introductory tutorial

www.deeplearningbook.org/contents/ml…

www.deeplearningbook.org/contents/au…

www.deeplearningbook.org/contents/re…

Convolutional Neural Networks (Convnets)

It is possible to achieve almost any of the most advanced vision-based machine learning results in the world today using convolutional neural networks. They can be used for image classification, object detection and even image segmentation. Convnets invented by Yann Lecun in the late 1980s and early 1990s have convolution layers that can be used as hierarchical feature extractors. You can also use them in text (and even graphics).

Use convnets for state-of-the-art image and text classification, object detection, and image segmentation.

Libraries

developer.nvidia.com/digits

Github.com/kuangliu/to…

Github.com/chainer/cha…

Keras. IO/application…

Introductory tutorial

cs231n.github.io/

Adeshpande3. Making. IO/A – Beginner %…

Recursive Neural Network (RNN)

RNN models the sequence of models by recursively applying the same set of weights to the aggregator state at time t and input at time t (the given sequence has input time 0.. t.. T, at every time T has a hidden state which is output from step T minus 1 of RNN. Pure RNN is rarely used today, but its counterparts such as LSTM and GRU are state of the art in most sequence modeling tasks.

RNN (f is now usually LSTM or GRU if there are densely connected elements and nonlinearities). LSTM cells are used to replace ordinary dense layers in pure RNN.

Use RNN for any sequence modeling task, especially text classification, machine translation, and language modeling.

The Library:

Github.com/tensorflow/… (Many of Google’s cool NLP research papers are here.)

Github.com/wabyking/Te…

opennmt.net/

Introductory tutorial

cs224d.stanford.edu/

www.wildml.com/category/ne…

Colah. Making. IO/posts / 2015 -…

Conditional Random Field (CRF)

CRF is probably the most commonly used model in the probabilistic Graph Model (PGM) family. They are used for sequence modeling like RNN and can also be used in combination with RNN. CRFS were the most advanced before the advent of neural machine translation systems and used small data sets in many sequence marking tasks, they still learn better than RNNS that require much larger amounts of data. They can also be used for other structured prediction tasks such as image segmentation. CRF models each element of a sequence, such as a sentence, so that neighbors influence the labels of components in the sequence, rather than all tags being independent of each other.

Use CRF to tag sequences (in text, images, time series, DNA, etc.).

The Library:

sklearn-crfsuite.readthedocs.io/en/latest/

Introductory tutorial

Blog. Echen. Me / 2012/01/03 /…

www.youtube.com/watch?v=GF3…

The decision tree

Suppose I’m given an Excel spreadsheet with data on various fruits, and I have to say what an apple looks like. What I’m going to do is ask the question “Which fruits are red and round?” And separate all the fruits that answered yes and no. Now, all red and round fruits may not be apples, and all apples will not be red and round. So I ask the question, “Which fruits have red or yellow hints on them? “On the red and round fruit,” which fruit is green and round? “Not red and round fruit. Based on these questions, I can pinpoint Apple. This set of questions is the decision tree. But this is a decision tree based on my intuition. Intuition cannot be used for high or complex data. We must automatically ask the cascade question by looking at the tag data. That’s what machine learning-based decision trees do. Earlier versions like CART trees were used for simple data, but as data sets grew larger, the bias – variance tradeoff needed to be addressed with better algorithms. Two common decision tree algorithms in use today are random forests (building different classifiers on a random subset of attributes and combining them for output) and Boosting Trees (cascade training Trees on other Trees to correct the following errors).

Decision trees can be used to classify (or even regression) logarithmic points.

Library

Scikit-learn.org/stable/modu…

Scikit-learn.org/stable/modu…

xgboost.readthedocs.io/en/latest/

catboost.yandex/

Introductory tutorial

Xgboost. Readthedocs. IO/en/latest/m…

Arxiv.org/abs/1511.05…

Arxiv.org/abs/1407.75…

Education.parrotprediction.teachable.com/p/practical…

TD algorithm

If you’re still wondering how any of the above methods would solve DeepMind’s task of beating the Go world champion, they don’t. All 10 algorithms we’ve discussed so far are pattern recognition, not strategy learners. To learn strategies for solving multi-step problems, such as winning chess or playing the Atari console, we need to make the world free of agents and learn from the rewards/punishments they face. This type of machine learning is called reinforcement learning. Much, but not all, of the recent success in this area has been the result of combining the perceptual capabilities of Convnet or LSTM with a set of algorithms called time-difference learning. These include Q-learning, SARSA, and several other variations.

These algorithms are mainly used for auto-playing games, but also for other applications in language generation and object detection.

Library

Github.com/keras-rl/ke…

Github.com/tensorflow/…

Introductory tutorial

Web2.qatar.cmu.edu/~gdicaro/15…

www.youtube.com/watch?v=2pW…

These are ten machine learning algorithms you can learn to become a data scientist.

You can also read about the machine learning library here.

Click on the link to the original English text

More articles are welcome to visit http://www.apexyun.com

Public id: Galaxy 1

Contact email: [email protected]

(Please do not reprint without permission)