This article briefly introduces a variety of non
Supervised learningPython implementation of the algorithm, including k-means clustering,
Hierarchical clustering,
t-SNEClustering,
DBSCANClustering.
Unsupervised learning is a class of machine learning techniques used to find patterns in data. The input data used by the unsupervised learning algorithm are unlabeled, which means that the data only gives the input variable (independent variable X) and does not give the corresponding output variable (dependent variable). In unsupervised learning, the algorithm itself will discover interesting structures in the data.
Artificial intelligence (ai)Yan Lecun, the study’s lead author, explains: None
Supervised learningBeing able to learn on their own without having to be explicitly told if they’re doing anything right. This is true
Artificial intelligence (ai)The key!
Supervised learning vs. unsupervised learning
In supervised learning, the system tries to learn from the examples given earlier. (In unsupervised learning, the system tries to find patterns directly from a given example.) So if the data set is labeled, it’s a supervised learning problem; And if the data isn’t labeled, it’s an unsupervised learning problem.
The figure above is an example of supervised learning that uses regression techniques to find the best fit curve between features. In unsupervised learning, input data is divided according to features and predicted according to the cluster to which the data belongs.
Important terms
-
Characteristics: Input variables used to make predictions.
-
Predictive value: The output of the model given an input example.
-
Example: A row in a dataset. An example contains one or more characteristics and possibly a label.
-
Tags: The real results of the feature (which correspond to the prediction).
Prepare data for unsupervised learning
In this article, we use the Iris data set to perform primary forecasting. The dataset contains 150 records, each made up of five characteristics — petal length, petal width, sepal length, sepal width, and flower type. The flower category includes Iris Setosa, Iris VIrginica and Iris Versicolor. In this paper, four features of iris are provided to the unsupervised algorithm to predict which category it belongs to.
This article uses the Sklearn library in the Python environment to load the Iris dataset and uses Matplotlib for data visualization. The following code snippet is used to explore the data set:
# Importing Modules
from sklearn import datasets
import matplotlib.pyplot as plt
# Loading dataset
iris_df = datasets.load_iris()
# Available methods on dataset
print(dir(iris_df))
# Features
print(iris_df.feature_names)
# Targets
print(iris_df.target)
# Target Names
print(iris_df.target_names)
label = {0: 'red', 1: 'blue', 2: 'green'}
# Dataset Slicing
x_axis = iris_df.data[:, 0] # Sepal Length
y_axis = iris_df.data[:, 2] # Sepal Width
# Plotting
plt.scatter(x_axis, y_axis, c=iris_df.target)
plt.show()
Copy the code
['DESCR', 'data', 'feature_names', 'target', 'target_names'] ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] ['setosa' 'versicolor' 'virginica']Copy the code
Purple: Setosa, green: Versicolor, yellow: Virginica
Clustering analysis
In cluster analysis, the data is divided into different groups. In short, this step aims to separate groups with similar characteristics from the overall data and assign them to clusters.
Visual examples:
As shown above, the original data without classification is shown on the left, while the data after clustering is shown on the right (the data is classified according to its own characteristics). When given an input to be predicted, it looks at which cluster it belongs to based on its characteristics and makes predictions based on that.
Python implementation of K-means clustering
K-means is an iterative clustering algorithm whose goal is to find a local maximum in each iteration. The algorithm requires that the number of clustering clusters be selected initially. Since we knew that the problem involved three flower categories, we wrote an algorithm by passing the parameter “n_clusters” to the K-means model to group the data into three categories. Now, we randomly divide three data points (inputs) into three clusters. The next given input data point is divided into separate clusters based on the centroid distance between each point. Next, we will recalculate the centroids of all clusters.
The center of mass of each cluster is the set of eigenvalues that define the result set. The characteristic weights of the center of mass can be used to qualitatively explain what type of group each cluster represents.
We import k-means model from SKLearn library, fit feature and make prediction.
Python implementation of the k-means algorithm:
# Importing Modules from sklearn import datasets from sklearn.cluster import KMeans # Loading dataset iris_df = datasets.load_iris() # Declaring Model model = KMeans(n_clusters=3) # Fitting Model model.fit(iris_df.data) # Predicitng Predicted_label = model. Predict ([[7.2, 3.5, 0.8, 1.6]]) # Prediction on the entire data all_Predictions = Model. predict(iris_df.data) # Printing Predictions print(predicted_label) print(all_predictions)Copy the code
[0] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 2 1 2 1, 2]Copy the code
Hierarchical clustering
Hierarchical clustering, as the name suggests, is an algorithm capable of constructing hierarchical clusters. At the beginning of the algorithm, each data point is a cluster. Next, the two closest clusters merge into one. Eventually, the algorithm stops when all the points have been merged into a cluster.
The implementation of hierarchical clustering can be demonstrated using Dendrogram. Next, let’s look at an example of hierarchical clustering of food data. Data set links: raw.githubusercontent.com/vihar/unsup…
Python implementation of hierarchical clustering:
# Importing Modules
from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt
import pandas as pd
# Reading the DataFrame
seeds_df = pd.read_csv(
"https://raw.githubusercontent.com/vihar/unsupervised-learning-with-python/master/seeds-less-rows.csv")
# Remove the grain species from the DataFrame, save for later
varieties = list(seeds_df.pop('grain_variety'))
# Extract the measurements as a NumPy array
samples = seeds_df.values
"""
Perform hierarchical clustering on samples using the
linkage() function with the method='complete' keyword argument.
Assign the result to mergings.
"""
mergings = linkage(samples, method='complete')
"""
Plot a dendrogram using the dendrogram() function on mergings,
specifying the keyword arguments labels=varieties, leaf_rotation=90,
and leaf_font_size=6.
"""
dendrogram(mergings,
labels=varieties,
leaf_rotation=90,
leaf_font_size=6,
)
plt.show()
Copy the code
The difference between k-means and hierarchical clustering
-
Hierarchical clustering does not handle big data well, whereas K-means clustering does. The reason is that the time complexity of k-means algorithm is linear, that is, O(n); The time complexity of hierarchical clustering is quadratic, that is, O(n2).
-
In the k-means clustering, the results obtained by multiple runs of the algorithm may differ greatly because we select clusters randomly at first. The results of hierarchical clustering are reproducible.
-
The results show that the k-means algorithm performs well when the cluster shape is a hypersphere (such as a circle in two-dimensional space or a ball in three-dimensional space).
-
The k-means algorithm has poor ability to resist noise data (poor robustness to noise data), while hierarchical clustering can directly use noise data for clustering analysis.
T – SNE clustering
This is a visual method of unsupervised learning. T-sne refers to t-distributed stochastic neighbor embedding. It maps a higher-dimensional space into a visualized two-dimensional or three-dimensional space. Specifically, it will model objects in a high-dimensional space with two or three dimensional data points in the following way: similar objects are modeled with adjacent points with high probability, and dissimilar objects are modeled with points far apart.
Python implementation of T-SNE clustering for Iris datasets:
# Importing Modules
from sklearn import datasets
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
# Loading dataset
iris_df = datasets.load_iris()
# Defining Model
model = TSNE(learning_rate=100)
# Fitting Model
transformed = model.fit_transform(iris_df.data)
# Plotting 2d t-Sne
x_axis = transformed[:, 0]
y_axis = transformed[:, 1]
plt.scatter(x_axis, y_axis, c=iris_df.target)
plt.show()
Copy the code
Purple: Setosa, green: Versicolor, yellow: Virginica
Here, an Iris data set with four features (four dimensions) is transformed into a two-dimensional space and displayed in a two-dimensional image. Similarly, the T-SNE model can be used for data sets with N features.
DBSCAN clustering
DBSCAN (Noisy Density-based spatial Clustering Method) is a popular clustering algorithm used to replace the K-means algorithm in predictive analysis. It does not require the number of clusters to run. However, you need to tune the other two parameters.
Scikit-learn’s DBSCAN algorithm implementation provides default “EPS” and “min_samples” parameters, but in general, users will need to tune them. Parameter “EPS” is the maximum distance between two data points considered to be in the same near neighbor. The parameter “min_samples” is the minimum number of data points in a neighbor that are in the same cluster.
Python implementation of DBSCAN clustering:
# Importing Modules
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
# Load Dataset
iris = load_iris()
# Declaring Model
dbscan = DBSCAN()
# Fitting
dbscan.fit(iris.data)
# Transoring Using PCA
pca = PCA(n_components=2).fit(iris.data)
pca_2d = pca.transform(iris.data)
# Plot based on Class
for i in range(0, pca_2d.shape[0]):
if dbscan.labels_[i] == 0:
c1 = plt.scatter(pca_2d[i, 0], pca_2d[i, 1], c='r', marker='+')
elif dbscan.labels_[i] == 1:
c2 = plt.scatter(pca_2d[i, 0], pca_2d[i, 1], c='g', marker='o')
elif dbscan.labels_[i] == -1:
c3 = plt.scatter(pca_2d[i, 0], pca_2d[i, 1], c='b', marker='*')
plt.legend([c1, c2, c3], ['Cluster 1', 'Cluster 2', 'Noise'])
plt.title('DBSCAN finds 2 clusters and Noise')
plt.show()
Copy the code
More techniques for unsupervised learning:
-
Principal Component Analysis (PCA)
-
Anomaly detection
-
Since the encoder
-
Deep belief network
-
Hebb-type learning
-
Generative Adversarial network (GAN)
-
Self-organizing map
The original link: https://towardsdatascience.com/unsupervised-learning-with-python-173c51dc7f03