By Chris Albon
Translator: Flying Dragon
Protocol: CC BY-NC-SA 4.0
Dimensionality reduction on sparse eigenmatrices
Preprocessing import StandardScaler from sklearn.decomposition import TruncatedSVD from scipy.sparse Import csr_matrix from sklearn import datasets import numpy as NP # load data digits = datasets. Load_digits () # StandardScaler().fit_transform(digits.data) # Generate x_SR_matrix (X) # create TSVD TSVD = TruncatedSVD(N_Components =10) # Use TSVD X_sparse_tsvd = TVD. Fit (X_sparse). Transform (X_sparse) # print('Original number of features:', X_sparse.shape[1]) print('Reduced number of features:', X_sparse_tsvd.shape[1]) ''' Original number of features: 64 Reduced number of features: Tsvd.explained_variance_ratio_ [0:3].sum() # 0.30039385372588506Copy the code
Kernel PCA dimension reduction
Decomposition import PCA, KernelPCA from sklearn.datasets import make_circles _ = make_circles(n_samples=1000, random_state=1, noise=0.1, Factor =0.1) # Apply the kernel PCA with RBF kernel kpCA = KernelPCA(kernel=" RBF ", gamma=15, n_components=1) X_kpca = kpca.fit_transform(X) print('Original number of features:', X.shape[1]) print('Reduced number of features:', X_kpca.shape[1]) ''' Original number of features: 2 Reduced number of features: 1 '''Copy the code
Use dimension reduction of PCA
Preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn import Datasets # load data digits = datasets.load_digits() # Normalize the feature matrix X = StandardScaler().fit_transform(digits.data) # create the one that keeps 99% of the variance PCA PCA = PCA (n_components = 0.99, Print ('Original number of features:') print('Original number of features:', X.shape[1]) print('Reduced number of features:', X_pca.shape[1]) ''' Original number of features: 64 Reduced number of features: 54 '''Copy the code
PCA feature extraction
Principal component analysis (PCA) is a common feature extraction method in data science. Technically, PCA finds the eigenvectors of the covariance matrix with the highest eigenvalues, and then uses these eigenvectors to project the data into new subspaces of equal or smaller dimensions. In practice, PCA transforms n feature matrices into new data sets with (possibly) fewer than n features. That is, it reduces the number of features by constructing new fewer variables that capture an important part of the information found in the original features. However, the purpose of this tutorial is not to explain the concept of PCA, which has been done very well elsewhere, but to demonstrate the practical application of PCA.
Import numpy as NP from sklearn import Decomposition, Datasets from sklearn. Preprocessing import StandardScaler # Load breast cancer dataset dataset = datasets.load_breast_cancer() # load feature X = dataset.dataCopy the code
Note that the raw data contained 569 observations and 30 features.
X.shape # (569, 30)Copy the code
Here’s what the data looks like
Array ([[1.79900000e+01, + 1.03800000e+01, + 1.22800000e+02, + e-1, + e-1, + e-1, + e-1) E-01], [2.05700000e+01, 1.77700000e+01, 1.32900000e+02, [1.96900000e+01, + 2.12500000e+01, + 1.30000000e+02, E-1; e-1; e-1; e-1; e-1; e-1; e-1; e-1; e-1; e-1; e-1; e-1; e-1; e-1 + e-01; + e-01; + e-01; + e-01; + e-01 [7.76000000e+00, 2.45400000e+01, 4.79200000e+01,... X_std = sc.fit_transform(X)Copy the code
Note that PCA contains one parameter, the number of components. This is the number of output characteristics that need to be adjusted.
X_std_pca = pca.fit_transform(X_std) = fit_transform(X_std)Copy the code
After PCA, the new data has been reduced to two features with the same number of rows as the original feature.
X_std_pca "" array([[9.19283683, 1.94858307], [2.3878018, -3.76817174], [5.73389628, -1.0751738], [1.25617928, -1.90229671], [10.37479406, 1.67201011], [-5.4752433, 0.67063679]] "'Copy the code
Observations were grouped using KMeans clustering
From sklearn.datasets import make_blobs from sklearn.cluster import KMeans import pandas as pd _ = make_blobs(n_samples = 50, n_features = 2, Centers = 3, random_state = 1) # create DataFrame df = pd.dataframe (X, Columns =['feature_1','feature_2']) # columns=['feature_1','feature_2'] KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, N_clusters =3, N_init =10, n_JOBS =1, Precompute_Distances ='auto', random_state=1, TOL =0.0001, ['group'] = clusterer.predict(X)Copy the code
feature_1 | feature_2 | group | |
---|---|---|---|
0 | 9.877554 | 3.336145 | 0 |
1 | 7.287210 | 8.353986 | 2 |
2 | 6.943061 | 7.023744 | 2 |
3 | 7.440167 | 8.791959 | 2 |
4 | 6.641388 | 8.075888 | 2 |
Choose the best number of ingredients for the LDA
In scikit – learn, LDA are implemented using LinearDiscriminantAnalysis, contains a parameter n_components, said we want to return to the characteristics of the number. To find out the parameter values for n_components (for example, how many parameters to keep), we can take advantage of the fact that explain_variance_ratio_ tells us the interpreted variance for each output feature and is an ordered array.
Specifically, we can run Linear_iscriminantAnalysis, set n_Components to None to return the ratio of interpretation variance by each characteristic component, and then calculate how many components are required to exceed the threshold of interpretation variance (typically 0.95 or 0.99).
# to load the library from sklearn import datasets from sklearn. Discriminant_analysis import LinearDiscriminantAnalysis # loading irises iris data set . = datasets. Load_iris (X) = iris data y = iris. The target # create and run the LDA LDA = LinearDiscriminantAnalysis (n_components = None) X_lda = lda.fit(X, Lda_var_ratios = lda.explained_variance_ratio_ def select_n_components(var_ratio, goal_var: float) -> int: # set explained_variance to total_variance = 0.0 # set n_components = 0 # Set explained_variance to explained_variance in var_ratio: Add explained_variance to the total total_variance += explained_variance # Number of components + 1 n_components += 1 # If we achieve our explain_variance target if total_variance >= Goal_var: # end loop break # return n_components # execute function select_n_components(lda_var_ratios, 0.95) # 1Copy the code
Choose the best number of ingredients for TSVD
Preprocessing import StandardScaler from sklearn.decomposition import TruncatedSVD from scipy.sparse Import csr_matrix from sklearn import datasets import numpy as NP # load data digits = datasets. Load_digits () # Taichichuan The feature matrix X = StandardScaler().fit_transform(digits.data) # XSR_matrix (X) # TSVD = TruncatedSVD(N_components =X_sparse. Shape [1]-1) X_tsvd = TVD. fit(X) # List of explained variances tsvd_var_ratios = Def def select_n_components(var_ratio, goal_var: float) -> int: # set explained_variance to total_variance = 0.0 # set n_components = 0 # Set explained_variance to explained_variance in var_ratio: Add explained_variance to the total total_variance += explained_variance # Number of components + 1 n_components += 1 # If we achieve our explain_variance target if total_variance >= Goal_var: # end loop break # return n_components # execute function select_n_components(tsvd_var_ratios, 0.95) # 40Copy the code
LDA is used for dimension reduction
# to load the library from sklearn import datasets from sklearn. Discriminant_analysis import LinearDiscriminantAnalysis # loading irises iris data set = datasets.load_iris() X = iris.data y = iris.target It will be the data dimension reduction to a characteristic lda = LinearDiscriminantAnalysis (n_components = 1) # run lda features and use it to convert X_lda = lda. Fit (X, Y). Transform (X) # print('Original number of features:', x.shape [1]) print('Reduced number of features:', X_lda.shape[1]) ''' Original number of features: 4 Reduced number of features: Lda. Explained_variance_ratio_ # array([0.99147248])Copy the code