1. Ask questions:
- Make it clear whether it’s a classification problem or a regression problem
2. Understand the data:
2.1 Data Collection
- Sklearn. Datasets has practice data (data should be representative and the amount of data should be appropriate)
2.2 Importing Data
- pd.csv…
2.3 Viewing Data Set Information
- Data. shape View data shape; .shape[0] Check the number of rows; .shape[1] Check the number of columns
-
Df.head () looks at the first few lines;
-
Df.describe () view descriptive statistics of numerical data;
-
Df.info () checks for missing values and appropriate data types based on the number of rows
-
Understand the meaning of each field, which field is the target and feature respectively; You can also visualize the distribution of data
3. Clean data
3.1 Data preprocessing
Including missing value processing, duplicate value processing, data type conversion, string data regularization
-
Missing value processing (label data need not be filled with missing values) :
-
[A].fillna(data[A].mean())
-
[A].value_counts(); Data [A].fillna(” the most represented category “); Data [A]. Fillna (“U”) fillna [A]. Fillna (“U”
-
Use models to predict missing values, for example, K-nn
-
-
Data normalization/standardization:
-
The model has scalability and variability, such as SVM, which should be standardized to avoid the influence of model parameters by extreme values. Scaling invariant models, such as logistic regression, are also best standardized to speed up training
-
There are two common approaches to normalization/standardization:
- Min – Max, into a [0, 1] : (x – min (x))/(Max – min (x) (x))/preprocessing MinMaxScaler; Suitable for data in a limited range, the values are more concentrated, but the min/ Max instability will affect the results
- Z – core, into the mean to 0, variance of 1: (x – mean/STD (x) (x))/sklearn preprocessing. The scale (), suitable for maximum/minimum value is unknown, or is beyond the scope of discrete values
-
3.2 Feature extraction (Feature Engineering.1) (see Titanic Project)
-
Numerical data processing: It can be used directly or transformed into new features through operations
- Family size = df.a + df.b +1(oneself); Map (lambda s: 1 if 2 <= s <= 4 else 0)
-
Data processing by type:
-
Df. A= df.a.ap ({“male”:1;”) female”:0})
-
More than two categories: one-hot coding, data’=pd.get_dummies(df.a, prefix=’ prefix ‘); pd.concat([data,data’],axis=1)
-
String – name: each name contains the appellation, using the split function to extract the appellation; Strip removes Spaces. The appellation is classified, the corresponding dictionary is defined, and the map function is used to replace; One_hot encoding
-
String – Cabin number: A [n] can get the “n” character of the string data; One_hot encoding is performed after extraction
-
-
Time series data, data collected periodically over a period of time – can be converted into years, months and days
3.3 Feature selection (Feature Engineering.2)
- Calculate the correlation of each feature and tag: df ‘=pd.corr()
-
To view the correlation coefficient of the label: df ‘. Label sort_values(Ascending =False)
-
Feature columns are selected as model inputs according to the correlation coefficient
4. Model building:
4.1 Establish training data sets and test data sets
-
Select features and labels of training data and test data:.loc select feature columns and label columns; Train_test_spilt partition, usually 80% training data set
-
.shape View partition results
4.2. Machine learning Algorithm selection:
-
Import the algorithm
-
Regression (Logisic)
-
Random Forests Model
-
Support Vector Machines
-
Gradient Boosting Classifier
-
K-nearest neighbors
-
Gaussian Naive Bayes
-
Data dimension reduction: PCA, Isomap
-
Data classification: SVC, K-means
-
LinearRegression: LinearRegression
-
-
Create the model
- model=LinearRegression()
-
Training model
- model.fit(train_X , train_y )
5. Evaluation model
-
Model. Score (test_X, test_Y), different model indicators are different, classification model evaluation accuracy
-
Metrics. Confusion_matrix: confusion matrix
-
Homogeneity_score: homogeneity, each cluster contains only members of a single class; [0,1],1 means completely homogeneous.
-
Completeness_score: Completeness, all members of a given class are assigned to the same cluster. [0,1],1 means complete.
-
V_measure_score: Harmonic mean of homogeneity and completeness
-
Adjusted_mutual_info_score: Rand coefficient ARI. The value range is [-1,1]. The higher the value is, the more consistent the clustering result is with the real clustering result, indicating the coincidence degree of the two data distribution
-
Adjusted_mutual_info_score: AMI is the mutual information. The value is [-1,1]. The higher the value is, the better the match is with the actual situation.
-
Fowlkes_mallows_score: geometric average of accuracy and recall, [0,1]. The larger, the more similar.
-
The nearer the part is, the farther away the part is, the greater the coefficient.
-
Calinski_harabaz_score: The smaller the covariance within a class is, the larger the covariance between classes is. The larger the value is, the better the clustering effect is.
6. Program implementation
- Model. Predict (pred_X)
7. Report writing
– Welcome to follow my official account, learn about the growth path of a female programmer who studied design but became an operation and finally became a data analyst and strive to become a big data engineer.