This is the 31st day of my participation in the August Text Challenge.More challenges in August

An introduction to the

  • Data normalization is a conventional approach to data processing in machine learning. In the field of traditional machine learning, due to the dimensional inconsistency of features, it may occur that features with large dimensions cannot be effectively learned in the modeling process. However, the data after normalization processing can be uniformly scaled up and shrunk in an interval, so as to avoid the problem of learning bias of all dimensions. The normalized data can improve model training efficiency, speed up model convergence and improve model stability. Of course, in traditional machine learning, there are many cases where you need to ensure that the model is interpretable, and normalization of data reduces the interpretability of the model itself.
  • In the field of deep learning, processing Data into zero-centered Data can effectively ensure the effectiveness of learning at all levels of the model and alleviate the occurrence of gradient disappearance and gradient explosion. In addition, deep learning does not require interpretability, so there are not too many obstacles to Data standardization.
  • Of course, the data normalization of deep learning is quite different from that of classical machine learning, but essentially the theories are the same.

1.1 Z-Score standardization

Different from 0-1 standardization, Z-Score standardization uses mean and standard deviation of original data to standardize data. Again, column by column, subtracting the mean of the current column and dividing by the standard deviation of the current column. Obviously, the Data processed by this method is typical zero-centered Data, and if the original Data follows normal distribution, the Data processed by Z-Score will follow standard normal distribution. The standardized calculation formula of Z-Score is as follows:

𝜇μ represents the mean value, and 𝜎σ represents the standard deviation. Encapsulate the z-Score standardization process as a function

import random
import matplotlib as mpl
import matplotlib.pyplot as plt
from mpl_toolkits .mplot3d import Axes3D
import seaborn as sns
import numpy as np
import torch
from torch import nn,optim
import torch.nn.functional as F
from torch.utils .data import Dataset,TensorDataset,DataLoader
from torch.utils.data import random_split
from torch.utils.tensorboard import SummaryWriter
Copy the code

The self-built functions covered in this article

# 回归类数据集创建函数
def tensorGenReg(num_examples = 1000, w = [2, -1, 1], bias = True, delta = 0.01, deg = 1):
    """回归类数据集创建函数。

    :param num_examples: 创建数据集的数据量
    :param w: 包括截距的(如果存在)特征系数向量
    :param bias:是否需要截距
    :param delta:扰动项取值
    :param deg:方程次数
    :return: 生成的特征张和标签张量
    """
    
    if bias == True:
        num_inputs = len(w)-1                                                        # 特征张量
        features_true = torch.randn(num_examples, num_inputs)                        # 不包含全是1的列的特征张量
        w_true = torch.tensor(w[:-1]).reshape(-1, 1).float()                         # 自变量系数
        b_true = torch.tensor(w[-1]).float()                                         # 截距
        if num_inputs == 1:                                                          # 若输入特征只有1个,则不能使用矩阵乘法
            labels_true = torch.pow(features_true, deg) * w_true + b_true
        else:
            labels_true = torch.mm(torch.pow(features_true, deg), w_true) + b_true
        features = torch.cat((features_true, torch.ones(len(features_true), 1)), 1)  # 在特征张量的最后添加一列全是1的列
        labels = labels_true + torch.randn(size = labels_true.shape) * delta         
                
    else: 
        num_inputs = len(w)
        features = torch.randn(num_examples, num_inputs)
        w_true = torch.tensor(w).reshape(-1, 1).float()
        if num_inputs == 1:
            labels_true = torch.pow(features, deg) * w_true
        else:
            labels_true = torch.mm(torch.pow(features, deg), w_true)
        labels = labels_true + torch.randn(size = labels_true.shape) * delta
    return features, labels
# 常用数据处理类
# 适用于封装自定义数据集的类
class GenData(Dataset):
    def __init__(self, features, labels):           
        self.features = features                    
        self.labels = labels                       
        self.lens = len(features)                  

    def __getitem__(self, index):
        return self.features[index,:],self.labels[index]    

    def __len__(self):
        return self.lens
def split_loader(features, labels, batch_size=10, rate=0.7):
    """数据封装、切分和加载函数:
    
    :param features:输入的特征 
    :param labels: 数据集标签张量
    :param batch_size:数据加载时的每一个小批数据量 
    :param rate: 训练集数据占比
    :return:加载好的训练集和测试集
    """    
    data = GenData(features, labels) 
    num_train = int(data.lens * 0.7)
    num_test = data.lens - num_train
    data_train, data_test = random_split(data, [num_train, num_test])
    train_loader = DataLoader(data_train, batch_size=batch_size, shuffle=True)
    test_loader = DataLoader(data_test, batch_size=batch_size, shuffle=False)
    return(train_loader, test_loader)
class Sigmoid_class3(nn.Module):                                   
    def __init__(self, in_features=2, n_hidden1=4, n_hidden2=4, n_hidden3=4, out_features=1, BN_model=None):       
        super(Sigmoid_class3, self).__init__()
        self.linear1 = nn.Linear(in_features, n_hidden1)
        self.normalize1 = nn.BatchNorm1d(n_hidden1)
        self.linear2 = nn.Linear(n_hidden1, n_hidden2)
        self.normalize2 = nn.BatchNorm1d(n_hidden2)
        self.linear3 = nn.Linear(n_hidden2, n_hidden3)
        self.normalize3 = nn.BatchNorm1d(n_hidden3)
        self.linear4 = nn.Linear(n_hidden3, out_features) 
        self.BN_model = BN_model
        
    def forward(self, x): 
        if self.BN_model == None:
            z1 = self.linear1(x)
            p1 = torch.sigmoid(z1)
            z2 = self.linear2(p1)
            p2 = torch.sigmoid(z2)
            z3 = self.linear3(p2)
            p3 = torch.sigmoid(z3)
            out = self.linear4(p3)
        elif self.BN_model == 'pre':
            z1 = self.normalize1(self.linear1(x))
            p1 = torch.sigmoid(z1)
            z2 = self.normalize2(self.linear2(p1))
            p2 = torch.sigmoid(z2)
            z3 = self.normalize3(self.linear3(p2))
            p3 = torch.sigmoid(z3)
            out = self.linear4(p3)
        elif self.BN_model == 'post':
            z1 = self.linear1(x)
            p1 = torch.sigmoid(z1)
            z2 = self.linear2(self.normalize1(p1))
            p2 = torch.sigmoid(z2)
            z3 = self.linear3(self.normalize2(p2))
            p3 = torch.sigmoid(z3)
            out = self.linear4(self.normalize3(p3))
        return out
def mse_cal(data_loader, net):
    """mse计算函数
    
    :param data_loader:加载好的数据
    :param net: 模型
    :return:根据输入的数据,输出其MSE计算结果
    """
    data = data_loader.dataset                # 还原Dataset类
    X = data[:][0]                            # 还原数据的特征
    y = data[:][1]                            # 还原数据的标签
    yhat = net(X)
    return F.mse_loss(yhat, y)

def fit(net, criterion, optimizer, batchdata, epochs=3, cla=False):
    """模型训练函数
    
    :param net:待训练的模型 
    :param criterion: 损失函数
    :param optimizer:优化算法
    :param batchdata: 训练数据集
    :param cla: 是否是分类问题
    :param epochs: 遍历数据次数
    """
    for epoch  in range(epochs):
        for X, y in batchdata:
            if cla == True:
                y = y.flatten().long()          # 如果是分类问题,需要对y进行整数转化
            yhat = net.forward(X)
            loss = criterion(yhat, y)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()



def model_train_test(model, 
                     train_data,
                     test_data,
                     num_epochs = 20, 
                     criterion = nn.MSELoss(), 
                     optimizer = optim.SGD, 
                     lr = 0.03, 
                     cla = False, 
                     eva = mse_cal):
    """模型误差测试函数:
    
    :param model_l:模型
    :param train_data:训练数据
    :param test_data: 测试数据   
    :param num_epochs:迭代轮数
    :param criterion: 损失函数
    :param lr: 学习率
    :param cla: 是否是分类模型
    :return:MSE列表
    """  
    # 模型评估指标矩阵
    train_l = []
    test_l = []
    # 模型训练过程
    for epochs in range(num_epochs):
        model.train()
        fit(net = model, 
            criterion = criterion, 
            optimizer = optimizer(model.parameters(), lr = lr), 
            batchdata = train_data, 
            epochs = epochs, 
            cla = cla)
        model.eval()
        train_l.append(eva(train_data, model).detach())
        test_l.append(eva(test_data, model).detach())
    return train_l, test_l


def weights_vp(model, att="grad"):
    """观察各层参数取值和梯度的小提琴图绘图函数。
    
    :param model:观察对象(模型)
    :param att:选择参数梯度(grad)还是参数取值(weights)进行观察
    :return: 对应att的小提琴图    
    """
    vp = []
    for i, m in enumerate(model.modules()):
        if isinstance(m, nn.Linear):
            if att == "grad":
                vp_x = m.weight.grad.detach().reshape(-1, 1).numpy()
            else:
                vp_x = m.weight.detach().reshape(-1, 1).numpy()
            vp_y = np.full_like(vp_x, i)
            vp_a = np.concatenate((vp_x, vp_y), 1)
            vp.append(vp_a)
    vp_r = np.concatenate((vp), 0)
    ax = sns.violinplot(y = vp_r[:, 0], x = vp_r[:, 1])
    ax.set(xlabel='num_hidden', title=att)
    
    
class tanh_class2(nn.Module):                                   
    def __init__(self, in_features=2, n_hidden1=4, n_hidden2=4, out_features=1, BN_model=None):       
        super(tanh_class2, self).__init__()
        self.linear1 = nn.Linear(in_features, n_hidden1)
        self.normalize1 = nn.BatchNorm1d(n_hidden1)
        self.linear2 = nn.Linear(n_hidden1, n_hidden2)
        self.normalize2 = nn.BatchNorm1d(n_hidden2)
        self.linear3 = nn.Linear(n_hidden2, out_features) 
        self.BN_model = BN_model
        
    def forward(self, x):
        if self.BN_model == None:
            z1 = self.linear1(x)
            p1 = torch.tanh(z1)
            z2 = self.linear2(p1)
            p2 = torch.tanh(z2)
            out = self.linear3(p2)
        elif self.BN_model == 'pre':
            z1 = self.normalize1(self.linear1(x))
            p1 = torch.tanh(z1)
            z2 = self.normalize2(self.linear2(p1))
            p2 = torch.tanh(z2)
            out = self.linear3(p2)
        elif self.BN_model == 'post':
            z1 = self.linear1(x)
            p1 = torch.tanh(z1)
            z2 = self.linear2(self.normalize1(p1))
            p2 = torch.tanh(z2)
            out = self.linear3(self.normalize2(p2))
        return out
    
# 分类数据集的创建函数
def tensorGenCla(num_examples = 500, num_inputs = 2, num_class = 3, deg_dispersion = [4, 2], bias = False):
    """分类数据集创建函数。
    
    :param num_examples: 每个类别的数据数量
    :param num_inputs: 数据集特征数量
    :param num_class:数据集标签类别总数
    :param deg_dispersion:数据分布离散程度参数,需要输入一个列表,其中第一个参数表示每个类别数组均值的参考、第二个参数表示随机数组标准差。
    :param bias:建立模型逻辑回归模型时是否带入截距
    :return: 生成的特征张量和标签张量,其中特征张量是浮点型二维数组,标签张量是长正型二维数组。
    """
    
    cluster_l = torch.empty(num_examples, 1)                         # 每一类标签张量的形状
    mean_ = deg_dispersion[0]                                        # 每一类特征张量的均值的参考值
    std_ = deg_dispersion[1]                                         # 每一类特征张量的方差
    lf = []                                                          # 用于存储每一类特征张量的列表容器
    ll = []                                                          # 用于存储每一类标签张量的列表容器
    k = mean_ * (num_class-1) / 2                                    # 每一类特征张量均值的惩罚因子(视频中部分是+1,实际应该是-1)
    
    for i in range(num_class):
        data_temp = torch.normal(i*mean_-k, std_, size=(num_examples, num_inputs))     # 生成每一类张量
        lf.append(data_temp)                                                           # 将每一类张量添加到lf中
        labels_temp = torch.full_like(cluster_l, i)                                    # 生成类一类的标签
        ll.append(labels_temp)                                                         # 将每一类标签添加到ll中
        
    features = torch.cat(lf).float()
    labels = torch.cat(ll).long()
    
    if bias == True:
        features = torch.cat((features, torch.ones(len(features), 1)), 1)              # 在特征张量中添加一列全是1的列
    return features, labels
Copy the code
def Z_ScoreNormalization(data):
    stdDf = data.std(0)
    meanDf = data.mean(0)
    normSet = (data - meanDf) / stdDf
    return normSet 
Copy the code

The practice of binary normalization algorithm in deep learning

  • Train on the training set, test on the test set

Before modeling, two questions need to be clarified. One is whether the labels need to be standardized (if it is a regression problem), and the other is whether the characteristics of the test set need to be standardized. First of all, whether the tag is standardized or not has no effect on modeling, so we generally do not standardize the tag; Secondly, in the actual model training process, since the data set is divided into training set and test set, generally speaking, we will calculate the mean and standard deviation line by line in the characteristics of the training set, and then conduct model training. When the test set is input for testing, We will bring the mean and standard deviation of each column calculated on the training set into the test set and standardize the characteristics of the test set, and then bring into the model test. It should be noted that the mean and variance of each column involved in standardization at this time are also equivalent to model parameters, which must be obtained from the training set rather than the data of the test set.

2.1 Z-Score modeling experiment

The data are standardized and then brought into the model for training, so as to test the actual effect of z-Score standardization on the deep learning model. Here, the process of calculating mean variance on the training set and then bringing it into the test set for operation is simplified, and all data sets are directly used for data normalization operation.

= =tensorGenReg(w=[2,-1], Bias =False,deg=2) Features_norm =Z_ScoreNormalization(features_norm=Z_ScoreNormalization(features) # train_loader_norm,test_loader=split_loader(features_norm,labels)Copy the code
Sigmoid_model3 =Sigmoid_class3() Sigmoid_model3_norm =Sigmoid_class3() # Initialize Xavier for m in sigmoid_model3.modules(): nn.init.xavier_uniform_(m.weight) for m in sigmoid_model3_norm.modules(): if isinstance(m,nn.Linear): Nn.init.xavier_uniform_ (M.eight) # sigmoID_model3 training train_L, test_L =model_train_test(sigmoid_model3,train_loader ,test_loader ,num_epochs=num_epochs ,criterion=nn.MSELoss() ,optimizer=optim.SGD ,lr=lr , Cla =False, Eva = mSE_cal) #sigmoid_model3_norm train_L_norm, test_L_norm =model_train_test(sigmoid_model3_norm ,train_loader_norm ,test_loader ,num_epochs=num_epochs ,criterion=nn.MSELoss() ,optimizer=optim.SGD ,lr=lr ,cla=False ,eva=mse_cal ) plt.plot(list(range(num_epochs)),train_l,label='train_mse') plt.plot(list(range(num_epochs)),train_l_norm,label='train_norm_mse') plt.legend(loc=1)Copy the code

From the final running results of the model, it can be seen that the data convergence speed after z-Score normalization is faster and better results can be obtained in some cases.

Limitations of normalization of Z-Score data

Z-score initialization is not a data normalization method tailored for deep learning algorithms. In the actual neural network modeling process, the use of Z-Score still has many limitations. Specifically, there are mainly the following two points.

3.1 The disappearance of zero-centered characteristics

  • Although Z-Score normalization can ensure the smoothness of gradient to a certain extent, thus improving the model convergence speed and even the model effect, just like Xavier’s initialization method, zero-centered Data will be damaged gradually with the increase of iterations because of the modification of “initial value”.
  • Moreover, as both parameters and input data return to uncontrollable state, the gradient of each layer will return to uncontrollable state, and the so-called control gradient stability cannot be achieved
  • The tanH activation function model with unstable relative gradient was created to check the gradient changes of each layer in the 5th and 40th rounds of iteration.
Tanh_model2_norm1 =tanh_class2() tanh_model2_norm2=tanh_class2() For m in tanh_model2_norm1.modules(): if isinstance(m, nn.linear): nn.init.xavier_uniform_(m.weight) for m in tanh_model2_norm2.modules(): if isinstance(m,nn.Linear): Nn.init. xavier_Uniform_ (M.eight) # TANh_model2 training train_L, test_L =model_train_test(TANh_model2_norm1,train_loader_norm) ,test_loader ,num_epochs=5 ,criterion=nn.MSELoss() ,optimizer=optim.SGD ,lr=lr ,cla=False ,eva=mse_cal ) #tanh_model2_norm train_l_norm,test_l_norm=model_train_test(tanh_model2_norm2,train_loader_norm,test_loader ,num_epochs=40 ,criterion=nn.MSELoss() ,optimizer=optim.SGD ,lr=lr ,cla=False ,eva=mse_cal ) weights_vp(tanh_model2_norm1,att='grad')Copy the code

weights_vp(tanh_model2_norm2,att='grad')
Copy the code

The gradient was relatively stable at the beginning, but obvious gradient explosion appeared at the end of iteration.

3.2 Limitations of zero-centered Data

In addition to the gradual loss of zero-centered characteristics of input Data in the process of iteration, the application of Z-Score standardization to the deep learning model, the role scope of zero-centered Data itself is limited. Even if the zero-centered characteristics of input Data are maintained, It is also hard to guarantee that this alone will ensure a smooth gradient. As the gradient of each layer is actually affected by the activation function, input data and parameters of each layer, even if all input data are adjusted to zero mean, the calculation results of the gradient of each layer may be unstable due to the influence of other factors. So zero-centered data-seeking may not be the best approach.

Four input data adjustment to ensure stable gradient

  • There are three core factors affecting the stability of gradient, one is the parameters of each layer, the other is the data received by each linear layer, and the third is the activation function. In addition to parameter adjustment, there are only two methods to ensure the gradient stability: selecting activation function and adjusting input data.
  • Batch Normalization is a proven and effective means of model optimization,
  • Deep learning as “empirical” technology, in many cases model effect is the Paramount consideration, therefore, such as the BN although theoretical basis is not formed, but the practical effect is very good method to study in depth domain is widespread, however, this does not mean that we can discuss how to use regardless only and ignore the theory behind the discussion. As a qualified algorithm engineer, we still need to establish a correct understanding of the application and theoretical background of many methods.

4.1 Independence of normalization method and data distribution

The essence of any normalization is to translate and scale up the data. The so-called translation means that each column of the exponential data set uniformly adds or subtracts a certain number. In z-Score, it means that each column of the exponential data set uniformly subtracts its mean value. In the case of z-score it’s each column divided by the standard deviation of the current column. However, the translation and expansion of data will not affect the distribution of data features.

Manual_seed (420) # set data set features,labels=tensorGenCla(num_class=2,deg_dispersion=[6,2]) # set data set features,labels=tensorGenCla(num_class=2,deg_dispersion=[6,2] plt.scatter(features[:,0],features[:,1],c=labels)Copy the code

features
Copy the code

f=Z_ScoreNormalization(features)
f
Copy the code

The data set distribution before and after normalization was compared

plt.subplot(121)
plt.scatter(features[:,0],features[:,1],c=labels)
plt.title('features distribution')
plt.subplot(122)
plt.scatter(f[:,0],f[:,1],c=labels)
plt.title('f distribution')
Copy the code

  • Data distribution remains unchanged before and after normalization, but the absolute value of data coordinates in space changes.
  • In fact, the distribution of data represents the law behind the data. We use the model to capture the law of data, which is actually to learn the distribution of data. Therefore, data normalization does not modify the data distribution, which is the basic premise for us to use the normalization method. Otherwise, once the data normalization method will modify the data distribution, it is equivalent to artificially destroying the original law of data, which will have a huge impact on the subsequent model learning.