Model building of TF

Generally speaking, common supervised machine learning problems fall into two categories: classification and regression. When we use Tensorflow to solve these problems, we have to build our own network model. However, different levels of Tensorflow API produce different model building methods. The lower-level API is more flexible, giving you more freedom to add whatever you want, but harder to code; On the other hand, the higher the API, the better the encapsulation, the model can be built in a few lines of code, but the flexibility is inevitably reduced. Today I’m going to talk about a couple of different levels of API that can be used to build networks.

1. Regression problems

1.1 Data Generation

First we have to design a regression problem ourselves, which is to create an equation, and then train the network to fit it.

The well-known linear equation is Y=W∗X+bY=W*X+bY=W∗X+b

Here we generate 200 data, X is evenly distributed between (-10, 10), W is (2, -2), b=3, and add noise

# Set random number seed
tf.random.set_seed(0)
# sample
n=200

Generate test data sets
# Y=WX+b+noise
Y=2X+3+noise; Y=-3X+3+noise
X = tf.random.uniform([n,2],minval=-10,maxval=10) 
w0 = tf.constant([[2.0], [...2.0]])
b0 = tf.constant([[3.0]])
Y = X@w0 + b0 + tf.random.normal([n,1],mean = 0.0,stddev= 2.0)
Copy the code

The generated data looks like this

plt.figure(figsize = (12.5))
ax1 = plt.subplot(121)
ax1.scatter(X[:,0],Y[:,0], c = "b")
plt.xlabel("x1")
plt.ylabel("y",rotation = 0)

ax2 = plt.subplot(122)
ax2.scatter(X[:,1],Y[:,0], c = "g")
plt.xlabel("x2")
plt.ylabel("y",rotation = 0)
plt.show()
Copy the code

For the sake of simplicity, I will not divide the training set, test set, etc., and directly use the whole data for training and prediction. Next we need to construct a data generator that generates the X and Y in each batch_size.

The general idea of the data generator is as follows:

  1. Start by randomly shuffling data subscripts

  2. Iterate over the data, each batch_size as a partition, get the scrambled subscript slice (size batch_size)

  3. Use the tf.Gather () function to combine X and Y with the random subscripts from the previous step, yielding the generator

The tf.Gather (Params, Indices, Axis =0) function returns a slice of the corresponding element from Params based on the indices subscript

# Build the data generator
Tf. gather(Params,indices, Axis =0) returns a slice of the corresponding element from params according to indices


def data_iter(features, labels, batch_size=8) :
    num_examples = len(features)
    indices = list(range(num_examples))
    np.random.shuffle(indices)
    for i in range(0, num_examples, batch_size):
        indexs = indices[i: min(i + batch_size, num_examples)]
        yield tf.gather(features,indexs), tf.gather(labels,indexs)

Test the data generator effect
batch_size = 8
(datas,labels) = next(data_iter(X,Y,batch_size))
print(datas)
Copy the code

At this point, the data generation and build generator are implemented, and the next step is to define and train the model, which is the key for today.

1.2 High-level API implementation

First, I want to implement model building in the most common way you can write it, so that you can easily set up a network without too much detail, and it’s more acceptable to start with.

Here I set the learning rate to 0.001, BATCH_size =20, EPOchs =100, and the optimizer selects SGD

I want to say a few extra words here about optimizers. As the key of parameter optimization, optimizer plays an important role in model training. Most of the time, we choose Adam as the optimizer, but according to the recent results of my own practice, Adam is slow in fitting and the effect is not good (the effect here refers to the final loss size). To achieve the same loss requires more iterative training, and loss is often higher than SGD. Then I looked into the new methods of optimizer proposed in recent years, such as AggMo, Apollo, diffGrad, Lamb, MADGRAD… I gained a deeper understanding of optimizers and wondered if there was a better way to do it. Of course, there are no apis in the optimizers either in Tensorflow or Pytorch. However, there is a third-party library called Pytorch optimizer that is completely compatible with all of the optimizers. You can try out the difference between the new optimizer and the classic optimizer.

Okay, back to today’s topic, networking

Here we know that the equation is a linear equation, so we only need one layer of linear layer fitting, and we don’t need to add any other nonlinear activation functions

lr3=0.001
optimizer=optimizers.SGD(learning_rate=lr3)


model3=models.Sequential()
model3.add(layers.Dense(1,input_shape=(2,)))
model3.compile(optimizer=optimizer,loss='mse',metrics=['mae'])
model3.fit(X,Y,batch_size=20,epochs=100)

tf.print(f"w={model3.layers[0].kernel}")
tf.print(f"b={model3.layers[0].bias}")
Copy the code

The final loss is 3.62, and the model fitting effect is as follows

w,b = model3.variables

plt.figure(figsize = (12.5))
ax1 = plt.subplot(121)
ax1.scatter(X[:,0],Y[:,0], c = "b",label = "samples")
ax1.plot(X[:,0],w[0]*X[:,0]+b[0]."-r",linewidth = 5.0,label = "model")
ax1.legend()
plt.xlabel("x1")
plt.ylabel("y",rotation = 0)

ax2 = plt.subplot(122)
ax2.scatter(X[:,1],Y[:,0], c = "g",label = "samples")
ax2.plot(X[:,1],w[1]*X[:,1]+b[0]."-r",linewidth = 5.0,label = "model")
ax2.legend()
plt.xlabel("x2")
plt.ylabel("y",rotation = 0)

plt.show()
Copy the code

1.3 Mid-level API implementation

The most important feature of high-level API implementation is convenience and simplicity. It takes only a few lines of code to complete model building training. However, the use of mid-level API is no longer dependent on encapsulated interfaces, and some functions can be realized by themselves.

Vector #
lr2=0.001
# Batch size
batch_size2=30

model2=layers.Dense(1,input_shape=(2,))
model2.loss_func=losses.mean_squared_error
model2.optimizer=optimizers.SGD(learning_rate=lr2)
Copy the code

As with the high-level API implementation, the models are no longer needed, but a simpler layer. With the loss function and optimizer set up, it’s time to write your own training.

First write a function to train epochs with the epoch, and then train epochs with the loop several times

@tf.function
def train_step(model, features, labels) :
    with tf.GradientTape() as tape:
        predictions = model(features)
        loss = model.loss_func(tf.reshape(labels,[-1]), tf.reshape(predictions,[-1]))
    grads = tape.gradient(loss,model.variables)
    model.optimizer.apply_gradients(zip(grads,model.variables))
    return loss

# Test train_step effect
features,labels = next(data_iter(X,Y,batch_size2))
train_step(model2,features,labels)
Copy the code

With the help of the automatic differentiation mentioned before, we get the predicted value and loss of each time forward, and then calculate the partial derivative of the parameters of the model according to the loss to get the gradient of each parameter. Finally, the optimizer updates according to the gradient of each parameter

With the training of each epoch, the model can then be iterated using only loops

def train_model(model,epochs) :
    for epoch in tf.range(1,epochs+1):
        loss = tf.constant(0.0)
        for features, labels in data_iter(X,Y,batch_size2):
            loss = train_step(model,features,labels)
        if epoch%50= =0:
            tf.print(f"=========================================================   {time.asctime(time.localtime(time.time()))}")
            tf.print("epoch =",epoch,"loss = ",loss)
            tf.print("w =",model.variables[0])
            tf.print("b =",model.variables[1])
train_model(model2,epochs = 200)

w,b = model2.variables

plt.figure(figsize = (12.5))
ax1 = plt.subplot(121)
ax1.scatter(X[:,0],Y[:,0], c = "b",label = "samples")
ax1.plot(X[:,0],w[0]*X[:,0]+b[0]."-r",linewidth = 5.0,label = "model")
ax1.legend()
plt.xlabel("x1")
plt.ylabel("y",rotation = 0)



ax2 = plt.subplot(122)
ax2.scatter(X[:,1],Y[:,0], c = "g",label = "samples")
ax2.plot(X[:,1],w[1]*X[:,1]+b[0]."-r",linewidth = 5.0,label = "model")
ax2.legend()
plt.xlabel("x2")
plt.ylabel("y",rotation = 0)

plt.show()
Copy the code

The final loss is around 3.46

1.4 Implementation of the most basic API

Of the two implementations above, Dense saved us from building Y = W ∗x+ BY = W *x+ BY =w∗x+ B

Optimizer allows us to implement an optimization algorithm without having to do it ourselves, so let’s use the basic API and some basic knowledge to do it without any encapsulation.

Y = W ∗x+ BY = W *x+ BY = W ∗x+ B was first constructed by ourselves, and two variables W, B were declared, and then the calculation formula of forward propagation was defined: x@w+bx@w+bx@w+ B, which was the removal of the activation function in Dense. Groudtruth −predict ^2}{N}N(groudtruth−predict) ^2} It is usually preceded by a 12\frac{1}{2}21

# Construct wx+ B fitting function
w = tf.Variable(tf.random.normal(w0.shape))
b = tf.Variable(tf.zeros_like(b0,dtype = tf.float32))

# Define model
class LinearRegression:     
    # Forward propagation
    def __call__(self,x) : 
        return x@w + b

    # Loss function
    def loss_func(self,y_true,y_pred) :  
        return tf.reduce_mean((y_true - y_pred)**2/2)

model = LinearRegression()
Copy the code

In this way, a basic linear regression model is established, and the next stage is to optimize the model parameters, that is, the training model. The model doesn’t have many parameters, so we’re going to use the most basic gradient descent method. The gradient depend on automatic derivation, gradient update formula is: w = w – alpha ∗ partial partial wJ (w, b) w = w – \ \ frac alpha * {\ partial}} {\ partial w J (w, b) w = w – alpha ∗ partial w partial J (w, b)

Where WWW is the weight, α\alphaα is the learning rate, that is, the step size of each update decline, J(w,b)J(w,b)J(w,b) J(w,b) is the loss function, that is, the above mean square error, and other training parts are similar to the above training part

Vector #
lr=0.001
# Batch size
batch_size=20

@tf.function
def train_step(model, features, labels) :
    # For automatic differentiation
    with tf.GradientTape() as tape:
        predictions = model(features)
        loss = model.loss_func(labels, predictions)
    Calculate the gradient of backpropagation, that is, the partial derivative of each coefficient
    dloss_dw,dloss_db = tape.gradient(loss,[w,b])
    # Gradient descent method to update parameters
    w.assign(w - lr*dloss_dw)
    b.assign(b - lr*dloss_db)
    return loss


def train_model(model,epochs) :
    for epoch in tf.range(1,epochs+1) :for features, labels in data_iter(X,Y,batch_size):
            loss = train_step(model,features,labels)
        if epoch%50= =0:
            tf.print(f"=========================================================   {time.asctime(time.localtime(time.time()))}")
            tf.print("epoch =",epoch,"loss = ",loss)
            tf.print("w =",w)
            tf.print("b =",b)

train_model(model,epochs = 200)
Copy the code

The final loss is actually only about 2.00. The effect of my simple gradient descent is better than that of other advanced optimizers. Maybe the equation is too simple.

2. Classification problems

Classification problem and regression problem process is roughly similar, or to generate data, and then use API to build models and training. The only difference is that the regression problem does not need activation function because of its linear distribution, and gradient descent can also be used for good convergence fitting during training. However, the classification is nonlinear, so activation functions must be added at each layer, and the loss function is no longer MSE during training, but cross entropy is used as the loss function.

Why is MSE no longer appropriate?

This is Sigmoid activation function of function curve and its derivative curve, if we use the MSE as a loss function, the beginning MSE if large (often), so at the beginning of Sigmoid function derivative value is almost zero, the gradient descent almost no gradient (gradient disappear), also can’t to update parameters, Eventually the training failed.

In this case, cross entropy is a good classification loss function


H ( p . q ) = i = 1 n p ( x i ) l o g ( q ( x i ) ) H(p,q)=-\sum_{i=1}^{n}p(x_i)log(q(x_i))

Where p(xi)p(x_i)p(xi) is the probability of the event, and q(xi)q(x_i)q(xi) is the predicted probability

For dichotomies, there are only two labels, 0 or 1, and then q=1-p, so the formula above can be reduced to


C r o s s _ E n t r o p y ( p . q ) = ( p l o g q + ( 1 p ) l o g ( 1 q ) ) Cross\_Entropy(p,q)=-(plog{q}+(1-p)log(1-q))

I said the basic principles, let’s start implementing them now

2.1 Data Generation

# number of positive and negative samples
n_positive,n_negative = 2000.2000

Generate positive sample, small circle distribution
r_p = 5.0 + tf.random.truncated_normal([n_positive,1].0.0.1.0)
theta_p = tf.random.uniform([n_positive,1].0.0.2*np.pi) 
Xp = tf.concat([r_p*tf.cos(theta_p),r_p*tf.sin(theta_p)],axis = 1)
Yp = tf.ones_like(r_p)

# Generate negative sample, big circle distribution
r_n = 8.0 + tf.random.truncated_normal([n_negative,1].0.0.1.0)
theta_n = tf.random.uniform([n_negative,1].0.0.2*np.pi) 
Xn = tf.concat([r_n*tf.cos(theta_n),r_n*tf.sin(theta_n)],axis = 1)
Yn = tf.zeros_like(r_n)

# Aggregate sample
X = tf.concat([Xp,Xn],axis = 0)
Y = tf.concat([Yp,Yn],axis = 0)


# visualization
plt.figure(figsize = (6.6))
plt.scatter(Xp[:,0].numpy(),Xp[:,1].numpy(),c = "r")
plt.scatter(Xn[:,0].numpy(),Xn[:,1].numpy(),c = "g")
plt.legend(["Positive sample".Negative sample])
plt.show()
Copy the code

Tff.random.truncated_normal () truncates the normal distribution, i.e. limits the range of randomly generated normal distribution data to (μ−2δ,μ+2δ)(\mu-2\delta,\mu+2\delta)(μ−2δ,μ+2δ)

2.2 High-level API implementation

Using a high-level API is still just a few lines of code

model3=models.Sequential()
model3.add(layers.Dense(4,input_shape=(2,),activation='relu'))
model3.add(layers.Dense(8,activation='relu'))
model3.add(layers.Dense(1,activation='sigmoid'))

model3.summary()
Copy the code

optimizer = optimizers.SGD(learning_rate=0.001)
loss_func = tf.keras.losses.BinaryCrossentropy()
model3.compile(optimizer=optimizer,loss=loss_func,metrics=['acc'])
model3.fit(X,Y,batch_size=100,epochs=50)


fig, (ax1,ax2) = plt.subplots(nrows=1,ncols=2,figsize = (12.5))
ax1.scatter(Xp[:,0].numpy(),Xp[:,1].numpy(),c = "r")
ax1.scatter(Xn[:,0].numpy(),Xn[:,1].numpy(),c = "g")
ax1.legend(["positive"."negative"]);
ax1.set_title("y_true");

Xp_pred = tf.boolean_mask(X,tf.squeeze(model3(X)>=0.5),axis = 0)
Xn_pred = tf.boolean_mask(X,tf.squeeze(model3(X)<0.5),axis = 0)

ax2.scatter(Xp_pred[:,0].numpy(),Xp_pred[:,1].numpy(),c = "r")
ax2.scatter(Xn_pred[:,0].numpy(),Xn_pred[:,1].numpy(),c = "g")
ax2.legend(["positive"."negative"]);
ax2.set_title("y_pred")
plt.show()
Copy the code

2.3 Mid-level API implementation

Using a mid-order API is essentially a matter of defining your own deep neural network without relying on models.sequential (), and then writing down its layers and forward propagation. After instantiation, binary cross entropy and optimizer are used to optimize and train the model

class DNNModel2(tf.Module) :
    def __init__(self,name = None) :
        super(DNNModel2, self).__init__(name=name)
        self.dense1 = layers.Dense(4,activation = "relu") 
        self.dense2 = layers.Dense(8,activation = "relu")
        self.dense3 = layers.Dense(1,activation = "sigmoid")


    # Forward propagation
    @tf.function(input_signature=[tf.TensorSpec(shape = [None.2], dtype = tf.float32)])  
    def __call__(self,x) :
        x = self.dense1(x)
        x = self.dense2(x)
        y = self.dense3(x)
        return y

model2 = DNNModel2()
model2.loss_func = losses.binary_crossentropy
model2.metric_func = metrics.binary_accuracy
model2.optimizer = optimizers.Adam(learning_rate=0.001)

(features,labels) = next(data_iter(X,Y,batch_size))
predictions = model2(features)
loss = model2.loss_func(tf.reshape(labels,[-1]),tf.reshape(predictions,[-1]))
metric = model2.metric_func(tf.reshape(labels,[-1]),tf.reshape(predictions,[-1]))

tf.print("Initial loss :",loss)
tf.print("Initialization accuracy",metric)
Copy the code

The training of the model is similar to the regression problem above

@tf.function
def train_step(model, features, labels) :
    with tf.GradientTape() as tape:
        predictions = model(features)
        loss = model.loss_func(tf.reshape(labels,[-1]), tf.reshape(predictions,[-1]))
    grads = tape.gradient(loss,model.trainable_variables)
    model.optimizer.apply_gradients(zip(grads,model.trainable_variables))

    metric = model.metric_func(tf.reshape(labels,[-1]), tf.reshape(predictions,[-1]))

    return loss,metric

# Test train_step effect
(features,labels) = next(data_iter(X,Y,batch_size))
train_step(model2,features,labels)

def train_model(model,epochs) :
    for epoch in tf.range(1,epochs+1):
        loss, metric = tf.constant(0.0),tf.constant(0.0)
        for features, labels in data_iter(X,Y,batch_size):
            loss,metric = train_step(model,features,labels)
        if epoch%10= =0:
            tf.print(f"=========================================================   {time.asctime(time.localtime(time.time()))}")
            tf.print("epoch =",epoch,"loss = ",loss, "accuracy = ",metric)
train_model(model2,epochs = 50)


fig, (ax1,ax2) = plt.subplots(nrows=1,ncols=2,figsize = (12.5))
ax1.scatter(Xp[:,0].numpy(),Xp[:,1].numpy(),c = "r")
ax1.scatter(Xn[:,0].numpy(),Xn[:,1].numpy(),c = "g")
ax1.legend(["positive"."negative"]);
ax1.set_title("y_true");

Xp_pred = tf.boolean_mask(X,tf.squeeze(model2(X)>=0.5),axis = 0)
Xn_pred = tf.boolean_mask(X,tf.squeeze(model2(X)<0.5),axis = 0)

ax2.scatter(Xp_pred[:,0].numpy(),Xp_pred[:,1].numpy(),c = "r")
ax2.scatter(Xn_pred[:,0].numpy(),Xn_pred[:,1].numpy(),c = "g")
ax2.legend(["positive"."negative"]);
ax2.set_title("y_pred")
plt.show()
Copy the code

2.4 Low-level API implementation

In the above implementation, every layer except the last classification layer is actually doing one thing, that is, calculating Y=relu(W ∗x+b)Y=relu(W *x+b)Y=relu(W ∗x+b)Y=relu(W ∗x+b), and the last layer changes relu to sigmoid for classification, all else unchanged; So in a low-level implementation, we literally define variables and then propagate forward, calculating backwards and updating those parameters. And then the loss_func which is the cross entropy of two variables just uses the simplified formula that we started with

class DNNModel(tf.Module) :
    def __init__(self,name = None) :
        super(DNNModel, self).__init__(name=name)
        self.w1 = tf.Variable(tf.random.truncated_normal([2.4]),dtype = tf.float32)
        self.b1 = tf.Variable(tf.zeros([1.4]),dtype = tf.float32)
        self.w2 = tf.Variable(tf.random.truncated_normal([4.8]),dtype = tf.float32)
        self.b2 = tf.Variable(tf.zeros([1.8]),dtype = tf.float32)
        self.w3 = tf.Variable(tf.random.truncated_normal([8.1]),dtype = tf.float32)
        self.b3 = tf.Variable(tf.zeros([1.1]),dtype = tf.float32)


    # Forward propagation
    @tf.function(input_signature=[tf.TensorSpec(shape = [None.2], dtype = tf.float32)])  
    def __call__(self,x) :
        x = tf.nn.relu([email protected] + self.b1)
        x = tf.nn.relu([email protected] + self.b2)
        y = tf.nn.sigmoid([email protected] + self.b3)
        return y

    # Loss function (binary cross entropy)
    @tf.function(input_signature=[tf.TensorSpec(shape = [None.1], dtype = tf.float32),
                              tf.TensorSpec(shape = [None.1], dtype = tf.float32)])  
    def loss_func(self,y_true,y_pred) :  
        # Limit predictions above 1E-7 and below 1-1E-7 to avoid log(0) errors
        eps = 1e-7
        y_pred = tf.clip_by_value(y_pred,eps,1.0-eps)
        bce = - y_true*tf.math.log(y_pred) - (1-y_true)*tf.math.log(1-y_pred)
        return  tf.reduce_mean(bce)

    # Evaluation indicators (accuracy)
    @tf.function(input_signature=[tf.TensorSpec(shape = [None.1], dtype = tf.float32),
                              tf.TensorSpec(shape = [None.1], dtype = tf.float32)]) 
    def metric_func(self,y_true,y_pred) :
        y_pred = tf.where(y_pred>0.5,tf.ones_like(y_pred,dtype = tf.float32),
                          tf.zeros_like(y_pred,dtype = tf.float32))
        acc = tf.reduce_mean(1-tf.abs(y_true-y_pred))
        return acc
      
batch_size = 10
(features,labels) = next(data_iter(X,Y,batch_size))

# model instantiation
model = DNNModel()
predictions = model(features)

loss = model.loss_func(labels,predictions)
metric = model.metric_func(labels,predictions)

tf.print("Initial loss :",loss)
tf.print("Initialization accuracy",metric)
Copy the code

Train all trainable parameters according to LOSS_FUNc, that is, those parameters defined by ourselves, w= W −α∗∂ ww= W -\alpha *\frac{\partial}{\partial w} W = W −α∗∂w∂

# Start training
lr=0.005

@tf.function
def train_step(model, features, labels) :

    # Forward propagation for loss
    with tf.GradientTape() as tape:
        predictions = model(features)
        loss = model.loss_func(labels, predictions) 

    Find the gradient of backpropagation
    grads = tape.gradient(loss, model.trainable_variables)

    Perform gradient descent
    for p, dloss_dp in zip(model.trainable_variables,grads):
        p.assign(p - lr*dloss_dp)

    # Calculate the metrics
    metric = model.metric_func(labels,predictions)

    return loss, metric


def train_model(model,epochs) :
    for epoch in tf.range(1,epochs+1) :for features, labels in data_iter(X,Y,150):
            loss,metric = train_step(model,features,labels)
        if epoch%100= =0:
            tf.print(f"=========================================================   {time.asctime(time.localtime(time.time()))}")
            tf.print("epoch =",epoch,"loss = ",loss, "accuracy = ", metric)


train_model(model,epochs = 600)

fig, (ax1,ax2) = plt.subplots(nrows=1,ncols=2,figsize = (12.5))
ax1.scatter(Xp[:,0],Xp[:,1],c = "r")
ax1.scatter(Xn[:,0],Xn[:,1],c = "g")
ax1.legend(["positive"."negative"]);
ax1.set_title("y_true");

Xp_pred = tf.boolean_mask(X,tf.squeeze(model(X)>=0.5),axis = 0)
Xn_pred = tf.boolean_mask(X,tf.squeeze(model(X)<0.5),axis = 0)

ax2.scatter(Xp_pred[:,0],Xp_pred[:,1],c = "r")
ax2.scatter(Xn_pred[:,0],Xn_pred[:,1],c = "g")
ax2.legend(["positive"."negative"]);
ax2.set_title("y_pred")
plt.show()
Copy the code

The end of the

Today was a review of how to build a model, including some important details such as the size of Input_shape and the meaning of the value of Dense. These commonly used apis will be discussed in detail in the follow-up. So many kinds of model building methods, review after the adjustment skills more skilled, deepen the understanding of the model implementation principle. It’s still a little slower than I expected, but it should pick up a little more.