From basic concept to implementation, full convolutional networks achieve more concise image recognition

An image is known as a collection of pixel values, and this idea could help computer scientists and researchers build neural networks that are similar to the human brain and can perform specific functions. In some cases, this neural network can even exceed human precision.

The image above is a good example of how images are represented by pixel values. These small blocks of pixels form the most basic convolutional neural network.

Convolutional neural networks are very similar to general neural networks in that they are both composed of learnable weights and bias items as well as neurons. Each neuron takes some input, then performs the dot product (scalar), and optionally performs nonlinear classification. The entire network still represents a single differentiable score function, with raw image pixels input at one end and the probability of categories output at the other. The network still has loss functions, because loss functions can calculate relative probabilities at the last (fully connected) layer (e.g., support vector machines /Softmax), and various development techniques learned from conventional neural networks can be applied to loss functions.

How convolution works. Each pixel is replaced by a weighted sum of surrounding pixels, and the neural network learns these weights.

More recently, ConvNets have performed well in areas such as face recognition, object recognition, traffic signs, robotics and autonomous driving, as data volumes and computing power have increased dramatically.

The following figure shows the four main operations in ConvNet:

1. Convolution

2. Nonlinearity (e.g. ReLU)

Pooling or sub-sampling

4. Classification

A picture of a car passes through ConNet and outputs the category for cars at the full connection layer

All Convolution Network

Most modern convolutional neural networks (CNN) for target recognition are constructed using the same principle: alternating convolutional and maximum pooling layers, accompanied by a small number of fully connected layers. A previous paper proposed that max-pooling can be easily replaced by a convolution layer with increased step size without loss of accuracy in the image recognition benchmark. Another interesting thing mentioned in the paper is the replacement of the full connection layer with a Global Average pooling.

If you want to learn more about the convolution network, can refer to the paper: https://arxiv.org/abs/1412.6806#

Getting rid of the fully connected layer is perhaps not a surprise, since it has long been used in the first place. Not long ago Yann LeCun even wrote on Facebook that I never used the full connection layer in the first place.

This makes sense. The only difference between the fully connected layer and the convolution layer is that the neurons of the latter are only connected to the local domain of the input, and many neurons in the convolution space share parameters. However, neurons in the fully connected layer and the convolution layer still compute the dot product, and their functional form is the same. Therefore, it turns out that it is possible to convert between the fully connected layer and the convolution layer, and sometimes the fully connected layer can even be replaced by the convolution layer.

As mentioned above, the next step is to remove space pooling from the network. Now this might cause some confusion. Let’s look at this concept in detail.

Spatial Pooling, also known as subsampling or downsampling, reduces the dimension of each feature mapping, but retains the most important information.

Let’s take maximum pooling as an example. In this case, we define a spatial window and get the maximum element from its feature map. Now remember Figure 2 (how convolution works). Intuitively, the convolution layer with larger stride length can be used as sub-sampling layer and sub-sampling layer to make the input representation smaller and more controllable. It can also reduce the number and calculation of parameters in the network, and then control the occurrence of overfitting.

In order to reduce the representation size, using a larger stride length in the convolution layer is sometimes the best choice in many cases. It is also important to abandon pooling layers in trained generative models such as variational autoencoders (VAE) or generative adversarial networks (GAN). In addition, future neural network architectures may have very few or no pooling layers.

In view of all of the above mentioned tips or fine-tuning is more important, we released on making using Keras model to realize the convolution neural network: https://github.com/MateLabs/All-Conv-Keras

Import libraries and dependencies

from __future__ import print_function

import tensorflow as tf

from keras.datasets import cifar10

from keras.preprocessing.image import ImageDataGenerator

from keras.models import Sequential

from keras.layers import Dropout, Activation, Convolution2D, GlobalAveragePooling2D

from keras.utils import np_utils

from keras.optimizers import SGD

from keras import backend as K

from keras.models import Model

from keras.layers.core import Lambda

from keras.callbacks import ModelCheckpoint

import pandasCopy the code

Train on multiple Gpus

For a multi-GPU implementation of the model, we have a custom function that allocates training data to available Gpus.

The computation is done on the GPU, and the output data is passed to the CPU to complete the model.

def make_parallel(model, gpu_count):

    def get_slice(data, idx, parts):

        shape = tf.shape(data)

        size = tf.concat(0, [ shape[:1] // parts, shape[1:] ])

        stride = tf.concat(0, [ shape[:1] // parts, shape[1:]*0 ])

        start = stride * idx

        return tf.slice(data, start, size)

    outputs_all = []

    for i in range(len(model.outputs)):

        outputs_all.append([])Copy the code

#Place a copy of the model on each GPU, each getting a slice of the batchCopy the code

    for i in range(gpu_count):

        with tf.device('/gpu:%d' % i):

            with tf.name_scope('tower_%d' % i) as scope:

            inputs = []Copy the code

#Slice each input into a piece for processing on this GPUCopy the code

            for x in model.inputs:

                input_shape = tuple(x.get_shape().as_list())[1:]

                slice_n = Lambda(get_slice, output_shape=input_shape, arguments={'idx':i,'parts':gpu_count})(x)

                inputs.append(slice_n)

            outputs = model(inputs)Copy the code

            if not isinstance(outputs, list):

                outputs = [outputs]Copy the code

#Save all the outputs for merging back together laterCopy the code

            for l in range(len(outputs)):

                outputs_all[l].append(outputs[l])Copy the code

# merge outputs on CPUCopy the code

with tf.device('/cpu:0'):

    merged = []

    for outputs in outputs_all:

        merged.append(merge(outputs, mode='concat', concat_axis=0))

    return Model(input=model.inputs, output=merged)Copy the code

Set batch size, class number, and iteration number

Since we are using the CIFAR 10 dataset with 10 classes (different object types), the number of classes is 10 and the batch size is equal to 32. The number of iterations is determined by your own available time and the computing power of your device. In this case we iterate 1000 times.

The image size is 32 by 32. Color channels=3 (RGB)

batch_size = 32

nb_classes = 10

nb_epoch = 1000

rows, cols = 32, 32

channels = 3Copy the code

The data set is divided into three parts: “training set”, “test set” and “verification set”

(X_train, y_train), (X_test, y_test) = cifar10.load_data()

print('X_train shape:', X_train.shape)

print(X_train.shape[0], 'train samples')

print(X_test.shape[0], 'test samples')

print (X_train.shape[1:])Copy the code

Y_train = np_utils.to_categorical(y_train, nb_classes)

Y_test = np_utils.to_categorical(y_test, nb_classes)Copy the code

Build the model

model = Sequential()Copy the code

model.add(Convolution2D(96, 3, 3, border_mode = 'same', input_shape=(3, 32, 32)))

model.add(Activation('relu'))

model.add(Convolution2D(96, 3, 3,border_mode='same'))

model.add(Activation('relu'))Copy the code

#The next layer is the substitute of max pooling, we are taking a strided convolution layer to reduce the dimensionality of the image.Copy the code

Model. Add (Convolution2D(96, 3, 3, border_mode='same', subsample = (2,2))

Model. The add (Dropout (0.5))

model.add(Convolution2D(192, 3, 3, border_mode = 'same'))

model.add(Activation('relu'))

model.add(Convolution2D(192, 3, 3,border_mode='same'))

model.add(Activation('relu'))Copy the code

# The next layer is the substitute of max pooling, we are taking a strided convolution layer to reduce the dimensionality of the image.Copy the code

Model. Add (Convolution2D(192, 3, 3,border_mode='same', subsample = (2,2))

Model. The add (Dropout (0.5))

model.add(Convolution2D(192, 3, 3, border_mode = 'same'))

model.add(Activation('relu'))

model.add(Convolution2D(192, 1, 1,border_mode='valid'))

model.add(Activation('relu'))

model.add(Convolution2D(10, 1, 1, border_mode='valid'))Copy the code

model.add(GlobalAveragePooling2D())

model.add(Activation('softmax'))

model = make_parallel(model, 4)

SGD = SGD(LR =0.1, decay= 1E-6, Momentum =0.9, nesterov=True)

model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])Copy the code

Print the model. This will give you an overview of the model, which is very helpful in visualizing the dimensions and number of parameters of the model

print (model.summary())Copy the code

Data expansion

datagen = ImageDataGenerator(

featurewise_center=False,  # set input mean to 0 over the datasetCopy the code

samplewise_center=False,  # set each sample mean to 0Copy the code

featurewise_std_normalization=False,  # divide inputs by std of the datasetCopy the code

samplewise_std_normalization=False,  # divide each input by its stdCopy the code

zca_whitening=False,  # apply ZCA whiteningCopy the code

rotation_range=0,  # randomly rotate images in the range (degrees, 0 to 180)Copy the code

Width_shift_range =0.1, # swerving shift images horizontally (fraction of total width)Copy the code

Height_shift_range =0.1, # stereoscopic shift imagesCopy the code

horizontal_flip=False,  # randomly flip imagesCopy the code

vertical_flip=False)  # randomly flip imagesCopy the code

datagen.fit(X_train)Copy the code

Save the best weights in your model and add checkpoints

filepath="weights.{epoch:02d}-{val_loss:.2f}.hdf5"Copy the code

checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, save_weights_only=False, mode='max')Copy the code

callbacks_list = [checkpoint]Copy the code

# Fit the model on the batches generated by datagen.flow().Copy the code

history_callback = model.fit_generator(datagen.flow(X_train, Y_train, batch_size=batch_size), samples_per_epoch=X_train.shape[0], nb_epoch=nb_epoch, validation_data=(X_test, Y_test), callbacks=callbacks_list, verbose=0)Copy the code

Finally, get a log of the training session and save your model

pandas.DataFrame(history_callback.history).to_csv("history.csv")

model.save('keras_allconv.h5')Copy the code

The above model easily achieves more than 90% accuracy after the first 350 iterations. If you want to increase accuracy, you can sacrifice computing time and try to scale up larger data.

From basic concept to implementation, full convolutional networks achieve more concise image recognition

Related Posts

Data groupings and Pivottables for the database family

Spring Series S1E5: Spring WebFlux

Python crawls all albums, songs and reviews of Jay Chou from netease Cloud and completes visual analysis