Target detection algorithm R-CNN is introduced

Author: Gao Yuzhuo

Introduction to Target Detection

The task of Object Detection is to find out all the objects (objects) of interest in the image and determine their categories and positions. There are four major tasks in computer vision related to image recognition: 1. Classification: solving the problem of “What is it?” Given a picture or a video, determine what kind of object it contains. 2. Location-location: Solve “where?” The problem of locating the target. E.g. “What is it? Where is it?” The problem of locating the target and knowing what the target is. 4. Segmentation -Segmentation: it is divided into instance-level and scene-level Segmentation to solve the problem of “which object or Scene each pixel belongs to”.

Current target detection algorithm classification

1. The two-stage target detection algorithm firstly performs region proposal (RP) (a pre-selection box that may contain objects to be tested), and then performs sample classification through the convolutional neural network. Tasks: feature extraction – > generate RP – > classify/locate regression. Common two-stage target detection algorithms include R-CNN, SPP-NET, Fast R-CNN, Faster R-CNN and R-FCN, etc.

2. The One-stage target detection algorithm does not use RP, and directly extracts features from the network to predict object classification and location. Task: feature extraction — > classification/location regression. Common one stage target detection algorithms include: OverFeat, YOLOv1, YOLOv2, YOLOv3, SSD and RetinaNet.

This paper will introduce the classical algorithm R-CNN and give the corresponding code implementation.

R-CNN

R-CNN (Regions with CNN Features) is a milestone in the application of CNN method to target detection. With the good feature extraction and classification performance of CNN, the target detection problem is transformed by RegionProposal method. The algorithm is divided into four steps:

  1. Generate candidate regions from the original image (RoI Proposal)
  2. Candidate regions were input into CNN for feature extraction
  3. The features were fed into the SVM detector of each category to determine whether they belonged to this category
  4. The precise target region is obtained by boundary regression

The algorithm forward flow chart is as follows (digital marks in the figure correspond to the above four steps) :We will also follow the sequence of the above four stepsModel buildingAnd we’ll talk about how to do that after that. OkayModel training. But before we do that, let’s take a quick look at the data set we’ll be using in our training.

Introduction to the dataset

The data set used in the original paper is: 1.ImageNet ILSVC (a large recognition library) 10 million images, 1000 classes. PASCAL VOC 2007 (a smaller detection library) 10,000 images, 20 classes. The recognition library is used for pre-training, and then the detection library is used to tune the parameters and evaluate the model effect on the detection library.

Due to the large capacity of the original data set, the training time of the model may reach dozens of hours. To simplify the training, we replaced the training data set. Similar to the original paper, the data we used included two parts: 1. Flower pictures with 17 categories 2. Pictures of flowers in 2 categories.

Subsequently, we will use the 17 classification data for pre-training of the model, and fine-tuning the 2 classification data to obtain the final prediction model, and conduct evaluation on the 2 classification pictures.

Model building

Step one

In this step, we need to complete the following part of the algorithm process:R-CNNSelective search algorithmTo carry outregion proposal. The algorithm first initializes the original region through a graph-based image segmentation method, that is, the image is divided into many, many small pieces. Then, using a greedy strategy, calculate the similarity of each two adjacent areas, and merge the two most similar pieces each time until only one complete image is left. And the image blocks generated during the process including the merged image blocks are saved as the finalRegion of Interest (RoI). The detailed algorithm flow is as follows:Region merging adopts a variety of strategies, and it is easy to merge dissimilar regions by mistake if one strategy is simply adopted. For example, when only considering textures, regions of different colors are easily merged by mistake. Selective Search uses three diversity strategies to increase candidate areas to ensure recall:

  • A variety of color space, considering RGB, gray, HSV and its variants
  • Multiple similarity metrics that take into account color similarity, texture, size, overlap, and so on
  • The original region is initialized by changing the threshold. The larger the threshold, the less the region is divided

Selective Search is built into many machine learning frameworks.

Step 2

In this step, we need to complete the following part of the algorithm process:In step 1 we have the result ofSelective search algorithmThe generatedregion proposalsHowever, the size of each proposal is basically inconsistent, considering thatregion proposalsIt’s going to be input toConvNetFor feature extraction, so it is necessary to put allregion proposalsAdjust to be consistent and consistentConvNetStandard dimensions of architecture. The relevant code implementation is as follows:

import matplotlib.patches as mpatches
# Clip Image
def clip_pic(img, rect) :
    x = rect[0]
    y = rect[1]
    w = rect[2]
    h = rect[3]
    x_1 = x + w
    y_1 = y + h
    # return img[x:x_1, y:y_1, :], [x, y, x_1, y_1, w, h]   
    return img[y:y_1, x:x_1, :], [x, y, x_1, y_1, w, h]

#Resize Image
def resize_image(in_image, new_width, new_height, out_image=None, resize_mode=cv2.INTER_CUBIC) :
    img = cv2.resize(in_image, (new_width, new_height), resize_mode)
    if out_image:
        cv2.imwrite(out_image, img)
    return img

def image_proposal(img_path) :
    img = cv2.imread(img_path)
    img_lbl, regions = selective_search(
                       img, scale=500, sigma=0.9, min_size=10)
    candidates = set()
    images = []
    vertices = []
    for r in regions:
        # excluding same rectangle (with different segments)
        if r['rect'] in candidates:
            continue
        # excluding small regions
        if r['size'] < 220:
            continue
        if (r['rect'] [2] * r['rect'] [3]) < 500:
            continue
        # resize to 227 * 227 for input
        proposal_img, proposal_vertice = clip_pic(img, r['rect'])
        # Delete Empty array
        if len(proposal_img) == 0:
            continue
        # Ignore things contain 0 or not C contiguous array
        x, y, w, h = r['rect']
        if w == 0 or h == 0:
            continue
        # Check if any 0-dimension exist
        [a, b, c] = np.shape(proposal_img)
        if a == 0 or b == 0 or c == 0:
            continue
        resized_proposal_img = resize_image(proposal_img,224.224)
        candidates.add(r['rect'])
        img_float = np.asarray(resized_proposal_img, dtype="float32")
        images.append(img_float)
        vertices.append(r['rect'])
    return images, vertices
Copy the code

Let’s select an image to examine the selective Search algorithm

img_path = './17flowers/jpg/7/image_0591.jpg' 
imgs, verts = image_proposal(img_path)
fig, ax = plt.subplots(ncols=1, nrows=1, figsize=(6.6))
img = skimage.io.imread(img_path)
ax.imshow(img)
for x, y, w, h in verts:
    rect = mpatches.Rectangle((x, y), w, h, fill=False, edgecolor='red', linewidth=1)
    ax.add_patch(rect)
plt.show()
Copy the code

Get the same sizeproposalsAfter, you can input it toConvNetFeature extraction is carried out. Here we areConvNetThe network architecture model used isAlexNet. Its network structure is as follows:

import tflearn
from tflearn.layers.core import input_data, dropout, fully_connected
from tflearn.layers.conv import conv_2d, max_pool_2d
from tflearn.layers.normalization import local_response_normalization
from tflearn.layers.estimator import regression

# Building 'AlexNet'
def create_alexnet(num_classes, restore = True) :
    # Building 'AlexNet'
    network = input_data(shape=[None.224.224.3])
    network = conv_2d(network, 96.11, strides=4, activation='relu')
    network = max_pool_2d(network, 3, strides=2)
    network = local_response_normalization(network)
    network = conv_2d(network, 256.5, activation='relu')
    network = max_pool_2d(network, 3, strides=2)
    network = local_response_normalization(network)
    network = conv_2d(network, 384.3, activation='relu')
    network = conv_2d(network, 384.3, activation='relu')
    network = conv_2d(network, 256.3, activation='relu')
    network = max_pool_2d(network, 3, strides=2)
    network = local_response_normalization(network)
    network = fully_connected(network, 4096, activation='tanh')
    network = dropout(network, 0.5)
    network = fully_connected(network, 4096, activation='tanh')
    network = dropout(network, 0.5)
    network = fully_connected(network, num_classes, activation='softmax', restore=restore)
    network = regression(network, optimizer='momentum',
                         loss='categorical_crossentropy',
                         learning_rate=0.001)
    return network
Copy the code

At this point, we completed the architecture of the ConvNet part, through which we could extract feature map from the proposal.

Steps three and four

In this step, we need to complete the following part of the algorithm process:For eachproposalExtracted fromfeature mapAnd then we can input it toSVMs(It is worth noting that the number of SVM classifiers is not unique, and we need to train one SVM for each classification category. Corresponding to our data set, there are two categories of flowers to be classified, so the number of SVM is 2)Classification discriminant. For the above discriminant positive example (non background)proposalSubsequent input toBbox regTo fine-tune the bbox and output the final border prediction. Now that we know the whole flow of the algorithm, let’s start training the model.

Model training

The training of R-CNN model is divided into two steps:

  1. The ConvNet was initialized and the pre-training model was obtained by using large data sets. Fine-tuning was performed on the pre-training model by using small data sets and the final ConvNet was obtained.
  2. Images were input into the model, feature map of each proposal was extracted through ConvNet obtained in the first step, and feature map was used to train our classifier SVMs and regressor Bbox Reg. (This process ConvNet does not participate in learning, i.e. the parameters of ConvNet remain unchanged)

First, pre-training was performed on the big data set. During training, input X as the original image and correct label Y as the classification of the original image. The relevant codes are as follows:

import codecs

def load_data(datafile, num_class, save=False, save_path='dataset.pkl') :
    fr = codecs.open(datafile, 'r'.'utf-8')
    train_list = fr.readlines()
    labels = []
    images = []
    for line in train_list:
        tmp = line.strip().split(' ')
        fpath = tmp[0]
        img = cv2.imread(fpath)
        img = resize_image(img, 224.224)
        np_img = np.asarray(img, dtype="float32")
        images.append(np_img)

        index = int(tmp[1])
        label = np.zeros(num_class)
        label[index] = 1
        labels.append(label)
    if save:
        pickle.dump((images, labels), open(save_path, 'wb'))
    fr.close()
    return images, labels

def train(network, X, Y, save_model_path) :
    # Training
    model = tflearn.DNN(network, checkpoint_path='model_alexnet',
                        max_checkpoints=1, tensorboard_verbose=2, tensorboard_dir='output')
    if os.path.isfile(save_model_path + '.index'):
        model.load(save_model_path)
        print('load model... ')
    for _ in range(5):
        model.fit(X, Y, n_epoch=1, validation_set=0.1, shuffle=True,
                  show_metric=True, batch_size=64, snapshot_step=200,
                  snapshot_epoch=False, run_id='alexnet_oxflowers17') # epoch = 1000
        # Save the model
        model.save(save_model_path)
        print('save model... ')
        
X, Y = load_data('./train_list.txt'.17)
net = create_alexnet(17)
train(net, X, Y,'./pre_train_model/model_save.model')
Copy the code

After that, fine-tuning of small data sets was used in the pre-training model. There are two differences between this part and the previous part: 1. Input the RoI generated by region proposal instead of the original picture. 2. For the correct label Y of each RoI, we determine it by calculating RoI and IOU of ground truth (Intersection over Union). The IoU calculation method is as follows:

It can be seen that the value of IoU is ∈[0,1], and the larger the value is, the smaller the gap between RoI and ground truth is. Candidate regions with an IoU greater than 0.5 are defined as positive samples and the rest as negative samples. The code for calculating IoU is as follows:

# IOU Part 1
def if_intersection(xmin_a, xmax_a, ymin_a, ymax_a, xmin_b, xmax_b, ymin_b, ymax_b) :
    if_intersect = False
    if xmin_a < xmax_b <= xmax_a and (ymin_a < ymax_b <= ymax_a or ymin_a <= ymin_b < ymax_a):
        if_intersect = True
    elif xmin_a <= xmin_b < xmax_a and (ymin_a < ymax_b <= ymax_a or ymin_a <= ymin_b < ymax_a):
        if_intersect = True
    elif xmin_b < xmax_a <= xmax_b and (ymin_b < ymax_a <= ymax_b or ymin_b <= ymin_a < ymax_b):
        if_intersect = True
    elif xmin_b <= xmin_a < xmax_b and (ymin_b < ymax_a <= ymax_b or ymin_b <= ymin_a < ymax_b):
        if_intersect = True
    else:
        return if_intersect
    if if_intersect:
        x_sorted_list = sorted([xmin_a, xmax_a, xmin_b, xmax_b])
        y_sorted_list = sorted([ymin_a, ymax_a, ymin_b, ymax_b])
        x_intersect_w = x_sorted_list[2] - x_sorted_list[1]
        y_intersect_h = y_sorted_list[2] - y_sorted_list[1]
        area_inter = x_intersect_w * y_intersect_h
        return area_inter


# IOU Part 2
def IOU(ver1, vertice2) :
    # vertices in four points
    vertice1 = [ver1[0], ver1[1], ver1[0]+ver1[2], ver1[1]+ver1[3]]
    area_inter = if_intersection(vertice1[0], vertice1[2], vertice1[1], vertice1[3], vertice2[0], vertice2[2], vertice2[1], vertice2[3])
    if area_inter:
        area_1 = ver1[2] * ver1[3]
        area_2 = vertice2[4] * vertice2[5]
        iou = float(area_inter) / (area_1 + area_2 - area_inter)
        return iou
    return False
Copy the code

Before fine-tuning using small data sets, let us complete the reading of relevant training data (labels of RoI sets, corresponding pictures, frame marks, etc.). In the code below, we read and save data for SVM training and target frame regression incidentally.

# Read in data and save data for Alexnet
def load_train_proposals(datafile, num_clss, save_path, threshold=0.5, is_svm=False, save=False) :
    fr = open(datafile, 'r')
    train_list = fr.readlines()
    # random.shuffle(train_list)
    for num, line in enumerate(train_list):
        labels = []
        images = []
        rects = []
        tmp = line.strip().split(' ')
        # tmp0 = image address
        # tmp1 = label
        # tmp2 = rectangle vertices
        img = cv2.imread(tmp[0])
        # Select search to get candidate box
        img_lbl, regions = selective_search(
                               img, scale=500, sigma=0.9, min_size=10)
        candidates = set()
        ref_rect = tmp[2].split(', ')
        ref_rect_int = [int(i) for i in ref_rect]
        Gx = ref_rect_int[0]
        Gy = ref_rect_int[1]
        Gw = ref_rect_int[2]
        Gh = ref_rect_int[3]
        for r in regions:
            # excluding same rectangle (with different segments)
            if r['rect'] in candidates:
                continue
            # excluding small regions
            if r['size'] < 220:
                continue
            if (r['rect'] [2] * r['rect'] [3]) < 500:
                continue
            # Intercept target area
            proposal_img, proposal_vertice = clip_pic(img, r['rect'])
            # Delete Empty array
            if len(proposal_img) == 0:
                continue
            # Ignore things contain 0 or not C contiguous array
            x, y, w, h = r['rect']
            if w == 0 or h == 0:
                continue
            # Check if any 0-dimension exist
            [a, b, c] = np.shape(proposal_img)
            if a == 0 or b == 0 or c == 0:
                continue
            resized_proposal_img = resize_image(proposal_img, 224.224)
            candidates.add(r['rect'])
            img_float = np.asarray(resized_proposal_img, dtype="float32")
            images.append(img_float)
            # IOU
            iou_val = IOU(ref_rect_int, proposal_vertice)
            # x,y,w,h used for boundingbox regression
            rects.append([(Gx-x)/w, (Gy-y)/h, math.log(Gw/w), math.log(Gh/h)])
            # propasal_rect = [proposal_vertice[0], proposal_vertice[1], proposal_vertice[4], proposal_vertice[5]]
            # print(iou_val)
            # labels, let 0 represent default class, which is background
            index = int(tmp[1])
            if is_svm:
                # iOU is less than the threshold, which is background, 0
                if iou_val < threshold:
                    labels.append(0)
                else:
                     labels.append(index)
            else:
                label = np.zeros(num_clss + 1)
                if iou_val < threshold:
                    label[0] = 1
                else:
                    label[index] = 1
                labels.append(label)


        if is_svm:
            ref_img, ref_vertice = clip_pic(img, ref_rect_int)
            resized_ref_img = resize_image(ref_img, 224.224)
            img_float = np.asarray(resized_ref_img, dtype="float32")
            images.append(img_float)
            rects.append([0.0.0.0])
            labels.append(index)
        view_bar("processing image of %s" % datafile.split('\ \')[-1].strip(), num + 1.len(train_list))

        if save:
            if is_svm:
                Strip () removes the first space
                np.save((os.path.join(save_path, tmp[0].split('/')[-1].split('. ') [0].strip()) + '_data.npy'), [images, labels, rects])
            else:
                Strip () removes the first space
                np.save((os.path.join(save_path, tmp[0].split('/')[-1].split('. ') [0].strip()) + '_data.npy'),
                        [images, labels])
    print(' ')
    fr.close()
    
# load data
def load_from_npy(data_set) :
    images, labels = [], []
    data_list = os.listdir(data_set)
    # random.shuffle(data_list)
    for ind, d in enumerate(data_list):
        i, l = np.load(os.path.join(data_set, d),allow_pickle=True)
        images.extend(i)
        labels.extend(l)
        view_bar("load data of %s" % d, ind + 1.len(data_list))
    print(' ')
    return images, labels

import math
import sys
#Progress bar 
def view_bar(message, num, total) :
    rate = num / total
    rate_num = int(rate * 40)
    rate_nums = math.ceil(rate * 100)
    r = '\r%s:[%s%s]%d%%\t%d/%d' % (message, ">" * rate_num, "" * (40 - rate_num), rate_nums, num, total,)
    sys.stdout.write(r)
    sys.stdout.flush()
Copy the code

With the above preparations, we can start training in the fine-tuning stage of the model, and the relevant codes are as follows:

def fine_tune_Alexnet(network, X, Y, save_model_path, fine_tune_model_path) :
    # Training
    model = tflearn.DNN(network, checkpoint_path='rcnn_model_alexnet',
                        max_checkpoints=1, tensorboard_verbose=2, tensorboard_dir='output_RCNN')
    if os.path.isfile(fine_tune_model_path + '.index') :print("Loading the fine tuned model")
        model.load(fine_tune_model_path)
    elif os.path.isfile(save_model_path + '.index') :print("Loading the alexnet")
        model.load(save_model_path)
    else:
        print("No file to load, error")
        return False

    model.fit(X, Y, n_epoch=1, validation_set=0.1, shuffle=True,
              show_metric=True, batch_size=64, snapshot_step=200,
              snapshot_epoch=False, run_id='alexnet_rcnnflowers2')
    # Save the model
    model.save(fine_tune_model_path)
        
data_set = './data_set'
if len(os.listdir('./data_set')) = =0:
    print("Reading Data")
    load_train_proposals('./fine_tune_list.txt'.2, save=True, save_path=data_set)
print("Loading Data")
X, Y = load_from_npy(data_set)
restore = False
if os.path.isfile('./fine_tune_model/fine_tune_model_save.model' + '.index'):
    restore = True
    print("Continue fine-tune")
# three classes include background
net = create_alexnet(3, restore=restore)
fine_tune_Alexnet(net, X, Y, './pre_train_model/model_save.model'.'./fine_tune_model/fine_tune_model_save.model')
Copy the code

Step 2

In this step we have to trainSVMsandBbox regFigure below:First, we extracted the feature map from the CNN model used in step 1. Pay attention to the feature map used hereConvNetCompared with the previous training, the last layer of Softmax is missing, because what we need now is the features extracted from RoI and the softmax layer is needed for classification in the training. The relevant codes are as follows:

def create_alexnet() :
    # Building 'AlexNet'
    network = input_data(shape=[None.224.224.3])
    network = conv_2d(network, 96.11, strides=4, activation='relu')
    network = max_pool_2d(network, 3, strides=2)
    network = local_response_normalization(network)
    network = conv_2d(network, 256.5, activation='relu')
    network = max_pool_2d(network, 3, strides=2)
    network = local_response_normalization(network)
    network = conv_2d(network, 384.3, activation='relu')
    network = conv_2d(network, 384.3, activation='relu')
    network = conv_2d(network, 256.3, activation='relu')
    network = max_pool_2d(network, 3, strides=2)
    network = local_response_normalization(network)
    network = fully_connected(network, 4096, activation='tanh')
    network = dropout(network, 0.5)
    network = fully_connected(network, 4096, activation='tanh')
    network = regression(network, optimizer='momentum',
                         loss='categorical_crossentropy',
                         learning_rate=0.001)
    return network
Copy the code

We need to train a SVM for each classification category. There are two categories of flowers to be classified, so the number of SVM we need to train is 2. The input used for SVM training was the feature map extracted from RoI, and the labels used had a total of N +1 categories (+1 as the background). Corresponding to our data set, there were three categories of labels. The relevant codes are as follows:

from sklearn import svm
from sklearn.externals import joblib

# Construct cascade svms
def train_svms(train_file_folder, model) :
    files = os.listdir(train_file_folder)
    svms = []
    train_features = []
    bbox_train_features = []
    rects = []
    for train_file in files:
        if train_file.split('. ')[-1] = ='txt':
            X, Y, R = generate_single_svm_train(os.path.join(train_file_folder, train_file))
            Y1 = []
            features1 = []
            features_hard = []
            for ind, i in enumerate(X):
                Extract features
                feats = model.predict([i])
                train_features.append(feats[0])
                Add feature1,Y1 to all positive and negative samples
                if Y[ind]>=0:
                    Y1.append(Y[ind])
                    features1.append(feats[0])
                    # added to Boundingbox training set for iOU >0.5 with Groundtruth
                    if Y[ind]>0:
                        bbox_train_features.append(feats[0])
                view_bar("extract features of %s" % train_file, ind + 1.len(X))

            clf = svm.SVC(probability=True)

            clf.fit(features1, Y1)
            print(' ')
            print("feature dimension")
            print(np.shape(features1))
            svms.append(clf)
            Serialize CLF and save SVM classifier
            joblib.dump(clf, os.path.join(train_file_folder, str(train_file.split('. ') [0]) + '_svm.pkl'))

    # save the boundingBox regression training set
    np.save((os.path.join(train_file_folder, 'bbox_train.npy')),
            [bbox_train_features, rects])
    return svms

# Load training images
def generate_single_svm_train(train_file) :
    save_path = train_file.rsplit('. '.1) [0].strip()
    if len(os.listdir(save_path)) == 0:
        print("reading %s's svm dataset" % train_file.split('\ \')[-1])
        load_train_proposals(train_file, 2, save_path, threshold=0.3, is_svm=True, save=True)
    print("restoring svm dataset")
    images, labels,rects = load_from_npy_(save_path)

    return images, labels,rects

# load data
def load_from_npy_(data_set) :
    images, labels ,rects= [], [], []
    data_list = os.listdir(data_set)
    # random.shuffle(data_list)
    for ind, d in enumerate(data_list):
        i, l, r = np.load(os.path.join(data_set, d),allow_pickle=True)
        images.extend(i)
        labels.extend(l)
        rects.extend(r)
        view_bar("load data of %s" % d, ind + 1.len(data_list))
    print(' ')
    return images, labels ,rects
Copy the code

The regressor is linear and the input is N pairs, {(𝑃𝑖,𝐺𝑖)}𝑖=1,2… , 𝑁 {} (Pi, Gi) I = 1, 2,… ,N are the frame coordinates of the candidate region and the real frame coordinates respectively. The relevant codes are as follows:

from sklearn.linear_model import Ridge

# Display BoundingBox on the image
def show_rect(img_path, regions) :
    fig, ax = plt.subplots(ncols=1, nrows=1, figsize=(6.6))
    img = skimage.io.imread(img_path)
    ax.imshow(img)
    for x, y, w, h in regions:
        rect = mpatches.Rectangle(
            (x, y), w, h, fill=False, edgecolor='red', linewidth=1)
        ax.add_patch(rect)
    plt.show()
    

# Train boundingbox regression
def train_bbox(npy_path) :
    features, rects = np.load((os.path.join(npy_path, 'bbox_train.npy')),allow_pickle=True)
    # Np.array () should not be used directly, instead it should be taken out and put into an empty list. Element structures cannot be directly converted to matrices because features and Rects are built with appends
    X = []
    Y = []
    for ind, i in enumerate(features):
        X.append(i)
    X_train = np.array(X)

    for ind, i in enumerate(rects):
        Y.append(i)
    Y_train = np.array(Y)

    # Linear regression model training
    clf = Ridge(alpha=1.0)
    clf.fit(X_train, Y_train)
    Serialize, save bbox regression
    joblib.dump(clf, os.path.join(npy_path,'bbox_train.pkl'))
    return clf
Copy the code

Start training SVM classifier and frame regressor.

train_file_folder = './svm_train'
# Build model, network
net = create_alexnet()
model = tflearn.DNN(net)
# Load fine tuning alexnet network parameters
model.load('./fine_tune_model/fine_tune_model_save.model')
# Load/train SVM classifier and boundingbox regressor
svms = []
bbox_fit = []
# boundingBox whether the regressor is archived
bbox_fit_exit = 0
# Load SVM classifier and Boundingbox regressor
for file in os.listdir(train_file_folder):
    if file.split('_')[-1] = ='svm.pkl':
        svms.append(joblib.load(os.path.join(train_file_folder, file)))
    if file == 'bbox_train.pkl':
        bbox_fit = joblib.load(os.path.join(train_file_folder, file))
        bbox_fit_exit = 1
if len(svms) == 0:
    svms = train_svms(train_file_folder, model)
if bbox_fit_exit == 0:
    bbox_fit = train_bbox(train_file_folder)

print("Done fitting svms")
Copy the code

At this point the model has been trained.

View model effect

Let’s select an image to see the specific operation effect of the model in the order of forward propagation of the model. First, check the RoI generated by the Region proposal.

img_path = './2flowers/jpg/1/image_1282.jpg'  
image = cv2.imread(img_path)
im_width = image.shape[1]
im_height = image.shape[0]
Extract the region proposal
imgs, verts = image_proposal(img_path)
show_rect(img_path, verts)
Copy the code

RoI was input into ConvNet to obtain features and then input into SVMs and regressor, and the sample of SVM classification result was selected as positive example for border regression.

# Extract RoI features from CNN
features = model.predict(imgs)
print("predict image:")
# print(np.shape(features))
results = []
results_label = []
results_score = []
count = 0
print(len(features))
for f in features:
    for svm in svms:
        pred = svm.predict([f.tolist()])
        # not background
        if pred[0] != 0:
            # boundingbox regression
            bbox = bbox_fit.predict([f.tolist()])
            tx, ty, tw, th = bbox[0] [0], bbox[0] [1], bbox[0] [2], bbox[0] [3]
            px, py, pw, ph = verts[count]
            gx = tx * pw + px
            gy = ty * ph + py
            gw = math.exp(tw) * pw
            gh = math.exp(th) * ph
            if gx < 0:
                gw = gw - (0 - gx)
                gx = 0
            if gx + gw > im_width:
                gw = im_width - gx
            if gy < 0:
                gh = gh - (0 - gh)
                gy = 0
            if gy + gh > im_height:
                gh = im_height - gy
            results.append([gx, gy, gw, gh])
            results_label.append(pred[0])
            results_score.append(svm.predict_proba([f.tolist()])[0] [1])
    count += 1
print(results)
print(results_label)
print(results_score)
show_rect(img_path, results)
Copy the code

It can be seen that the number of enclosures may be greater than one. In this case, we need to use NMS (Non-maximum Suppression) to select the relatively optimal result. The code is as follows:

results_final = []
results_final_label = []

# Non-maximal suppression
Remove candidate boxes with scores less than 0.5
delete_index1 = []
for ind in range(len(results_score)):
    if results_score[ind] < 0.5:
        delete_index1.append(ind)
num1 = 0
for idx in delete_index1:
    results.pop(idx - num1)
    results_score.pop(idx - num1)
    results_label.pop(idx - num1)
    num1 += 1

while len(results) > 0:
    # Find the highest score in the list
    max_index = results_score.index(max(results_score))
    max_x, max_y, max_w, max_h = results[max_index]
    max_vertice = [max_x, max_y, max_x + max_w, max_y + max_h, max_w, max_h]
    # This candidate box adds the final result
    results_final.append(results[max_index])
    results_final_label.append(results_label[max_index])
    # Remove the candidate box from Results
    results.pop(max_index)
    results_label.pop(max_index)
    results_score.pop(max_index)
    # print(len(results_score))
    # delete other candidate boxes with the highest candidate box iOU >0.5
    delete_index = []
    for ind, i in enumerate(results):
        iou_val = IOU(i, max_vertice)
        if iou_val > 0.5:
            delete_index.append(ind)
    num = 0
    for idx in delete_index:
        # print('\n')
        # print(idx)
        # print(len(results))
        results.pop(idx - num)
        results_score.pop(idx - num)
        results_label.pop(idx - num)
        num += 1

print("result:",results_final)
print("result label:",results_final_label)
show_rect(img_path, results_final)
Copy the code

conclusion

So far we have a rough R-CNN model. R-cnn made flexible use of the more advanced tools and technologies at that time, fully absorbed them and reformed them according to its own logic, and finally made great progress. But there are a number of obvious downsides:

  1. Training is too tedious: fine tuning network + training SVM+ border regression, which will involve many hard disk read and write operation efficiency.
  2. Each RoI needs feature extraction through CNN network, resulting in a large number of additional operations (imagine the RoI of two overlapped parts, which is equivalent to two convolution operations, but theoretically only once).
  3. The operation speed is slow, such as independent feature extraction and selective Search as region proposal are time-consuming.

Fortunately, these problems have been greatly improved in the subsequent Fast R-CNN and Faster R-CNN.

The project address

Momodel. Cn/workspace / 5…