The introduction

In yesterday’s article, we showed you how to use your own image in PyTorch to train the image classifier and then use it for image recognition. This article shows you how to use a pre-trained classifier to detect multiple objects in an image and track them in a video.

Target detection in images

There are many algorithms for target detection, among which YOLO and SSD are the most popular. In this article, we will use YOLOv3. We won’t discuss YOLO in detail here, but if you want to learn more about it, you can check out the link below: pjreddie.com/darknet/yol…

Let’s get started, again with the import module:

  1. from models import *

  2. from utils import *

  3. import os, sys, time, datetime, random

  4. import torch

  5. from torch.utils.data import DataLoader

  6. from torchvision import datasets, transforms

  7. from torch.autograd import Variable

  8. import matplotlib.pyplot as plt

  9. import matplotlib.patches as patches

  10. from PIL import Image

The pre-trained configurations and weights are then loaded, along with some predefined values, including: image size, confidence threshold, and non-maximum suppression threshold.

  1. config_path=’config/yolov3.cfg’

  2. weights_path=’config/yolov3.weights’

  3. class_path=’config/coco.names’

  4. img_size=416

  5. Conf_thres = 0.8

  6. Nms_thres = 0.4

  7. # Load model and weights

  8. model = Darknet(config_path, img_size=img_size)

  9. model.load_weights(weights_path)

  10. model.cuda()

  11. model.eval()

  12. classes = utils.load_classes(class_path)

  13. Tensor = torch.cuda.FloatTensor

The following function returns the detection result on the specified image.

  1. def detect_image(img):

  2. # scale and pad image

  3. ratio = min(img_size/img.size[0], img_size/img.size[1])

  4. imw = round(img.size[0] * ratio)

  5. imh = round(img.size[1] * ratio)

  6. img_transforms=transforms.Compose([transforms.Resize((imh,imw)),

  7. transforms.Pad((max(int((imh-imw)/2),0),

  8. max(int((imw-imh)/2),0), max(int((imh-imw)/2),0),

  9. Max (int ((imw – imh) / 2), 0), (128128128)),

  10. transforms.ToTensor(),

  11. ])

  12. # convert image to Tensor

  13. image_tensor = img_transforms(img).float()

  14. image_tensor = image_tensor.unsqueeze_(0)

  15. input_img = Variable(image_tensor.type(Tensor))

  16. # run inference on the model and get detections

  17. with torch.no_grad():

  18. detections = model(input_img)

  19. detections = utils.non_max_suppression(detections, 80,

  20. conf_thres, nms_thres)

  21. return detections[0]

Finally, let’s get the detection result by loading an image, and then display it with a bounding box around the detected object. And use different colors for different classes.

  1. # load image and get detections

  2. img_path = “images/blueangels.jpg”

  3. prev_time = time.time()

  4. img = Image.open(img_path)

  5. detections = detect_image(img)

  6. inference_time = datetime.timedelta(seconds=time.time() – prev_time)

  7. print (‘Inference Time: %s’ % (inference_time))

  8. # Get bounding-box colors

  9. cmap = plt.get_cmap(‘tab20b’)

  10. colors = [cmap(i) for i in np.linspace(0, 1, 20)]

  11. img = np.array(img)

  12. plt.figure()

  13. Axplots (1, figsize=(12,9)).

  14. ax.imshow(img)

  15. pad_x = max(img.shape[0] – img.shape[1], 0) * (img_size / max(img.shape))

  16. pad_y = max(img.shape[1] – img.shape[0], 0) * (img_size / max(img.shape))

  17. unpad_h = img_size – pad_y

  18. unpad_w = img_size – pad_x

  19. if detections is not None:

  20. unique_labels = detections[:, -1].cpu().unique()

  21. n_cls_preds = len(unique_labels)

  22. bbox_colors = random.sample(colors, n_cls_preds)

  23. # browse detections and draw bounding boxes

  24. for x1, y1, x2, y2, conf, cls_conf, cls_pred in detections:

  25. box_h = ((y2 – y1) / unpad_h) * img.shape[0]

  26. box_w = ((x2 – x1) / unpad_w) * img.shape[1]

  27. y1 = ((y1 – pad_y // 2) / unpad_h) * img.shape[0]

  28. x1 = ((x1 – pad_x // 2) / unpad_w) * img.shape[1]

  29. color = bbox_colors[int(np.where(

  30. unique_labels == int(cls_pred))[0])]

  31. bbox = patches.Rectangle((x1, y1), box_w, box_h,

  32. linewidth=2, edgecolor=color, facecolor=’none’)

  33. ax.add_patch(bbox)

  34. plt.text(x1, y1, s=classes[int(cls_pred)],

  35. color=’white’, verticalalignment=’top’,

  36. bbox={‘color’: color, ‘pad’: 0})

  37. plt.axis(‘off’)

  38. # save image

  39. plt.savefig(img_path.replace(“.jpg”, “-det.jpg”),

  40. Bbox_inches = ‘tight’, pad_inches = 0.0)

  41. plt.show()

Here are some of our results:

Target tracking in video

Now you know how to detect different objects in the image. As you watch frame by frame in a video, you’ll see those tracking boxes moving. But if there are multiple objects in these video frames, how do you know if the object in one frame is the same as the object in the previous frame? This is called target tracking, and it uses multiple detections to identify a particular object.

There are several algorithms that can do this, and in this article we decided to use SORT(Simple Online and Realtime Tracking), which uses Kalman filters to predict the trajectory of previously identified targets and match them with new detection results, very conveniently and quickly.

Now to write the code, the first three code segments will be the same as those in the single image detection, because they deal with getting YOLO detection on a single frame. The difference comes in the last section, where for each detection, we call the Update function of the Sort object to get a reference to the object in the image. Therefore, instead of regular detection in the previous example, including the coordinates and class prediction of the bounding box, we will get the object tracked, including an object ID in addition to the arguments above. And use OpenCV to read the video and display the video frames.

  1. videopath = ‘video/interp.mp4’

  2. %pylab inline

  3. import cv2

  4. from IPython.display import clear_output

  5. cmap = plt.get_cmap(‘tab20b’)

  6. colors = [cmap(i)[:3] for i in np.linspace(0, 1, 20)]

  7. # initialize Sort object and video capture

  8. from sort import *

  9. vid = cv2.VideoCapture(videopath)

  10. mot_tracker = Sort()

  11. #while(True):

  12. for ii in range(40):

  13. ret, frame = vid.read()

  14. frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

  15. pilimg = Image.fromarray(frame)

  16. detections = detect_image(pilimg)

  17. img = np.array(pilimg)

  18. pad_x = max(img.shape[0] – img.shape[1], 0) *

  19. (img_size / max(img.shape))

  20. pad_y = max(img.shape[1] – img.shape[0], 0) *

  21. (img_size / max(img.shape))

  22. unpad_h = img_size – pad_y

  23. unpad_w = img_size – pad_x

  24. if detections is not None:

  25. tracked_objects = mot_tracker.update(detections.cpu())

  26. unique_labels = detections[:, -1].cpu().unique()

  27. n_cls_preds = len(unique_labels)

  28. for x1, y1, x2, y2, obj_id, cls_pred in tracked_objects:

  29. box_h = int(((y2 – y1) / unpad_h) * img.shape[0])

  30. box_w = int(((x2 – x1) / unpad_w) * img.shape[1])

  31. y1 = int(((y1 – pad_y // 2) / unpad_h) * img.shape[0])

  32. x1 = int(((x1 – pad_x // 2) / unpad_w) * img.shape[1])

  33. color = colors[int(obj_id) % len(colors)]

  34. color = [i * 255 for i in color]

  35. cls = classes[int(cls_pred)]

  36. cv2.rectangle(frame, (x1, y1), (x1+box_w, y1+box_h),

  37. color, 4)

  38. cv2.rectangle(frame, (x1, y1-35), (x1+len(cls)*19+60,

  39. y1), color, -1)

  40. cv2.putText(frame, cls + “-” + str(int(obj_id)),

  41. (x1, y1 – 10), cv2.FONT_HERSHEY_SIMPLEX,

  42. 1, (255255255), 3)

  43. fig=figure(figsize=(12, 8))

  44. title(“Video Stream”)

  45. imshow(frame)

  46. show()

  47. clear_output(wait=True)

PS: If you need Python learning materials, please click on the link below to obtain them

Free Python learning materials and group communication solutions click to join

Let’s take a look at the results: