As COVID-19 continues, people need to wear masks when traveling in public. Here, we share a computer vision model that can detect whether the person is wearing a mask.

Data set collection:

The model was built on a dataset consisting of 331 training images, 17 test images with label masks, and side view images without masks.

Data notes:

Image annotations have been done using makesen.ai. This is an open source site that helps annotate images and download them in any format we choose. There are two ways to annotate:

1. In.xml format, we have x-min, x-max, y-min, y-max coordinates

2. TXT format contains object category, object coordinates, image width, image height

Data preparation and modeling:

Because two different models are used here, data preparation is done in two different ways.

1. Tensorflow object detection using pre-trained SSD Resnet:

A.csv file must be created from XML.

import glob
# this library is needed to read XML files for converting it into CSV
import xml.etree.ElementTree as ET
import shutil
def xml_to_csv(data):
    xml_list=[]
    for files in data:
      if '.xml' in files:
        tree=ET.parse(files)
        root=tree.getroot()
        for member in root.findall('object'):
              #print(member)
              value = (root.find('filename').text,
                      int(root.find('size').find('width').text),
                      int(root.find('size').find('height').text),
                      member.find('name').text,
                      int(member.find('bndbox').find('xmin').text),
                      int(member.find('bndbox').find('ymin').text),
                      int(member.find('bndbox').find('xmax').text),
                      int(member.find('bndbox').find('ymax').text)
                      )
              xml_list.append(value)

    column_name = ['filename', 'width', 'height', 'class', 'xmin', 'ymin', 'xmax', 'ymax']
    xml_df = pd.DataFrame(xml_list, columns=column_name)
    return xml_df
Copy the code

After that, we need to create training and test TF records, respectively.

""" Usage: # From tensorflow/models/ # Create train data: python generate_tfrecord.py --csv_input=data/train_labels.csv --output_path=train.record # Create test data: python generate_tfrecord.py --csv_input=data/test_labels.csv --output_path=test.record """ from __future__ import division from __future__ import print_function from __future__ import absolute_import import os import io import pandas as pd import tensorflow as tf from PIL import Image import sys sys.path.append('.. /') from object_detection.utils import dataset_util from collections import namedtuple, OrderedDict # flags = tf.app.flags # flags.DEFINE_string('csv_input', '', '/content/Object_Detection/test_labels.csv') # flags.DEFINE_string('output_path', '', '/content/Object_Detection/Annotations/test.record') # flags.DEFINE_string('image_dir', '', '/content/Object_Detection/Images/test') # FLAGS = flags.FLAGS # TO-DO replace this with label map def class_text_to_int(row_label): if row_label == 'no mask': return 1 elif row_label == 'mask': return 2 def split(df, group): data = namedtuple('data', ['filename', 'object']) gb = df.groupby(group) return [data(filename, gb.get_group(x)) for filename, x in zip(gb.groups.keys(), gb.groups)] def create_tf_example(group, path): with tf.io.gfile.GFile(os.path.join(path, '{}'.format(group.filename)), 'rb') as fid: encoded_jpg = fid.read() encoded_jpg_io = io.BytesIO(encoded_jpg) image = Image.open(encoded_jpg_io) width, height = image.size #change here based on your image extension filename = group.filename.encode('utf8') image_format = b'jpeg' xmins = [] xmaxs = [] ymins = [] ymaxs = [] classes_text = [] classes = [] for index, row in group.object.iterrows(): xmins.append(row['xmin'] / width) xmaxs.append(row['xmax'] / width) ymins.append(row['ymin'] / height) ymaxs.append(row['ymax'] / height) classes_text.append(row['class'].encode('utf8')) classes.append(class_text_to_int(row['class'])) tf_example = tf.train.Example(features=tf.train.Features(feature={ 'image/height': dataset_util.int64_feature(height), 'image/width': dataset_util.int64_feature(width), 'image/filename': dataset_util.bytes_feature(filename), 'image/source_id': dataset_util.bytes_feature(filename), 'image/encoded': dataset_util.bytes_feature(encoded_jpg), 'image/format': dataset_util.bytes_feature(image_format), 'image/object/bbox/xmin': dataset_util.float_list_feature(xmins), 'image/object/bbox/xmax': dataset_util.float_list_feature(xmaxs), 'image/object/bbox/ymin': dataset_util.float_list_feature(ymins), 'image/object/bbox/ymax': dataset_util.float_list_feature(ymaxs), 'image/object/class/text': dataset_util.bytes_list_feature(classes_text), 'image/object/class/label': dataset_util.int64_list_feature(classes), })) return tf_example def main_train(): writer = tf.io.TFRecordWriter('/content/FaceMaskDetection/Annotations/train.record') path = os.path.join('/content/FaceMaskDetection/Images/train') examples = pd.read_csv('/content/FaceMaskDetection/train.csv') grouped = split(examples, 'filename') for group in grouped: tf_example = create_tf_example(group, path) writer.write(tf_example.SerializeToString()) writer.close() output_path = os.path.join(os.getcwd(),'/content/FaceMaskDetection/Annotations/train.record') print('Successfully created the TFRecords: {}'.format(output_path))Copy the code

We’ve done 10,000 steps of training on the model. For the last few steps, here’s what losses look like:

From the above we can see that the total losses are too large, so we don’t expect this model to work properly. For model inference, we provide the following points.

category_index = label_map_util.create_category_index_from_labelmap(files['LABELMAP']) IMAGE_PATH = os.path.join('/content/FaceMaskDetection/Images/train/N47.jpeg') img = cv2.imread(IMAGE_PATH) image_np = np.array(img) input_tensor = tf.convert_to_tensor(np.expand_dims(image_np, 0), dtype=tf.float32) detections = detect_fn(input_tensor) num_detections = int(detections.pop('num_detections')) detections  = {key: value[0, :num_detections].numpy() for key, value in detections.items()} detections['num_detections'] = num_detections # detection_classes should be ints. detections['detection_classes'] = detections['detection_classes'].astype(np.int64) label_id_offset = 1 image_np_with_detections = image_np.copy() viz_utils.visualize_boxes_and_labels_on_image_array( image_np_with_detections, detections['detection_boxes'], detections['detection_classes']+label_id_offset, detections['detection_scores'], category_index, use_normalized_coordinates=True, max_boxes_to_draw=1, Min_score_thresh =0.1, agnostic_mode=False) newsize = (300, Figure (figsize=(15,10)) #image_np_with_detections = image_np_with_detection.resize (newsize) plt.imshow(cv2.cvtColor(image_np_with_detections, cv2.COLOR_BGR2RGB)) plt.show()Copy the code

Let’s examine the execution of the model.

It is clear from the above that this model is not performing well enough. Either more steps need to be trained or better models need to be used, but relatively speaking the losses have not decreased much and may have reached saturation point. We need to use better models.

2. YoloV5s:

In this case, we will use.txt annotations to train the model. There are different versions of YOLO V5. We initially used the Yolo V5S version to check the performance of the model.

import yaml with open("data.yaml", 'r') as stream: num_classes = str(yaml.safe_load(stream)['nc']) from IPython.core.magic import register_line_cell_magic @register_line_cell_magic def writetemplate(line, cell): with open(line, 'w') as f: f.write(cell.format(**globals())) %%writetemplate /content/yolov5/models/custom_yolov5s.yaml # parameters nc: 2 # number of classes # CHANGED HERE depth_multiple: 0.33 # model depth multiple: 0.50 # Layer Channel multiple # anchors anchors: - [10, 13, 16, 30, 33, 10] # P3/8 - [30, 21, 62, 59119] # P4/16 - [116, living, 156198, # p5/32 # yolov # [from, number, module, args] [[-1, 1, Focus, [64, 3]], # 0-P1/2 [-1, 1, Conv, [128, 3, 2]], # 1-P2/4 [-1, 3, BottleneckCSP, [128]], [-1, 1, Conv, [256, 3, 2]], # 3-P3/8 [-1, 9, BottleneckCSP, [256]], [-1, 1, Conv, [512, 3, 2]], # 5-P4/16 [-1, 9, BottleneckCSP, [512]], [-1, 1, Conv, [1024, 3, 2]], # 7-P5/32 [-1, 1, SPP, [1024, [5, 9, 13]]], [-1, 3, BottleneckCSP, [1024, False]], # 9 ] # YOLOv5 head head: [[-1, 1, Conv, [512, 1, 1]], [-1, 1, nn.Upsample, [None, 2, 'nearest']], [[-1, 6], 1, Concat, [1]], # cat backbone P4 [-1, 3, BottleneckCSP, [512, False]], # 13 [-1, 1, Conv, [256, 1, 1]], [-1, 1, nn.Upsample, [None, 2, 'nearest']], [[-1, 4], 1, Concat, [1]], # cat backbone P3 [-1, 3, BottleneckCSP, [256, False]], # 17 (P3/8-small) [-1, 1, Conv, [256, 3, 2]], [[-1, 14], 1, Concat, [1]], # cat head P4 [-1, 3, BottleneckCSP, [512, False]], # 20 (P4/16-medium) [-1, 1, Conv, [512, 3, 2]], [[-1, 10], 1, Concat, [1]], # cat head P5 [-1, 3, BottleneckCSP, [1024, False]], # 23 (P5/32-large) [[17, 20, 23], 1, Detect, [nc, anchors]], # Detect(P3, P4, P5) ]Copy the code

Before training the model, make sure the file structure looks like this:

Make sure the image case, the label filename should be exactly the same, otherwise an error will be raised. Now we will train our model in batch sizes of 80 for 200 periods. For the last few periods, this is what our losses look like.

Let’s examine the execution of our model:

You can see that this model works well. Let’s see how this works:

From the above we can observe that the model has not been fitted at all.

Examples of computer vision techniques:

TSINGSEE black rhino video based on many years of experience in the field of video technology, in the field of artificial intelligence technology + video, continuously research and development, also will be AI detection, intelligent recognition technology integration to various video application scenarios, such as: security monitoring, face detection in video, traffic statistics, risk behavior (rising, falling, shoving, etc.) to detect recognition, etc. Typical examples are EasyCVR video fusion cloud service, which has AI face recognition, license plate recognition, voice intercom, PTT control, sound and light alarm, surveillance video analysis and data summary capabilities.