How to quickly prepare high quality AI data?

Abstract: With the rapid development of AI, how to quickly prepare a large amount of high-quality data has become a very challenging problem in the process of AI development!

This article is shared in huawei Cloud community “How to Quickly Prepare High Quality AI Data?” , author: Xu Bo.

The background,

Generally speaking, the three elements of AI are data, algorithms and computation power. One of these three elements is indispensable for the rapid development of artificial intelligence. It is these three elements that have been put in place that have allowed the AI boom to grow so quickly. The quality of the data will affect the accuracy of the model. Generally speaking, a large amount of high-quality data is more likely to train a high-precision AI model. At present, many algorithms can achieve 85% or 90% accuracy by using conventional data, while commercial applications often have higher requirements. If the model accuracy is to be improved to 96% or even 99%, a large amount of high-quality data is needed, and at this time, more refined, scenarioized and specialized data are required. This is often the key condition for AI models to break the bottleneck.

In most AI and machine learning projects, data preparation and engineering tasks take up more than 80% of the time, with data cleaning and data tagging accounting for about 50% of the total project. However, data preparation is very labor-intensive, and how to quickly prepare a large amount of high-quality data has become a very challenging problem in AI development.

ModelArts is a one-stop development platform for AI developers, which can support the whole process of development from data to AI applications, including data processing, algorithm development, model training, model deployment and other operations. It also provides an AI Gallery to share data, algorithms, models, and more with other developers in the market. To help users quickly prepare large volumes of high quality data, ModelArts Data Management provides the following key capabilities:

Data preview and multi-dimensional filtering are provided for AI developers to quickly identify data.
Provides data verification, automatic grouping and other data processing functions to speed up data cleaning;
More than 12 annotation tools are provided to help users annotate the data of each scene.
Intelligent annotation, team annotation and other functions are provided to accelerate annotation and ensure annotation quality.

See ModelArts Data Management for more functions:

ModelArts Data Management provides the ability to prepare high quality AI data

This case will be based on the original data set of traffic sign recognition, which will be demonstrated for you using ModelArts:

How to use the data verification function to quickly clean the data;
How to use automatic grouping function to select desired data from numerous data;
How to use annotation tools to quickly complete annotation;
How to use intelligent annotation and other functions to speed up data annotation.

Users only need to confirm or make some adjustments to complete annotation, which can greatly improve the efficiency of data annotation and save users’ annotation time.

When you’re done with this case, you’ll know how to use ModelArts to quickly prepare large amounts of high-quality data.

2, preparation,

You need to make preparations before the huawei cloud account registration, real-name authentication, ModelArts global configuration, and OBS operations. For details, see this document.

Three, operation

This case is mainly divided into the following steps: ① Download data set from AI Gallery to ModelArts data management, ② data verification: processing illegal data, ③ automatic grouping: deleting unwanted data, ④ data labeling: marking data, ⑤ intelligent labeling: Accelerating data annotation with AI technology, ⑥ Publishing data sets: sharing data.

Operation flow chart

1. Download the dataset

The case of the data set name for “traffic signs recognition of the original data set”, has been uploaded to the Gallery, AI AI Gallery address for marketplace.huaweicloud.com/markets/aih… . After entering AI Gallery, you need to select the data bar, and then search the data set name “Traffic Sign Recognition original data set” in AI Gallery, or click the data set link to download.

Search data set name “Traffic Sign Recognition Original Data Set”

Details of the Original Traffic Sign Recognition Data set

Select this data set to download, configure the target location of the data set (need now OBS to create buckets and directories), change the name to “traffic sign recognition”, you can add a description according to your own situation. After you confirm the download, the page will jump to the “My Data” page. At this time, you can click the “My Downloads” page to check the download progress.

Download “Traffic Sign Identification Raw Data Set”

Download progress

Data set details

2. Clean data

1) Data identification

After the data is downloaded, you need to identify the data, such as how much data there is, what kind of data it is, and whether it needs cleaning. At this point, you can click “Start annotation” to preview the data and see the sample list of the data set. A total of 706 pictures: 500 traffic sign identification data, 100 of which have been marked, 400 of which have not; 200 plants; Other data 6. The image in the sample list also shows the tag information, with the full tag information for that dataset on the right. The existing labels are:

The tag information

Sample list of datasets

2) Data screening

When the data is viewed, the data is often screened to choose the data you want to see. At this time, you can click the expansion on the right of the filtering conditions to select the relevant conditions for filtering. ModelArts data management supports filtering of label name, file name, annotator, sample attribute, difficult case information, etc. You can also select multiple filter criteria for simultaneous filtering.

Data screening

For example, if you want to view the sample list with the label “green_go”, you can directly select the label name.

A list of samples labeled “green_go”.

In actual application scenarios, data is often mixed with invalid data and therefore needs to be cleaned. This dataset also has relevant illegal data: 2 coding errors (badencode1.jpg, badencode2.jpg), 2 image suffixes errors (badSuffix1.png, badSuffix2.png), 2 single channel errors (badchannel1.jpg, badchannel1.jpg) Badchannel2. JPG). For example, if you look at illegal data based on the file name “badencode1.jpg”, you can see that the image load is abnormal because the image encoding is wrong.

View illegal data according to the file name badencode1.jpg

3) Create a data processing job of the data Verification type

ModelArts data processing provides a “data validation” function to check the data. Go to the Data processing page under the ModelArts home page to create a data processing job.

Data processing page

When creating a data processing job, you can change the name of the job to “Datavalidate”, select the scenario category “Object Detection”, set the data processing type to “Data Verification”, enter the V001 version of dataset “Traffic Sign Identification”, and enter the V002 version of dataset “Traffic Sign Identification”.

Create a data processing job of the Data Validation type

4) View the data verification result

Confirm data verification result: Wait several minutes until the data processing is complete. After the completion of the job “Datavalidate”, you can view the data. Select the output data set as V002 of “Traffic Sign Identification”, at which point you will be prompted to switch the version. Clicking Yes will switch the version and jump to the data set page to show the data set details. If you do not change the version, the data set still displays the data before data verification, which may cause subsequent steps to fail. Check the result, it can be seen that there are only 704 pictures, 2 pictures with wrong encoding format have been deleted, 2 pictures with wrong suffix and 2 pictures with single channel have been modified. That is, the data set has been cleaned.

Select View output dataset version

According to the file name badencode1.jpg, the illegal data has been cleaned

3. Automatic grouping

1) Start the task

After checking the data, it was found that there were 500 pictures of traffic signs, 200 pictures of plants, and 4 other pictures. If the previous data is not successfully obtained, you can directly choose to download the verified data set from AI Gallery: verified data set for traffic sign recognition. You can refer to the following figure to download the data processed at the corresponding stage:

Data processed at the corresponding stage

At this time, it will be very slow and time-consuming to select the data you want to mark one by one or delete the data you don’t want. At this time, you can choose to start the automatic grouping function to group traffic labeling data and plant data. Enter the page for all, and then click Auto Group to start the task.

Start automatic group task for data selection

To start the automatic group task, set the number of groups to 3, set the attribute name to Group (you can also customize the attribute name), click OK, and wait for the task to execute. Automatic grouping tasks are displayed in the upper right corner.

To start the automatic grouping task, fill in the parameters

Automatic group progress view

2) View the task results

After the automatic grouping is finished, you can expand the screening conditions on all the tabs, select the sample attribute “group”, and then select the attribute value to view the results: The sample attribute is “group”, and the values 0 and 1 are basically traffic sign identification data. The difference lies in the difference between the two shooting scenes. The sample attribute is “group”, and the value of 2 is basically plant data.

The sample attribute is “group” and the value is 0

The sample attribute is “group” and the value is 1

Filter result with sample attribute “group” and value 2

3) Delete data

In this way, the data has been grouped, and the grouping results are more accurate. We can delete plant data in batches according to the results. Click “Select current page” in the upper right corner of the picture list, select all data, and then browse the data. If you find the data you want in the selected data, you can deselect the image, and click “Delete Image” after processing, to complete the batch image deletion. After the deletion is complete, there is basically only traffic sign recognition data.

Delete unwanted images in batches

4. Data annotation

After data cleaning is complete and unwanted data is deleted, annotate the data. There are about 500 images left in the data. If the previous data is not successfully obtained, you can directly choose to download the data set that has been cleaned from AI Gallery: the data set that has been cleaned for traffic sign recognition

On the data set sample list page, click the “Unlabeled” TAB. The sample attribute in the screening condition is “Group” and the value is 0, and the data of the first scene in the traffic sign data set can be seen. See the user guide for more information.

Unlabeled TAB List of samples whose attribute is Group and value is 0

Annotation Tools

Click any image to enter the sample details page for annotation. The annotation page will have annotation toolbar, picture details display, picture list, tag list, picture switching and other functions, as shown below.

Image annotation page

Select the rectangular box, left-click draw to select the annotation position, and then select the label to complete the annotation. Click the next one to automatically save the annotation result. You can also use the N shortcut key to switch to the next slide.

Data annotation

5. Smart tagging

In the process of use, it can be felt that the labeling workload of object detection task is very large, and manual labeling efficiency is not high. In this case, intelligent labeling function can be used to accelerate.

Intelligent annotation automatically annotates unannotated data. Users only need to confirm or make minor adjustments to complete annotation.

The principle of intelligent tagging active learning is to use part of the existing data and the built-in algorithm of ModelArts to train a model, and then use the model to predict the remaining unlabeled images. Fast is a supervised algorithm, which uses labeled data for training; precise is a semi-supervised algorithm, which uses labeled and unlabeled data for training. Users can also select their own models for intelligent annotation. At this time, they can select the pre-annotation function of intelligent annotation and also get the prediction results of automatic annotation. After the prediction is completed, people only need to check the accuracy of the prediction results. If the prediction is accurate, the results of the algorithm will be directly marked, and if the prediction is not accurate, the annotation will be manually corrected. This human-machine collaboration can greatly improve the efficiency of the annotation and save the user’s annotation time.

1) Enable smart tagging

Before enabling smart tagging, you are advised to create at least 15 labels for each label to achieve higher progress. Click “Start smart tagging” in the upper right corner of the sample list, use the default options, and click submit to start smart tagging.

The smart annotation portal is displayed

Intelligent annotation is enabled. Procedure

2) Check the progress of smart tagging

After submitting a smart annotation task, the smart annotation progress page is displayed. You can also click the To Confirm TAB to view the task progress.

Progress of intelligent labeling task

3) Confirm the result of intelligent annotation

After smart annotation is complete, you can view the result on the To Confirm TAB page.

Intelligent annotation result list

402 were not labeled, and the result of intelligent labeling was also 402. Click the specific picture to enter the details page for confirmation. Check the accuracy of the label. If the label is correct, click Confirm Label. If the label is incorrect, adjust the result and click Confirm label.

Check the intelligent annotation result

6. Publish data sets

1) Release the dataset version

After data annotation is complete, the dataset version can be published, with or without data sharding and writing descriptions.

Publish the dataset version

After the release is complete, an immobilized version is produced to record the total number of samples and how many samples have been labeled. The manifest file is also generated. Manifest will record all sample information and the storage information of labeled files. For object detection, there is no XML file labeled in the form of Pascal VOC. Please refer to the official document for detailed description.

Version details

2) Publish the dataset version to the AI Gallery

After publishing the version of the dataset, you can select the version in ModelArts training for training, or you can publish the dataset to AI Gallery and share it with other users. Enter the data page under AI Gallery, click “Publish” button, fill in the name of published data set, such as “HDC2021– Traffic Sign Recognition Data set”, select data set name “Traffic Sign Recognition” and version “V003”, select data type as picture, and select license type. Click Publish.

AI Gallery publishes data sets

Publish data sets to the AI Gallery

After publishing the dataset, you can click the edit button to improve the dataset information, including the front page of the dataset

Click Edit to complete the dataset information

At this point, the case is complete.

Click to follow, the first time to learn about Huawei cloud fresh technology ~