1. Question analysis

1.1. Introduction to the challenge

The first “Malanshan Cup” International Audio and Video Algorithm Contest was guided by Chinese Society of Industrial and Mathematical Applications, sponsored by Hunan Internet Information Office and Hunan Science and Technology Association, and undertaken by Malanshan Video Cultural And Creative Industry Park and Mango TV, China.

The algorithm contest is divided into three questions: video specific point tracking, video recommendation and image quality damage repair. I participated in the video specific point tracking track and won the second place in both the preliminary and the final. The approximate method is image registration + image tracking.

Video specific point tracking refers to the technology of tracking and locating the position of the original marked area in subsequent videos with the progress of the original marked area in the case of initial fixed area of the video. One of the application points of this technology is dynamic advertisement implantation in video, which can improve the liquidity of video platform and reduce the risk of blind investment of advertisers on the premise of not affecting users’ viewing experience. Dynamic video advertisement has become a popular form of advertising. Dynamic video advertising requires the natural and accurate fusion of advertising visual elements with the original video content to achieve the effect of indistuding the real from the real. This requires us to accurately estimate camera motion, accurately estimate light and shadow, properly deal with the depth of field and object occlusion relationship. How to solve the above problems, improve the sense of reality of video dynamic advertising implantation, and achieve as much as possible automated and standardized video dynamic advertising implantation is extremely challenging.

Basically, the target position is given in the first frame, and the target position is predicted in subsequent frames. This position is the four clockwise vertices of the quadrilateral, which can be an irregular quadrilateral. After the statistics of participants’ feelings, they generally think this question is very interesting. For application scenarios, it is also more extensive. The sponsor’s application is mainly advertising implantation. By pasting advertising-related posters and dynamic videos in the tracking area, the advertisement is embedded in the watching content. On the premise of not affecting the user’s viewing experience, the advertisement is naturally implanted. Therefore, the accuracy of this tracking position is particularly high. According to the meaning of the organizer, mSE should be less than 1 in order not to affect the viewing effect.

1.2. Data description

Research data: for the research of creative solutions. The first frame of the video is marked with the object area to be implanted in the advertisement. Scenes are relatively simple and short in duration (~100 frames). There are 2,000 video clips.

Validation set data: 100 video clips, video annotation of four feature points in the region to be implanted in all frame tracks. It is used to verify the implantation effect of creative scheme.

Test set data: divided into AB lists, each containing 200 video clips, and the tracks of all frames marked with four feature points in the region to be implanted. It is used to test the implant effect of creative solutions. Contestants are given only the coordinates of the four feature points of the first frame.

1.3. Evaluation indicators

Let’s say there are N videos, and a video has M frames.

For a certain frame F, MSE is calculated as follows:

For a video, MSE is calculated as follows:

For the entire video set, the final MSE is:

2. Determine the plan

2.1. Review commonly used schemes

For general tracking applications, we first think of traditional methods such as correlation filtering (KCF) and meanshift, and then the development of deep learning. For various specific target tracking, track by Detection is often used in the industry, such as face tracking and pedestrian tracking, which roughly means to detect the target box first. The boxes in the sequence are then linked by correlation or positional relation. However, for the application of this competition, the above methods can not be directly used, there are roughly two differences

  • The box in the contest is an irregular quadrilateral, and the sampling window in KCF or the Anchor in the detection model in DNN are generally regular quadrilateral. And the accuracy of the four vertices is highly required

  • The content of the box in the question may be blocked, or the content may change. And the target may not have any characteristics.

Finally, it is concluded that the direction of this challenge is between image registration (commonly used for SLAM and medical registration) and image tracking (VOT,MOT). Some of the scenarios in this problem would not work if only image registration was used, and neither would a scheme using only image tracking. I will describe the details later. Therefore, WHEN choosing a solution, I combined some solutions from both directions with a simple strategy

2.1.1. Image registration

The general process of image registration is to align two images through feature extraction, matching and finally image alignment. To put it simply, we select points of interest in two images, associate the reference image with the equivalent points of interest in the sensible image, and then transform the sensible image to align the two images. SIFT+NN is often used in traditional image registration methods, and the industry may use ORB features for efficiency. At the same time, deep learning also has some good work in registration, and SuperPoint and SuperGlue schemes are often used in the mainstream. SuperPoint uses deep learning to get feature points in images. SuperGlue uses CNN and GNN to match feature points of two frames of images. Of course, there are some new SOTA papers for image registration in 2020, and relevant directions can be further studied.

Finally, I adopted SIFT+NN method. It is not that SuperPoint+SuperGlue is not attractive, but that there is no training data set. Generally, the applicability of features trained by deep learning may not be good. In addition, even if SIFT+NN has more mismatches than SuperPoint+SuperGlue, good transformation results can be obtained.

2.1.2. Image tracking

The knowledge of image tracking is also quite large. Here I briefly describe the reason for using siammask. The network head with mask can do some tactical work, and the effect is good in most scenes, but it lacks the exact position of the four vertices, which is used to calibrate with the center position. At the same time, we also use some knowledge of tracking to measure the similarity of candidate box.

2.2. Plan

The specific pipeline is divided into two parts. The first part is the forecast position, recall the forecast box. The second part is similarity calculation, and the optimal box is selected. The position prediction module predicts the position of the tracking box through various methods. The similarity calculation is to calculate the similarity between the content of each frame and the reference content as the confidence evaluation of the frame. The establishment of this pipeline represents that my optimization direction is to recall the location of the box as far as possible, and select the optimal box by accurate similarity evaluation.

2.2.1. Predicted location

The feature points of the global image are partitioned. Divided into full graph, local and four vertex region. SIFT feature points were used for full image feature points and NN nearest collar matching was used for matching. SIFT feature points + corner points are used for local features,SIFT uses NN pairing, and corner points are paired using optical flow estimation and reverse verification. SIFT features are used for the four vertex regions, and template matching is used for matching due to fewer feature points. And the homography matrix for each region’s feature points, and the homography matrix after fusion feature points.

2.2.2. Similarity calculation

The calculation process of similarity is also a common scheme. Here, the area is enlarged and compared according to the diagonal in order to introduce the direction error. Finally warp makes rectangular contrast.

3. Effect tuning

3.1. Analysis

plan advantages disadvantages
Image registration Ideally, the coordinates of the four vertices in the tracking region can be accurately obtained after the alignment transformation according to the affine relationship of the paired points. Even if the tracking area is occluded, the location can be deduced according to the relationship between other areas Feature points obtained from local regions are subsequently paired with descriptors. Or the global position or feature information is lost, mismatching may occur in multiple similar scenes, and then lead to inaccurate vertex position
Target tracking In the case of single target tracking, the predicted tracking region is transmitted by the receptive field with reference to the global information, and the general position is still relatively accurate Irregular quadrilateral boxes are not supported, resulting in reduced MSE accuracy. Tracking the target changes or occlusion, the box is not accurate, almost no solution.

3.2. There are plans for improvement

3.2.1. Location zoning of feature points (obvious extraction)

Motivation: Feature points are paired accurately and there are a large number of them, which may not ensure the accurate position of the prediction box. For example, the images after global feature point registration and alignment are not necessarily aligned in the tracking region. The ideal state is to directly find the matching points of the four vertices in the tracking region, but the four vertices in the tracking region in the real state are not necessarily feature points. Therefore, in the tuning, I used feature point partition to improve the attention close to the tracking area, which not only improves the accuracy of the feature point matching process, but also makes the recalled candidate box conform to the target of local tracking.

About 5% of the videos in the questions were heavily blurred. This type of case is very difficult to optimize. The general description is that the video is shaking due to the shooting technique, and the target to be tracked is the blurred background, while the foreground of the close-up is a moving object. In this case, most of the feature points are distributed in the close-up foreground, and the predicted position is biased by the foreground, but the groundtruth should be the blurred background of the jitter. Finally, I can cover this kind of case roughly by corner point (solving the problem of fewer feature points in the low-frequency component region of the image)+ partition priority (solving the problem of a large number of feature points with deviation in the foreground)+ Siammask (tracking algorithm has little influence on this kind of case).

3.2.2. Similarity tuning (obvious score improvement)

Motivation: Sometimes the similarity of the image is not accurate when it reaches a certain threshold, for example, 96% is not necessarily better than the candidate box of 95%. In this case, the area is amplified and compared according to the diagonal, so as to enlarge the similarity error caused by the image coordinate error. Here is mainly due to the coordinate error caused by diagonal deviation. So it’s essentially an error in the direction of introduction. Filtering and SSIM were used to measure the similarity. There is no time to test whether metric learning of deep learning or Siammse network has better metrics.

3.2.3. Optical flow estimation tuning (obvious improvement)

Motivation: SIFT feature In some scenes, the points are insufficient, and the Angle point is used to increase the matching points by optical flow estimation. At the same time, because the optical flow estimation depends on the time sequence context and is suitable for scenes with little background change, the image closest to the candidate frame in the time sequence is added into the search_window reference frame.

3.2.4. Rerank Strategy (Small score increase)

Motivation: First, the similarity of the image is high to a certain threshold and not accurate, for example, 96% is not necessarily better than 95% candidate box; second, when serious occlusion occurs, the similarity will be very low, and the similarity estimation at this time is not accurate. I rerank the similarity ranking with the reference partition priority and can also hit the optimal box effectively.

3.3. Other schemes without improvement

3.3.1. Use of other traditional feature points

Trying out several other traditional feature points like ORB, etc., doesn’t work as well as SIFT. For scenes with few points, it is also a good scheme to use corner points for optical flow estimation, while the effect of other feature points is not particularly good. Therefore, it is not recommended to try other traditional feature point schemes. Instead, we can try the method of deep learning. Finetune should have a good effect in using the data of the competition.

3.3.2. Traditional target tracking such as KCF

It is not as effective as Siammask, and for difficult scenarios, no amount of optimization can improve it. It is not recommended to spend too much energy on target tracking, because the main work of this part is to correct the pairing position of feature points. It is difficult to achieve good results only through target tracking. The champion scheme is also image registration.

4. Tricks is recommended

4.1. Visualize and quantify the effect to improve badcase analysis

This is a common method, but there is no groundtruth in this question. At the very beginning, I used the similarity calculation made by prior rules as the confidence degree, which enabled me to conduct rapid iterative effect tuning from the preliminary round to the second round for more than 10 days.

4.2. The motivation of some papers can be used for reference

Sift and other traditional features are generally local features. We are using SIFT local feature matching without location and structure information. Siammask is introduced here. Theoretically, I can obtain a global reference tracking position by relying on CNN’s receptive field. In the preliminary round this did increase the index, but in the final round it dropped slightly and was removed. But I feel it should be helpful, because only three days of the final rematch did not find the reason for the index decline.

4.3. SuperPoint + SuperGlue

The SuperGlue paper is really powerful. I feel that superglue has improved a lot more than NN, and maybe SuperPoint+ Superglue index will improve. In the competition, other students also used SuperPoint+SuperGlue, but my ranking was not as good as mine. I still felt that I did not control the direction well. This question is not only about image registration, see my effect tuning module. At the same time, I looked at the scheme of MEGvii CVPR 2020 SLAM Challenge. When there were few problems, they used dynamic threshold adjustment, but I don’t know what specific method of reference dynamic adjustment was used. It can also be used as trick to try.

Reference:

1.SuperPoint: Self-Supervised Interest Point Detection and Description

2.SuperGlue: Learning Feature Matching with Graph Neural Networks

3. Siammask the author’s understanding

4. Megvii CVPR 2020 SLAM Challenge