On September 18, 2018, the World Artificial Intelligence Conference 2018 · Vision Intelligent Pupil Learning Future 7 Niuyun Special Session sub-forum was held in The European Hall, 5th floor, Shanghai International Convention Center. Qi Tian, chief scientist of Huawei Noah’s Ark Computational Vision Lab and professor of computer Science at the University of Texas at SAN Antonio, shared the topic “Challenges and Recent Progress in Pedestrian Re-identification”.
The following content is based on the actual shorthand of the lecture.

Distinguished guests, teachers and students, it is a great honor to be here to share our work with you. The topic of my presentation is “Challenges and Recent Developments in Pedestrian Re-identification”. In today’s report, I will first introduce the background and challenges of pedestrian re-recognition, then introduce the latest progress made in recent years and our related work, and finally share with you the possible new research directions in the field of pedestrian re-recognition in the future.
Pedestrian re-recognition has broad application prospects, including pedestrian retrieval, pedestrian tracking, street event detection, pedestrian behavior analysis and so on. Of course, this action behavior analysis also includes the analysis of users’ shopping behaviors in the shopping mall, such as estimating customers’ age, gender, what kind of products they are interested in, and how long they stay, etc. This information helps the market to plan the corresponding sales strategy. Because of the importance of the pedestrian re-recognition task, more and more researchers and institutions have become involved in recent years, a trend that can be seen in a paper published by the Computer Vision Foundation. For example, in 2013, there were not many related articles published in Vision Summit, but in recent years, there are 32 articles published in CVPR, the top conference on computer vision, and there are 19 articles published in ECCV.

Pedestrian re-recognition is a difficult problem, and there are many challenges to solve it. These challenges can be grouped into three categories: the first is the need for large amounts of training data; The second challenge is the large difference of pedestrian visual appearance; The third challenge is a non-ideal scenario.
The challenges to the need for large amounts of training data are mainly reflected in the following areas:
One is limited training data. From the current collection of pedestrian re-recognition training data, the spatial and temporal distribution of the collected data is very limited and local compared with the real data. At the same time, compared with other visual tasks, the data scale of pedestrian re-recognition is also very small. For example, ImageNet, a large-scale image recognition data set, has 1.25 million images in its training data, 350,000 pedestrian frames in Caltech, a pedestrian detection data set, and more than 123,000 images in cCOCOoco’s target detection data training set. However, there are only more than 30,000 pedestrian images in the commonly used data set for pedestrian re-recognition.
Second, training and data acquisition are difficult. It is difficult to collect pedestrian data across time, climate and multiple scenarios. Privacy concerns also hamper access to data.
Third, data annotation is difficult. The first is the enormous amount of tagging, as you know from ImageNet, a massive image classification dataset, 48,000 people spent nearly two years crowdsourcing it. Both in terms of time and money, labeling costs are very large. Secondly, labeling itself is sometimes very difficult, for example, it is easy to simply separate a dog from a cat, but it is difficult to separate two pedestrians of similar age, body appearance and wearing the same clothes in the video.
The second challenge is that pedestrians present different poses with complex backgrounds, different lighting conditions and different shooting angles, which will bring great trouble to pedestrian re-recognition. And a pedestrian wearing different clothes, wearing different hats or glasses, and having a different hairstyle can also cause huge problems.

Data is the key. From the beginning, our team worked to build standard data sets to drive pedestrian re-identification. In ICCV2015, we released the largest image-based pedestrian re-recognition dataset market-1501 at that time. The dataset featured six cameras that tagged 1,501 pedestrians, making a total of more than 30,000 pedestrian images. This data set has become the benchmark data set for pedestrian re-identification. It has been cited more than 4,230 times since 2016. This year, in collaboration with Peking University, we presented a larger image-based pedestrian re-recognition dataset, MSMT17. When collecting the MSMT17 data set, we used 15 cameras respectively deployed inside and outside the teaching building to take pictures at three times in the morning, noon and afternoon in four discontinuous days. In the end, more than 4,000 people were collected and more than 120,000 pedestrian images were annotated. In addition, in ECCV2016 and CVPR2017, we also published and publicly made MARS, a video-based pedestrian re-recognition dataset, and PRW, an end-to-end pedestrian re-recognition retrieval dataset. Therefore, in the past few years, we have mainly made four pedestrian re-recognition datasets, which have well promoted the development of pedestrian re-recognition.
In addition to building larger and more realistic data sets to meet the challenge of large amount of training data demand, we can also increase the amount of training data through data generation. There are traditional methods and deep learning methods for data generation. For example, some operations on the image such as inverting, clipping, constructing pyramid input, etc., are widely used in traditional methods. In recent years, deep learning methods are mainly gan-based. The first work of generative adversarial network (GAN) on pedestrian re-recognition was published in ICCV2017. The author used DCGAN to generate unlabeled pedestrian data for data enhancement. The problem with this work is that the pedestrian images generated by DCGAN are of low quality. In CVPR2018, Ms Ni Bingbing and her team from Shanghai Jiaotong University used conditional GAN to generate pedestrian images with different poses to enrich the posture changes of pedestrians in the training concentration. However, the same problem is that the generated images are of lower quality. In addition, CVPR2018 has a team to do camera style learning. For example, the real image from the first camera is transferred to the sixth camera, or the image from the sixth camera is transferred to the image of the first camera. In this way, our training set will more evenly include the camera styles in the scene and have better performance during the test phase.
We proposed PTGAN (Person Transfer GAN) in CVPR2018 this year. PTGAN is mainly for cross-scene migration. Assuming that the training data we annotated in Beijing would like to be used in a scene in Shanghai, we can use PTGAN to migrate the annotated data to the scene in Shanghai, and the images after migration will be just like shooting in Shanghai. Then we train the pedestrian re-recognition model on the migrated data set, which will achieve better performance in the Shanghai scenario. The implementation of PTGAN is based on two loss functions: style transfer and pedestrian retention. The purpose of style transfer is that the style of the image after migration is consistent with the target scene as much as possible, while the purpose of pedestrian maintenance is that there is no change of pedestrians in the image after migration. We have done experiments on different data sets and have seen considerable performance improvements.

Finally, to deal with the challenge of non-ideal scene, the main solution is to detect and match human body parts. We proposed Pose Driven Convolution (PDC) method in ICCV2017 to extract fine-grained parts of human body and conduct correction. However, due to the need to extract very fine human body parts, PDC is sensitive to occlusion and human key point detection errors. Based on this, we proposed a method of global-local Alignment Descriptor(GLAD) in MM17, which only needs to extract three coarse-grained components to get very good performance.

When it comes to identifying future directions for pedestrians, there are two aspects: data and methods. On the data level, on the one hand, we need to construct more real and larger data sets, on the other hand, we can also do data generation through 3D Graphics related methods. At the method level, we previously only considered visual information. There’s a lot of other information we can use in the real world: Wi-Fi access, Gait, GPS, etc. In addition, in practical applications, pedestrian detection and re-recognition are actually integrated, and should be optimized in a unified framework. At present, there is still a lack of work in this area, and we will focus on this direction in the future.

Finally, I would like to introduce the current situation of Huawei Noah’s Ark Laboratory. The research work of Noah’s Ark lab is mainly focused on five areas, including computer vision, natural language processing, decision and reasoning, search and recommendation, and basic theory of AI. In computer vision, I mainly do safe city, terminal vision and other directions. The lab currently collaborates with more than 25 universities in ten countries. Noah’s Ark LABS is located in Shenzhen, Beijing, Shanghai and Xi ‘an in China, and in Toronto, Silicon Valley, London, Paris and Montreal overseas.