In this paper, the author Ren Xuqian, public number: computer vision life member, because format, formula shows may be a problem, suggest to read the original links: review | SLAM loopback detection method

In visual SLAM, pose estimation is usually a recursive process, that is, the pose estimation of the current frame is calculated from the pose of the previous frame, so the error is passed down frame by frame, which is called cumulative error. An effective way to eliminate errors is to carry out loop detection. Loopback detection determines whether the robot has returned to the previous position. If it detects a loopback, it will pass the information to the back end for optimization. Loopback is a more compact and accurate constraint than the back end, which can form a topologically consistent trajectory map. If you can detect the closed loop and optimize it, you can make the results more accurate.

When detecting loopback, if all the previous frames are matched with the current frame, it is loopback if the match is good enough, but this will lead to too much calculation, too slow matching speed, and the number of matches is very large if the initial value is not found. Therefore, loopback detection is a difficult problem in SLAM. For this problem, we summarize several classical methods for your reference here.

Bag Of Words,BOW

The principle of

Introduction: The most popular loopback detection methods in existing SLAM systems are the combination of feature points and word bags (e.g., ORB-SLAM,VINS-Mono). Method based on word bag is preloaded bag a word dictionary tree, inform the preload dictionary tree will be described by the local feature points of each child in the image into a word, the dictionary contains all of the words, through the whole image of the word counts bag of vector, a word word bag of the distance between vector that represents the difference between the two images. In the process of image retrieval, can use the method of inverted index structure, the first to find out the key frame and current frame with the same words, and according to their word bag vector similarity calculation and the current frame, excluding similarity were not high image frames, the remaining key frames as a candidate key frames, sorted by bag vector distance from near to far [1].

The relationships between dictionaries, words, and descriptors are:


Therefore, the loopback detection method based on word bag model can be divided into the following three steps [2] :

1. Feature extraction

2. Build a dictionary (a collection of all words)


3. Determine which words are present in a frame and form a word bag vector (1 means there is a word, 0 means there is no word)


4. Compare the difference of description vectors between the two frames.

The following modules are introduced one by one:

Build a dictionary

Equivalent to describing the sub-clustering process, k-nearest Neighbor algorithm can be used, or the features in the explored environment can be used to dynamically generate the word bag model online [3].

(1) K-nearest Neighbor algorithm

According to the images that have been collected offline, the process of extracting feature descriptors and using k-nearest Neighbor algorithm to form dictionaries is as follows:

1. Randomly select K center points from multiple descriptors in the dictionary:


2. For each sample, calculate the distance between it and each center point, and take the smallest center point as its classification.

3. Recalculate the center point of each class.

4. If the change of each center point is small, the algorithm converges and exits; otherwise, the search will continue iteratively.

Each classified class is a word, and each word is made up of descriptors with close distances after clustering.

Other similar methods include hierarchical clustering, K-means++ and so on.

Kmeans++ algorithm is based on the modified Kmeans, mainly to improve point lies in the center of the initialization, unlike the original version of the algorithm of randomly generated, it through some policy for k initial center distance between each other to far as far as possible, in order to obtain the center has better representative, is conducive to the back of the classification operation effect [8].

The center point initialization process in Kmeans++ algorithm is as follows:

1. Randomly select a point from n samples as the first center point;

2. Calculate the distance between each point in the sample and its nearest central point, select a new center point according to the policy

3. Repeat 2 until k center points are obtained.

(2) Online dynamic word bag generation model:

The traditional BOW model generates offline dictionaries. A more flexible approach is to create a dictionary dynamically so that features not present in the training set can be effectively identified. Typical papers include [4],[5]. In this paper, the word bag model in image recognition is extended and bayesian filtering is used to estimate the loop probability. The loopback detection problem involves the difficulty of identifying the established map region, while the global positioning problem involves the difficulty of retrieving the robot position from the existing map. When a word is found in the current image, the TF-IDF score of pictures that have seen the word before is updated. The method dynamically builds dictionaries based on features encountered while exploring the environment so that the environment of unrepresented features in the training set can be effectively identified.

The dictionary tree

Because the dictionary is too large, if one by one to find matching words, it will produce a large amount of calculation, so we can use k-tree to express the dictionary, the process of building dictionary tree is as follows [6] :

  1. Offline extraction of local descriptors (Words) for a large number of training images in application scenarios (each image may have multiple descriptors)

  2. These descriptors KNN were clustered into K class.

  3. For each node of the first layer, KNN continues to cluster into K classes to obtain the next layer.

  4. This cycle is followed until the number of levels of the cluster reaches the threshold d, where the leaf node represents a Word and the middle node is the center of the cluster.

(Source: Lecture 14 of Visual SLAM)

However the word bag set up a major disadvantage is that it needs to be loaded in advance a trained word bag dictionary tree, there is a lot of feature words in this dictionary the tree generally, in order to guarantee good distinguish ability, otherwise has greater influence on image retrieval results, but this can lead to the dictionary file is bigger, for some mobile applications would be a great burden. To solve this problem, you can avoid the trouble of preloading dictionaries by dynamically building k-D trees. A global K-D tree is maintained during the process of adding key frames, and each feature point is added to the k-D tree in frames. In the process of image retrieval, the closest node is found for matching, and each key frame is voted according to the matching result. The number of votes obtained can be used as the score of this frame, so as to generate the candidate set of key frames similar to the current frame [1].

Word bag vector

The similarity between key and query frames is measured by the distance between the bag vectors. Assume that the set of local descriptors of an image I is [6]


Each descriptorFind the nearest word in the dictionary tree, assuming that its word_id is, its corresponding weight is. The search process starts at the root of the dictionary tree, proceeds to the nearest node at each level, and continues at the next level until the leaf node is reached. The size of the word list is, defines the mapping to look up in the dictionary tree as


If two descriptors map to the same Word, then the weights are added to get a vector of constant length:


Among them


In this way, when searching for key frames according to Word, it is not necessary to traverse all the key frames, but only to find the key frames in the words index of the query frame descriptor mapping.

Similarity calculation

Some words are more useful than others in identifying whether two images show the same place, while some words do not contribute much to recognition. In order to distinguish the importance of these words, specific weight can be assigned to each word, and a common scheme is TF-IDF. It integrates tF-term Frequency of words in images and IDF-inverse Document Frequency of words in the collection process to evaluate the degree of repetition of a word for a file or a domain file set in a corpus.

For a single image, assume that the total number of words appearing in a single image is, a leaf node wordthe, then TF is [7] :

TF: The higher the frequency of a feature in a single image, the higher its discrimination


When building the dictionary, consider IDF and assume that the number of all features is, leaf nodeAs the number ofThe IDF

IDF: The less frequently a word appears in the dictionary, the less distinguishable it is


theIs equal to the product of TF and IDF:


After considering the weights, for an image, its feature points can correspond to many words, which make up its word bag:


So we have a single vectorRepresents an image, this vector is a sparse vector, and the non-zero part represents which words are contained. These values are the values of TF-IDF.

getTwo imagesandAfter, can passDifference of normal form calculation:


Among them,Said only inSome of the words,Said only inSome of the words,Said in,All of these words.The greater the similarity, the greater the similarity when scoringWhen large enough, two frames can be judged as loops.

In addition, if only absolute value is used to represent the similarity of two images, it will not help much in the case that the environment is already similar. Therefore, a prior similarity can be taken, it represents the similarity between the key frame image of a certain moment and the key frame of the previous moment. Then, other points are normalized with reference to this value:


Therefore, it can be defined that if the similarity between the current frame and a previous key frame is more than 3 times that of the current frame and the previous key frame, a loopback is considered possible.

Dsmt4 (Squared TF, Frequency TF, Binary,BM25 TF, etc.) and global (Probabilistic IDF, Squared IDF).

Verify the action

Another problem with the word bag model is that it is not completely accurate and can produce false positive data. In the later stage of loopback detection retrieval, other methods are needed to verify it. If the current trace has been completely lost, relocation is required to give the current frame pose to adjust. In repositioning verification, spatial information is used for screening, PnP can be used for posterior correction, or conditional random fields can be used. This verification can remove images that do not conform geometrically to the reference image [3]. After obtaining accurate image matching, the camera pose can be solved according to the matching results.

If the system tracing is normal and the previously accessed scenarios are discovered, you need to perform loopback check to add new constraints. Loopback detection method based on word bag only care about with or without words, don’t care about the order of the words and can easily cause cognitive deviation, in addition, the word bag depend entirely on the action without using any appearance of geometry information, leads to the appearance of similar images easy to be used as a loopback, so you need to add a validation step, validation [1] mainly consider the following three:

1) Loop closure does not occur with frames that are too close. If the key frame is too close, the similarity between two key frames is too high, and the loopback detected is of little significance. Therefore, frames used for loopback detection had better be sparse, different from each other, and able to cover the whole environment [7]. In order to avoid wrong loopback, a loopback is judged only when a loopback occurs near a pose for consecutive times (3 times in ORB-SLAM) and a pose in history. Loopback candidate frames still need to be matched, with enough matching points for loopback.

2) The closed result is consistent on consecutive frames of a certain length. If a loopback is successfully detected, say on frame 1 and frame 1The frame. So probably noThe frame,Frames will loop back to frame 1. However, confirm frames 1 and 2There is a loop between the frames, which is helpful for trajectory optimization, but then the nextThe frame,Frames are looped back to frame 1, which is less helpful because the cumulative error has been eliminated with the previous information, and more looped back does not generate more information. Therefore, we will group the “close” loops into one group, so that the algorithm does not detect the same kind of loops repeatedly.

3) The result of closure is spatially consistent. In other words, feature matching is carried out for the two frames detected in the loop to estimate the motion of the camera, and then the motion is put into the previous pose map to check whether there is a big difference with the previous estimate.

Classic word bag model source code

DBOW

The DBow library is an open source C ++ library for indexing images and converting them to word bag representations. It implements a hierarchical tree for approximating nearest neighbors in the image feature space and creating visual vocabularies. DBow also implements an image database based on a reverse file structure for indexing images and quick queries. DBow does not require OpenCV (except for demo applications), but they are fully compatible.

Source code address: github.com/dorian3d/DB…

DBOW2

DBoW2 is an improved version of THE DBow library. DBoW2 implements the image database with positive and reverse order pointing to the index image, which can achieve fast query and feature comparison. The main differences from previous DBow libraries are:

  • The DBoW2 class is templated, so it can be used with any type of descriptor.
  • DBoW2 can use ORB or BRIEF descriptors directly.
  • DBoW2 adds direct files to the image database for quick feature comparison, implemented by DLoopDetector.
  • DBoW2 no longer uses binary format. On the other hand, it uses the OpenCV storage system to hold vocabularies and databases. This means that these files can be stored as plain text in YAML format for greater compatibility, or compressed in gunzip (.gz) format to reduce disk usage.
  • Some code has been rewritten to optimize speed. The DBoW2 interface has been simplified.
  • For performance reasons, DBoW2 does not support stop words.

DBoW2 requires the OpenCV and Boost:: Dynamic_bitset classes to use BRIEF versions.

DBoW2 and DLoopDetector have been tested on several real data sets, performed in 3 milliseconds, can convert the image’s brief features into word bag vector quantities, and in 5 milliseconds can find images in a database matching over 19,000 images.

Source code address: github.com/dorian3d/DB…

DBoW3

DBoW3 is an improved version of the DBow2 library. The main differences from the previous DBow2 library are:

  • DBoW3 just needs OpenCV. The DBoW2 dependency for DLIB has been removed.
  • DBoW3 can work with binary and floating point descriptors. There is no need to reimplement any classes for any descriptor.
  • DBoW3 compiles on Linux and Windows.
  • Some code has been rewritten to optimize speed. The DBoW3 interface has been simplified.
  • Use binary files. Binary files load/save 4-5 times faster than YML. Moreover, they can be compressed.
  • DBoW2 compatible YML files

Source code address: github.com/rmsalinas/D…

FBOW

FBOW (Fast Bag of Words) is an extremely optimized version of the DBow2 / DBow3 library. The library is highly optimized to speed up Bag of Words creation using AVX, SSE, and MMX instructions. Fbow is about 80 times faster than DBOW2 when loading vocabularies (see the Tests directory and try). It was about 6.4 times faster when converting images into word bags using a machine with AVX instructions.

Source code address: github.com/rmsalinas/f…

FAB-MAP

It is a probabilistic method to identify place problems based on appearance. Our proposed system is not limited to location, but can augment its map by determining that new observations are coming from previously unseen places. In effect, this is a SLAM system for facade space. Our probabilistic approach allows us to explicitly consider perceptual aliasing in the environment — the probability of identical but not obvious observations coming from the same location is small. We do this by studying a generative model of the appearance of the place. By splitting the learning problem into two parts, new place models can be learned online from a single observation of a place. The complexity of the algorithm is linear in the number of positions in the map, which is especially suitable for online loop closure detection in mobile robots.

Source code address: github.com/arrenglover…

Implementation of word bag model in V-SLAM

C + + version

Blog introduction: nicolovaligi.com/bag-of-word…

Source code address: github.com/nicolov/sim…

Python version

Loop Closure Detection using Bag of Words

Source code address: github.com/pranav9056/…

Matlab:

Blog: www.jaijuneja.com/blog/2014/1…

Source code address: github.com/jaijuneja/t…

ORB-SLAM

Source code address: github.com/raulmur/ORB…

ORB-SLAM2

Source code address: github.com/raulmur/ORB…

VINS-Mono

Github.com/HKUST-Aeria…

kintinous

Github.com/mp3guy/Kint…

The literature

[1] Bao Hujun, ZHANG Guofeng, Qin Xueying. Augmented Reality: Principles, Algorithms and Applications [M]. Science Press: Beijing,2019:114-115.

[2] zhuanlan.zhihu.com/p/45573552

[3] J. Fuentes-Pacheco, J. Ruiz-Ascencio, and J. M. Rendon-Mancha, “Visual Simultaneous localization and Mapping: A Survey, “Artif Intell Rev, Vol. 43, No. 1, pp. 55 — 81, Jan. 2015.

[4] A. Angeli, S. Doncieux, J. A. Meyer, and D. Filliat, “Real-time Visual Loop-closure Detection,” In 2008 IEEE International Conference on Robotics and Automation, Pasadena, CA, USA, 2008, pp. 1842 — 1847.

[5] T. Botterill, S. Mills, A. Ltd, and C. St, “Bag-of-words – Driven Single Camera Simultaneous Localisation and Mapping,” p.28.

[6] www.zhihu.com/question/49…

[7] Gao Xiang, ZHANG Tao. 14 Lectures on Visual SLAM [M]. Publishing House of Electronics Industry,2017:306-316.

[8] blog.csdn.net/lwx30902516…


Random Ferns method

The principle of

This relocation method compresses and encodes each frame of the camera and effectively evaluates the similarity between different frames. The random fern method is used for compression coding. In this relocation method based on key frames, frame coding approach based on says fern: enter a RGB – D image, the image of the random position evaluation simple binary test, the whole frame coding, coding block, each says fern to produce a small piece of code, and code to connect can express a compact camera frame. Each code block refers to a row in a code table, which is stored as a hash table, associated with an equivalent code, the fern, which stores the key frame ID.

As new images are acquired, if the dissimilarity is greater than the threshold, the ID of the new frame will be added to the row. When tracing recovery, the pose is retrieved from the hash table, associating the most similar keyframes. The degree of dissimilarity between a new frame and all previous encoded frames is measured by block-by-block Hamming distance (BlockHD).


When the return value is 0, the two blocks are similar. When the return value is 1, there is a difference of at least one bit. Therefore, BlockHD represents the number of different coding blocks. Different block length will directly affect the accuracy/recall properties of BlockHD when finding similar frames. To determine whether an image satisfies sufficient similarity, a minimum BlockHD is required. For each new frame, calculate


Represents how much useful information a new frame provides if a new frameA low value means that the frame is similar to the previous frame, ifA high value indicates that the pose was taken from a new perspective and should be saved as a keyframe. With such observations, you can try to capture trace frames and automatically determine which ones should be saved as keyframes. Through the valueAnd an implementation defined thresholdCan determine whether a new frame should be added to the hash table or removed. This method of finding key frames and retrieving pose can effectively reduce the time of 3D reconstruction, and is applicable to current open source SLAM algorithms.

code

Application of Random Fern in VSLAM

kinect fusion

Github.com/Nerei/kinfu…

elastic fusion

Github.com/mp3guy/Elas…

PTAM

The relocation method in PTAM is similar to random Ferns. PTAM is to shrink each frame of image and generate a thumbnail with Gaussian blur during the construction of key frames, which is used as the descriptor of the whole image. During image retrieval, this thumbnail is used to calculate the similarity between the current frame and the key frame. The main disadvantage of this method is that when the perspective changes, the results will have a large deviation, which is not as robust as the method based on invariant features.

Github.com/Oxford-PTAM…

The literature

[1] B. Glocker, S. Izadi, J. Shotton, and A. Criminisi, “Real-time RGB-D camera relocalization,” in 2013 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Adelaide, Australia, 2013, pp. 173–179.

[2] B. Glocker, J. Shotton, A. Criminisi, and S. Izadi, “Real-Time RGB-D Camera Relocalization via Randomized Ferns for Keyframe Encoding,” IEEE Trans. Visual. Comput. Graphics, vol. 21, no. 5, pp. 571–583, May 2015.

[3] blog.csdn.net/fuxingyin/a…


Methods based on deep learning

The image retrieval method based on deep learning is a global retrieval method, which requires a large amount of data for pre-training, but has a good tolerance to scene changes. Some end – to – end camera pose estimation methods have achieved good results. PoseNet, a pioneering work combining deep learning and visual positioning, uses neural networks to derive 6-dof camera poses directly from images. Compared with the traditional visual positioning method, it saves the complex image matching process and does not need to iteratively solve the camera pose, but the input image must be in the training scene. On this basis, projection error was used in the error function to further improve the accuracy of pose estimation. Similarly, MapNet using the traditional method to solve the two images of the relative position, compared with the relative position of network calculation, get the relative posture errors of the camera, and counterpoint posture error added to the network loss function, which the camera pose more smooth, MapNet can also be more consecutive frames of posture figure optimization result, Finally, the estimated camera pose is more accurate.

There are ways to supervise

Basically are using Zhou Bolei Places365

Places365 is the latest subset of the Places2 database. Places365 comes in two versions: Places365-Standard and Places365-Challenge. Places365-standard’s train set includes approximately 1.8 million images from 365 scene categories, with a maximum of 5,000 images per category. We have trained various CNN-based networks on Place 365-Standard and have published them below. There are also 6.2 million images for the Places365-Challenge and all images for the Places365-Standard (a total of about 8 million images), with a maximum of 40,000 images per category. Places365-challenge will host a joint workshop between ILSVRC and COCO at ECCV 2016 in conjunction with the 2016 Places2 Challenge.

Places3- Standard edition and Places365-Challenge data are published on the Places2 web site.

Pre-trained CNN models on Places365-Standard:

  • AlexNet-places365
  • GoogLeNet-places365
  • VGG16-places365
  • VGG16-hybrid1365
  • ResNet152-places365
  • ResNet152-hybrid1365

Source code address: github.com/CSAILVision…

Unsupervised methods

Principle of CALC

An unsupervised deep neural network approach for loop detection in large-scale real-time SLAM can improve the detection performance. This method creates an automatic coding structure, which can effectively solve the boundary location errors. Shooting a location at different times may lead to inaccurate positioning due to changes in Angle of view, illumination, climate, dynamic target and other factors. Convolutional neural networks can effectively perform vision-based classification tasks. In scene recognition, embedding CNN into the system can effectively identify similar images. However, traditional METHODS based on CNN sometimes produce low feature extraction, slow query, large data to be trained and other shortcomings. CALC is a lightweight real-time fast deep learning architecture that requires few parameters and can be used for SLAM loopback detection or any other site recognition task, even for resource-limited systems.

This model maps high-dimensional raw data to a low-dimensional description subspace with rotation invariance. Before the training, each image in the image sequence was randomly projected and re-scaled to 120×160 to generate image pairs, in order to capture the extreme change of perspective in the process of motion. Then some images are randomly selected to calculate the HOG operator, and the fixed length HOG descriptor can help the network learn the scene geometry better. Store the HOG for each block of the training image on the stack and define asDimension for, includingIs the block size,Is the dimension of each HOG operator. The network has two convolutional layers with pooling layers, one pure convolutional layer and three fully connected layers, and ReLU is used as the activation unit of the convolutional layer. In this architecture, the image is projected and the operation of extracting HOG descriptors is calculated only once for the whole training data set, and then the results are written into the database for training. When training, the batch size N is set to 1 and only the layers in the Boxed area are used.

Related literature

[1] N. Merrill and G. Huang, “Lightweight Unsupervised Deep Loop Closure,” Robotics: Science and Systems XIV, 2018.

code

CALC

Principle: convolutional autoencoder for loopback detection. It divides the code into two modules. TrainAndTest is used to TrainAndTest models and DeepLCD is a C ++ library for online loopback detection or image retrieval.

Source address: github.com/rpng/calc follow the public number, click “learning circle”, “SLAM entry”, start from scratch to learn 3d vision core technology SLAM, unconditional refund within 3 days. Early is an advantage, learning must not fight alone, there are tutorial materials, exercise homework, answer questions and so on, high quality learning circle to help you less detour, fast entry!

Recommended reading

How to systematically learn Visual SLAM from scratch? From scratch learn together SLAM | : why do you want to learn to SLAM? Starting from scratch learn together SLAM | SLAM really need to learn to learn? What’s the use of studying together from scratch SLAM | SLAM? Starting from scratch learn together SLAM | c + + features do you want to learn? From scratch learn together SLAM | why use homogeneous coordinates? Starting from scratch learn together SLAM | the rotation of the rigid body in three dimensional space Starting from scratch learn together SLAM | why need lie group and lie algebra. From scratch learn together SLAM | camera imaging model from scratch learn together SLAM | not formula, how to really understand the very constraints? Starting from scratch learn together SLAM | magical homographic matrix From scratch learn together SLAM | hi, Point cloud Starting from scratch learn together SLAM | to add one mesh point cloud From scratch learn together SLAM | point cloud smooth normal estimation Starting from scratch learn together SLAM | point cloud to the evolution of the grid From scratch learn together SLAM | understand figure optimization, Take you understand in Pittsburgh code from scratch step by step with learning SLAM | master in Pittsburgh vertex programming routines Starting from scratch learn together SLAM | master code routines in Pittsburgh From scratch learn together SLAM | using quaternion interpolation to alignment of IMU and small white image frames zero foundation, introduction to computer vision? I used MATLAB to perform a 2D LiDAR SLAM visualization to understand quaternions, may you no longer lose your hair. What is the representative work of semantic SLAM in the last year? Visual SLAM technology overview summary | VIO SLAM, laser SLAM related thesis classified collection research, how tall to the requirement of programming? SLAM 2018, 3 d visual direction job experience sharing SLAM 2018, 3 d visual direction job experience sharing deep learning in SLAM | how to evaluation based on the deep learning DeepVO, VINet, VidLoc? AI resource interconnection requirement Summary: Issue 1 AI resource interconnection requirement Summary: Issue 2

Computer vision is the eye of artificial intelligence. The public account has created 170 articles, which are both systematic, rigorous and easy to read. Click “Summary classification” on the menu bar to view the original series, including: 3d vision, Visual SLAM, deep learning, Machine learning, Depth Camera, Introduction to Popular Science, CV Orientation introduction, Mobile phone dual photography, Panoramic camera, camera calibration, Medical Image, Frontier Conference, Robotics, ARVR, Industry Trends, etc. At the same time, there are basic entry, project actual combat, interview experience, teaching materials and other dry goods. A key to focus on the star, plus technical exchange group, progress together.