This paper tries to summarize the main ideas and work of Semantic SLAM in recent years. But because the level is limited, if there is a mistake, thank you for correcting. (For better formula display effect, please pay attention to the public account at the bottom of the article)

Semantic SLAM

Introduction to the

So far, the mainstream of SLAM scheme [[1]] (http://webdiis.unizar.es/~raulmur/orbslam/) based in pixel level, * * * * feature points, more specifically, they often use only * * * * to extract the corner or edge of the sign. Humans infer camera motion from the motion of objects in the image, not from specific pixels.

Semantic SLAM is a scheme that researchers attempt to utilize object information. Driven by Deep Learning, Semantic SLAM has made great progress and become a relatively independent branch. In terms of methods (non-devices), its position in the whole FIELD of SLAM is shown as follows:

At present, the so-called Semantic refers to the application of Semantic segmentation, target detection, instance segmentation and other technologies based on neural network in SLAM, which are mainly used for feature point selection and camera pose estimation. More broadly, End-to-end images in place appearance, built from the result of segmentation marker point cloud, scene recognition, feature extraction, do loopback detection using the method of neural network can be referred to as Semantic SLAM [[2]] (https://zhuanlan.zhihu.com/p/58648284).

The embodiment of the combination of semantic and SLAM has the following two [[3]] (https://book.douban.com/subject/27028215/) :

  • SLAM helps semantics.

Both detection and segmentation tasks require a large amount of training data. In SLAM, since we can estimate the motion of the camera, the changes of the positions of each object in the image can also be predicted, generating a large amount of new data to provide more optimization conditions for semantic tasks and save the cost of manual calibration.

  • Semantics help SLAM.

On the one hand, semantic segmentation labels every image in motion, and then traditional SLAM maps labeled pixels into 3D space to get a labeled map. This provides a high-level map that facilitates autonomous robot understanding and human-machine interaction.

On the other hand, semantic information can also bring more optimization conditions for loopback detection and Bundle Adjustment to improve positioning accuracy.

Only work that implements the former is often called Semantic Mapping, while the latter is considered a true Semantic SLAM.


The development direction

This paper introduces some main ideas from Semantic Mapping and Real Semantic SLAM.

Semantic Mapping

This kind of work requires that the characteristic points are dense or semi-dense (otherwise Mapping is meaningless), so RGB-D SLAM schemes are often used. Or semi – dense LSD – monocular camera SLAM scheme [[4]] (https://vision.in.tum.de/research/vslam/lsdslam).

There are two Mapping modes:

  • The result of semantic segmentation of 2D image, i.e. labeled pixels, is mapped to 3D point cloud.

Researchers are trying to use the information generated by SLAM, especially camera pose, to improve the performance of semantic segmentation. One is to use SemanticFusion [[5]] (https://arxiv.org/abs/1609.05130) Recursive Bayes method: According to SLAM’s estimation of pixel point motion, the semantic classification probability ** of ** pixels of the current frame is multiplied by the classification probability ** of the previous frame ** position as the final probability, that is, the probability of pixels will be multiplied along each frame, thus enhancing the result of semantic segmentation.

This method is based on monocular camera work [[6]] (https://arxiv.org/abs/1611.04144) in use, the overall framework is described below.

Lsd-slam + Deeplab-v2 (Semantic Segmentation)

Process: enter -> Select key frames and refine (non-key frames are used to enhance depth estimation) -> 2D semantic segmentation -> semantic optimization +3D reconstruction

① In order to ensure speed, only semantic segmentation is performed on key frames.

(2) the other frame with small – baseline stereo comparisons [[7]] (https://ieeexplore.ieee.org/document/6751290), the optimization of key frames to do depth estimation

③ Use Recursive Bayes to enhance semantic segmentation

(4) Optimal use condition Random field (CRF) for 3D reconstruction, same as SemanticFusion

  • The second mode of the Mapping is built for the unit with the Object map [[8]] [[9]] (https://arxiv.org/pdf/1609.07849.pdf) (https://arxiv.org/pdf/1804.09194.pdf). A semantic map containing individual objects would be more valuable than a bunch of voxels labeled with categories.

This part focuses on how to do Data correlation (Data Association), which identified the Object tracking and find a new Object, with [[8]] (https://arxiv.org/pdf/1609.07849.pdf) as an example is described below.

A dense point cloud can be constructed using RGB-D and ORB-SLAM2.

For key frames, SSD detect more than one Object, the 3 d application unsupervised segmentation method [[10]] (https://ieeexplore.ieee.org/document/7759618) * * for each Object sequence distribution point cloud, and store up * *.

Data association: After obtaining a set of segmentation results (Object, corresponding to point cloud), the Euclidean distance of the center of gravity of the point cloud is found. If the distance of more than 50% point pairs is less than a threshold value (2cm), it is considered as the matched Object; otherwise, it is considered as a new Object and stored.

Two point clouds matching the same Object are directly added to the classification probability (confidence). This is similar to the Recursive Bayes method mentioned above, that is, multi-angle information of objects provided by SLAM is used to enhance segmentation results.

(Note: The Related Work of this article is well written)

Real Semantic SLAM

This part is the focus of this paper. In comparison, true Semantic SLAM (i.e., Semantic mapping and SLAM localization mutually reinforcing) is relatively late (basically after 2017).

In the Bundle Adjustment (BA) method, we optimized both camera pose and 3D coordinate positions to minimize the total error between pixels reprojected to 2D images and actual observations (multiple cameras and multiple feature points).

So how do you integrate semantic information?

  • Idea 1: After the same 3D point is re-projected, the semantics should be consistent.

This is again a reprojection optimization problem, which can be added into the BA formula to strengthen the optimization goal. The key is how to quantify the reprojection error, just as the reprojection error of traditional BA is quantified by the distance from the actual observed pixels.

ICRA 2017 the famous work of Probabilistic Data Association for Semantic SLAM [[11]] (https://ieeexplore.ieee.org/document/7989203) Using this idea, the method of quantifying the reprojection error is: the center of ** object calculated by the probability model should be close to the center of the detection frame ** when reprojected onto the image. The data association (which detection frame center should be close to) is determined by a set of weights. Finally, “BA” and “weight update” are optimized alternately through EM algorithm.

ECCV work on 2018 VSO [[12]] (http://cvg.ethz.ch/research/visual-semantic-odometry/) similarly, heavy projection error categories with the target via * * * * of the semantics of the area to quantify. There are a few clever details, which will be explained below.

As shown in the figure above, (a) is the semantic segmentation graph, and (b) is the region of category “Car”. In (c) (d), the probability value changes from 1 (red) to 0 (blue) according to the distance from the region of Car. Other categories, such as Tree, also produce this probability distribution.

Among themThe transformation from distance to probability, using the gaussian distribution below, the difference of (c) (d) is varianceThe result of. This is in preparation for quantifying the reprojection error. For a point P in space (has coordinates), the probability calculated after the reprojection is:

Among themThe result of reprojection is calculatedWith category C areaThe recent distanceThat’s what you end up withUsed forCalculate the reprojection error:

The weightIs there to solve data correlation, namely, which category of region should the spatial point P target,. It’s taken from multiple camerasValue multiplications, that is, multiple angles of observation voting decisions.

Will be added to the ordinary BA optimization formula, using EM algorithm for optimization, E step update weightWhile M step optimizes 3d point P coordinates and camera pose (common BA process).

Personally, I use the Gaussian distribution because the function has a “sudden drop”, the varianceCan play a role in determining the threshold forCategory areas whose distance exceeds the threshold get a small weight fasterIn the case of multiple cameras, data association can be stabilized quickly and optimization can be accelerated.

(For the sake of simplicity, the above formula has been simplified to remove indexes for multiple cameras and spatial points, see the original article)

  • Idea 2: Dynamic regions can be inferred from semantic information.

Almost all traditional SLAM methods assume that the current scene is static. When facing the scene with moving objects, the feature points on the moving objects will produce a huge deviation to the camera pose estimation. The main way to solve this problem is to remove these dynamic feature points, and semantic segmentation is very suitable for finding these dynamic regions.

Semantic segmentation has two characteristics, one is to connect many pixels in the plane region, the other is to attach classification labels to the region.

The former is helpful to determine whether the object is really moving, because the offset of a single feature point cannot determine the occurrence of motion (it may be the observation noise that always exists in the SLAM system). If a group of related feature points generally have a large offset, it can be considered as dynamic.

The latter is good for predicting whether an object will move, such as an area labeled as a person that is almost dynamic, while a wall can be determined to be static (without even calculating the offset).

IROS 2018 DS – SLAM [[13]] (https://arxiv.org/abs/1809.08379) based on the characteristics of the first, determine whether dynamic for the unit with the area, And some jobs like [[14]](http://openaccess.thecvf.com/content_cvpr_2018_workshops/papers/w9/Kaneko_Mask-SLAM_Robust_Feature-Based_CVPR_201 8_paper.pdf) makes use of the second feature only, and more violently excludes the feature points of certain regions (sky, car) directly.

Using the two characteristics of is the job of the ICRA 2019 [[15]] (https://arxiv.org/abs/1812.10016), is briefly as follows.

Using semantic segmentation, some categories () area is defined as background (green), other categories (A region is defined as a movable object.

Next, use the motion determination to distinguish whether a moveable object is currently stationary (blue) or moving (red).

Movement decision rules are as follows: to a certain semantic area, the characteristic points of past estimates of 3 d position of the projection of a new image to the current if heavy projection position and the corresponding feature points of the Euclidean distance is greater than a certain threshold, is defined as moving point, the proportion of the moving point if the area is larger than a certain threshold is judged to be mobile area.

  • Idea 3: Semantic information provides object-level descriptions with seasonal (light) invariance.

This idea can be applied to positioning with existing 3D maps.

Traditional feature points (with descriptors) are not robust in a changeable environment and are easy to lose. However, the result of semantic segmentation is relatively stable. In addition, it is more intuitive for human to do localization based on object level (matching map with semantic labels).

ICRA work in 2018 and 2019 [[16]] (https://arxiv.org/abs/1801.05269) [[17]] (https://ieeexplore.ieee.org/abstract/document/8794475) with this way of thinking.


  • Idea 4: To be concluded

—— better reading effect, more relevant content can pay attention to the public account [Xiaolin’s brain circuit] ↓ ↓ ↓