Welcome to Tencent cloud community, get more Tencent mass technology practice dry goods oh ~

Author: Xie Hongwen

Introduction:

With the explosion of robotics, unmanned aerial vehicles, unmanned aerial vehicles, and VR/AR in recent years, SLAM technology is well known and is considered as one of the key technologies in these fields. In this paper, SLAM technology and its development are briefly introduced, the key problems of visual SLAM system and difficulties in practical application are analyzed, and the future of SLAM is forecasted.

1. SLAM technology

SLAM (Simultaneous Localization and Mapping), Simultaneous Localization and map construction, was first proposed in the field of robotics, which refers to: Starting from an unknown location in an unknown environment, the robot locates its position and attitude through repeated observed environmental features during the movement, and then builds an incremental map of the surrounding environment according to its position, so as to achieve the purpose of simultaneous localization and map construction. SLAM has always been considered as a key technology to realize fully autonomous mobile robot due to its important academic and application value.

In layman’s terms, SLAM answers two questions: “Where am I?” “What is around me? Just like a person in an unfamiliar environment, WHAT SLAM tries to solve is to recover the relative spatial relationship between the observer and the surrounding environment. “Where I am” corresponds to the positioning problem, while “what is around me” corresponds to the mapping problem and gives a description of the surrounding environment. Answering these two questions actually completes the spatial cognition of oneself and surrounding environment. With this foundation, we can carry out path planning to reach the destination we want to go to. In this process, we also need to timely detect and avoid obstacles to ensure safe operation.

2. Introduction of SLAM development

Since the concept of SLAM was put forward in the 1980s, SLAM technology has gone through more than 30 years of history. The sensors used in SLAM systems continue to expand, from sonar in the early days, to 2D/3D lidar later, to monocular, binocular, RGBD, ToF and other cameras, and integration with inertial measurement unit (IMU) and other sensors. The algorithm of SLAM has also changed from filter based approach (EKF, PF, etc.) to optimization based approach, and the technical framework has also evolved from single thread to multi-thread. Here are some representative SLAM technologies in these processes.

(1) Development of LIdar SLAM

Lidar based SLAM (Lidar SLAM) uses 2D or 3D Lidar (also known as single-line or multi-line Lidar), as shown in the figure below. 2D lidar is commonly used for indoor robots (such as sweeping robots), and 3D lidar is commonly used for unmanned vehicles.

The advantages of lidar are accurate measurement, can provide more accurate Angle and distance information, can reach <1° Angle accuracy and cm ranging accuracy, wide scanning range (usually can cover more than 270° in the plane), Moreover, solid-state lidar based on scanning galvanometer (Sick, Hokuyo, etc.) can achieve a high data refresh rate (above 20Hz), which basically meets the needs of real-time operation. The disadvantage is that the price is relatively expensive (at present, the relatively cheap mechanical rotary single-line laser radar on the market is also several thousand yuan), the installation and deployment of the structure has requirements (requiring no occlusion of the scanning plane).

Ocupanccy Grid is often used to represent the map established by LIdar SLAM. Each Grid represents the probability of being occupied in the form of probability, and the storage is very compact, which is especially suitable for path planning.

Sebastian Thrun (below), Udacity’s founder and CEO, former Google VP, and Google’s leader in driverless cars, wrote his 2005 classic book, Probabilistic In Robotics, the theoretical basis of using 2D lidar to construct and position maps based on probability method was elaborated, and the FastSLAM method based on RBPF particle filter was elaborated, which became the basis of GMapping[1][2], one of the standard methods for 2D lidar map construction. The algorithm is also integrated into the Robot Operation System (ROS).

In 2013, literature [3] made a comparative evaluation of several 2D SLAM algorithms in ROS, HectorSLAM, KartoSLAM, CoreSLAM, LagoSLAM and GMapping. Readers can go to have a closer look.

In 2016, Google opened its Cartographer[4], its lidar SLAM algorithm library, which improved GMapping’s disadvantages of complex calculation and no effective closed-loop processing, and adopted the ideas of SubMap and Scan Match to construct the map, which could effectively process the closed-loop and achieved good results.

(2) Development of visual SLAM

Compared to lidar, cameras as sensors for visual SLAM are cheaper, lighter, and more widely available (as is the case with cameras on everyone’s mobile phones). In addition, images provide more information and feature discrimination. The disadvantage is that real-time processing of image information requires a lot of computing power. Fortunately, with the increasing power of computing hardware, it is now possible to run real-time visual SLAM on small PCS and embedded devices, as well as mobile devices.

Currently, there are three main sensors for visual SLAM: monocular camera, binocular camera and RGBD camera. The depth information of RGBD camera is calculated by the principle of structured light (such as Kinect1 generation). Some are calculated by projecting infrared pattern and using binocular infrared camera (such as Intel RealSense R200), and some are realized by TOF camera (such as Kinect2 generation). For the user, these types of RGBD can output RGB image and Depth image.

Modern popular visual SLAM systems can be roughly divided into a front end and a back end, as shown below. Front end to complete the data correlation, equivalent to VO odometer (vision), the research frame and the frame, the transform relationship between main complete real-time position tracking, the input image processing, calculating attitude change, at the same time also to detect and deal with closed loop, when the company information, also can participate in the fusion calculation (visual inertia odometer VIO); The back end mainly optimizes the output results of the front end, and optimizes trees or graphs by using filtering theory (EKF, PF, etc.) or optimization theory to obtain the optimal pose estimation and map.

Using filter SLAM, the following figure (a), to estimate the camera pose the n times Tn need to use all the landmarks in the map information, and status of each frame needs to be updated these landmarks, as new signs are added to a rapid increase in the size of the state matrix, lead to calculate and to solve the time consuming more and more serious, thus not suitable for long time the operation of the big scene; However, SLAM with optimization algorithm, as shown in Figure (b) below, is usually used in combination with key frames. The camera pose Tn estimated at time N can use a subset of the whole map without updating map data in every image. Therefore, most modern successful real-time SLAM systems adopt optimization methods.

Several representative SLAM systems in the development of visual SLAM are introduced as follows:

MonoSLAM[5] is the first successful monocular camera-based pure visual SLAM system developed by Davison et al in 2007. MonoSLAM uses extended Kalman filter, which states the camera’s motion parameters and positions of all three-dimensional points with a probabilistic bias for the camera’s orientation at each moment. The location of each THREE-DIMENSIONAL point also has a probability deviation, which can be represented by a three-dimensional ellipsoid, with the center of the ellipsoid as an estimate and the volume of the ellipsoid indicating the degree of uncertainty (as shown in the figure below). Under this probability model, the shape of the field spot projected onto the image is a projected probability ellipse. MonoSLAM abstracts Shi-Tomasi corner points [6] from each frame of image and actively searches [7] for feature point matching in the projection ellipse. Since the location of three-dimensional points is added into the estimated state variable, the computational complexity at every moment is O(N3), so only small scenes with several hundred points can be processed.

In the same year, Davison’s masters Murray and Klein in Oxford published the real-time SLAM system PTAM (Parallel Tracking and Mapping) [8] and opened source (as shown below). It was the first monocular VISUAL SLAM system based on keyframe BA, which was subsequently ported to mobile in 2009 [9]. PTAM’s innovative architecture of Tracking and Mapping in parallel was a first at the time, and it was the first time that map optimization could be integrated into real-time computing and the entire system could run. This design was later followed by real-time SLAM, such as ORB-SLAM, and has become standard in modern SLAM systems. To be specific, the attitude tracking thread does not modify the map, but only uses the known map for fast tracking. The map building thread focuses on creating, maintaining, and updating the map. Even if it takes a little longer to set up the map thread, the attitude tracker thread still has a map to track (if the device is still within the established map range). In addition, PTAM also realizes the strategy of lost relocation. If the number of successful matching points (Inliers) is insufficient (for example, due to image blur, fast motion, etc.), relocation will start [10] — compare the current frame with the thumbnail of the existing key frame, and select the most similar key frame as the prediction of the current frame orientation.

In 2011, Newcombe et al proposed a monocular DTAM system [11], whose most notable feature was that it could recover 3d scene models in real time (see figure below). Based on 3d model, DTAM can not only allow physical collision between virtual objects in AR application and scene, but also guarantee stable direct tracking under the condition of feature loss and image blur. DTAM uses Inverse Depth [12] to express Depth. As shown in the figure below, DTAM discretized the solution space into a THREE-DIMENSIONAL grid of M×N×S, where M×N was the image resolution and S was the inverse depth resolution. The direct method was adopted to construct the energy function for optimal solution. DTAM has good robustness to feature loss and image blur. However, because DTAM restores dense depth maps for each pixel and adopts global optimization, it requires a lot of calculation. Even if GPU acceleration is adopted, the expansion efficiency of the model is still low.

In 2013, Engel et al. from TUM Machine Vision Group proposed a Visual Odometry (VO) system also based on direct method, which was extended to VISUAL SLAM system LSD-SLAM in 2014 [13] and opened source the code. Compared with DTAM, LSD-SLAM only restores semi-dense depth maps (as shown below), and each pixel depth is calculated independently, thus achieving high computational efficiency. Lsd-slam uses key frames to express scenes, and each key frame K contains image Ik, inverse depth map Dk and Vk, the variance of inverse depth. The system assumes that the inverse depth value of each pixel X follows a Gaussian distribution N(Dk (x),Vk (x)). The foreground thread of LSD-SLAM uses the direct method to calculate the relative motion between the current frame T and the key frame K. The background thread searches the corresponding point of Ik (x) along the polar line in It for the pixel x extracted from each semi-dense key frame (gradient significant region), and obtains the new inverse depth observation value and its variance. Then, EKF is used to update Dk and Vk. Lsd-slam uses pose optimization to close loop and process large-scale scenarios. In 2015, Engel et al. extended lsD-SLAM to support binocular camera [14] and panoramic camera [15].

In 2014, Forster et al., robot Perception Group, University of Zurich, proposed an open source SVO system [16], which uses Sparse model-based Image Alignment for Sparse feature blocks to obtain camera pose. Then, according to the assumption of photometric invariance, an optimization equation is constructed to optimize the Feature Alignment predicted. Finally, the pose and Structure are optimized (motion-only BA and structure-only BA). In terms of depth estimation, a depth filter is constructed. A special Bayesian network [17] was used to update the depth. One of the outstanding advantages of SVO is its high speed, which can be achieved by using sparse image blocks and not requiring feature descriptor calculation (55fps on the embedded ARM Cortex A9 4-core 1.6ghz processor platform of the author). However, SVO also has obvious disadvantages. It does not take into account relocation and closed loop, so it is not a complete SLAM system, and basically fails after loss. Moreover, its Depth Filter converges slowly, and the result relies heavily on accurate pose estimation. In 2016, Forster improved SVO to form VERSION SVO2.0[18]. The new version made great improvements, increased edge tracking, and took into account the motion prior information of IMU, supporting cameras with large field Angle (such as fisheye camera and reflex panoramic camera) and multi-camera system. The system is also open source in an executable version [19]. It is worth mentioning that Foster also derived VIO’s theory in detail, and relevant literature [20] has become the theoretical guidance for subsequent SLAM fusion IMU systems, such as Visual Inertial ORBSLAM.

In 2015, MUR-Artal et al. proposed the open source monocular ORB SLAM[21], which was expanded to ORB SLAM2 supporting binocular and RGBD sensors in 2016 [22]. It is one of the visual SLAM systems with the most complete sensor support and the best performance. It is also the highest ranking of all open source systems submitting results on KITTI datasets [23]. ORB-SLAM continues the algorithm framework of PTAM by adding separate loopback detection threads and improving most components of the framework, which can be summarized as follows: 1) ORB features are used in all aspects of ORB-SLAM tracking, mapping, repositioning and loopback detection [24], so that the established map can be saved, loaded and reused; 2) Thanks to the use of convisibility graph, the tracking and map building operations are concentrated in a local mutual area, so that it can realize real-time operation of a wide range of scenes without relying on the size of the overall map; 3) A unified BoW bag model is used for repositioning and closed-loop detection, and indexes are established to improve detection speed; 4) To improve the PTAM can only manually select initialization from planar scenes, a new automatic robust system initialization strategy based on model selection is proposed to allow reliable automatic initialization from planar or non-planar scenes. Later, Mur-Artal expands the system to form Visual Inertial-ORB SLAM integrating IMU information [25], which uses the pre-integration method proposed in Foster’s paper [] to describe the initialization process of IMU and the joint optimization with Visual information.

In 2016, Engel et al., author of LSD-SLAM and TUM Machine Vision Group, proposed DSO system [26]. This system is a new visual odometer based on direct method and sparse method, which combines the minimum photometric error model with the model parameter joint optimization method. In order to meet the real-time requirements, the image is not smoothed, but uniformly sampled. Instead of key point detection and feature descriptor calculation, DSO samples pixels with intensity gradient in the whole image, including edges on white walls and pixels with smooth intensity changes. Moreover, DSO presents a complete photometric calibration method, which considers the effects of exposure time, lens halo and nonlinear response function. The system was tested on TUM monoVO, EuRoC MAV and ICL-NUIM datasets, and achieved high tracking accuracy and robustness.

In 2017, The research group of Shen Shaojie from Hong Kong University of Science and Technology proposed a VINS system integrating IMU and visual information [27], which opened source code of both mobile phone and Linux versions at the same time. This is the first VISUAL IMU integrated SLAM system that directly opened source code of mobile phone platform. This system can run on iOS devices and provide accurate positioning function for mobile augmented reality applications. At the same time, this system has also been applied in uav control and achieved good results. Ins-mobile uses the sliding window optimization method to achieve vision and IMU fusion in the way of quaternion posture, and has a closed loop detection module based on BoW. Cumulative errors can be corrected in real time through the global posture map.


Recommended reading

Rambling SLAM Technology (Part 2)

Hand in hand to teach you Serverless salon to come to date?

Cloud server from 20 yuan/month, and enjoy a thousand yuan renewal package


This article has been published by Tencent Cloud Technology community authorized by the author