1 Introduction
Simultaneous Localization and Mapping (SLAM) is a hot and difficult technology in the field of robotics. The localization and mapping problems it solves are considered to be the key to the autonomous navigation of robots. The main principle of this technology is to use the various sensors equipped by the robot to perceive the surrounding environment and calculate its own position in the current environment. After SLAM was proposed, it went through several research stages. Due to the high accuracy and wide range of lidar, early SLAM research often used lidar as the main sensor. In addition, early SLAM used the extended Kalman filter method to estimate the robot's pose [1], but the effect was not good. For some strongly nonlinear systems, this method would bring more truncation errors, which would lead to the inability to accurately achieve localization and mapping. Later, particle filtering-based SLAM gradually became the mainstream. The particle sampling method can effectively avoid nonlinear problems, but it also leads to the problem that the amount of computation increases with the number of particles. Until 2007, Gristetti et al. [2] proposed the SLAM method based on improved particle filtering (Gmapping). This method effectively improved the positioning accuracy and reduced the computational load through improved proposal distribution and adaptive resampling technology, which is a milestone of laser SLAM. Konolige et al. [3] proposed KartoSLAM in 2010. This method uses graph optimization instead of particle filtering and uses sparse point adjustment to solve the problem of difficulty in directly solving the matrix in nonlinear optimization. Kohlbrecher et al. [4] proposed HectorSLAM in 2011. This method does not require odometry information and solves the scan matching problem by using the Gauss-Newton method, but it has high requirements for sensors and requires high-precision lidar to operate. Cartographer [5] proposed by Google in 2016 can be said to be the latest laser SLAM scheme. This method inserts each frame of laser data into the submap at the best estimated position using scan matching, and the scan matching is only related to the current submap. After generating a subgraph, a local loop closure is performed, and after all subgraphs are completed, a global loop closure is performed using branch localization and pre-computed mesh. Compared with Gmapping and Hector, this scheme has the advantages of lower cumulative error and no need for high-cost equipment. Visual SLAM has gradually become a research hotspot of SLAM due to the low cost of sensors and rich image information. However, compared with laser SLAM, visual SLAM is more complex. MonoSLAM, first proposed by Davison et al. [6] in 2007, is considered the "birthplace" of many visual SLAMs. This method uses extended Kalman filtering as the back end to track sparse feature points in the front end; it uses probability density function to represent uncertainty, and finally obtains the mean and variance of the posterior probability distribution from the observation model and recursive calculation. Sim et al. [7] implemented visual SLAM using particle filtering. This method avoids the linearization problem and has high accuracy, but requires a large number of particles, which leads to an increase in computational complexity. Subsequently, in order to reduce computational complexity, extracting keyframes has become extremely important. The most representative one is PTAM proposed by Klein and Murray in 2007 [8]. This method proposes a simple and effective technique for extracting keyframes, and the two most important aspects of this method are: (1) Parallelization of tracking and mapping is achieved. Although the tracking part needs to respond to image data in real time, the back-end optimization does not need to be calculated in real time. That is, the back-end optimization can run slowly in the background, and the two threads can be synchronized when needed. This is also the first time that the concept of distinguishing between the front-end and the back-end has been proposed, leading the architecture design of many SLAM methods thereafter. (2) Nonlinear optimization is used for the first time instead of traditional filters. Since the proposal of PTAM, visual SLAM research has gradually shifted to the back-end dominated by nonlinear optimization. In 2016, Mur-Artal and Tardos[9] proposed the very famous second-generation visual SLAM system ORB-SLAM2 based on the feature points of the OrientedFAST and RotatedBRIEF[10] with directional accelerated segmented test features (FAST) keypoints and rotation-invariant binary robust independent basic features (BRIEF) descriptors. It is one of the most complete and easy-to-use systems in modern SLAM. This method not only supports three modes: monocular, binocular, and depth camera, but also divides localization, map creation, and loop closure into three threads, all of which use ORB features. The loop closure detection of ORB-SLAM2 is a major highlight. This method uses a bag-of-words model, which effectively prevents the accumulation of errors and can quickly recover them after loss. However, the disadvantages of ORB-SLAM2 are also very obvious. Image feature extraction and matching, as well as backend optimization, all require computational resources. It is difficult to achieve real-time operation on embedded platforms. Although the sparse feature point map constructed can satisfy localization well, it cannot provide navigation, obstacle avoidance, and other functions. Compared with extracting feature points in the image, the direct method of calculating camera motion based on image pixel grayscale information realizes localization and mapping from another direction. The camera localization method proposed by Stühmer et al. [11] relies on each pixel of the image, that is, it uses dense image alignment to perform self-localization and constructs a dense three-dimensional map. Engel et al. [12] construct a semi-dense depth map for the current image and use the dense image alignment method to calculate the camera pose. Constructing a semi-dense map means estimating the depth values of all pixels with large gradients in the image. These depth values are represented as Gaussian distributions and are updated when new images arrive. Engel et al. [13] proposed the LSD-SLAM algorithm, which is based on applying the direct method to semi-dense monocular SLAM, a method rarely seen in previous direct methods. Previously, only sparse maps could be constructed based on feature points, while dense maps require cameras like RGB-D that can provide depth information. Forster et al. [14] proposed Semi-direct Monocular Visual Odometry (SVO) in 2014, a method called "sparse direct method". This method combines feature points with the direct method, tracks some key points (such as corner points), and then estimates the camera motion and position based on the information around the key points according to the direct method. Compared with other schemes, SVO does not consume a lot of resources to compute descriptors or process too much pixel information. Therefore, this method can be widely used in devices such as drones and handheld augmented reality (AR). Newcombe et al. [15] proposed a Kinect fusion method, which obtains the camera pose by minimizing the distance measurement of each pixel in each frame image through the depth image obtained by Kinect, and fuses all depth images to obtain global map information. Gokhool et al. [16] used the photometric and geometric information of image pixels to construct an error function, obtain the camera pose by minimizing the error function, and the map problem is processed into a pose graph representation. Kerl et al. [17] proposed a better direct RGB-DSLAM method, which combines the intensity error and depth error of the pixel as the error function, and obtains the optimal camera pose by minimizing the cost function. This process is implemented by g2o, and proposed a key frame extraction and loop closure detection method based on entropy, which greatly reduces the path error. In terms of multi-sensor fusion, there are currently a variety of sensors such as visual sensors, lidar, inertial measurement units (IMU) and ultrasonic sensors. At present, the main fusion directions are lidar combined with visual sensors and IMU combined with visual sensors. Chen et al. [18] used visual sensors combined with IMU to perform precise pose estimation of the robot, and vertically installed two-dimensional lidar on the robot to collect point cloud data to realize three-dimensional mapping. Houben et al. [19] used three-dimensional laser data for micro UAV positioning. In response to the problem of fuzzy laser positioning in environments with similar structures, they proposed adding visual markers to the environment to improve positioning accuracy, and it can also be used for repositioning of micro UAVs. Wang Xiaowei et al. [20] proposed a SLAM method that fuses binocular visual information and lidar data. This method is based on an improved particle filter algorithm to realize the observation data when calculating the proposal distribution, which includes both visual information and lidar data. Compared with the odometry motion model as the proposal distribution, this method effectively improves the accuracy of positioning and mapping. Zhang Jie and Zhou Jun [21] proposed a SLAM method that combines lidar and vision, in which the laser map is used for navigation and the visual map is used to restore the target scene; and proposed an improved Iterative Closest Point (ICP) method to achieve faster point cloud stitching, while using graph optimization to reduce cumulative error and ensure map accuracy. Shi et al. [22] used visual odometry to provide initial values for the ICP of two-dimensional lasers on a small UAV, achieving good real-time performance and accuracy.
Both achieved good results. Qin et al. [23] proposed a tight coupling scheme between vision and IMU, which puts the residual terms constructed by vision and the residual terms constructed by IMU together to form a joint optimization problem. Li et al. [24] and Lynen et al. [25] fused vision and IMU through extended Kalman filter to obtain state estimation in real time. Relatively speaking, the effect of laser SLAM is still better than that of visual SLAM, but due to the characteristics of its own laser data, laser SLAM cannot effectively perform large-scale loop closure detection. For low-cost laser radar, due to insufficient laser point density, the constructed map often returns to the original location map, resulting in inaccurate positioning and deviation, which is caused by accumulated error. At the same time, loop closure detection has always been a major challenge in laser SLAM: since the acquired laser data is two-dimensional point cloud data, without obvious features and very similar to each other, the loop closure detection based on laser data is often not effective. Since the image contains rich information, visual SLAM has a natural advantage in loop closure detection. The bag-of-words model proposed in ORB-SLAM2, which combines ORB features with a bag-of-words approach, boasts high accuracy and speed, making it the most widely used method for loop closure detection. Addressing the issues of high noise, low accuracy, and difficulty in loop closure in low-cost laser SLAM, this paper proposes a joint optimization method combining laser and vision to improve localization and mapping accuracy. Furthermore, it effectively solves the challenge of laser loop closure detection through a visual bag-of-words model.
2. A Graph-Optimized Simultaneous Localization and Mapping Framework
Laser SLAM primarily calculates pose transformations between adjacent frames by matching laser beams in those frames. However, laser data is not entirely noise-free, especially for low-cost LiDAR systems where the emitted laser points are sparse, leading to errors in the calculated pose transformations. Therefore, filtering or optimization methods are often needed to improve localization accuracy. Visual SLAM, on the other hand, calculates pose through feature point extraction and matching. However, if a problem occurs in even one frame, errors can accumulate and cause further problems.
It continues to grow. Graph optimization is a popular method for SLAM backend optimization. By constructing nodes and constraint edges, it clearly shows the relationship between a series of poses and observations. Then, it uses nonlinear optimization to find the optimal variables, thereby obtaining an accurate pose estimate.
This method was first applied to visual SLAM. Initially, due to the large number of image feature points, the matrix dimension was too large, making it difficult to solve, and therefore it did not become a mainstream method. Until 2011, the sparsity of the Hessian matrix was discovered, which greatly improved the computation speed, allowing the ideas of nonlinear optimization and graph optimization to be successfully applied to visual SLAM and laser SLAM. Depending on the real-time requirements of pose estimation and optimization, SLAM is also divided into front-end and back-end parts. The current mainstream SLAM framework is shown in Figure 1.
The front-end primarily estimates the robot's pose using sensor data, but both image and laser data contain varying degrees of noise. High-precision LiDAR offers significantly less noise but is prohibitively expensive. Using low-cost LiDAR and camera-acquired images for pose calculation leads to accumulated errors between the localization and the actual ground truth, which increase over time. The main role of back-end optimization is to improve the accuracy of localization and map construction through filtering or optimization, eliminating accumulated errors. This paper employs graph optimization as the back-end, using nonlinear optimization to find the descent gradient and iteratively minimize the error. Simply put, graph optimization describes the optimization problem in the form of a graph. In SLAM, the nodes of the graph represent poses, and the edges represent the constraints between poses and between poses and observations. During navigation and mapping, the robot's observations consist of laser data and external environmental information continuously captured by cameras, generating a large amount of machine data.
The three-dimensional spatial points corresponding to the ORB feature points observed by humans. All data are placed within the framework of the graph, as shown in Figure 2.
Where X represents the keyframe pose; O represents the observations, which include the 3D spatial coordinates of the feature points and the 2D laser data. Visual error is represented by reprojection error (Figure 3), the calculation of which requires the values of adjacent frames.
The two corresponding camera poses, the 2D coordinates of the matched feature points in the two images, and the 3D coordinates of the corresponding 3D points in space are presented. Pure vision SLAM often extracts and matches feature points, and then uses methods such as EPnP to estimate the pose transformation between adjacent frames. However, compared with laser matching between adjacent frames, this method has a larger error. Therefore, this paper uses the pose estimate obtained by laser scanning matching as the initial value for backend optimization. For the feature point pair p1 and p2 obtained by feature point matching in adjacent frames, the depth camera used in this paper can directly obtain the coordinates of the 3D point P corresponding to the feature point p1 in the previous frame image, and reproject point P onto the next frame image to form the feature point in the image. Due to the error in pose estimation and the presence of noise in the depth camera, the feature point p1 and p2 do not completely coincide, and the distance between the two points is the error.
Figure 3
Figure 4 Overall framework of joint optimization
The formula and process for calculating the reprojection coordinates are as follows: (1) Calculate the three-dimensional coordinates of the point P in the world coordinate system corresponding to the point in the camera coordinate system by using the transformation relationship (R,t) from the world coordinate system to the camera coordinate system.
(1)
Where R is the rotation matrix; t is the translation matrix; T represents the transpose. (2) Project it onto the normalized plane and normalize it to 1 to obtain the normalized coordinates Pc.
Where (uc,vc) are two-dimensional coordinates on the normalized plane. (3) According to the camera intrinsic model, the pixel coordinate system is scaled by fx times on the horizontal axis and fy times on the vertical axis of the normalized plane, and the origin is shifted by cx and cy pixels. Therefore, the formula for calculating the pixel coordinates (us,vs) is as follows:
The error function for this point is then given by: where p2 is the pixel coordinate obtained directly from the image; and is the reprojection coordinate calculated according to formulas (1) to (3). Extending the above error function to the interval between two adjacent frames, the cost function to be minimized is then given by: The pose transformation (R,t) and the three-dimensional spatial point coordinates Pi can be obtained by using a nonlinear optimization algorithm to minimize the error.
Compared to visual errors, laser errors are simpler to obtain. Laser SLAM often requires scan matching to estimate the pose transformation between adjacent frames, but this estimated value (R,t) cannot guarantee that all laser data from the previous frame will completely overlap with the laser data from the next frame after the pose transformation. Therefore, the laser error is defined as follows:
(4)
Then, the error function is minimized through nonlinear optimization, and the obtained pose is returned to the front end as the reference frame pose for the next frame.
3. Backend optimization and closed-loop detection
Visual SLAM maps are composed of feature points, and the biggest problem with feature maps is that they cannot be used for navigation, only for localization. Low-cost LiDAR-built grid maps are more suitable for navigation, but suffer from sparse LiDAR and high noise. Therefore, this paper proposes a visual-LiDAR fusion method, which improves localization accuracy while ensuring a more precise map, and also solves the problem of loop closure in LiDAR-built grid maps. The construction of two-dimensional grid maps mainly relies on a series of pose and LiDAR measurement data, incrementally building the map based on the probability of grid occupancy. Since the visual information and LiDAR data observed at the same time are not completely independent, this paper proposes visual-LiDAR joint optimization to fully utilize the constraints between the data. The overall framework of SLAM with LiDAR-visual joint optimization incorporating visual information is shown in Figure 4.
3.1 Error Function
The traditional visual adjacent frame error function has been given in detail in Section 2. Figures 3 and 4 show the overall framework for joint optimization. The relationship between the reprojection coordinates and the three-dimensional spatial points, the feature points of the previous frame image, and the pose transformation is as follows:
(5)
Where K is the camera intrinsic parameter; Z is the depth value of the 3D point. The error function is as follows:
The pose transformation (R,t) can be written in the corresponding Lie algebra form, and the Lie algebra transformation formula is:
(7)
In this system, there are m matching feature points and n laser data points in adjacent frames. In practical calculations, m and n can be limited to a certain number to reduce computational complexity. Visual SLAM alone has its own algorithm for calculating the pose transformation between adjacent frames, but due to the characteristics of image data, the calculated (R,t) is often less accurate than the pose transformation obtained through laser scanning matching. Therefore, the pose transformation obtained through laser scanning matching is used to estimate the initial value of the pose transformation in the error function.
3.2 Sparse Posture Adjustment
Based on the observation model, it can be easily determined that the error function is not linear. Therefore, this paper uses nonlinear optimization to solve for the variable corresponding to the minimum value of the error function. Due to the large amount of 3D feature points and laser data, and the increasing number of polynomials in the overall cost function over time, this paper only optimizes the pose variable in the error function. Simultaneously, the optimization of the pose map slows down with the increase of vertex constraints. Therefore, this paper adopts a sparse pose adjustment method, utilizing the sparsity of the matrix to improve the optimization speed. Since the robot pose is continuously calculated from the transformation matrix and corresponds one-to-one, i.e., the pose transformation between every two adjacent frames is obtained, the current pose of the robot can be obtained. Therefore, the robot pose is used as the only variable and optimized, while visual observations and laser point cloud data serve as constraints between poses. Let the relationship between the robot pose and the pose transformation be:
The error function can then be rewritten as a function of pose x. Here, x is the set of poses, i.e., the variable to be optimized is:
(10)
Where k is the number of poses to be optimized. Correspondingly, represents the increment of the overall independent variable x. Therefore, after adding the increment, the objective function becomes:
(11)
Where J represents the Jacobian matrix, which is the partial derivative of the cost function with respect to the independent variable; K is the number of poses to be optimized, which is 2 when optimizing adjacent frames, and the number of poses between the current frame and the loopback frame when optimizing globally. Pose optimization can be regarded as a least squares problem, and common methods for solving least squares problems include gradient descent, Gauss-Newton's method, and Levenberg-Marquadt (LM) method. Among them, the LM method is a combination of gradient descent and Gauss-Newton's method, and has the best effect. Therefore, this paper uses the LM method to solve the above least squares problem. The error function is improved by adding Lagrange multipliers:
(12)
The parameter represents the similarity between the approximate model and the actual model. The closer it is to 1, the smaller the parameter, and the better the approximation effect of the Gauss-Newton method. The smaller the parameter, the worse the approximation effect, and the optimization method is more similar to the gradient descent method. In general, the dimension of the H matrix is very large, and the complexity of matrix inversion is O(n³). However, since the H matrix contains constraints between each vertex, and only adjacent vertices have direct constraints, most elements of the H matrix are 0, which results in sparsity. Therefore, the sparsity of the H matrix can be used to greatly improve the computation speed. Solving formula (16) yields that by iterating along the gradient descent direction, the independent variable x corresponding to the minimum objective function is finally obtained, which is the robot pose.
3.3 Closed-loop detection
Loop closure detection is a core problem in SLAM. Identifying previously visited locations can effectively reduce accumulated errors and improve localization accuracy. LiDAR-based SLAM algorithms often struggle with loop closure detection due to the limited data available, while the rich texture features of visual images can compensate for this deficiency. This paper employs the Bag-of-Words (BoW) model, the most commonly used model in visual SLAM. It constructs a dictionary corresponding to keyframes using visual features. Upon detecting a loop closure, the current pose is calculated by matching the loop closure frame with the current frame. This constraint is then incorporated into the backend for global optimization between the loop closure frame and the current frame, improving localization accuracy and preventing the common problem of loop closures in LiDAR-based grid maps.
Table 1 Comparison of Pose Estimation
Because the number of images collected for building indoor maps is too large and there is a high degree of repetition between adjacent images, keyframe extraction is required first. The keyframe selection mechanism in this paper is as follows: (1) 15 frames have passed since the last global relocalization; (2) 15 frames have passed since the last keyframe insertion; (3) the keyframe must have tracked at least 50 3D feature points. Among them, (1) and (2) are the basis of its uniqueness, because the features in the field of view will not change significantly in a short time; (3) ensures its robustness, as too few map points will lead to uneven calculation errors. In contrast, the selection of keyframes for laser SLAM is simpler and more stable, but using laser as the keyframe selection parameter is prone to problems such as insufficient image feature points contained in the keyframe and image discontinuity. Therefore, this paper adopts a vision-based keyframe selection mechanism to ensure that adjacent keyframes are not too close and that there is enough information for matching between keyframes. Common image features include Scale Invariant Feature Transform (SIFT)[26], Speeded Up Robust Features (SURF)[27], and ORB. Among them, SIFT feature points have rotation invariance and scale invariance, and are highly stable and not easily affected by illumination and noise, making them the most ideal choice. However, SIFT feature extraction is slow and cannot guarantee real-time performance, so it is not suitable for SLAM. SURF has similar performance to SIFT but also suffers from excessive computation time. Therefore, this paper chooses ORB to construct the bag-of-words model. While maintaining rotation invariance and scale invariance, ORB is significantly faster than SIFT and SURF.
BoW calculates the similarity between the current frame and each keyframe by comparing the features contained in the image. First, when the similarity indicates that the current frame is sufficiently similar to a certain keyframe, the robot is considered to have returned to a position near that keyframe, resulting in a loop closure. Then, after the loop closure, ICP matching of the laser point cloud is performed between the current frame and the keyframe. Using the pose of that keyframe as a reference, the current robot pose is calculated. The current frame pose, image feature points between the current frame and the loop closure frame, and the laser point cloud are added as constraints to the graph optimization framework, thereby globally optimizing a series of poses between the keyframe where the loop closure was detected and the current keyframe. Finally, the map is reconstructed based on the optimized poses and the laser data carried by each frame's pose, eliminating any non-closable parts in the map.
4 Experiments
The experiments in this paper are divided into two parts: the first part is a comparative experiment on the accuracy of fixed-point positioning in a small-scale scene, in which positioning data are collected for the traditional graph-optimized laser SLAM method (i.e., Karto) and the laser vision combined method proposed in this paper; the second part is a closed-loop experiment to verify whether the proposed method can effectively solve the problem of map non-closure that may occur in laser SLAM.
4.1 Experimental Platform and Environment
The experiments in this paper were conducted on a Turtlebot2, equipped with a laptop, a LiDAR, and a depth camera. The laptop was configured with an Intel Core i5 processor, 8GB of RAM, and running Ubuntu 14.04 + ROSIndigo. The 2D LiDAR used was a single-line LiDAR RPLIDAR2, with a manually set sampling frequency of 5–15Hz and a measurement radius of 8m. The depth camera used was an Astra depth camera from Orbbec. This camera has an effective depth ranging range of 0.6–8m with an accuracy of 3mm; its field of view can reach 58° horizontally and 45.5° vertically.
The experiment was conducted in Building B of the Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences. A world coordinate system was established with the robot's starting position. Five landmark locations were selected between the elevators in sections B and C on the 5th floor of Building B. Pose measurements were performed using both the Karto method and the method proposed in this paper. The experimental process is shown in Figure 5. Starting from point 0, with point 0 as the origin of the world coordinate system, the direction from 0 to 1 is the x-axis, and the direction from 2 to 3 is the -y-axis. The robot moved sequentially along points 1, 2, 3, 4, and 5. The actual coordinates of each point are shown in Table 1. The robot's actual pose was controlled by time and velocity. To ensure the stability of visual feature extraction, the robot maintained a constant linear velocity of 0.2 m/s and an angular velocity of 30°/s. That is, the robot only had three states: moving forward at a constant speed of 0.2 m/s, turning right at a constant speed of 30°/s, and stopping. Moving forward for 15 seconds equates to moving 3 m; turning right for 3 seconds equates to turning 90°. Finally, the robot was controlled to accurately reach the five landmark points by issuing linear and angular velocity commands and timing the movement. The second part of the experiment takes place in the workstations on the 4th floor, Block B, and the 3rd floor of Block C at the Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences. The workstations on the 4th floor form a small closed loop, while the workstations on the 3rd floor of Block C form a large closed loop.
4.2 Analysis of Experimental Results
The localization results and error comparisons are shown in Table 1. The data in Table 1 show that the initial error is small when relying solely on laser for SLAM, but the error between the measured value and the actual pose gradually increases as the distance increases. The laser used in the experiment is low in cost and the beam itself is not dense enough, which makes it impossible to ensure that there is enough available data to filter out the error of the measured value itself during the measurement process, thus generating cumulative error. Although the laser vision joint optimization method proposed in this paper also has cumulative error, the addition of visual information constraints can effectively reduce the cumulative error and obtain higher localization accuracy. Figures 6(a) to (d) are the grid maps constructed by Karto[3] and the method proposed in this paper, respectively; Figure 6(e) is the sparse feature point map of the 3rd floor of Zone C constructed by ORB-SLAM2. The robot pose is represented by a series of red arrows. Figures 6(a) and (b) are small-scale closed loops of the workstation environment. It can be seen that there is a non-overlapping phenomenon at the white circle in Figure 6(a), but the effect is not obvious due to the small scene.
Figures 6(c) and 6(d) show the maps constructed for the large-scale scene in area C. Due to the large scene size, the cumulative error increases over time. In Figure 6(c), it is evident that the map area circled in white does not overlap. This is because the cumulative error affects the robot's localization performance. As the localization deviation increases, the map constructed based on the robot's pose and laser data will exhibit non-closed loops. In contrast, in Figure 6(d), when the robot returns to its starting position after a full circle, a loop closure is detected. The current frame pose is calculated based on the starting position and used as a constraint for global optimization. The map information is then updated using all optimized poses and their associated laser data, eliminating the non-closed loop situation. Experimental results show that the proposed method can effectively detect loop closures and optimize all poses between two points. After pose optimization, the map is reconstructed based on the laser data carried by each frame pose, successfully eliminating the non-closed map portions. However, due to the addition of visual information, the speed limit for the robot during the mapping process becomes greater, and its ability to resist interference from dynamic obstacles becomes smaller. Mapping can only be carried out when no one is around, and it is necessary to avoid pointing the camera at large white walls or other featureless objects.
5. Summary and Outlook
This paper investigates the multi-sensor fusion problem in SLAM. Addressing the issues of high noise levels in low-cost LiDAR and the unsuitability of visually constructed sparse feature maps for navigation, a localization and mapping scheme combining LiDAR and vision is proposed. By employing a sparse attitude adjustment-based method, LiDAR data and image information are jointly optimized, effectively improving the accuracy of localization and mapping. Loop closure detection is achieved through a visual bag-of-words model, solving the problem of difficult LiDAR loop closure detection. However, the addition of visual information reduces system robustness. Indoor ambient lighting changes are often small, and the presence of numerous dynamic obstacles, such as moving pedestrians captured by the camera, can lead to errors in visual constraints, affecting the overall localization and mapping performance. Furthermore, insufficient features cannot be extracted when encountering objects like white walls, causing visual constraints to fail and resulting in mapping failure. Future work will leverage the advantages of both LiDAR and vision to improve overall robustness and ensure effective mapping.
Disclaimer: This article is provided by the company. If it involves copyright or confidentiality issues, please contact us promptly for deletion (QQ: 2737591964). We apologize for any inconvenience.