End-to-end processing solution for RGB-D perception in mobile collaborative robots

The process includes instance segmentation, feature matching, and point set registration. First, single-view 3D semantic scene segmentation is performed using RGB images, encapsulating common object classes from the 2D dataset into point clouds of object instances. Then, 3D corresponding points between two consecutive segmented point clouds are extracted based on matching keypoints between objects of interest in the RGB images. Furthermore, the distribution estimated by kernel density estimation (KDE) is used to weight each pair of 3D points, thus providing robustness with fewer central corresponding points when dealing with rigid transformations between point clouds. Finally, the process was tested on a 7-DOF dual-arm Baxter robot, and the results show that the robot can successfully segment objects, register multiple views during movement, and grasp target objects.

Reader's understanding:

This paper introduces an end-to-end processing flow for RGB-D perception in mobile collaborative robots. The flow includes instance segmentation, feature matching, and alignment, aiming to help the robot understand the scene and perform operations during movement. The proposed method first segments objects of interest in the scene and matches features in consecutive RGB images as the robot moves, then uses depth maps to obtain 3D correspondences. These 3D correspondences are statistically weighted, and rigid point cloud alignment is performed using kernel density estimation (KDE). Experimental results demonstrate that in tests conducted on a real robot, the robot successfully understands the scene and grasps target objects, validating the effectiveness of the proposed method. The main contribution of this paper is the proposal of a comprehensive processing flow, providing an important reference for the perception and manipulation of mobile robots in complex environments.

1 Introduction

This paper introduces the importance of egocentric vision in machine and human vision, particularly in dense environments. To improve the operational tasks of autonomous robots, 3D perception of the spatial information of objects of interest is necessary. Current segmentation and registration tasks are typically performed separately, but deploying both processes simultaneously leads to high computational costs. Therefore, this paper aims to realize a lightweight egocentric 3D segmentation, feature matching, and scene reconstruction pipeline to improve the performance of vision-based indoor mobile collaborative robots. Existing work mainly focuses on learning matching features between images, but for indoor mobile collaborative robots, it is also necessary to pay attention to the spatial occupancy information of objects of interest. To fill the gaps in previous work and improve 3D semantic scene perception in vision-based mobile collaborative robots, this paper proposes three contributions:

(1) A robust method for extracting and statistically weighting 3D corresponding points for rigid point cloud alignment.

(2) An end-to-end segmentation, feature matching, and global registration process for an egocentric robot with binocular vision.

(3) Test the proposed method using a real robot system to verify its correctness.

2. Egocentric 3D object segmentation

This section proposes an algorithm for self-centered object segmentation in RGB-D frames. The algorithm first obtains a depth image D and an RGB image I from the image stream, then segments the object of interest (MOI) within I to obtain the object mask MI. Next, D is hole-padded to ensure result quality and then aligned with I. Subsequently, pixels aligned to the depth frame D are processed to correct depth pixels outside MI and converted into points in the PM. Finally, the PM is cleaned by removing outliers in the depth image that may be caused by holes.

3 Feature Detection and Matching

This section introduces an algorithm for feature detection and matching in egocentric 3D object segmentation. First, 1D location embeddings are applied to the 2D domain to improve the feature extraction learning process, and a feature extraction network is designed. Then, a segmentation mask is used to provide a masked RGB image of each object for SuperPoint, ensuring that the feature scanning region lies within the masked region. Next, a masked RGB image of the corresponding object is created for each corresponding object in two consecutive frames, and a retrained SuperPoint is applied to each pair of images to extract and match 2D keypoints within each object instance. Finally, the matched features are aggregated, and 3D correspondences between point clouds are computed. This method avoids feature matching between unrelated objects and improves the accuracy and consistency of object instances.

4. Point Cloud Alignment and Registration

This section details the point cloud alignment and registration process, which mainly includes two key steps: importance weighting of 3D correspondences and point cloud alignment for rigid motion.

Weighted importance of 3D correspondences:

Weight initialization: Initialize the weight of each point based on the number of neighboring points within a specific radius around it.

Density estimation: The density of the unknown distribution is estimated using the KDE and ISJ algorithms to obtain robustness.

Weight update: Update the weight of each point according to the density function to better represent its importance.

Point cloud alignment for rigid motion:

Calculate the translation vector: Calculate the weighted centroid for translating the point cloud.

Calculate the rotation matrix: The rotation matrix is obtained through singular value decomposition and used to rotate the point cloud.

Define a rigid transformation matrix: combine translation vectors and rotation matrices into a rigid transformation matrix.

Point cloud alignment: Apply a rigid transformation matrix to align two multi-view point clouds.

5. Experiment

Performance of SuperPoint with Positional Embedding: SuperPoint was retrained using 2D positional embedding on the MS COCO 2014 dataset and fine-tuned on points of interest labeled with MagicPoint. SuperPoint with 128-dimensional positional embedding was trained by adjusting and enhancing images, such as random brightness and contrast, Gaussian noise, shadows, and motion blur. Training was performed for 10 epochs (300,000 iterations) on an NVIDIA RTX 4090 GPU. Experimental results show that SuperPoint performs excellently on the HPatches dataset, exhibiting strong robustness, especially in common scenarios such as changes in brightness and viewpoint.

Point cloud alignment error at multiple angles: By moving the camera on a plane surface 2 meters away from the scene at different angles, including 0° (initial position), ±10°, ±20°, ±30°, and ±45°, the root mean square error (RMSE) between two corresponding point sets Kt−1 and Kt was calculated. Experimental results show that the RMSE increases with the increase of the offset angle, thus demonstrating the effectiveness of KDE in reducing alignment error.

Deployment experiments on Baxter robots:

Experimental setup: An Intel RealSense D435i RGB-D camera was mounted on the Baxter robot, and a scene consisting of a table, a chair, a bag, and two plastic cups was set up.

Robot movement and multi-view shooting: Baxter first stands in one position to shoot one view, and then moves to another angle to shoot another view. Its movement is supported by the Dataspeed mobile base and synchronized through ROS messages.

Multi-view point cloud segmentation and alignment: After capturing multi-view point clouds, Baxter first segments the objects in the scene, then matches the 3D correspondence between the two views, and finally solves the rigid alignment of the weighted 3D correspondence, thus gaining an understanding of the scene.

Approaching and Grasping Target Objects: Baxter demonstrates the feasibility of using 3D semantic scene awareness for robotic grasping, effectively grasping objects when they are within the robot's workspace.

Time complexity on traditional hardware: YOLOv8n was deployed on Intel HD Graphics 4000 using the OpenVINO library to evaluate the time complexity and runtime of steps such as segmentation, key point extraction and matching, key point weighting, and point cloud alignment.

6. Conclusion

This study proposes an end-to-end process for RGB-D perceptive mobile collaborative robots, including instance segmentation, feature matching, and alignment. Experiments on a real robot validate the effectiveness of the proposed method, demonstrating the robot's ability to understand the scene and perform operations.

End-to-end processing solution for RGB-D perception in mobile collaborative robots

Read next

CATDOLL Ava Hard Silicone Head

CATDOLL 126CM Laura

CATDOLL 123CM Olivia (TPE Body with Soft Silicone Head)

CATDOLL Sabrina Hybrid Silicone Head