From the perspective of machine vision, this article provides an in-depth analysis of how robots accurately locate objects when grasping them, starting from the simplest aspects and progressing to the most complex: camera calibration, planar object detection, textured objects, non-textured objects, deep learning, and integration with task/motion planning .
First, we need to understand that machine vision in robotics differs somewhat from computer vision : the purpose of machine vision is to provide robots with information for manipulating objects. Therefore, research in machine vision generally covers the following areas:
1. Object Recognition : Detecting object types in images, which largely overlaps with computer vision ( CV ) research.
2. Pose Estimation : Calculates the position and orientation of an object in the camera coordinate system. For a robot to grasp something, it needs to know not only what it is, but also where it is.
3. Camera Calibration : Since the above only calculated the coordinates of the object in the camera coordinate system, we also need to determine the relative position and orientation of the camera and the robot so that the object's pose can be converted to the robot's pose.
Of course, I'm mainly referring to machine vision in the field of object grasping here; I won't talk about other fields such as SLAM for now.
Since vision is a crucial aspect of robot perception, there's been a great deal of research on it. I'll introduce some of what I know, starting with the simplest and progressing to the most complex:
I. Camera Calibration
This is actually a relatively mature field. Since all our object recognition only calculates the object's pose in the camera coordinate system, but robots need to know the object's pose in the robot's coordinate system to manipulate it, we first need to calibrate the camera's pose.
I won't go into detail about internal reference calibration; refer to Zhang Zhengyou's paper or various calibration toolkits.
For external parameter calibration, there are two methods depending on the camera's installation location:
Eye to Hand : The camera is fixed to the robot's polar coordinate system and does not move with the robotic arm.
Eye in Hand : The camera is fixed to the robotic arm and moves with the robotic arm.
The two methods share a similar solution approach, both starting with the eye to the hand .
Simply fix a chessboard grid at the end of the robotic arm and move it in several poses within the camera's field of view. Since the camera can calculate the pose A_i of the chessboard grid relative to the camera coordinate system , and the forward kinematics of the robot can calculate the pose change E_i between the robot's base and the end effector , while the pose of the end effector relative to the chessboard grid remains relatively fixed.
Similarly, for the case of "Eye in Hand," a chessboard can be placed on the ground (fixed to the robot's base), and the robotic arm can move in several poses with the camera to form a coordinate loop of AX=XB .
II. Detection of Planar Objects
This is the most common scenario on industrial production lines. Currently, the requirements for vision in this field are: speed, accuracy, and stability. Therefore, the simplest edge extraction + edge matching / shape matching method is generally used; moreover, to improve stability, methods such as using a primary light source and a high-contrast background are typically employed to reduce system variables.
Currently, many smart cameras (such as Cognex ) have these functions directly embedded; moreover, since objects are generally placed on a plane, the camera only needs to calculate the object's (x, y, θ )T three-degree-of-freedom pose.
In addition, this type of application is generally used to process a specific workpiece, which is equivalent to only having pose estimation, but not object recognition.
Of course, the pursuit of stability in industry is understandable. However, with the increasing demands for production automation and the rise of service robots, the estimation of the complete pose (x, y, z, rx, ry, rz)T of more complex objects has become a research hotspot in machine vision.
III. Textured objects
The field of robot vision was one of the first to study textured objects, such as beverage bottles and snack boxes, which have rich textures on their surfaces.
Of course, these objects can still be handled using methods similar to edge extraction and template matching. However, in actual robot operations, the environment is much more complex: lighting conditions are uncertain (lighting), the distance between the object and the camera is uncertain (scale), the angle at which the camera views the object is uncertain (rotation, affine), and the object may even be occluded by other objects (occlusion).
Fortunately, a genius named Lowe proposed a super-powerful local feature point called SIFT ( Scale-invariant feature transform ).
For a detailed explanation of the underlying principles, please refer to the paper mentioned above, which has been cited over 40,000 times , or various blogs. Simply put, the feature points extracted by this method are only related to the texture of a certain part of the object's surface and are independent of changes in lighting, scale, affine transformations, or the entire object.
Therefore, by using SIFT feature points, we can directly find the same feature points in the camera image as those in the database, thus determining what the object in the camera is (object recognition).
For objects that do not deform, the positions of feature points in the object's coordinate system are fixed. Therefore, after obtaining several point pairs, we can directly solve for the homography matrix between the objects in the camera and the objects in the database.
If we use a depth camera (such as Kinect ) or a binocular vision method to determine the 3D position of each feature point, then we can directly solve this PnP problem to calculate the pose of the object in the current camera coordinate system.
Of course, there are still many details to be done in actual operation to make it truly usable, such as: firstly using point cloud segmentation and Euclidean distance to remove the influence of the background, selecting objects with relatively stable features (sometimes SIFT will also change), and using Bayesian methods to accelerate matching, etc.
Moreover, in addition to SIFT , a whole host of similar feature points have emerged since then, such as SURF and ORB .
IV. Objects without texture
Okay, it's easy to solve the problem with objects that have issues, but there are many objects in daily life or industry that lack texture:
The most obvious question is: Is there a feature point that can describe the shape of an object and has invariance similar to SIFT?
Unfortunately, to my knowledge, there is currently no such feature point.
Therefore, a large class of methods previously used template matching, but the matching features were specifically selected (not just simple features such as edges).
Here, I'd like to introduce LineMod, an algorithm that our lab has previously used and reproduced :
In short, this paper uses both the image gradient of the color image and the surface normal of the depth image as features to match templates in the database.
Since the templates in the database are generated from multiple perspectives of an object, the object pose obtained through matching can only be considered a preliminary estimate and is not accurate.
However, once we have this preliminary estimate of the object's pose, we can directly use the ICP ( Iterative Closest Point ) algorithm to match the object model with the 3D point cloud, thereby obtaining the object's precise pose in the camera coordinate system.
Of course, there are still many details to consider in the actual implementation of this algorithm: how to build the template, the representation of color gradients, etc. Additionally, this method cannot handle situations where objects are occluded. (While lowering the matching threshold can address partial occlusion, it will lead to false recognition).
Regarding the issue of partial occlusion, Dr. Zhang from our lab made improvements to LineMod last year, but since the paper has not yet been published, we will not go into detail about it for now.
V. Deep Learning
Since deep learning has achieved very good results in the field of computer vision, we who work in robotics will naturally try to apply DL to object recognition in robots.
First, for object recognition, we can directly apply the research results of deep learning (DL), and simply use various CNNs . In the 2016 Amazon Web Services (AWS) competition, many teams used DL as their object recognition algorithm.
However, in this competition, although many people used deep learning for object recognition, they still used relatively simple or traditional algorithms for object pose estimation. It seems that deep learning was not widely adopted . As Zhou Bolei mentioned, the common practice is to use a semantic segmentation network to segment objects on color images, and then perform ICP matching between the segmented point cloud and the object's 3D model .
Of course, there are also works that directly use neural networks for pose estimation, such as this one:
Its method is roughly as follows: For an object, take many small patches of RGB-D data (only care about one patch , and use local features to deal with occlusion); each small patch has a coordinate (relative to the object coordinate system); then, first use an autoencoder to reduce the dimensionality of the data; then, use the dimensionality-reduced features to train Hough Forest .
VI . Integration with task / exercise planning
This part is also quite interesting research content. Since the purpose of machine vision is to provide information for robots to manipulate objects, it is not limited to object recognition and localization in the camera, but often needs to be combined with other modules of the robot.
We asked the robot to take a bottle of Sprite from the refrigerator, but the Sprite was blocked by Mirinda.
Our human approach is this: first move the Mirinda aside, then get the Sprite .
Therefore, for the robot, it first needs to visually determine that Sprite is behind Mirinda. At the same time, it also needs to determine that Mirinda is something that can be moved, rather than a fixed object like a refrigerator door.
Of course, combining vision with robotics will lead to many other interesting new things. Since that's not my own research area, I won't presume to offer any further insights.