This article is compiled from a manuscript titled "Artificial Intelligence is Driving the Development of Robotics" by Professor Tao Dacheng, IEEE Fellow of the Institute for Artificial Intelligence at the University of Sydney.
Previous generations of robots relied on computational intelligence. Robots of this era have achieved perceptual intelligence to a certain extent through various sensors. Future robots will further achieve high-performance perceptual intelligence and cognitive intelligence. The development of robots driven by artificial intelligence is mainly based on four elements of AI: Perceiving, Learning, Reasoning, and Behaving.
Robot perception and interaction capabilities
Perception involves using various sensors to acquire environmental information, enabling robots to understand their external environment. Currently, we focus on environmental information acquired by cameras, because this information can be used for many things, such as object detection and tracking, scene analysis, etc. This allows robots to perform tasks that humans would normally perform within our environment, thus achieving the goal of extending human intelligence into robotics.
Object detection is effortless for humans; we can easily detect a cup or a person in a scene. For robots, the goal is the same, but achieving high-performance detection is challenging. Traditional object detection uses window scanning. The scanning window starts from the top-left pixel and scans pixel by pixel to the bottom-right pixel, adjusting its size repeatedly. This method is only suitable for detecting a specific type of object, and even then, it's very inefficient. We know that the various scanning windows generated during the scanning process have a lot of redundancy, and many of these windows may not even be objects. This drives us to consider using a small neural network to quickly find image regions that might contain the objects we are interested in. This network is called the proposal network, and it is now widely used. After obtaining these image regions that may contain objects, we can use a high-precision classification network to classify the image regions, thus enabling rapid object detection in the scene.
With an efficient detection framework, what can we do? For example, if a photo contains many people, counting them manually would be extremely time-consuming. However, a machine can count them; a face detector can detect approximately 850 faces. In reality, the photo was taken with about 1000 people. This is because some people are too far from the camera, resulting in low resolution; also, some people standing in the back might be obscured by those in front. Detecting these faces still presents many challenges. Besides face detection, it can also detect vehicles, day and night.
The robot operates in a dynamic environment; people and objects are moving, and the robot itself is also in motion. The robot needs to understand the behavior of both objects and people. To achieve behavioral understanding, tracking all objects is essential. Let's first discuss single-object tracking. The challenge of this task stems from various factors, such as changes in lighting and object deformation. To stably track moving objects over a long period, tracking alone is insufficient. Typically, we combine tracking and detection.
After tracking a single target, it's often necessary to track multiple targets. For example, in a surveillance scenario involving many people, this is clearly a multi-target tracking problem. Besides the challenges of single-target tracking, multi-target tracking also presents the challenge of occlusion between moving objects. Multi-target tracking has many applications, such as in autonomous driving, where we need to understand the behavior of everyone within the target area. Another question arises: why do we need home service robots or social robots? Beyond expecting these robots to help us with simple household chores, we also hope they can engage in emotional communication with us.
Let's take a look at this video. In this setup, we first need to solve the problem of multi-camera video stitching. This football match video uses four cameras. This basketball match video uses two cameras. Through camera calibration, we can achieve accurate video stitching. With such a stitched video, we can understand the players' positioning on the field. Combined with pedestrian re-verification and facial recognition, we can even identify each player. Furthermore, with human pose estimation, we can also understand each player's every movement with precision. With this information input, the robot can understand the game status of the two teams, making this kind of human-computer interaction very interesting.
For robots to fully understand a scene, they heavily rely on scene segmentation. Scene segmentation helps robots identify objects within the scene, their locations, and includes details such as size, volume, and even detailed attribute labels. Currently, deep neural networks can achieve relatively accurate annotation of static scenes. In moving scenes, machines can also achieve relatively accurate segmentation, assisting in tasks such as autonomous driving. To achieve high-performance scene segmentation, we need to achieve efficient multi-feature, multi-scale information fusion.
When we look at a scene, we can know which objects are closer to us and which are farther away. Scene segmentation tells us what objects are there and where they are. We also need to know the depth information of the scene so that we can know the distance of the objects. Distance information is very important for robots to navigate and locate in the scene and to grasp objects. We know that most robots are currently equipped with only one camera. So we need to obtain depth information from a single photo. We know that this problem is very difficult, but we can use a lot of historical data pairs to train a deep neural network. We input a color image into a depth convolutional network and output a depth map. If we have a large number of data pairs, it seems that we can achieve this goal. But even so, it is difficult to achieve good results. When we did this problem, we made two discoveries: (1) It is very inaccurate to directly regress this depth information on a high-resolution color image, but if we quantize the depth information, divide it into several blocks, and turn it into a classification problem, we can achieve very good results; (2) We still need continuous depth information. At this time, we reduce the resolution of the color image and regress continuous depth information on a low-resolution color image, and we can also get very good results. The remaining question is how to effectively combine the two findings to achieve high-precision depth regression for a single image.
We don't expect such a system to replace 16-line or 64-line LiDAR. After all, there's still a significant difference between the regression accuracy and the measurement accuracy of LiDAR. However, such a system is very effective for applications that don't require high-precision depth information. Furthermore, it can be fused with LiDAR to obtain high-resolution spatial scene depth information.
Modern robots can easily recognize, for example, five or six people in a family. Even by slightly increasing the model's complexity, they can not only identify family members but also their friends, enabling them to handle larger scenarios to a certain extent. This is mainly thanks to deep neural networks.
Facial recognition is a very direct means of identity authentication. Of course, we can also use walking patterns and even clothing information. In smart city systems with multi-camera networks, there's a problem: how do we reconstruct a person's movement trajectory when they pass through several cameras? This problem can be solved through pedestrian re-verification. We can even use clothing information to find specific individuals. For example, we can find someone wearing a blue shirt and black pants. We can then determine this person's movement trajectory within a certain area. On the Market1501 database, our rank-one recognition rate has exceeded 95%.
Human-computer interaction is extremely complex because understanding human intentions is very difficult. To simplify this, let's first consider detecting human joints and tracking their posture. Generally speaking, under relatively good lighting conditions, effective tracking is achievable. For example, a recent framework from CMU can even identify hand joints, allowing us to analyze hand movements and even perform sign language recognition. With such a human posture tracking framework, we can control robots and analyze the movements of each athlete on a sports field, determining whether they are shooting or throwing. Furthermore, we can perform fine-grained classification, such as identifying birds worldwide. Using keypoint detection technology, or based on our Pose-Net, we can detect the bird's beak, head, and feet. Then, by extracting fine features from each region, we can accurately identify the bird's species.
These are just some examples of machine vision perception. Besides visual perception, there are also natural language understanding, speech recognition, and so on. In these examples, we have quality requirements for the input data. If the quality of the input image or video data is poor, such as due to noise or haze, it will cause problems for subsequent recognition. Therefore, we need to perform image quality assessment.
Image resolution is also an issue. Modern cameras are quite advanced and can generally acquire very high-resolution images and videos, but this isn't always the case, especially when the camera is far from the object. When the data resolution is low, detection, tracking, and recognition become very difficult, making improvements in resolution crucial.
Robot self-learning ability
After a robot perceives its environment, the information it acquires can help improve system performance. To further enhance robot performance, it needs to learn on its own and effectively integrate different types of information, all of which are closely related to machine learning.
Humans are multi-task learners, and we hope that robots can also be multi-task learners. Currently, most networks are single-task driven; face recognition is face recognition, and expression recognition is expression recognition. However, if I give you a photo, you can get a lot of information, such as whether the person is male or female, whether they wear glasses, and even the person's attributes. This motivates us to train neural networks that support multi-task learning.
Besides multi-task learning, there's also multi-label learning. Given an image or video, the label information it encompasses is very extensive. These labels are also related, and this relationship is an asymmetric causal relationship. Utilizing such asymmetric causal relationships, we can perform image recognition and understanding more effectively.
Then there's transfer learning. For example, if I have something in my hand that's round, red, and crisp, and I ask everyone what it is, they probably won't know. Conversely, if I have an apple in my hand and ask everyone to describe its features, that's very straightforward; you'll tell me it's round, red, crisp, and delicious. Traditional transfer learning is based on the assumption that features contain labels.
When our labeled data is noise-free, we can effectively train the model. But what if the labels are noisy? In today's big data era, data labels are obtained through crowdsourcing, so label noise is normal.
Deep learning has improved system performance, but models have also become increasingly large, posing challenges for storage and computation. How can we make deep models smaller? We need to compress them. Through traditional DCT transform, we can effectively compress the model while simultaneously improving the generalization ability of the original model to some extent.
Finally, there's reasoning and behavior. For example, analyzing human behavior: to correctly understand the content of a video, such as whether the video shows someone boxing, washing their face, or playing a game.
Take human-computer interaction as another example. How do we teach robots to do certain things? The traditional way is through programming. In the future, robots will learn through trial teaching or imitation. Let the robot see something, and it will know what to do.
Image or video captioning. Giving a machine a short video to recognize characters is no longer a difficult problem. However, accurately understanding and describing the content remains a significant challenge.
Besides describing images, there's also answering questions based on images. This isn't easy for computers; you need to understand the image, understand the question, and know how to connect the question with the image.
By combining deep learning with video and LiDAR, we can detect all people and vehicles, estimate their speed, detect routes, segment scenes, analyze safe zones, and enable autonomous vehicles to fully perceive their environment. We can even analyze the future actions of people and vehicles. Autonomous vehicles and robots provide excellent platforms for showcasing artificial intelligence.
In short, artificial intelligence and robots have brought endless possibilities to the future.