How do robots build 3D semantic maps?

As robot intelligence technology continues to iterate, simply perceiving the spatial geometry of the environment is no longer sufficient for complex behavioral decisions and human-computer interaction tasks. Robots need to understand the categories and locations of objects in their environment, much like humans do – that is, the semantic information of the environment. For example, a robot vacuum cleaner tasked with cleaning under a dining table requires it to know the category and location of the target. However, while current mainstream 2D grid maps and topological maps can describe the geometric features of obstacles and the structural information of the environment, they lack the high-level semantic information that robots need to understand their environment and perform human-machine/object-machine interactions. In contrast, 3D semantic maps not only contain structural information about objects and the environment but also "common sense" information such as object categories and functional attributes – essentially a comprehensive, "nanny-level" map for robots.

From a technical perspective, the map model of a 3D semantic map is a reconstruction of the three-dimensional environment of a real scene. It includes regional scene information, the attributes of each independent object in the scene, the three-dimensional model in space, pose information, etc., enabling robots to understand environmental information at the semantic level and mimic the way the human brain understands the environment, thereby providing information support for achieving higher-level intelligent operation.

How to construct a 3D semantic map?

To build a 3D semantic map, the prerequisite is to extract the features of the objects you need and perform semantic segmentation. INDEMIND adopts a stereo vision technology approach in building a 3D semantic map. It performs point cloud clustering on the 3D visual point cloud information acquired by binocular vision sensors , and combines it with embedded deep learning and VSLAM algorithms at the edge to output individual object semantics and regional scene semantics, thereby realizing the construction of a 3D semantic map.

In real-world scenarios, whether it's a home, a company, or a supermarket, there are usually three or more subdivided scenarios, and these scenarios are mostly similar. When a robot receives a task to complete in a designated room, it needs to quickly and accurately understand the functional attributes of the room, find the corresponding room, and perform personalized tasks based on the functional attributes of different rooms. This requires extremely high scene understanding accuracy.

Therefore, INDEMIND achieves scene understanding by fusing the output of regional scene semantics and individual object semantics. First, it performs overall feature recognition based on the acquired regional scene semantic information; second, it identifies a series of independent individual information in the scene based on individual object semantic recognition, and uses this information as scene feature markers. Finally, it achieves accurate and stable scene understanding through the superposition of these two semantics.

In practical applications, robots using 3D semantic maps, combined with the INDEMIND VSLAM algorithm and intelligent decision engine, have demonstrated excellent performance in AI recognition, intelligent obstacle avoidance, intelligent command-based operation, and human-machine/object-machine interaction.

In terms of AI-powered obstacle recognition and avoidance, based on 3D semantic maps, various image features in the environment can be quickly extracted. Combined with deep learning, it can stereoscopically identify individual obstacles such as pedestrians, animals, fixed/moving objects, and dangerous scenarios such as stairs and escalators, preventing dangerous situations from occurring. The stability and accuracy of this obstacle avoidance effect, which combines object 3D information, are significantly improved. Furthermore, by recognizing obstacle 3D information that matches the displayed information, the robot can also perform refined operations similar to human obstacle avoidance actions, enabling it to proactively avoid obstacles with anticipation and strategy.

In terms of interaction and intelligent operation, 3D semantic maps perform semantic recognition and object segmentation on individual entities and room information within a scene. Once the robot understands human "common sense," it can achieve high-level interactive logic. Combined with INDEMIND's self-developed natural language interaction technology, it can be commanded to perform various intelligent tasks such as safety, searching, following, autonomous pathfinding, and directional cleaning through voice, gestures, and actions. Taking directional cleaning as an example, issuing the voice command "Clean the bedroom" can be recognized as planning a cleaning operation for the bedroom area identified on the map, eliminating the need for a crude interactive experience.

Currently, 3D semantic map technology has been applied to INDMEIND's home robot navigation solution "Home Robot AIKit" and commercial robot navigation solution "Commercial Robot AIKit," and both solutions have been widely recognized by customers in the market.

It's worth noting that both solutions, employing visual technology, offer significant cost advantages over competitors. The "Home Robot AIKit" achieves comparable technical performance at only one-third the cost of a LiDAR fusion solution. While LiDAR vision fusion can also acquire semantic information from a scene, it's limited by sensor capabilities, only able to recognize two-dimensional objects and unable to construct 3D semantic maps. Compared to LiDAR solutions, the "Commercial Robot AIKit" reduces costs by 60-80%, lowering robot development costs to below 2,000 yuan, and the complete chassis cost, including navigation and batteries, to below 5,000 yuan, significantly reducing development costs and timelines.

How do robots build 3D semantic maps?

Read next

CATDOLL 126CM Sasha (Customer Photos)

Energy-saving and light pollution reduction solutions based on LED facade lighting

CATDOLL 138CM Airi (TPE Body with Soft Silicone Head)

CATDOLL Kara TPE Head