Detailed Explanation of Basic Robotics Technology Models

Building general-purpose robots capable of seamlessly operating in any environment, using a variety of skills to handle different objects, and performing diverse tasks has long been a goal in the field of artificial intelligence . Unfortunately, however, most existing robotic systems are limited—they are designed for specific tasks, trained on specific datasets, and deployed in specific environments. These systems typically require large amounts of labeled data, rely on task-specific models, suffer from numerous generalization problems when deployed in real-world scenarios, and struggle to remain robust to changes in distribution.

Inspired by the impressive open-set performance and content generation capabilities of large-capacity pre-trained network models (i.e., foundational models) in research fields such as Natural Language Processing (NLP) and Computer Vision (CV), this survey aims to explore (i) how existing foundational models in NLP and CV can be applied to the field of robotics, and (ii) what foundational models specifically for robotics would look like . We begin by outlining the structure of traditional robotic systems and the fundamental obstacles to their general applicability.

Next, we established a classification system and discussed current work on exploring robotics using existing foundational models and developing models for robotics. Finally, we discussed the key challenges and promising future directions for enabling general-purpose robotic systems using foundational models.

We still face many challenges in developing autonomous robotic systems that can adapt to and operate in different environments. Previous robotic perception systems have utilized traditional deep learning methods, which typically require large amounts of labeled data to train supervised learning models [1-3]; meanwhile, the crowdsourced labeling process for these large datasets remains quite expensive. In addition, due to the limited generalization ability of traditional supervised learning methods, the trained models usually require carefully designed domain adaptation techniques to deploy them to specific scenarios or tasks [4, 5], which often requires further data collection and labeling.

Similarly, traditional robot planning and control methods typically require accurate modeling of the world, the dynamics of the autonomous body, and/or the behavior of other agents [6-8]. These models are built for each specific environment or task and often need to be rebuilt as changes occur, exposing their limited transferability [8]; in fact, in many cases, building effective models is either too expensive or impractical. While deep (reinforcement) learning-based motion planning [9, 10] and control methods [11-14] can help alleviate these problems, they are still plagued by distributional variations and reduced generalization ability [15, 16].

While acknowledging the challenges of building robotic systems with generalization capabilities, we also note the significant advancements in natural language processing (NLP) and computer vision (CV)—the introduction of large language models (LLMs) [17] for NLP, the use of diffusion models for high-fidelity image generation [18, 19], and the use of large-capacity visual models and visual language models (VLMs) for zero- or few-time learning generalization of CV tasks [20-22] .

These are called “foundation models”[23], or simply large pre-trained models (LPTMS). These large-capacity vision and language models have also been applied in the field of robotics[24-26], with the potential to endow robotic systems with open-world perception, task planning, and even motion control capabilities. In addition to directly applying existing vision and/or language foundation models to robotic tasks, we also see considerable potential in developing more robot-specific models, such as motion models for manipulation[27, 28] or motion planning models for navigation[29]. These robot foundation models have shown great generalization ability across different tasks and even different entities.

Visual/language foundational models have also been directly applied to robotic tasks [30, 31], demonstrating the possibility of integrating different robotic modules into a single unified model. While we see promising applications of visual and language foundational models for robotic tasks and for developing new robotic foundational models, many robotic challenges remain insurmountable. From a practical deployment perspective, models are often unreplicable, lack multi-entity generalization capabilities, or fail to accurately capture feasible (or acceptable) situations in the environment. Furthermore, most publications utilize Transformer-based architectures, focusing on semantic perception of objects and scenes, task-level planning, or control [28]; other robotic system components that could benefit from cross-domain generalization capabilities have not been fully explored—for example, foundational models for world dynamics or foundational models capable of symbolic reasoning. Finally, we would like to emphasize the need for more large-scale real-world data and high-fidelity simulators with diverse robotic tasks.

In this paper, we investigate the application of foundational models in robotics and aim to understand how they can help alleviate core robotic challenges . We use the term “ foundational models of robotics ” to encompass two distinct aspects: (1) applying existing (primarily) visual and language models to robotics, mainly through zero-shot learning and contextual learning; and (2) developing and utilizing robot-generated data to create foundational models specifically for robotic tasks. We summarize the methodologies of foundational modeling papers in robotics and provide a meta-analysis of the experimental results from the papers we investigated.

The main components of this paper are summarized in Figure 1. The overall structure of this paper is shown in Figure 2. In Section 2, we briefly introduce robotics research prior to the era of basic models and discuss the fundamentals of basic models. In Section 3, we list the challenges in robotics research and discuss how basic models might alleviate these challenges. In Section 4, we summarize the current research status of basic models in robotics. Finally, in Section 6, we propose potential research directions that could have a significant impact on this interdisciplinary research field.

Challenges in Robotics In this section, we summarize five core challenges faced by various modules in a typical robotic system, each detailed in the following subsections. Although similar challenges have been discussed in previous literature (see Section 1.2), this section focuses primarily on those that can be addressed by appropriately utilizing the underlying model, as evidenced by current research findings. We also describe a taxonomy in this section for easier review in Figure 3.

Fundamental Models for Robotics In this section, we focus on the application of fundamental vision and language models in robotics using zero-shot learning. This primarily includes zero-shot learning deployment of VLMs for robot perception, contextual learning of LLMs for task-level and motion-level planning, and action generation. We present some representative works in Figure 6.

Basic Robot Models (RFMs)

As the number of robot datasets containing state-action pairs of real robots increases, the category of robot foundation models (RFMs) also becomes increasingly viable [28, 29, 176]. These models are characterized by being trained using robot data to solve robotic tasks. In this subsection, we summarize and discuss different types of RFMs. We first introduce RFMs that can perform a set of tasks within a single robot module as described in Section 2.1, which are defined as single-purpose robot foundation models. For example, an RFM that can generate low-level actions for controlling a robot, or a model that can generate higher-level motion planning. We then introduce RFMs that can perform tasks across multiple robot modules, thus becoming general-purpose models capable of performing perception, control, and even non-robot tasks [30, 31].

How to use basic models to solve robotics challenges

In Section 3, we outlined five major challenges in robotics. In this section, we summarize how foundational models—whether visual and language models or robot foundational models—help address these challenges in a more organized way. All foundational models related to visual information, such as VFMs, VLMs, and VGMs, are used in the perception module of robotics. LLMs, on the other hand, are more versatile and can be applied to planning and control domains. We also list RFMs here; these robot foundational models are commonly used in planning and motion generation modules. Table 1 summarizes how foundational models address the aforementioned robotics challenges. From this table, we can see that all foundational models excel at generalizing to various robot module tasks. Furthermore, LLMs are particularly adept at task normalization. On the other hand, RFMs excel at handling the challenges of dynamics models because most RFMs are model-free methods.

For robot perception, the challenges of generalization and modeling are interconnected, because if the perception model already has very good generalization ability, then there is no need to acquire more data for domain adaptation or additional fine-tuning. In addition, the call for addressing safety challenges is largely absent, and we will discuss this particular issue in Section 6. Zero-order generalization of the base model for generalization is one of the most prominent features of the current base model. Robotics benefits from the generalization ability of the base model in almost all aspects and modules. First, VLM and VFM are good choices for generalization in perception as default robot perception models. The second aspect is the generalization ability of task-level planning, with LLMs[24] generating the details of task plans. The third aspect is the generalization ability in motion planning and control, by leveraging the power of RFMs.

Basic models for data scarcity

Base models are crucial in addressing the data scarcity problem in robotics. They provide a solid foundation for learning and adapting to new tasks using minimal amounts of specific data. For example, recent approaches have leveraged base models to generate data to aid in training robots, such as robot trajectories [236] and simulations [237]. These models excel at learning from a small number of examples, enabling robots to quickly adapt to new tasks using limited data. From this perspective, addressing the data scarcity problem is equivalent to addressing the generalization problem in robotics. In addition, base models—especially LLMs and VGMs—can generate robotics datasets for training perception modules [238] (see Section 4.1.5 above) and task normalization [239].

A base model used to reduce model requirements

As discussed in Section 3.3, building or learning a model—whether it be an environment map, a world model, or an environment dynamics model—is crucial for solving robotic problems, especially in motion planning and control. However, the strong few/zero-order generalization capabilities exhibited by the base model can undermine this requirement. This includes using LLMs to generate task plans [24], using RFMs to learn model-free end-to-end control policies [27, 256], and so on.

Basic Model for Task Normalization

Task normalization, using language cues [24, 27, 28], target images [181, 272], human videos demonstrating the task [273, 274], rewards [26, 182], rough sketches of trajectories [239], policy sketches [275], and hand-drawn images [276], enables target normalization to be implemented in a more natural, human-like format. Multimodal base models allow users not only to specify the target but also to resolve ambiguity through dialogue. Recent work on trust and intent recognition in understanding human-computer interaction has opened up new paradigms for understanding how humans use explicit and implicit cues to communicate task normalization. While significant progress has been made, recent work on LLMs cue engineering has shown that even with only one modality, it is difficult to generate relevant outputs. Vision-language models have proven particularly adept at task normalization, showing potential for solving robotic problems. Extending the idea of vision-language-based task normalization, Cui et al. [181] explored methods for achieving multimodal task normalization using more natural inputs, such as images obtained from the Internet. Brohan et al. [27] further explored the concept of zero-transfer from task-independent data and proposed a new model class with extended model properties. The model encodes high-dimensional inputs and outputs, including camera images, instructions and motor commands, into compact token representations to enable real-time control of a motion manipulator.

Basic Models for Uncertainty and Security

While uncertainty and safety are key issues in robotics, addressing these issues using fundamental models of robotics remains largely unexplored. Existing work, such as KNOWNO [187], proposes a framework for measuring and aligning uncertainty in LLM-based task planners. Recent advances in chained thinking cues [277], open vocabulary learning [278], and illusion recognition in LLMs [279] may open new avenues for addressing these challenges.

Detailed Explanation of Basic Robotics Technology Models

Read next

CATDOLL 123CM Nanako (TPE Body with Soft Silicone Head)

CATDOLL Qiu Soft Silicone Head

CATDOLL 135CM Tami

CATDOLL 136CM Ya