Robots' various embodied operation tasks naturally place high demands on their ability to understand language commands, perceive the scene, and plan the spatiotemporal sequence. This naturally raises the question: can we make full use of the capabilities of large models and transfer them to the field of robotics to directly plan the underlying action sequence?
To address this, ByteDance Research developed the open-source and easy-to-use RoboFlamingo robot manipulation model based on the open-source multimodal language vision large-scale model OpenFlamingo, which can be trained on a single machine. With simple and minimal fine-tuning, the VLM can be transformed into a Robotics VLM, making it suitable for language-interactive robot manipulation tasks. OpenFlamingo was validated on the CALVIN robot manipulation dataset, and experimental results show that RoboFlamingo achieved state-of-the-art performance on a range of robot manipulation tasks using only 1% of the language-annotated data. With the release of the RT-X dataset, pre-training RoboFlamingo on open-source data and fine-tuning it for different robot platforms holds promise as a simple and effective large-scale robot model pipeline. The paper also tested the performance of VLMs with different policy heads, training paradigms, and Flamingo structures on Robotics tasks with fine-tuning, yielding some interesting conclusions.
Research Background
Language-based robot manipulation is an important application in embodied intelligence, involving the understanding and processing of multimodal data, including vision, language, and control. In recent years, Visual Language Models (VLMs) have made significant progress in several fields, including image captioning, visual question answering, and image generation. However, applying these models to robot manipulation still presents challenges, such as how to combine visual and linguistic information and how to handle the temporal nature of robot operations. To address these issues, the robotics research team at ByteDance Research designed a new visual language manipulation framework, RoboFlamingo, using the existing open-source VLM, OpenFlamingo. The VLM performs single-step visual language understanding, while an additional policy head module is used to process historical information. RoboFlamingo can be adapted for language-based robot manipulation tasks with simple fine-tuning. RoboFlamingo was validated on the language-based robot manipulation dataset CALVIN. Experimental results show that RoboFlamingo achieves state-of-the-art (SOTA) performance on a range of robot manipulation tasks using only 1% of the language-annotated data (66% success rate and 4.09 average tasks completed in multi-task learning task sequences, compared to 38% and 3.06 average tasks completed by the baseline method; 24% success rate and 2.48 average tasks completed in zero-shot tasks, compared to 1% and 0.67 average tasks completed by the baseline method). Furthermore, it enables real-time response through open-loop control and can be flexibly deployed on lower-performance platforms. These results demonstrate that RoboFlamingo is an effective robot manipulation method that can provide a useful reference for future robot applications.
This work utilizes existing image-text pair-based visual-language models to generate the robot's relative action for each step through end-to-end training. The model's main modules include a vision encoder, a feature fusion decoder, and a policy head. The vision encoder module first inputs the current visual observation into the ViT and then downsamples the tokens output by the ViT using a resampler. The feature fusion decoder takes the text token as input and performs cross-attention on the vision encoder's output as a query in each layer, followed by self-attention to fuse visual and linguistic features. Finally, the feature fusion decoder is max-pooled and fed into the policy head. The policy head directly outputs the current 7 DoF relative action based on the current and historical token sequences output by the feature fusion decoder, including a 6-dim robot arm end-effector pose and a 1-dim gripper open/close action. During training, RoboFlamingo utilizes pre-trained ViT, LLM, and Cross Attention parameters, and only fine-tunes the parameters of the resampler, cross attention, and policy head. Experimental dataset:
CALVIN (Composing Actions from Language and Vision) is an open-source simulation benchmark for learning language-based long-horizon action tasks. Compared to existing vision-language task datasets, CALVIN's tasks are more complex in terms of sequence length, action space, and language, and it supports flexible specification of sensor inputs. CALVIN is divided into four splits (A, B, C, and D), each corresponding to a different context and layout. Quantitative analysis:
RoboFlamingo achieved top performance across all settings and metrics, demonstrating its strong mimicry, visual generalization, and linguistic generalization capabilities. "Full" and "Lang" indicate whether the model was trained using unpaired visual data (i.e., visual data without linguistic pairings); "Freeze-emb" refers to freezing the embedding layer of the fused decoder; "Enriched" indicates instructions enhanced using GPT-4. Ablation experiments:
Different Policy Heads: Experiments examined four different policy heads: MLP without hist, MLP with hist, GPT, and LSTM. MLP without hist, which directly predicts history based on current observations, performed the worst. MLP with hist, by fusing historical observations at the vision encoder, improved performance. GPT and LSTM, which explicitly and implicitly maintain historical information at the policy head, respectively, performed best, demonstrating the effectiveness of historical information fusion through the policy head. Impact of Vision-Language Pre-training: Pre-training played a crucial role in improving RoboFlamingo's performance. Experiments showed that RoboFlamingo performed better on robotic tasks by pre-training on a large vision-language dataset. Model Size and Performance: While larger models generally lead to better performance, experimental results show that even smaller models can rival larger models on certain tasks. Impact of Instruction Fine-tuning: Instruction fine-tuning is a powerful technique, and experimental results demonstrate that it can further improve model performance.
In terms of qualitative results , RoboFlamingo not only completed all five consecutive subtasks, but also used significantly fewer steps for the first two subtasks that the baseline page successfully executed.
In summary, this work provides a novel framework for language-interactive robot manipulation strategies based on existing open-source Virtual Machines (VLMs), achieving excellent results with simple fine-tuning. RoboFlamingo offers robotics researchers a powerful open-source framework, making it easier to unleash the potential of open-source VLMs. The abundant experimental results in this work may provide valuable experience and data for the practical application of robotics, contributing to future research and technological development.