RT-2: New model translates vision and language into action

research
Automated 2 (RT-2) is a new model-VLA language work that learns from both web and robots, and translates this knowledge into generalized instructions for automated control
High capacity language models (VLMS) are trained in web databases, making these systems significantly good in identifying visual patterns or language and working through different languages. But in order for robots a similar level of efficiency, they will need to collect robot data directly, through every object, environment, important, and position.
In our paper, we present the Robotic Transformer 2 (RT-2) model, a new model-the VLA language that learns from both web and robots data, and translates this knowledge into generalized automatic control instructions, while maintaining the capabilities of the web range.
The visual language model (VLM) has been pre-trained on the web data is learning from the Robotics data to become RT-2, a visible business model (VLA) that can control the robot.
This work depends on the automated transformer 1 (RT-1), a model trained on multi-task demonstrations, which can learn groups of tasks and things seen in automated data. More specifically, our work used the RT-1 collected display data with 13 robots over 17 months in the office kitchen environment.
RT-2 shows improved generalization capabilities and semantic and visual understanding to exceed the automatic data that they were exposed to. This includes the interpretation of new orders and the response to the user’s orders by implementing primitive thinking, such as thinking about the categories of objects or high -level descriptions.
We also explain that combining thinking in a series of ideas allows RT-2 to perform multi-stage semantic thinking, such as identifying an object that can be used as an improvised hammer (rock), or any kind of best drink for a tired person (energy drink).
VLMS Automatic Air Conditioning
RT-2 depends on VLMS that takes one or more image as inputs, and produces a series of symbols that represent a naturally natural text. These VLMS are successfully trained on web data to carry out tasks, such as answering visual questions, illustrations of photos or recognition of objects. In our work, we adapt to the language model, the PALI-X and the Palm-E language model to serve as the backbone of the RT-2.
To control a robot, it must be trained in output. We face this challenge by representing the procedures as symbols in taking out the form – similar to the distinctive linguistic symbols – and half of the procedures as chains that can be addressed by the standard natural features, shown here:
Acting a work series used in RT-2 training. An example of such a series can be a series of robot work code numbers, for example “1 128 91 241 5 101 127 217”.
The series begins with a sign that indicates whether the current episode should be followed or terminated, without implementing subsequent orders, tracking orders to change the position and rotate the final response, as well as the extension required for the ugly robot.
We use the same estimated version of the robot procedures as in the RT-1, and we explain that converting it into a chain’s representation makes it possible to train VLM models on automatic data-because the distances of entry and directing these models do not need to be changed.
RT-2 Architectural Engineering and Training: We participated in setting a pre-trained VLM model on robots and web data. The resulting model takes robot camera images and expects robot procedures to perform.
Circular and emerging skills
We conducted a series of qualitative and quantitative experiments on our RT-2 models, on more than 6000 automatic experiences. Explore the emerging RT-2 capabilities, we first searched for tasks that require combining web data and robot experience, then identified three categories of skills: understanding the symbol, logic, and human recognition.
Each task requires understanding of democratic visual concepts and the ability to perform automatic control to work on these concepts. Orders such as “picking the bag is about to fall from the table” or “transfer banana to the sum of one plus one”-where the robot is required to perform a task processing on organisms or scenarios that have never been seen in the automatic data-required to know the data based on the web to work.
Examples of emerging automatic skills that are not present in robot data and require the transfer of knowledge from pre -training on the Internet.
In all categories, we have noticed the increased generalization performance (more than 3x improvement) compared to previous basic lines, such as previous RT-1 models and models such as Visual Cortex (VC-1), which were previously trained on large visual data collections.
Successful skills evaluation rates: our RT-2 models excel over both the previous robots (RT-1) and visual basic lines before training (VC-1).
We have also performed a series of quantitative assessments, starting with the original RT-1 tasks, which we have examples in robot data, and we have continued with varying degrees of objects, backgrounds and invisible environments before.
Examples of the previously invisible environments by robot, where RT-2 generalizes new situations.
RT-2 kept performing in the original tasks seen in Android data and improving performance in the formerly invisible scenarios by robot, from RT-1 by 32 % to 62 %, indicating a great benefit of pre-training on a large scale.
In addition, we have noticed significant improvements on the pre-trained basic lines on visual tasks only, such as VC-1 and reusable representations for automated manipulation (R3M), and algorithms that use VLMS to determine the object, such as processing open world objects (Moo).
RT-2 high performance on distribution tasks and outperforms multiple foundation lines over invisible tasks outside the distribution.
By evaluating our model on the open language schedule wing of automated tasks, we achieved a 90 % success rate in simulation, and significantly improvement on previous basic lines including BC-Z (72 %), RT-1 (74 %), and LAVA (77 %).
Then we evaluated the same model in the real world (since he was trained in simulation and real data), and we showed his ability to generalize new organisms, as shown below, as none of the organisms except the blue cube was present in the training data set.
RT-2 works well on the tasks of the true robot language table. None of the objects except the blue cube was present in the training data.
Inspired by the methods of calling for the series used in LLMS, we have investigated our models to combine automatic control with thinking in a series of thought to enable learning to plan the long horizon and low -level skills within one model.
In particular, we have activated a variable of RT-2 for a few hundred of gradient steps to increase its ability to use language and procedures jointly. Then we increased the data to include an additional “plan” step, first describing the purpose of the procedure that the robot wants to take in a natural language, followed by “work” and the procedure symbols. Here we show an example of this logic and the behavior of the resulting robot:
Thinking about the series of thinking about learning an independent, autonomous model can be able to plan all the sequence of long horizon skills and the prediction of robot actions.
With this process, the RT-2 can perform more sharing orders that require thinking about the intermediate steps needed to complete the user instructions. Thanks to the backbone of the VLM, RT-2 can also plan from each of the images and text orders, allowing the visually depicted planning, while the current methods of the plan and acting like SayCan cannot see the real world and depend completely on the language.
Provide mechanical control
RT-2 explains that VLMS models can be converted into strong vision system models (VLA), which can directly control the robot by combining the pre-exercises of VLM with automatic data.
With two VLAS counterparts on the basis of Palm-E and PALI-X, RT-2 produces very improved automatic policies, and most importantly, leads to a significant better performance in generalization and emerging abilities, inherited from pre-training online training on the Internet.
RT-2 is not a simple and effective modification to the current VLM models, but also shows a promise to build a material robot for general purposes that can cause problems to solve and explain information to perform a variety of tasks in the real world.
Thanks and appreciation
We would like to thank the authors participating in this work: Anthony Bohan, Noah Brown, Judge Carbajal, Yivgen Chipotar, Xi Chen, Czezetov Churumski, Tianley Ding, Danny Dares, Avenafa Dobby, Chelsea Vin, Betty Florence, Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, JasMine Hsu, Brian Ichter, Alex Irpan, Nighthil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Lisa Lee, Michalewski, Igor Mordch, Karl Pertsch, Kanishka Rao, Krista Reymann, Michael Ryoo, Grecia Salazar, Pannag Sanketi, Pierre Sermanet, JASPIAR SINGH, Anikait Singh, Radu Soricut, Huong Trans, Vincent Vanhoucke, Wohlhart, Jialin Wu, Fei XIA, TED Xiao, Peng Xu, Sichun Xu, Tianhe Yu and Brianna Zitkovich on their contributions to the project and Farid Alcober, Jodi Lynn Andres, Carolina Parada, Joseph Dabis, Rochelle Dela Cruz, Jessica, Jessica, Gavin Gavin, Gavin Gavin, Jackson, J. Tan, Scott Lehrir, DM, Outsaf Malla, Sarah Ngwin, Jin Park, Emily Perez, Elio Brado, Journal Quambao, Clayton Tan, Judky Terlong, Eleanor Tomlinson, Yenixwan Zoo, and a major Google DeepMind team to help them.
2023-07-28 00:00:00