From LLMs to Actions: Latent Codes as Bridges in Hierarchical Robot Control

*Equal contribution
University of California Berkeley
teaser

Abstract

Hierarchical control for robotics long been plagued by the need to have a well defined interface layer to communicate between high-level task planners and low-level policies. With the advent of LLMs, language as been emerging as a prospective interface layer. However, this has several limitations. Not all tasks can be decomposed into steps that are easily expressible in natural language (e.g. performing a dance routine). Further, it makes end-to-end finetuning on embodied data challenging due to domain shift and catastrophic forgetting. We introduce our method - Learnable Latent Codes as Bridges (LCB) - as an alternate architecture to overcome these limitations. LCB uses a learnable latent code to act as a bridge between LLMs and low- level policies. This enables LLMs to flexibly communicate goals in the task plan without being entirely constrained by language limitations. Additionally, it enables end-to-end finetuning without destroying the embedding space of word tokens learned during pre-training. Through experiments on Language Table and Calvin, two common language based benchmarks for embodied agents, we find that LCB vastly outperforms baselines (including those w/ GPT-4V) that leverage pure language as the interface layer on tasks that require reasoning and multi-step behaviors.


Latent Codes as Bridges

teaser
A high level architectural comparison of LLM-based policies. Predefined skills (left) uses a LLM to call predefined primitives. Language as an interface (middle) uses a LLM to output a simple language command, which is then passed into a language conditioned policy. LCB (right) utilizes a latent code as a bridge between the LLM and the low level policy, facilitating hierarchical control and end-to-end learning.

We wish to develop a hierarchical policy architecture that can enable robots to perform a variety of manipulation tasks when provided with free-form language descriptions. Specifically, we seek an architecture that can handle low-level actions for fine-grained or contact-rich tasks (e.g. pushing, 6D object manipulation) while also having the capability to reason and plan without any external step-by-step instructions.



Our key insight is that we can introduce an additional latent code to act as a bridge between the high-level LLM and low-level language conditioned policy. We augment the LLM’s tokenizer by adding a specialized <ACT> token, prompting the model to predict this token in response to actionable questions. The last layer embedding of the <ACT> token is then utilized as a latent goal for the downstream policy network. This learnable <ACT> token’s embedding facilitates the transmission of abstract goals and nuances to the low-level policy — details that are not easily conveyed through language alone.

env
A visualization of the two environments along with exemplar tasks that we train and evaluate on. The top depicts the Language Table environment. We evaluate study reasoning tasks (first trajectory) and long horizon tasks (second trajectory). The bottom depicts the CALVIN long horizon benchmark, in which the agent must sequentially accomplish tasks.

Results

We systematically evaluated LCB across a diverse set of environments and tasks to demonstrate the efficacy of integrating a pretrained Large Language Model (LLM) with a domain-specific, pretrained low-level policy. Our primary objective was to elucidate the enhanced capabilities of the policy, specifically its advanced high-level language comprehension coupled with nuanced, domain-specific physical awareness. Through our experiment, we aim to answer the following questions:

  • 1. Does LCB enable learning a bridge between the LLM and the policy more effective than pure language?
  • 2. Can LCB leverage the pretrained capabilities of LLMs to solve long horizon tasks by decomposing the high level goals into step by step latent commands?
  • 3. Can LCB outperform other baseline methods that leverage the close-sourced state of the art LLM such as GPT-4V?
teaser
Task success rates on Language Table. The tasks are drawn from the higher level Language Table tasks from PALM-E [31]. LangTable refers to the original language table policy [9]. +LLaVA (frozen) refers to composing the original language table with a frozen LLaVA model and few shot prompting. +GPT-4V similarly refers to composing the original policy with GPT-4V. +LLaVA (finetuned) refers to finetuning the LLaVA policy on our mixture dataset on the language only, then composing it with the policy. Our results show that leveraging LCB is effective on tasks that require additional reasoning and planning.

BibTeX

@misc{shentu2024lcb,
      title={From LLMs to Actions: Latent Codes as Bridges in Hierarchical Robot Control},
      author={Yide Shentu, Philipp Wu, Aravind Rajeswaran and Pieter Abbeel},
      year={2024},
}