LCB

Latent Codes as Bridges

A high level architectural comparison of LLM-based policies. Predefined skills (left) uses a LLM to call predefined primitives. Language as an interface (middle) uses a LLM to output a simple language command, which is then passed into a language conditioned policy. LCB (right) utilizes a latent code as a bridge between the LLM and the low level policy, facilitating hierarchical control and end-to-end learning.

We wish to develop a hierarchical policy architecture that can enable robots to perform a variety of manipulation tasks when provided with free-form language descriptions. Specifically, we seek an architecture that can handle low-level actions for fine-grained or contact-rich tasks (e.g. pushing, 6D object manipulation) while also having the capability to reason and plan without any external step-by-step instructions.

Our key insight is that we can introduce an additional latent code to act as a bridge between the high-level LLM and low-level language conditioned policy. We augment the LLM’s tokenizer by adding a specialized <ACT> token, prompting the model to predict this token in response to actionable questions. The last layer embedding of the <ACT> token is then utilized as a latent goal for the downstream policy network. This learnable <ACT> token’s embedding facilitates the transmission of abstract goals and nuances to the low-level policy — details that are not easily conveyed through language alone.

A visualization of the two environments along with exemplar tasks that we train and evaluate on. The top depicts the Language Table environment. We evaluate study reasoning tasks (first trajectory) and long horizon tasks (second trajectory). The bottom depicts the CALVIN long horizon benchmark, in which the agent must sequentially accomplish tasks.

Results

We systematically evaluated LCB across a diverse set of environments and tasks to demonstrate the efficacy of integrating a pretrained Large Language Model (LLM) with a domain-specific, pretrained low-level policy. Our primary objective was to elucidate the enhanced capabilities of the policy, specifically its advanced high-level language comprehension coupled with nuanced, domain-specific physical awareness. Through our experiment, we aim to answer the following questions:

1. Does LCB enable learning a bridge between the LLM and the policy more effective than pure language?
2. Can LCB leverage the pretrained capabilities of LLMs to solve long horizon tasks by decomposing the high level goals into step by step latent commands?
3. Can LCB outperform other baseline methods that leverage the close-sourced state of the art LLM such as GPT-4V?

Task success rates on Language Table. The tasks are drawn from the higher level Language Table tasks from PALM-E. LangTable refers to the original language table policy. +LLaVA (frozen) refers to composing the original language table with a frozen LLaVA model and few shot prompting. +GPT-4V similarly refers to composing the original policy with GPT-4V. +LLaVA (finetuned) refers to finetuning the LLaVA policy on our mixture dataset on the language only, then composing it with the policy. Our results show that leveraging LCB is effective on tasks that require additional reasoning and planning. Note that the same model is evaluated between the long horizon and reasoning tasks.

From LLMs to Actions: Latent Codes as Bridges in Hierarchical Robot Control

Abstract

Latent Codes as Bridges

Results

BibTeX