QUART-Online:

Latency-Free Large Multimodal Language Model for Quadruped Robot Learning

1MiLAB, Westlake University

2Zhejiang University

3Beijing University of Posts and Telecommunications

teaser

QUART-Online

We introduces QUART-Online, a novel latency-free quadruped multimodal large language model, designed to enhance inference efficiency without degrading the performance of the multimodal large language model. With the implementation of action chunk discretization, QUART-Online enhances the existing MLLM system, which was previously operating at a low frequency, enabling more precise actions to be executed in real-time at a frequency of 50Hz.

Overview

QUART Framework

QUART-Online enhances the inference process by employing two key strategies:
1) it accelerates MLLM inference by generating a reduced number of tokens in the latent space as opposed to the raw space (2.5x);
2) it introduces an action chunk mechanism during the action decoding phase, facilitating higher-frequency inference via multi-step predictions (10x).
By integrating these two approaches, QUART-Online successfully increases the inference rate of the original large quadruped robot model, QUART, from 2Hz to 50Hz, enhancing the model's accuracy in rapidly changing scenarios

QUART-Online Framework

QUART Framework

We introduce a novel latency-free quadruped MLLM model, dubbed QUART-Online, designed to enhance inference efficiency without degrading the performance of the language foundation model. By incorporating Action Chunk Discretization (ACD), we compress the original action representation space, mapping continuous action values onto a smaller set of discrete representative vectors while preserving critical information.

Experiment

Our experimental analysis employs the QUARD dataset, a comprehensive, large-scale multi-task dataset known as the Quadruped Robot Dataset (QUARD).

QUART Crawl Bar
crawl bar
QUART Go Avoid
go and avoid
QUART Go Tunnel
go tunnel
QUART-Online Unload
QUART-Online Crawl Bar
QUART-Online Go Avoid
QUART-Online Go Tunnel

Comparison with QUART

Compared to QUART, QUART-Online has achieved significant improvements in success rates across various tasks. These results demonstrate that QUART-Online significantly enhances multi-task and generalization capabilities compared to QUART, particularly in managing unseen objects and instructions.

Simulation Environment

⚠️QUART
(slow turn)

✅QUART-Online
(swift turn)

⚠️QUART
(clumsy adjustment)

✅QUART-Online
(fast adjustment)

❌QUART
(clumsy adjustment)

✅QUART-Online
(fast adjustment)

Reality Environment

QUART

QUART-Online Go avoid

⚠️go & avoid
(reaction latency)

QUART-Online Go Tunnel

⚠️tunnel
(Inflexible turns)

QUART-Online unload

❌unload
(incoherent actions)

QUART-Online (ours)

QUART-Online Go avoid

✅go & avoid
(fast reaction)

QUART-Online Go tunnle

✅tunnel
(flexible turns)

QUART-Online unload

✅unload
(steady movement)

BibTeX

@article{tong2024quartonline,
  author    = {Tong, Xinyang and Ding, Pengxiang and Wang, Donglin and Zhang, Wenjie and Cui, Can and Sun, Mingyang and Fan, Yiguo and Han, Zhao and Zhang, Hongyin and Dang, Yonghao and Huang, Siteng and Lyu, Shangke},
  title     = {QUART-Online: Latency-Free Large Multimodal Language Model for Quadruped Robot Learning},
  journal   = {arXiv preprint arXiv:2412.15576},
  year      = {2024},
}