QUART-Online:

Latency-Free Large Multimodal Language Model for Quadruped Robot Learning

Xinyang Tong^{* 1}, Pengxiang Ding^{* 1,2}, Yiguo Fan^{* 1}, Donglin Wang^{* 1}, Wenjie Zhang¹, Can Cui¹, Mingyang Sun^1,2, Han Zhao^1,2, Hongyin Zhang^1,2, Yonghao Dang³, Siteng Huang^1,2, Shangke Lyu¹

¹MiLAB, Westlake University

²Zhejiang University

³Beijing University of Posts and Telecommunications

ICRA 2025

Paper arXiv Code (Coming Soon)

QUART-Online

We introduces QUART-Online, a novel latency-free quadruped multimodal large language model, designed to enhance inference efficiency without degrading the performance of the multimodal large language model. With the implementation of action chunk discretization, QUART-Online enhances the existing MLLM system, which was previously operating at a low frequency, enabling more precise actions to be executed in real-time at a frequency of 50Hz.

Overview

QUART-Online enhances the inference process by employing two key strategies:
1) it accelerates MLLM inference by generating a reduced number of tokens in the latent space as opposed to the raw space (2.5x);
2) it introduces an action chunk mechanism during the action decoding phase, facilitating higher-frequency inference via multi-step predictions (10x).
By integrating these two approaches, QUART-Online successfully increases the inference rate of the original large quadruped robot model, QUART, from 2Hz to 50Hz, enhancing the model's accuracy in rapidly changing scenarios

QUART-Online Framework

We introduce a novel latency-free quadruped MLLM model, dubbed QUART-Online, designed to enhance inference efficiency without degrading the performance of the language foundation model. By incorporating Action Chunk Discretization (ACD), we compress the original action representation space, mapping continuous action values onto a smaller set of discrete representative vectors while preserving critical information.

Experiment

Our experimental analysis employs the QUARD dataset, a comprehensive, large-scale multi-task dataset known as the Quadruped Robot Dataset (QUARD).

crawl bar

go and avoid

go tunnel

Comparison with QUART

Compared to QUART, QUART-Online has achieved significant improvements in success rates across various tasks. These results demonstrate that QUART-Online significantly enhances multi-task and generalization capabilities compared to QUART, particularly in managing unseen objects and instructions.

Simulation Environment

⚠️QUART
(slow turn)

✅QUART-Online
(swift turn)

⚠️QUART
(clumsy adjustment)

✅QUART-Online
(fast adjustment)

❌QUART
(clumsy adjustment)

✅QUART-Online
(fast adjustment)

Reality Environment

QUART

⚠️go & avoid
(reaction latency)

⚠️tunnel
(Inflexible turns)

❌unload
(incoherent actions)

QUART-Online (ours)

✅go & avoid
(fast reaction)

✅tunnel
(flexible turns)

✅unload
(steady movement)

BibTeX

@article{tong2024quartonline,
  author    = {Tong, Xinyang and Ding, Pengxiang and Wang, Donglin and Zhang, Wenjie and Cui, Can and Sun, Mingyang and Fan, Yiguo and Han, Zhao and Zhang, Hongyin and Dang, Yonghao and Huang, Siteng and Lyu, Shangke},
  title     = {QUART-Online: Latency-Free Large Multimodal Language Model for Quadruped Robot Learning},
  journal   = {arXiv preprint arXiv:2412.15576},
  year      = {2024},
}