QUART-Online enhances the inference process by employing two key strategies:
1) it accelerates MLLM inference by generating a reduced number of tokens in the latent space as opposed to the raw space (2.5x);
2) it introduces an action chunk mechanism during the action decoding phase, facilitating higher-frequency inference via multi-step predictions (10x).
By integrating these two approaches, QUART-Online successfully increases the inference rate of the original large quadruped robot model, QUART,
from 2Hz to 50Hz, enhancing the model's accuracy in rapidly changing scenarios
We introduce a novel latency-free quadruped MLLM model, dubbed QUART-Online, designed to enhance inference efficiency without degrading the performance of the language foundation model. By incorporating Action Chunk Discretization (ACD), we compress the original action representation space, mapping continuous action values onto a smaller set of discrete representative vectors while preserving critical information.
Our experimental analysis employs the QUARD dataset, a comprehensive, large-scale multi-task dataset known as the Quadruped Robot Dataset (QUARD).
Compared to QUART, QUART-Online has achieved significant improvements in success rates across various tasks. These results demonstrate that QUART-Online significantly enhances multi-task and generalization capabilities compared to QUART, particularly in managing unseen objects and instructions.
⚠️QUART
(slow turn)
✅QUART-Online
(swift turn)
⚠️QUART
(clumsy adjustment)
✅QUART-Online
(fast adjustment)
❌QUART
(clumsy adjustment)
✅QUART-Online
(fast adjustment)
QUART
⚠️go & avoid
(reaction latency)
⚠️tunnel
(Inflexible turns)
❌unload
(incoherent actions)
QUART-Online (ours)
✅go & avoid
(fast reaction)
✅tunnel
(flexible turns)
✅unload
(steady movement)
@article{tong2024quartonline,
author = {Tong, Xinyang and Ding, Pengxiang and Wang, Donglin and Zhang, Wenjie and Cui, Can and Sun, Mingyang and Fan, Yiguo and Han, Zhao and Zhang, Hongyin and Dang, Yonghao and Huang, Siteng and Lyu, Shangke},
title = {QUART-Online: Latency-Free Large Multimodal Language Model for Quadruped Robot Learning},
journal = {arXiv preprint arXiv:2412.15576},
year = {2024},
}