Thoughts Pipeline

NVIDIA ACE

NVIDIA ACE offers a cloud-solution doing exactly what we are trying to achieve:

Real-time avatars, which understand and answer to user input with gestures, mimic and TTS.

NVIDIA ACE Overview

Animations are pre-made and then appropriatly chosen by the LLM.

An animation mixer is used to blend between them.

Among other contributions, the paper introduces the omniflow framework for incorporating multi-modal streams into realtime LLM interactions. Code and weights are available.

Junbo Cui, Bokai Xu, Chongyi Wang, Tianyu Yu, Weiyue Sun, Yingjing Xu, Tianran Wang, Zhihui He, Wenshuo Ma, Tianchi Cai, Jiancheng Gui, Luoyuan Zhang, Xian Sun, Fuwei Huang, Moye Chen, Zhuo Lin, Hanyu Liu, Qingxin Gui, Qingzhe Han, Yuyang Wen, Huiping Liu, Rongkang Wang, Yaqi Zhang, Hongliang Wei, Chi Chen, You Li, Kechen Fang, Jie Zhou, Yuxuan Li, Guoyang Zeng, Chaojun Xiao, Yankai Lin, Xu Han, Maosong Sun, Zhiyuan Liu, and Yuan Yao. 2026. MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction. https://doi.org/10.48550/arXiv.2604.27393

Thoughts Pipeline

NVIDIA ACE

NVIDIA ACE Product Page

NVIDIA ACE Overview

Interesting Paper about Omni-Modal modeling