Thoughts Pipeline
NVIDIA ACE
NVIDIA ACE offers a cloud-solution doing exactly what we are trying to achieve:
Real-time avatars, which understand and answer to user input with gestures, mimic and TTS.
NVIDIA ACE Product Page
NVIDIA ACE Overview
Animations are pre-made and then appropriatly chosen by the LLM.
An animation mixer is used to blend between them.
Interesting Paper about Omni-Modal modeling
Among other contributions, the paper introduces the omniflow framework for incorporating multi-modal streams into realtime LLM interactions. Code and weights are available.
Junbo Cui, Bokai Xu, Chongyi Wang, Tianyu Yu, Weiyue Sun, Yingjing Xu, Tianran Wang, Zhihui He, Wenshuo Ma, Tianchi Cai, Jiancheng Gui, Luoyuan Zhang, Xian Sun, Fuwei Huang, Moye Chen, Zhuo Lin, Hanyu Liu, Qingxin Gui, Qingzhe Han, Yuyang Wen, Huiping Liu, Rongkang Wang, Yaqi Zhang, Hongliang Wei, Chi Chen, You Li, Kechen Fang, Jie Zhou, Yuxuan Li, Guoyang Zeng, Chaojun Xiao, Yankai Lin, Xu Han, Maosong Sun, Zhiyuan Liu, and Yuan Yao. 2026. MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction. https://doi.org/10.48550/arXiv.2604.27393