Avatar Engine

Unreal Engine’s MetaHumans are used to quickly build, display, animate and distort virtual clones of the Actors.

The integreation deep into the UE systems makes it feasible to render high-quality avatars in real-time using UE’s various systems optimized for real-time rendering in virtual production contexts.

File Management

Authors: Malte Hillebrand

File Change History:

Date Change Author
2026-05-04 First Version Malte

Building an Avatar

MetaHuman Avatar Example

A variety of tools with varying degrees of complexity can be used to produce a MetaHuman Identity which Unreal Engine can animate. Likeness to an actor can be achieved via manual labor and comparison or a more advanced technical pipeline.

Key Concepts & Definitions

  • MetaHuman Identity: A specialized asset in UE that acts as a “bridge.” it maps a unique 3D mesh (from a scan or photo) to the standardized MetaHuman topology
  • MetaHuman DNA: The underlying data file that contains the specific vertex positions, rig constraints, and animation logic for a character. It ensures the avatar behaves realistically regardless of its shape
  • Mesh to MetaHuman: The automated process within Unreal Engine that “wraps” the standard MetaHuman topology onto a custom 3D scan

“Standard” Method: MetaHuman Creator (MHC)

  • Preset Start: Choose from a library of diverse humans
  • Sculpting: Manually adjust features using a sculpt tool or by blending between different preset faces

Advanced Methods of Cloning a Human

Uses the TrueDepthfront camera of an iPhone to capture facial geometry

The actor performs a neutral pose and a profile turn in the MetaHuman Live Link app. The data is sent to UE to solve the MetaHuman Identity

👍 Extremely fast; captures the general mesh

👎 Can not transfer a texture

Photogrammetry (RealityScan / Meshroom)

The most high-fidelity method, involving 50–200 high-resolution photos of the actor.

Photos are processed into a dense point cloud and then a high-poly mesh. This mesh is imported into UE as the source for a MetaHuman Identity

👍 Captures the detailled mesh; captures detailled texture

👎 Hard to automate; slow and tedious process; need to transfer the texture on the complex MetaHuman UV map (unclear, need to try)

FaceBuilder by KeenTools

A plugin for Blender that allows for “manual photogrammetry” from a limited set of photos

Align a base head model to 2D photos by placing “pins” on features (nose, eyes, lips)

👍 Great for when you cannot perform a live scan and only have reference photography; produces a very clean mesh that is easy for the MetaHuman solver to process

👎 Manual labor, time intense

OpenCV Pose Estimation (Skeletal Scaling)

OpenCV can be used to match the body proportions

Webcam feed of the actor is processed using OpenCV to extract specific metrics: total height, arm length, and shoulder width

Ratios are used to drive the “Body” selection in MHC or procedurally scale the skeletal bones in the UE Blueprint Construction Script

Comparison & Combination

Method Accuracy Setup Time Complexity Best For
MHC Only Low Low Low Generic background characters
iOS Link Medium Medium Low Rapid prototyping/Live rehearsal
FaceBuilder Medium-High High Medium When actor is not physically present
Photogrammetry Ultra-High Very High High Hero “Clones” for close-ups

Combining Tools: In practice, FaceBuilder / iOS link used to get the facial structure right, Photogrammetry for the skin textures, and OpenCV to ensure the avatar’s height matches the physical actor on stage

Animating an Avatar

Common interface for streaming and consuming animation data from external sources into Unreal Engine.

A variety of sources can be used to drive a variety of subjects. Some input can move a camera within the engine, a webcam feed can drive a MetaHuman’s facial mesh, a MoCap suit can move a MetaHuman’s skeleton

Pre-Built Animations

By pre-building animations and interpolating between them using UE’s Blendspace, complex and unique movement can be achieved.

Problem: Pre-building lots of animations

Proceudrally Generating Animations

Generating rig-animations needs to balance generation time and quality.

Complex full-rig animations don’t seem to be feasible in real-time yet, while face mesh animation from an audiostream seems to be doable.

MetaHuman Animator: Audio to Animation

Native MetaHuman plugin to process audio and build facial animations to be used to move a MetaHuan face mesh.

Works in real-time!

Lacks a bit of “depth”, mostly just mouth movement, not really emotional changes to a face (frowning etc.)

NVIDIA Audio2Face

Is able to generate expressive facial animation from an audio source in real-time

NVIDIA Kimodo - Full-Rig Diffusion Model

Generates a full skeleton animation from a text prompt

  • Reasonably fast, but not real-time (~2-5 seconds)
  • Does not natively work with UE MetaHuman#s rig (uses SOMA rig, has to be retargeted)
First Test with Kimodo -> Blender for conversion into FBX animation -> Retargetting in Unreal Engine to MetaHuman

NVIDIA Kimodo can export as .BVH file, a motion capture file format.

  • Generation time for 10 animations in batch: 25.99s
  • Average: 2.6 secs
  • Time to Load Model: 23.76s
  • Device: NVIDIA GeForce RTX 4090

Unreal Engine can NOT natively read .BVH files, but Blender can.

Blender is used to convert the .BVH files into .FBX files with the animation baked in.

To import properly into Unreal Engine a dummy skeleton (cube) is attached to the rig in Blender. This way Unreal Engine recognizes the animation as a skeleton aninmation.

The Kimodo skeleton differs from the UE MetaHuman skeleton. For the re-targeting, each chain of bones in the Kimodo rig has to be named after the chains in the UE MetaHuman rig. Some auto-chaining can be used, but it failed more often than it simplified the process.

BUT: After doing the process ONCE for the Kimodo skeleton, it can be used when importing all other .FBX converted animations WITHOUT MANUAL RE-TARGETING!

The animations are than automatically re-targeted using the existing skeleton and exported as rig animation sequences.

These animation sequences have then to be assigned to the MetaHuman skeleton to be able to drive the MetaHuman mesh.

26/05/13 - Results with little re-targeted bones and therefore buggy behaviour, first tests of the pipeline: Kimodo 1st Test

26/05/18 - Results with proper re-targeted bones and (hands and feet need further improvement) Kimodo 1st Test

NVIDIA ACE (Avatar Cloud Engine)

Bit hard for me to fully grasp, but seems to be a complete workflow that does:

  1. NLP from voice input
  2. Generate LLM output
  3. Output TTS and drive facial animations using Audio2Face in real-time
  4. Chose and blend between pre-defined animations appropriate to generated answer

State of the Art: DiDiffGes (2025)

Can generate gestures from speech with just 10 sampling steps. Decouples gesture data into body and hand distributions

Claims to be real-time!

Demo: https://cyk990422.github.io/DIDiffGes/

! WAS MERGED WITH ANOTHER PAPER (“HoleGest”) INTO “Efficient-Audio-Gesture”

State of the Art: AsynFusion (2025)

Enables paralell generation of facial and body animation, running a syncrhonization between them to relate their dependencies.

Claims to generate more cohesive / “natural” full body animations from a single audio stream.

Does not seem to work in real-time.

Sending Data

Getting data in and out of Unreal Engine

Transfering Framebuffers

A variety of open-source solutions leverage GPU architecture to send and recieve framebuffers with low-latency.

While Spout and Syphon are the fastest through their OS-specific integration, sharing raw textures directly on the local GPU, NDI offers a cross-plattform solution by encoding video to send over an IP network.

Raw pixel data is important for img2img generation using Diffusion models, compression artifacts can alter the diffusion output siginificantly and hinder a visually consistent output.

Spout (Windows)

✅ Transfers raw pixel data

Syphon (MacOS)

✅ Transfers raw pixel data

NDI: Network Device Interface (Cross-Platform)

Encodes the video and sends it over an IP network and can also be easily routed locally (using 127.0.0.1 or localhost) to share frames between apps on the same machine.

Integrated into Unreal Engine: NewTek provides an official Unreal Engine plugin.

❌ Sends compressed video frames

Applies a lightweight compression codec. Latency is low (usually around 1 frame), technically slightly heavier on the CPU/GPU than Spout’s or Syphon’s raw memory sharing.

Native Unreal Alternative: Pixel Streaming

❌ Sends compressed video frames

It uses a one-way-out architecture to compress and send video frames out of Unreal Engine.

It does not have the capabilities to recieve an image / framebuffer as an input for a shader etc.

Animation Data

The LiveLink protocol can be used to send data structures from and to Unreal Engine, but it is not designed to parse any arbitrary JSON data. While third-party plugins enable the features, it is computationally heavy! Other, more efficitent data types can be used to improve performance, but still limit actions to animating a skeletal mesh.

OSC (Open Sound Control) as a protocol can be natively used within Unreal Engine to send and recieve any data, does not have to be sound (!). It organizes data using a URL-like address system. Instead of sending a confusing string of numbers, you send a specific value to a specific address.

Protocol JSON via LiveLink OSC (Open Sound Control)
Performance Moderate (CPU parsing strings) Better (Parsing binary data)
Ease of Use (Animation) High. LiveLink maps data directly to skeletal rigs automatically Low. Have to manually route actions in a(n Animation) Blueprint
Flexibility Limited mostly to transforms and blendshapes Virtually endless. Can trigger events, change material colors, spawn particles, etc
Support Requires custom third-party plugins Native UE plugin, supported by almost all creative software

Distorting an Avatar

When distorting the avatar, it is important to keep the animation of the TTS and its body language intact and not interfere with this animation layer.

Hence, all distortions appear afterwards and transform later parts of the pipeline.

Mesh Distortion

Bone Scaling Proportional

Bone Scaling Unproportional

The bones’ of a MetaHuman can be transformed.

(Vertex) Shader Distortion

Vertex Shader Displacement

The vertex shader of a MetaHuman can be dsiplaced to transform the mesh.

Neural Style Transfer

Neural Style Transfer

Using Unreal Engine’s Neural Post Processing, ONXX Style Transfer of the framebuffer is possible within the Engine.

This is significantly faster than applying a style transfer on the framebuffer: The render pipeline does not need to render a high-quality MetaHuman first, just for it to be processsed again!

e.g.: left eye controls right arm

Framebuffer to Image2Image Diffusion

By transfering a framebuffer to ComfyUI, the output of the render can be used as basis to generate an image.