MotuBrainWorld Action Model
01 · Unified World–Action Modeling
A single model jointly learns video and action, enabling VLA, world modeling, video generation, inverse dynamics, and video-action prediction within one unified framework.
02 · From Action Prediction to World Understanding
Motus learns not only how to act, but also how tasks, environments, and action outcomes relate to one another.
03 · Learning from Diverse Data
By leveraging video data, unlabeled interactions, and robot trajectories, Motus learns from a much broader range of data than traditional VLA systems.
04 · Three-Stream MoT Architecture
Integrates video, language, and action modeling to combine world understanding, semantic reasoning, and action generation in a single system.
05 · Positive Multi-Task Scaling
As tasks increase, performance improves—demonstrating the ability to learn shared world knowledge rather than task-specific behaviors.
MotuBrain: Advancing the World Action Model
Building on Motus, MotuBrain introduces several key advances:
- Unified modeling across camera configurations enables adaptation to diverse viewpoints and visual setups.
- A dedicated language pathway directly connects semantic understanding with action generation, improving instruction following for complex tasks.
- A unified action representation supports rapid transfer across different robot embodiments and platforms.
- Teacher-forcing autoregression and diffusion-based generation enable long-horizon task execution without external memory systems.
- Optimized Video-to-Action reasoning and real-time closed-loop control enable smooth deployment of large embodied foundation models.
- With as few as 50–100 demonstrations, MotuBrain can quickly adapt to new embodiments and perform long-horizon, multi-step, and bimanual tasks with high success rates.
Architecture & Methodology
Pre-training
- Using relative end-effector (EEF) representations as a unified action space, enabling efficient learning from heterogeneous robot data and rapid adaptation across embodiments.
- Introducing a dedicated language stream that models instructions as an independent modality, improving semantic understanding, task reasoning, and instruction following.
- Combining multi-view visual inputs at the token level with view-dependent RoPE offsets, enabling unified modeling across different camera configurations and numbers of viewpoints.
- Applying noisy conditioning during training by injecting noise into conditioning frames, improving robustness to visual noise, observation errors, and real-world deployment variations.
- Employing H-Bridge Attention, enabling efficient video-action interaction while reducing computational overhead and minimizing modality-specific noise.
Post-Training & Inference
- Combining teacher-forcing autoregression with diffusion-based action generation, enabling the model to learn long-horizon dependencies and action continuity while supporting real-time closed-loop control and multi-step task execution.
- Leveraging a suite of inference optimizations—including DiT Cache, FP8 quantization, and CUDA Graphs—to achieve approximately 5Hz inference frequency despite the model's large scale, delivering around 10× speedup over Motus.
- Introducing IDM / Video-to-Action inference, which updates only the action branch instead of generating full video outputs during inference. Combined with additional system-level optimizations, this increases inference frequency to 11Hz, exceeding typical human reaction speed.
- Employing Real-Time Chunking to break long action sequences into continuously executable segments. Together with action smoothing strategies, this enables stable closed-loop execution and smooth real-world robot control powered by the World Action Model.
Results
On RoboTwin 2.0, MotuBrain ranked No. 1 in both the Clean (95.8) and Randomized (96.1) settings. It was the only model to surpass an average score of 95 in randomized environments.
Positive Multi-Task Scaling:As the number of tasks increased, average success rates also increased, demonstrating MotuBrain's ability to learn and transfer shared world knowledge across tasks.
- On WorldArena, MotuBrain also ranked No. 1, demonstrating strong capabilities in understanding physical dynamics, predicting future states, and reasoning about changes in the environment.
- At the CVPR 2026 RoboChallenge Table30v2, MotuBrain placed third across four real-world robot platforms, matching the success rate of the second-place team despite not using its optimal model configuration on two evaluation robots.
- MotuBrain adapted to new robot platforms with as few as 50–100 demonstrations and validated its deployment capabilities across multiple embodiments. Without relying on VLM-based planning, dual-system architectures, external memory, reinforcement data, or retry-specific data, it successfully completed complex real-world tasks using a native World Action Model alone.
Long-Horizon Task Execution
MotuBrain successfully completed tasks involving more than 10 atomic actions. In a flower-arranging task, for example, it generalized beyond the vase positions seen during training and adapted to different placements, interruptions, and failed insertions. Rather than repeating a fixed trajectory, it adjusted its actions based on the current visual state.
Bimanual Coordination
MotuBrain understood different objectives for the left and right arms while maintaining coordinated execution. In tasks such as pouring water while picking up bread, it recognized these as independent but compatible goals, enabling flexible and effective dual-arm collaboration.
Online Error Recovery
Even without retry-specific or reinforcement learning data, MotuBrain demonstrated the ability to recover from failures and self-correct. For example, in a food retrieval task, if the first attempt missed the target, the model often re-attempted the action based on the updated scene, showing an understanding of the task objective rather than simply replaying learned motions.