UnifoLM-WMA-0: A World-Model-Action(WMA) Framework under UnifoLM Family

UnifoLM-WMA-0 is Unitree‘s open-source world-model–action architecture spanning multiple types of robotic embodiments, designed specifically for general-purpose robot learning. Its core component is a world-model capable of understanding the physical interactions between robots and the environments. This world-model provides two key functions: (a) Simulation Engine – operates as an interactive simulator to generate synthetic data for robot learning; (b) Policy Enhancement – connects with an action head and, by predicting future interaction processes with the world-model, further optimizes decision-making performance. The deployment on the real robots are shown below: .

UnifoLM-WMA-0 是宇树科技跨多类机器人本体的开源世界模型-动作架构，专为通用机器人学习而设计。其核心成分在于一个可以理解机器人与环境交互物理规律的世界模型。该世界模型具备两大核心功能：（1）仿真引擎，作为交互式仿真器运行，为机器人学习提供合成数据;（2）策略增强，可与一个动作头进行对接，通过预测未来与物理世界的交互过程，进一步优化决策性能。模型的真机部署效果如下所示:

(Note: The top-right window shows the world model’s prediction of future action videos.)

Fine-tuning the Video Generation Model: First, we fine-tune the video generation model on the Open-X dataset to adapt its generative capability to robotic operation scenarios. The model takes images and text instructions as input and generates future interactions in video format. The generation results of the fine-tuned model on the test set are as follows:

微调视频生成模型：首先，我们在 Open-X 数据集上微调视频生成模型，将其生成能力适配至机器人作业场景。模型接收图像及文本指令，生成与文本指令对应的未来动作视频。微调模型在测试集上的生成效果如下：

UnifoLM-WMA-0 Architecture: We propose a world-model–embedded policy architecture. This framework enables the world model to operate in two modes: (1) Decision-Making Mode: predicts information about future physical interactions to assist the policy in generating actions. (2) Simulation Mode: generates high-fidelity environmental feedback based on robot actions. The complete system architecture and its working flow are shown as follows:

UnifoLM-WMA-0 架构: 我们提出了一种基于世界模型的策略架构。该方案中的世界模型支持两种运行模式: (1) 决策模式: 提供机器人与环境进行物理交互的预测信息, 辅助策略生成动作；(2) 仿真模式: 基于机器人动作生成高保真环境反馈。完整系统架构及工作流程如下所示:

UnifoLM-WMA-0 Action Controllable Generation: We trained the model on five open-source datasets from Unitree Robotics. Test results show that, as a simulation engine, the model can achieve interactive controllable generation based on the current image and a certain number of future robot actions. A comparison between the generated results and the original videos is shown below:

UnifoLM-WMA-0 动作可控生成: 我们在五个宇树科技开源数据集上完成模型训练，测试结果显示，模型作为仿真引擎，可根据"当前图像"及一定数量的“机器人未来动作”，实现交互可控生成。生成结果与原视频对比如下所示:

UnifoLM-WMA-0 Long-Term Interactive Generation: The model also has the capability to perform long-term interactive generation for long-horizon tasks. A comparison between the generated results and the original videos is shown below:

UnifoLM-WMA-0 持续交互生成: 我们的世界模型也有能力实现长程任务的持续交互生成，生成结果与原视频对比如下所示:

UnifoLM-WMA-0: A World-Model-Action (WMA) Framework
under UnifoLM Family

UnifoLM-WMA-0: UnifoLM 系列下的世界模型-动作（WMA）架构