DreamZero-SO101: An Open World-Action Model for the Low-Cost SO-101 Teleoperator Arm

Vizuara Research Team
Vizuara AI · vizuara.ai · 2026
Abstract

We present DreamZero-SO101, the first open-source world-action model for the low-cost SO-101 teleoperator arm. Built by fine-tuning DreamZero (Wan2.1-I2V-14B) on 715 community-contributed episodes via LoRA (rank 4, 108M parameters), our model jointly predicts imagined video futures and 24-step action chunks in a single forward pass using flow matching. Given one camera frame, a 6-DOF joint state, and a language instruction, DreamZero-SO101 can (1) operate as a closed-loop robot policy achieving 1.6–2.3° RMSE on held-out training episodes and 11.9° mean RMSE on a zero-shot dataset, and (2) run in DreamGen mode, autoregressively imagining 60 consecutive video+action chunks (18 seconds) from a single initial observation without any real feedback. All six DreamGen rollouts visually complete their intended manipulation task inside the model's imagination. We release the 217 MB LoRA adapter under Apache 2.0, along with the data pipeline, training recipe, and evaluation code.

1. Introduction

World models — learned simulators of environment dynamics — have become increasingly attractive as a substrate for robot policy learning. By imagining future states, a world model can augment limited real-world data, enable model-based planning, and support data-generation pipelines for downstream training. Critically, when the world model also predicts actions (a world-action model), it can serve directly as a robot policy without any separate policy distillation step.

Recent work such as DreamGen [1] has demonstrated that large video generation models (LVMs) fine-tuned on robot data can jointly predict video futures and action trajectories, exhibiting compelling generalisation and in-context learning. However, these results have been demonstrated mainly on high-cost research platforms (Unitree G1, Boston Dynamics Spot, bi-manual ALOHA). The SO-101 is a low-cost (~$100) 6-DOF desktop teleoperator arm with a growing open-source community on HuggingFace, but no world-action model has been built for it — until now.

We make the following contributions:

Figure 1 — DreamZero-SO101 architecture
Figure 1. Architecture of DreamZero-SO101. A single camera frame is encoded by WanVAE into video latent tokens; the language instruction is processed by UMT5-XXL; the 6-DOF joint state is embedded by the ActionEncoder MLP. All three streams enter the 14B Wan2.1 DiT, where blockwise causal attention allows video tokens to attend to action tokens. The DiT predicts the flow-matching velocity field for both video and action simultaneously. At inference, 4 Euler steps yield a predicted video chunk + action chunk in ~600 ms on a single H100.

World models for robot learning

World models have a long history in model-based RL [2]. Recent approaches leverage large-scale video pretraining to produce rich visual predictions: UniSim [3] learns universal simulators of real-world interactions; NVIDIA Cosmos [4] pre-trains on 20M hours of video to provide a foundation for world model fine-tuning; 1X WMLab [5] demonstrates that a world model trained solely on robot data can serve as a data engine for policy improvement.

Joint video and action prediction

DreamGen [1] is the closest prior work: it fine-tunes LVMs (Wan2.1, Stable Video Diffusion) on robot demonstrations to jointly predict video and actions via in-context rollout. GR00T N1 [6] unifies visual foundation models with robot policies but uses a separate policy head. Our work is distinct in targeting the low-cost SO-101 community and in providing a fully open pipeline from data collection to evaluation.

Video generation backbones

Wan2.1 [7] is an open-source 14B video diffusion transformer that achieves state-of-the-art on text-to-video and image-to-video benchmarks. Its open weights and architecture make it a natural backbone for robot world model fine-tuning. DreamZero [8] extends Wan2.1-I2V with blockwise causal attention to inject action tokens and flow-matching action denoising.

Open-source robot learning datasets

LeRobot [9] provides standardised dataset formats, simulation environments, and training recipes for low-cost robot arms. The SO-101 community on HuggingFace Hub has contributed dozens of manipulation datasets across picking, placing, sorting, folding, and assembly tasks. We aggregate the six largest into a single 715-episode training set.

3. Method

3.1 Architecture

DreamZero-SO101 is built directly on the DreamZero codebase [8] with minimal modifications. The core architecture consists of:

The key coupling mechanism is blockwise causal attention: within each DiT layer, the attention mask is structured so that video tokens can attend to action tokens (acquiring motion intent), but action tokens cannot attend back to video tokens. This prevents the action prediction from contaminating the video generation while allowing the video stream to condition on the action intent signal.

3.2 LoRA Adaptation

We fine-tune only a small fraction of the backbone parameters using Low-Rank Adaptation (LoRA) [10] with rank r=4 applied to the Q, K, V, O, and both FFN matrices of all 40 DiT layers. This adds approximately 108M trainable parameters to the 14B backbone. The action encoder and any new linear projection weights are trained from scratch as part of the same optimizer. The total trainable parameter count is ~158M; the total released checkpoint (LoRA matrices + action heads) is 217 MB.

3.3 Data Curation

We aggregated 715 episodes from six publicly available SO-101 datasets on HuggingFace Hub:

DatasetEpisodesTasksCameras
whosricky/so101-megamix-v140083
lipsop/so101-block-in-bin-100ep10012
youliangtan/so101-table-cleanup8042
G3ND3K/so101_picking_up_green_lego_big6012
lerobot/svla_so101_pickplace5012
observabot/so101_cloth_folding12513

All datasets share 6-DOF action spaces (shoulder_pan, shoulder_lift, elbow_flex, wrist_flex, wrist_roll, gripper), 30 FPS video, and LeRobot v2+ format. Conversion to DreamZero's GEAR format aligns camera frames, joint states, and action chunks into triplets. Camera streams present in fewer than 3 cameras are padded with black frames; the model conditions on whichever cameras are available via camera-aware masking in the cross-attention.

Figure 2 — Training pipeline
Figure 2. End-to-end training pipeline. HuggingFace SO-101 datasets are aggregated and converted to GEAR format via enumerate_so101.py + download_and_convert.py. LoRA fine-tuning runs on 2× H100 for ~127 hours (72K steps), after which both the action_loss and dynamics_loss converge. The resulting 217 MB LoRA adapter is released to Vizuara/dreamzero-so101-lora.

3.4 Training Objective

We use DreamZero's flow-matching training objective unchanged. Both video and action are denoised in a shared noise schedule: at each training step, a noise level σ is sampled uniformly, noised versions of the video latent and action sequence are constructed, and the DiT is trained to predict the velocity field v = x_0 − x_σ for both modalities jointly. The total loss is:

L = λ_video · MSE(v̂_video, v_video) + λ_action · MSE(v̂_action, v_action)

with λ_video = 1.0 and λ_action = 1.0. Action loss converges from 0.42 → 0.0015; dynamics (video) loss from 0.176 → 0.0298 at 72K steps.

4. Experimental Setup

4.1 Hardware and Training Duration

All training was conducted on 2× NVIDIA H100 80GB SXM5 GPUs provided by RunPod. Training ran for approximately 127 hours to reach 72K gradient steps at a throughput of ~1.4 it/s. Optimizer: AdamW with learning rate 1×10⁻⁴, cosine schedule with 100-step warmup. Batch size 1 per GPU, gradient accumulation 4, DeepSpeed ZeRO-2. Resolution: 320×176 pixels, 33 video frames per sample.

4.2 Evaluation Protocols

We evaluate DreamZero-SO101 under three protocols of increasing difficulty (see Figure 3):

Figure 3 — Evaluation framework
Figure 3. Three evaluation protocols in increasing difficulty. Top: single-chunk policy mode (one real frame → predicted actions → compare to GT). Middle: zero-shot generalisation on an unseen dataset. Bottom: DreamGen autoregressive rollout — a 60-chunk feedback loop where each chunk's imagined video is fed back as input to the next.

4.3 Drift Metrics

For DreamGen rollouts we report two complementary drift measures:

Best-match drift is the fairer measure for autoregressive rollouts because (a) imagined rollouts are not guaranteed to run at real-time speed, and (b) ground-truth episodes typically end at 10–19 s while imagined rollouts always run for 18 s, creating an unavoidable time-alignment penalty after the episode end.

5. Results

5.1 Single-Chunk Policy Mode

Table 1 summarises the single-chunk RMSE across all evaluation frames. On the training distribution the model is near-exact (0.57° RMSE at the canonical starting frame). On held-out training episodes it achieves 1.6–2.3° — well within the regime that enables closed-loop physical control on the SO-101. On the zero-shot dataset, mean RMSE rises to 11.9° with high variance (best 1.57°, worst 26.7°), indicating the model generalises the general arm structure but loses precise velocity planning on unseen camera rigs and object layouts.

Evaluation frameProtocolRMSE (°)Status
ep0 · frame 0Training distribution0.57deployable
ep100 · frame 0Held-out training1.60deployable
ep50 · frame 0Held-out training2.29deployable
ep100 · frame 150 (grasp)Mid-episode, held-out8.40marginal
ep50 · frame 80 (descend)Mid-episode, held-out11.97marginal
zeroshot · f254 (best)Zero-shot1.57deployable
zeroshot · meanZero-shot11.9marginal
zeroshot · f509 (worst)Zero-shot26.7too large

Table 1. Single-chunk action RMSE across evaluation protocols. "Deployable" = under ~5°, "marginal" = 5–20°, "too large" = over 20°.

5.2 DreamGen Autoregressive Rollouts

Table 2 shows the drift metrics for all six DreamGen rollouts. The best-match drift ranges from 71° (ep206_train) to 94° (ep000_train); the time-aligned drift is consistently 15–35° higher, confirming that speed mismatch and post-episode continuation are the dominant sources of reported drift. Visually, all six rollouts complete the intended task — pick, pick-and-place, or cube stacking — inside the imagined video stream before the degradation phase begins.

RolloutPrompt kindBest-match (°)Time-aligned (°)Real ep. end (s)
ep000training93.796.810.9
ep000novel90.197.310.9
ep245training74.0109.017.9
ep245novel85.3111.117.9
ep206training71.090.119.1
ep206novel83.4102.519.1

Table 2. DreamGen autoregressive rollout drift metrics. Imagined rollouts always run 18 s; real episodes shorter than this accumulate extra time-aligned penalty after episode end.

Figure 4 — Autoregressive loop and failure mode
Figure 4. The DreamGen autoregressive loop (left) and the post-task degradation failure mode (right). After the real episode ends (marked in red), the model continues imagining but without ground-truth supervision, producing increasingly incoherent motion. Adding a learned task-completion token is a direct fix.

5.3 Drift Analysis

The 20–35° gap between best-match and time-aligned drift is not model failure — it is a measurement artefact. When the real episode is 10.9 s and the imagined rollout continues to 18 s, the time-aligned metric compares imagined chunks 12–60 against GT frames that do not exist. The model is penalised for imagining anything at all. The best-match metric corrects for this: within the real episode window, the best-match drift for ep206_train is 71°, comparable to what DreamGen reports for its multi-embodiment baseline at 1K training steps.

Figure 5 — Results summary
Figure 5. Results summary. Left: single-chunk RMSE for all evaluation frames. Right: best-match drift (teal) vs time-aligned drift (orange) for the six DreamGen rollouts. The gap between the two measures reflects speed mismatch and post-episode continuation, not in-episode prediction error.

6. Discussion & Limitations

What works

The model has clearly learned SO-101 arm kinematics from 715 demonstrations. Single-chunk policy mode is in the deployable range on cruise (0.57–2.3°). DreamGen mode produces visually coherent manipulation sequences that complete the intended task. The 108M LoRA adapter is plug-and-play on top of an unmodified Wan2.1-I2V-14B-480P checkpoint.

Current failure modes

Path to physical deployment

DreamZero-SO101 is not yet suitable for physical closed-loop control with the autoregressive rollout mode. However, single-chunk policy mode at 2.3° RMSE on held-out episodes is close to deployable. The 7.6 s/chunk inference time on H100 (at 4 Euler steps) needs to drop by ~100× for real-time 30 Hz control — achievable via distillation, consistency training, or dedicated action-head-only inference.

7. Open-Source Release & Community Contribution

All artefacts are released under Apache 2.0:

We invite SO-101 owners to contribute episodes. Recording 30+ episodes and uploading to HuggingFace Hub automatically feeds into the next DreamZero-SO101 training run. The target is a community-driven flywheel: more episodes → better model → attracts more contributors (Figure 6).

Figure 6 — Community contribution flow
Figure 6. Community contribution flow. SO-101 owners record episodes, format as LeRobot, upload to HuggingFace, and submit a PR to the manifest. Vizuara retrains from the latest checkpoint and releases the new model. Each release attracts more contributors, forming a data flywheel.

Our roadmap includes: (1) a 150K-step full fine-tune of all 14B backbone parameters once the training dataset exceeds 2,000 episodes; (2) support for additional low-cost arms (Koch v1.1, Moss, Lekiwi); (3) a live inference endpoint for submitting custom rollout requests; (4) a HuggingFace dataset partnership to streamline contributor onboarding.

8. Acknowledgments

We thank the DreamZero team at GEAR Lab for open-sourcing the codebase and pre-trained action heads under Apache 2.0. We thank the Wan2.1 team at Alibaba for releasing the video backbone. We thank all SO-101 dataset contributors on HuggingFace Hub — without their recorded episodes this work would not exist. Compute was provided by RunPod.

9. References

  1. GEAR Lab. DreamGen: Unleashing the Generative Power of Large Video Models for Robot Manipulation. arXiv 2025. github.com/dreamzero0/dreamzero
  2. Ha, D. & Schmidhuber, J. World Models. NeurIPS 2018. arXiv:1803.10122.
  3. Yang, M. et al. Learning Interactive Real-World Simulators. arXiv:2310.06114.
  4. NVIDIA. Cosmos World Foundation Model. arXiv:2501.03575, 2025.
  5. 1X Technologies. 1X World Model Challenge. 2024. github.com/1x-technologies/1xgpt
  6. NVIDIA. GR00T N1: An Open Foundation Model for Generalist Humanoid Robots. arXiv:2503.14734, 2025.
  7. Wan-AI Team. Wan: Open and Advanced Large-Scale Video Generative Models. arXiv:2503.16399, 2025.
  8. GEAR Lab. DreamZero codebase. Apache 2.0. github.com/dreamzero0/dreamzero
  9. HuggingFace. LeRobot: State-of-the-Art Machine Learning for Real-World Robotics. github.com/huggingface/lerobot
  10. Hu, E.J. et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022. arXiv:2106.09685.
  11. Ho, J. et al. Denoising Diffusion Probabilistic Models. NeurIPS 2020. arXiv:2006.11239.
  12. Lipman, Y. et al. Flow Matching for Generative Modeling. ICLR 2023. arXiv:2210.02747.
  13. Radford, A. et al. Learning Transferable Visual Models From Natural Language Supervision. ICML 2021.
  14. Chi, C. et al. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. arXiv:2303.04137.
  15. Zhao, T.Z. et al. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. RSS 2023.