DreamZero-SO101 is a world-action model trained on 715 community episodes of the low-cost SO-101 teleoperator arm. Give it a single camera frame and a natural-language instruction, and it will imagine the robot completing the task — predicting both the future video and the joint commands in one pass.
Each video below was predicted autoregressively from one initial image plus the language instruction underneath it. No real robot was involved after frame 0. Every task is from the training distribution — the model executes the requested action against the actual scene in front of it.
Given one camera frame and a language instruction, the model outputs the robot's next 24 joint commands — everything needed to drive the arm for the next 0.8 seconds. This is the mode you would use on a real SO-101 arm.
Given only the very first frame, the model can autoregressively imagine the entire rest of the episode — up to 18 seconds of video + action, chaining its own predictions. Think of it as a learned simulator of the SO-101 arm.
The same scene with a different instruction produces a different rollout. Swap "red cube" for "blue cube" and the arm reaches for a different object — evidence the model is actually conditioning on language, not replaying training trajectories.
Every capability below comes from the same 217 MB adapter — no scene-specific fine-tuning. Click through to the interactive gallery for all 25 rollouts.
Reach, close the gripper, lift. Learned from 400+ training episodes of the same primitive.
Pick up a cube and place it inside a bowl — a two-phase trajectory with a precise release point.
Place one cube on top of another. Requires approach, grasp, lift, translate, align, release.
Start the model from a frame deep inside the task — descend, grasp, or lift phase — and it picks up where the trajectory left off.
Same scene, a language instruction the model never saw during training. The rollout changes accordingly.
Predict actions on a dataset the model was never trained on — raw generalization from the SO-101 pretraining mix.
A 14-billion-parameter open video diffusion model. We freeze the backbone and add a small low-rank adapter that learns to predict joint commands alongside video frames.
Aggregate six community datasets from the HuggingFace Hub — simple picks, pick-and-place, stacking, cloth folding, tabletop cleanup, Lego assembly. Convert everything to a shared multiview format.
Joint video and action prediction via flow matching. 127 hours on two H100 GPUs. The 217 MB adapter plugs into an unmodified Wan2.1 backbone — anyone can download it and run inference.
Weights on HuggingFace, code on GitHub, reproducible training recipe, and an invitation for the SO-101 community to contribute data and help scale the next version.
Pick a scene, swap between training and novel prompts, and watch the model's imagined rollout from three camera viewpoints. 25 rollouts ready to browse.
demo.html → Research paperConference-style writeup covering the architecture, data curation, training recipe, evaluation protocols, and a discussion of current limitations and future work.
paper.html → Detailed experimentsFull experimental log — sanity tests, single-chunk prediction accuracy, zero-shot generalization, and autoregressive rollout analysis across all six capability categories.
results.html → ReproduceInstall instructions, the SO-101 patch to the DreamZero codebase, the data conversion pipeline, the LoRA training recipe, and all evaluation scripts.
code.html → CommunityRecord your own SO-101 scenes, format them as LeRobot datasets, and submit via the HuggingFace Hub. Vizuara retrains and re-releases as new DreamZero-SO101 checkpoints.
contribute.html → WeightsDownload the 217 MB adapter. Apply it on top of an unmodified Wan2.1-I2V-14B-480P checkpoint. Ready to run offline inference on a single H100 or RTX 4090.
huggingface.co/Vizuara →