Vizuara Robotics · Apache 2.0 · Released April 2026

The first open world model
for the SO-101 robot arm.

DreamZero-SO101 is a world-action model trained on 715 community episodes of the low-cost SO-101 teleoperator arm. Give it a single camera frame and a natural-language instruction, and it will imagine the robot completing the task — predicting both the future video and the joint commands in one pass.

License Apache 2.0
Training episodes 715
Training tasks 7
Community datasets 6
Camera views 3
Adapter size 217 MB
What it does

Three things DreamZero-SO101 can do

01 — Predict

Plan the next half-second

Given one camera frame and a language instruction, the model outputs the robot's next 24 joint commands — everything needed to drive the arm for the next 0.8 seconds. This is the mode you would use on a real SO-101 arm.

02 — Imagine

Simulate a full task

Given only the very first frame, the model can autoregressively imagine the entire rest of the episode — up to 18 seconds of video + action, chaining its own predictions. Think of it as a learned simulator of the SO-101 arm.

03 — Follow instructions

Respond to language

The same scene with a different instruction produces a different rollout. Swap "red cube" for "blue cube" and the arm reaches for a different object — evidence the model is actually conditioning on language, not replaying training trajectories.

Capabilities on display

Six behaviors, one model

Every capability below comes from the same 217 MB adapter — no scene-specific fine-tuning. Click through to the interactive gallery for all 25 rollouts.

01

Pick an object

Reach, close the gripper, lift. Learned from 400+ training episodes of the same primitive.

02

Pick and place

Pick up a cube and place it inside a bowl — a two-phase trajectory with a precise release point.

03

Stack objects

Place one cube on top of another. Requires approach, grasp, lift, translate, align, release.

04

Resume mid-episode

Start the model from a frame deep inside the task — descend, grasp, or lift phase — and it picks up where the trajectory left off.

05

Follow novel instructions

Same scene, a language instruction the model never saw during training. The rollout changes accordingly.

06

Generalize zero-shot

Predict actions on a dataset the model was never trained on — raw generalization from the SO-101 pretraining mix.

How it was trained

From pretrained video model to SO-101 policy

1

Start from Wan2.1-I2V-14B

A 14-billion-parameter open video diffusion model. We freeze the backbone and add a small low-rank adapter that learns to predict joint commands alongside video frames.

2

Curate 715 SO-101 episodes

Aggregate six community datasets from the HuggingFace Hub — simple picks, pick-and-place, stacking, cloth folding, tabletop cleanup, Lego assembly. Convert everything to a shared multiview format.

3

Train the adapter for 72,000 steps

Joint video and action prediction via flow matching. 127 hours on two H100 GPUs. The 217 MB adapter plugs into an unmodified Wan2.1 backbone — anyone can download it and run inference.

4

Release under Apache 2.0

Weights on HuggingFace, code on GitHub, reproducible training recipe, and an invitation for the SO-101 community to contribute data and help scale the next version.

Where to go next

Explore the project

Interactive gallery

Try the demo

Pick a scene, swap between training and novel prompts, and watch the model's imagined rollout from three camera viewpoints. 25 rollouts ready to browse.

demo.html →
Research paper

Read the paper

Conference-style writeup covering the architecture, data curation, training recipe, evaluation protocols, and a discussion of current limitations and future work.

paper.html →
Detailed experiments

See the evaluation

Full experimental log — sanity tests, single-chunk prediction accuracy, zero-shot generalization, and autoregressive rollout analysis across all six capability categories.

results.html →
Reproduce

Read the code

Install instructions, the SO-101 patch to the DreamZero codebase, the data conversion pipeline, the LoRA training recipe, and all evaluation scripts.

code.html →
Community

Contribute data

Record your own SO-101 scenes, format them as LeRobot datasets, and submit via the HuggingFace Hub. Vizuara retrains and re-releases as new DreamZero-SO101 checkpoints.

contribute.html →
Weights

HuggingFace model

Download the 217 MB adapter. Apply it on top of an unmodified Wan2.1-I2V-14B-480P checkpoint. Ready to run offline inference on a single H100 or RTX 4090.

huggingface.co/Vizuara →