Vizuara Robotics · Apache 2.0 · Released April 2026

The first open world model
for the SO-101 robot arm.

DreamZero-SO101 is a world-action model trained on 715 community episodes of the low-cost SO-101 teleoperator arm. Give it a single camera frame and a natural-language instruction, and it will imagine the robot completing the task — predicting both the future video and the joint commands in one pass.

License Apache 2.0

Training episodes 715

Training tasks 7

Community datasets 6

Camera views 3

Adapter size 217 MB

Imagined rollouts

Seven training tasks the model imagines from a single frame.

Each video below was predicted autoregressively from one initial image plus the language instruction underneath it. No real robot was involved after frame 0. Every task is from the training distribution — the model executes the requested action against the actual scene in front of it.

Pick red cube

"Pick the red cube"

megamix

Pick blue cube

"Pick the blue cube"

megamix

Pick yellow cube

"Pick the yellow cube"

megamix

Pick and place

"Pick the red cube and place it in the bowl"

megamix

Stack objects

"Put the red cube on top of the blue cube"

megamix

All red cubes to bowl

"Pick all the red cubes and place them in the bowl"

megamix

Red cubes on bowl

"Put red cubes on the bowl"

megamix

What it does

Three things DreamZero-SO101 can do

01 — Predict

Plan the next half-second

Given one camera frame and a language instruction, the model outputs the robot's next 24 joint commands — everything needed to drive the arm for the next 0.8 seconds. This is the mode you would use on a real SO-101 arm.

02 — Imagine

Simulate a full task

Given only the very first frame, the model can autoregressively imagine the entire rest of the episode — up to 18 seconds of video + action, chaining its own predictions. Think of it as a learned simulator of the SO-101 arm.

03 — Follow instructions

Respond to language

The same scene with a different instruction produces a different rollout. Swap "red cube" for "blue cube" and the arm reaches for a different object — evidence the model is actually conditioning on language, not replaying training trajectories.

Capabilities on display

Six behaviors, one model

Every capability below comes from the same 217 MB adapter — no scene-specific fine-tuning. Click through to the interactive gallery for all 25 rollouts.

Pick an object

Reach, close the gripper, lift. Learned from 400+ training episodes of the same primitive.

Pick and place

Pick up a cube and place it inside a bowl — a two-phase trajectory with a precise release point.

Stack objects

Place one cube on top of another. Requires approach, grasp, lift, translate, align, release.

Resume mid-episode

Start the model from a frame deep inside the task — descend, grasp, or lift phase — and it picks up where the trajectory left off.

Follow novel instructions

Same scene, a language instruction the model never saw during training. The rollout changes accordingly.

Generalize zero-shot

Predict actions on a dataset the model was never trained on — raw generalization from the SO-101 pretraining mix.

How it was trained

From pretrained video model to SO-101 policy

Start from Wan2.1-I2V-14B

A 14-billion-parameter open video diffusion model. We freeze the backbone and add a small low-rank adapter that learns to predict joint commands alongside video frames.

Curate 715 SO-101 episodes

Aggregate six community datasets from the HuggingFace Hub — simple picks, pick-and-place, stacking, cloth folding, tabletop cleanup, Lego assembly. Convert everything to a shared multiview format.

Train the adapter for 72,000 steps

Joint video and action prediction via flow matching. 127 hours on two H100 GPUs. The 217 MB adapter plugs into an unmodified Wan2.1 backbone — anyone can download it and run inference.

Release under Apache 2.0

Weights on HuggingFace, code on GitHub, reproducible training recipe, and an invitation for the SO-101 community to contribute data and help scale the next version.

Where to go next

Explore the project

Interactive gallery

The first open world model
for the SO-101 robot arm.

Seven training tasks the model imagines from a single frame.

Three things DreamZero-SO101 can do

Plan the next half-second

Simulate a full task

Respond to language

Six behaviors, one model

Pick an object

Pick and place

Stack objects

Resume mid-episode

Follow novel instructions

Generalize zero-shot

From pretrained video model to SO-101 policy

Start from Wan2.1-I2V-14B

Curate 715 SO-101 episodes

Train the adapter for 72,000 steps

Release under Apache 2.0

Explore the project

Try the demo

Read the paper

See the evaluation

Read the code

Contribute data

HuggingFace model

The first open world modelfor the SO-101 robot arm.

Seven training tasks the model imagines from a single frame.

Three things DreamZero-SO101 can do

Plan the next half-second

Simulate a full task

Respond to language

Six behaviors, one model

Pick an object

Pick and place

Stack objects

Resume mid-episode

Follow novel instructions

Generalize zero-shot

From pretrained video model to SO-101 policy

Start from Wan2.1-I2V-14B

Curate 715 SO-101 episodes

Train the adapter for 72,000 steps

Release under Apache 2.0

Explore the project

Try the demo

Read the paper

See the evaluation

Read the code

Contribute data

HuggingFace model

The first open world model
for the SO-101 robot arm.