Experimental Report · Checkpoint 72K · LoRA rank 4

Experimental
evaluation report.

The complete experimental log for DreamZero-SO101 — every artifact and measurement behind the summary on the Home and Paper pages. Every video below is a real model output from a 14B Wan2.1-I2V backbone with a 108M LoRA adapter, evaluated on the whosricky/so101-megamix-v1 training set and the unseen RajatDandekar/so101_box_to_bowl_v2 test episode.

Sanity tests 3/5

Single-chunk RMSE (train) 0.57°

Single-chunk RMSE (held-out) 2.3°

Zero-shot mean RMSE 11.9°

DreamGen best-match drift 71–94°

Front-view PSNR (raw) 18.5 dB

Methodology

How we stress-tested the world-action model

DreamZero-SO101 is a joint video + action predictor. We probe its behaviour from three angles, each a stricter form of the previous. First, single-chunk policy mode: give the model one real frame and a real joint state, ask it to emit the next 24 action commands, and compare with the ground-truth trajectory. Second, zero-shot generalization: repeat the single-chunk protocol on a completely unseen dataset with a different camera rig, objects, and table layout. Third, autoregressive DreamGen rollouts: drop the live observation entirely and let the model chain its own imagined video back as input for 60 consecutive chunks — the hardest setting, and the one that reveals all failure modes.

01 · Sanity test suite

Does the model remember what it was trained on?

Five diagnostic tests run on a held-out subset of the training distribution. Tests 1–3 confirm the action head, latent space, and joint coherence are all learned; tests 4–5 probe the harder generalization and autoregressive behaviours where the cracks start to show.

test · 01

Action chunk fidelity

Predicted 24-step action chunks overlaid on ground-truth for 3 random frames. First 8 steps track tightly; drift begins past step 12 on fast-motion samples.

✓ pass

test · 02

Latent stability

PCA of the joint action + visual latent, coloured by episode. Clean cluster separation confirms the LoRA has learned an SO-101-aware embedding space.

✓ pass

test · 03

Joint coherence

Per-joint phase offsets across a full episode. Shoulder, elbow, and wrist all remain within 1° of the ground-truth coupling.

✓ pass

test · 04

Multi-task disambiguation

Same initial frame, four different prompts. Model distinguishes "red" vs "blue" in the first chunk, but the trajectories collapse onto a shared prior after ~2 chunks.

◐ partial

test · 05

Autoregressive rollout

60-chunk closed-loop rollout on a training episode. Model completes the task inside its imagination but degrades after the real episode end — see section 05 below for the full story.

◐ partial

02 · Single-chunk policy mode

Per-episode evaluation on the training distribution

For each episode below, we feed the model the first frame, the real joint state at that frame, and the training language prompt. The model emits 9 predicted frames + 24 joint commands. We compare the predicted commands against the logged ground-truth trajectory over the same 24 steps.

Episode 0 — training sample

"Pick the red cube"

whosricky/so101-megamix-v1 · ep 0 · frame 0 · episode length 328

action RMSE

0.57°

video PSNR

4.8 dB

chunk horizon

24 steps

predicted · front0.3×

predicted · top0.3×

predicted · gripper0.3×

ground-truth · front0.3×

ground-truth · future1×

predicted · 2×2 mosaic1×

Predicted vs ground-truth joints over the 24-step chunk. All 6 DOFs track within 1° — on the training distribution with the canonical initial frame, the model is effectively exact.

Episode 50 — held-out sample

"Pick the red cube"

whosricky/so101-megamix-v1 · ep 50 · frame 0 · episode length 374 · never seen in training

action RMSE

2.29°

video PSNR

4.9 dB

chunk horizon

24 steps

predicted · front0.3×

predicted · top0.3×

predicted · gripper0.3×

ground-truth · front0.3×

ground-truth · future1×

predicted · 2×2 mosaic1×

Held-out episode: shoulder and elbow are still within ~4° but the wrist starts drifting past step 16. Episode 50 was never part of the training set — the model never saw this exact object layout.

Episode 100 — held-out sample

"Pick the red cube"

whosricky/so101-megamix-v1 · ep 100 · frame 0 · episode length 374 · never seen in training

action RMSE

1.6°

video PSNR

4.9 dB

chunk horizon

24 steps

predicted · front0.3×

predicted · top0.3×

predicted · gripper0.3×

ground-truth · front0.3×

ground-truth · future1×

predicted · 2×2 mosaic1×

Cleanest held-out sample. All 6 joints track the ground-truth curves within 2° across the full 24-step chunk — this is the behaviour the paper calls "cruise RMSE".

03 · Mid-episode dynamics

Does it handle high-acceleration phases?

Single-chunk evaluation at mid-episode frames where the arm is already in motion. These are the hardest frames for an I2V-based model: the instantaneous velocity has to be inferred from a still image, and fast descend / grasp / lift phases amplify any phase error.

ep50 · frame 80 — descend phase

"Pick the red cube"

high-speed descent · GT motion magnitude 69°

action RMSE

11.97°

dominant drift

shoulder

worst joint

23.3°

predicted · front0.3×

predicted · top0.3×

ground-truth · front0.3×

Descending from frame 80: the model correctly predicts the direction of motion but underestimates the velocity, bleeding ~12° on the shoulder joint over the 24-step horizon.

ep100 · frame 150 — grasp phase

"Pick the red cube"

fine-motor grasp closure

action RMSE

8.4°

dominant drift

wrist

gripper

predicted · front0.3×

predicted · top0.3×

ground-truth · front0.3×

Grasp closure at frame 150: wrist rotation is the biggest residual, but the gripper open/close signal tracks ground-truth cleanly.

ep50 · f150 grasp0.3×

ep50 · f220 lift0.3×

ep100 · f80 descend0.3×

ep100 · f220 lift0.3×

Per-step squared action error across all 24 chunk steps, averaged over the 6 mid-episode samples above. Error stays below 1°² for the first 8 steps and grows roughly linearly after step 12.

04 · Zero-shot generalization

A dataset the model has never seen

We replay the single-chunk policy-mode protocol on RajatDandekar/so101_box_to_bowl_v2, episode 0 — a completely different SO-101 scene with a different camera rig, different object set, and a different prompt ("Pick up the box and place it in the red bowl"). 8 sample frames, evenly spaced across the 604-frame episode.

Mean action RMSE11.9°

Best action RMSE1.57°

Worst action RMSE26.7°

Mean video PSNR18.5 dB

Samples8

Per-step RMSE across all 8 zero-shot samples. Early steps are within 2° but the tail explodes past step 18 on worst-case frames.

Mean RMSE as a function of position in the episode. Best results come from mid-episode cruise frames; approach and drop-off phases both cost the model 5–15° of extra error.

Per-joint overlay of predicted vs ground-truth across the 8 samples. Shoulder and elbow carry most of the residual; wrist and gripper are essentially correct on the mid-episode cruise frames.

f00015.7°

f0847.2°

f1693.1°

f2541.57°

f33913.4°

f42418.2°

f50926.7°

f5949.4°

05 · Autoregressive rollouts — DreamGen mode

Imagining 18 seconds of video from a single frame

The hardest evaluation mode. We give the model a single front-camera frame and the starting joint state, then run it in closed loop for 60 consecutive chunks (540 imagined frames ≈ 18 seconds) without ever feeding it a real observation again. After each chunk, the model's own output becomes the input for the next. Six rollouts: three episodes (0 / 245 / 206) × two prompts (training + novel).

Raw per-chunk action drift (time-aligned, in degrees) across all six rollouts. The apparent blow-up past chunk 40 is largely an artefact of comparing imagined actions with already-ended ground-truth episodes — see the drift analysis section below.

Episode 0 — training

"Pick the red cube"

real episode: 10.9 s · 328 frames

best-match 93.7°

time-aligned 96.8°

Episode 0 — novel prompt

"Pick the blue cube"

real episode: 10.9 s · 328 frames

best-match 90.1°

time-aligned 97.3°

Episode 245 — training

"Pick the red cube and place it in the bowl"

real episode: 17.9 s · 537 frames

best-match 74.0°

time-aligned 109.0°

Episode 245 — novel prompt

"Pick the yellow cube and place it in the bowl"

real episode: 17.9 s · 537 frames

best-match 85.3°

time-aligned 111.1°

Episode 206 — training

"Put the red cube on top of the blue cube"

real episode: 19.1 s · 573 frames

best-match 71.0°

time-aligned 90.1°

Episode 206 — novel prompt

"Put the blue cube on top of the red cube"

real episode: 19.1 s · 573 frames

best-match 83.4°

time-aligned 102.5°

06 · Drift analysis

Three ways to measure drift — and why it matters

The time-aligned metric (predicted chunk vs GT chunk at the same wall-clock time) is brutal because it penalises any speed mismatch — and because the imagined rollout continues for 18 seconds while several real episodes end at 10.9 or 17.9 seconds. The best-match drift instead compares each predicted joint configuration against the closest matching GT frame anywhere in the real episode — a speed-invariant measure of "did the arm actually pass through this pose at some point?".

Drift metrics for all 6 rollouts. Bars: mean best-match drift. Dots: time-aligned drift. The 20–40° gap between the two is the speed-mismatch penalty — the arm is doing the right thing, just not at the right tempo.

ep000_train94° / 97°

ep000_novel90° / 97°

ep245_train74° / 109°

ep245_novel85° / 111°

ep206_train71° / 90°

ep206_novel83° / 102°

Key takeaways

The model executes the task in imagination. All six rollouts visually complete the intended pick, place, or stack inside the imagined video stream.
Single-chunk policy mode is solid. 0.57° on the training distribution, 1.6–2.3° on held-out training episodes, and a mean 11.9° on a completely unseen zero-shot dataset — well within the usable range for closed-loop control.
Autoregressive rollouts carry 70–95° of best-match drift. Much of that is speed mismatch and post-episode degradation; some of it is genuine prediction-chaining error. Both are expected limitations of a 72K-step LoRA checkpoint.
Language conditioning decays with horizon. "Pick the red cube" vs "Pick the blue cube" on the same initial frame diverge for the first two chunks and then converge onto the same trajectory — the video stream dominates after that.
No termination signal. The model was never trained to emit "task complete", so frames past the real episode end degrade into an increasingly noisy rollout. A stop-token head is the natural next step.

Research paper

Experimental
evaluation report.

How we stress-tested the world-action model

Does the model remember what it was trained on?

Action chunk fidelity

Latent stability

Joint coherence

Multi-task disambiguation

Autoregressive rollout

Per-episode evaluation on the training distribution

Episode 0 — training sample

Episode 50 — held-out sample

Episode 100 — held-out sample

Does it handle high-acceleration phases?

ep50 · frame 80 — descend phase

ep100 · frame 150 — grasp phase

A dataset the model has never seen

Imagining 18 seconds of video from a single frame

Episode 0 — training

Episode 0 — novel prompt

Episode 245 — training

Episode 245 — novel prompt

Episode 206 — training

Episode 206 — novel prompt

Three ways to measure drift — and why it matters

Key takeaways

Read the writeup

Play with the gallery

Read the code

Experimentalevaluation report.

How we stress-tested the world-action model

Does the model remember what it was trained on?

Action chunk fidelity

Latent stability

Joint coherence

Multi-task disambiguation

Autoregressive rollout

Per-episode evaluation on the training distribution

Episode 0 — training sample

Episode 50 — held-out sample

Episode 100 — held-out sample

Does it handle high-acceleration phases?

ep50 · frame 80 — descend phase

ep100 · frame 150 — grasp phase

A dataset the model has never seen

Imagining 18 seconds of video from a single frame

Episode 0 — training

Episode 0 — novel prompt

Episode 245 — training

Episode 245 — novel prompt

Episode 206 — training

Episode 206 — novel prompt

Three ways to measure drift — and why it matters

Key takeaways

Read the writeup

Play with the gallery

Read the code

Experimental
evaluation report.