Reproduce · Code Reference

From data to
deployed rollout.

Everything you need to replicate DreamZero-SO101 from scratch — or to run inference on your own SO-101 footage. The codebase is at github.com/Vizuara-AI-Lab/dreamzero-so101.

Step 1

Install

Clone DreamZero-SO101, apply the SO-101 patch to the upstream DreamZero codebase, and download the pre-trained weights.

# 1. Clone DreamZero-SO101
git clone https://github.com/Vizuara-AI-Lab/dreamzero-so101.git
cd dreamzero-so101

# 2. Clone upstream DreamZero (Apache 2.0) and install
git clone https://github.com/dreamzero0/dreamzero.git
cd dreamzero && pip install -e . && cd ..

# 3. Apply SO-101 embodiment patch
cd dreamzero && git apply ../patches/so101_embodiment.patch && cd ..

# 4. Download Wan2.1-I2V-14B backbone (~120 GB)
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P \
    --local-dir ./checkpoints/Wan2.1-I2V-14B-480P

# 5. Download DreamZero-SO101 LoRA adapter (217 MB)
huggingface-cli download Vizuara/dreamzero-so101-lora \
    --local-dir ./checkpoints/dreamzero-so101-lora
Note: The backbone requires ~120 GB of disk space. A single H100 80GB is sufficient for inference (4 Euler steps, ~600 ms/chunk). Training requires 2× H100 with DeepSpeed ZeRO-2.
Step 2

Data pipeline

Discover SO-101 datasets on HuggingFace, download them, and convert to GEAR format (the aligned triplet format DreamZero trains on).

# Discover all SO-101 datasets on HuggingFace Hub
python scripts/enumerate_so101.py --output so101_manifest.json

# Download and convert to GEAR format
# Produces: ./data/so101_gear/{split}/XXXXX.{mp4,json}
python scripts/download_and_convert.py \
    --manifest so101_manifest.json \
    --output-dir ./data/so101_gear \
    --dreamzero-dir ./dreamzero

Each GEAR sample is a triplet of: a camera-frame chunk (9 latent frames), a 6-DOF joint state vector, and a 24-step action chunk. The conversion script handles padding for episodes with fewer than 3 cameras and resamples to 30 FPS if the source dataset is at a different frame rate.

SO-101 datasets used for checkpoint 72K

DatasetEpisodesTasksCameras
whosricky/so101-megamix-v140083
lipsop/so101-block-in-bin-100ep10012
youliangtan/so101-table-cleanup8042
G3ND3K/so101_picking_up_green_lego_big6012
lerobot/svla_so101_pickplace5012
observabot/so101_cloth_folding12513
Step 3

Training

Two training modes: LoRA (recommended — cheap, fast, good quality) and Full Fine-Tune (best quality, needs 4× H100).

LoRA training (recommended)

# Set environment variables
export WAN_CKPT_DIR=./checkpoints/Wan2.1-I2V-14B-480P
export SO101_DATA_DIR=./data/so101_gear
export TOKENIZER_DIR=./checkpoints/umt5-xxl

# Launch LoRA training on 2× H100, converges around 70K steps
bash scripts/train_lora.sh 2 100000

Full fine-tune (best quality)

# 4× H100, ~28 hours (~$336 on RunPod)
bash scripts/train_full.sh 4 100000

Training hyperparameters

ParameterLoRAFull FT
Trainable params~108M LoRA + action heads14B (all DiT) + action heads
Learning rate1e-41e-5
Batch size1 per GPU1 per GPU
Gradient accumulation44
GPUs2× H1004× H100
DeepSpeedZeRO-2ZeRO-2 + CPU offload
Resolution320×176320×176
Video frames33 (→ 9 latent)33 (→ 9 latent)
Action horizon24 steps24 steps
LoRA rank4N/A
LoRA targetsq, k, v, o, ffn.0, ffn.2N/A
Steps to convergence~72K (~127h on 2×H100)~50K (~28h on 4×H100)
Step 4

Evaluation

Five sanity tests + single-chunk policy-mode evaluation + autoregressive DreamGen rollout.

# Run offline evaluation (single-chunk policy mode)
python scripts/evaluate.py \
    --model-path ./checkpoints/dreamzero-so101-lora \
    --data-dir ./data/so101_gear \
    --base-model-path ./checkpoints/Wan2.1-I2V-14B-480P \
    --output-dir ./eval_output

# Run DreamGen autoregressive rollout for one episode
python scripts/evaluate.py \
    --model-path ./checkpoints/dreamzero-so101-lora \
    --base-model-path ./checkpoints/Wan2.1-I2V-14B-480P \
    --episode 0 \
    --prompt "Pick the red cube" \
    --mode dreamgen \
    --n-chunks 60 \
    --output-dir ./rollout_ep0

The evaluation script writes per-sample meta.json, action_plot.png, pred_front.mp4, and drift_plot.png to the output directory — the same artefacts shown in the Results page.

Step 5

Inference demo

Run a single-shot inference on any SO-101 camera frame using the released LoRA adapter.

# Single-shot inference demo
python scripts/infer_demo.py \
    --model-path ./checkpoints/dreamzero-so101-lora \
    --base-model-path ./checkpoints/Wan2.1-I2V-14B-480P \
    --image ./sample_obs.jpg \
    --joint-state "-0.47 -99.23 95.37 67.74 -1.64 1.99" \
    --prompt "Pick the red cube" \
    --output ./infer_output

# Output: infer_output/pred_front.mp4 (predicted video) +
#         infer_output/actions.json (24 predicted joint commands)
Inference time: ~600 ms/chunk on a single H100 80GB at 4 Euler steps. On a 40GB A100 expect ~900 ms. CPU inference is not practical (hours/chunk). The LoRA adapter is ~217 MB; the backbone requires ~120 GB of disk but only ~80 GB of GPU RAM at fp16.
Reference

Repository structure

dreamzero-so101/
├── README.md
├── patches/
│   └── so101_embodiment.patch   # Adds SO-101 to DreamZero's embodiment registry
├── configs/
│   ├── so101_lora.yaml          # LoRA training config
│   ├── so101_full_ft.yaml       # Full fine-tune config
│   ├── so101_inference.yaml     # Inference config
│   └── so101_relative.yaml      # GEAR data config (installed via patch)
├── scripts/
│   ├── enumerate_so101.py       # Find SO-101 datasets on HuggingFace Hub
│   ├── download_and_convert.py  # Download + GEAR format conversion
│   ├── train_lora.sh            # LoRA training launcher (2× H100)
│   ├── train_full.sh            # Full FT launcher (4× H100)
│   ├── evaluate.py              # Offline eval (policy-mode + DreamGen)
│   └── infer_demo.py            # Single-shot inference demo
└── notebooks/
    └── quickstart.ipynb         # Colab quickstart notebook
Reference

Known issues & fixes

Bug 01

cuDNN SDPA error

Flash attention fails on some cuDNN builds during training. Fix: wrap the DiT forward pass with torch.backends.cuda.sdp_kernel(enable_flash=False) or set TORCH_CUDNN_SDPA_ENABLED=0.

Bug 02

PyTorch 2.1 LR scheduler strict mode

Resuming from checkpoint with strict=True in the LR scheduler state_dict causes a key mismatch. Fix: pass strict=False to scheduler.load_state_dict(), or use PyTorch ≥ 2.2.

Bug 03

modality.json missing annotation.task

The GEAR format requires an annotation.task entry in modality.json for the language conditioning to be correctly routed. The SO-101 patch adds this mapping; without it, the UMT5 text encoder receives zero embeddings.

Bug 04

dataloader_num_workers

Setting dataloader_num_workers above 6 causes workers to be silently killed on RunPod H100 nodes. Recommended: num_workers=4.

Research paper

Read the paper

Full architecture, training objective, and results writeup with six PaperBanana figures.

paper.html →
Experimental results

See the results

Every video, plot, and metric from sanity tests through DreamGen rollouts.

results.html →
Community

Contribute data

Record your own SO-101 scenes and help push the model's zero-shot generalisation.

contribute.html →