Everything you need to replicate DreamZero-SO101 from scratch — or to run inference on your own SO-101 footage. The codebase is at github.com/Vizuara-AI-Lab/dreamzero-so101.
Clone DreamZero-SO101, apply the SO-101 patch to the upstream DreamZero codebase, and download the pre-trained weights.
# 1. Clone DreamZero-SO101 git clone https://github.com/Vizuara-AI-Lab/dreamzero-so101.git cd dreamzero-so101 # 2. Clone upstream DreamZero (Apache 2.0) and install git clone https://github.com/dreamzero0/dreamzero.git cd dreamzero && pip install -e . && cd .. # 3. Apply SO-101 embodiment patch cd dreamzero && git apply ../patches/so101_embodiment.patch && cd .. # 4. Download Wan2.1-I2V-14B backbone (~120 GB) huggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P \ --local-dir ./checkpoints/Wan2.1-I2V-14B-480P # 5. Download DreamZero-SO101 LoRA adapter (217 MB) huggingface-cli download Vizuara/dreamzero-so101-lora \ --local-dir ./checkpoints/dreamzero-so101-lora
Discover SO-101 datasets on HuggingFace, download them, and convert to GEAR format (the aligned triplet format DreamZero trains on).
# Discover all SO-101 datasets on HuggingFace Hub python scripts/enumerate_so101.py --output so101_manifest.json # Download and convert to GEAR format # Produces: ./data/so101_gear/{split}/XXXXX.{mp4,json} python scripts/download_and_convert.py \ --manifest so101_manifest.json \ --output-dir ./data/so101_gear \ --dreamzero-dir ./dreamzero
Each GEAR sample is a triplet of: a camera-frame chunk (9 latent frames), a 6-DOF joint state vector, and a 24-step action chunk. The conversion script handles padding for episodes with fewer than 3 cameras and resamples to 30 FPS if the source dataset is at a different frame rate.
| Dataset | Episodes | Tasks | Cameras |
|---|---|---|---|
whosricky/so101-megamix-v1 | 400 | 8 | 3 |
lipsop/so101-block-in-bin-100ep | 100 | 1 | 2 |
youliangtan/so101-table-cleanup | 80 | 4 | 2 |
G3ND3K/so101_picking_up_green_lego_big | 60 | 1 | 2 |
lerobot/svla_so101_pickplace | 50 | 1 | 2 |
observabot/so101_cloth_folding1 | 25 | 1 | 3 |
Two training modes: LoRA (recommended — cheap, fast, good quality) and Full Fine-Tune (best quality, needs 4× H100).
# Set environment variables export WAN_CKPT_DIR=./checkpoints/Wan2.1-I2V-14B-480P export SO101_DATA_DIR=./data/so101_gear export TOKENIZER_DIR=./checkpoints/umt5-xxl # Launch LoRA training on 2× H100, converges around 70K steps bash scripts/train_lora.sh 2 100000
# 4× H100, ~28 hours (~$336 on RunPod)
bash scripts/train_full.sh 4 100000
| Parameter | LoRA | Full FT |
|---|---|---|
| Trainable params | ~108M LoRA + action heads | 14B (all DiT) + action heads |
| Learning rate | 1e-4 | 1e-5 |
| Batch size | 1 per GPU | 1 per GPU |
| Gradient accumulation | 4 | 4 |
| GPUs | 2× H100 | 4× H100 |
| DeepSpeed | ZeRO-2 | ZeRO-2 + CPU offload |
| Resolution | 320×176 | 320×176 |
| Video frames | 33 (→ 9 latent) | 33 (→ 9 latent) |
| Action horizon | 24 steps | 24 steps |
| LoRA rank | 4 | N/A |
| LoRA targets | q, k, v, o, ffn.0, ffn.2 | N/A |
| Steps to convergence | ~72K (~127h on 2×H100) | ~50K (~28h on 4×H100) |
Five sanity tests + single-chunk policy-mode evaluation + autoregressive DreamGen rollout.
# Run offline evaluation (single-chunk policy mode) python scripts/evaluate.py \ --model-path ./checkpoints/dreamzero-so101-lora \ --data-dir ./data/so101_gear \ --base-model-path ./checkpoints/Wan2.1-I2V-14B-480P \ --output-dir ./eval_output # Run DreamGen autoregressive rollout for one episode python scripts/evaluate.py \ --model-path ./checkpoints/dreamzero-so101-lora \ --base-model-path ./checkpoints/Wan2.1-I2V-14B-480P \ --episode 0 \ --prompt "Pick the red cube" \ --mode dreamgen \ --n-chunks 60 \ --output-dir ./rollout_ep0
The evaluation script writes per-sample meta.json, action_plot.png,
pred_front.mp4, and drift_plot.png to the output directory — the same
artefacts shown in the Results page.
Run a single-shot inference on any SO-101 camera frame using the released LoRA adapter.
# Single-shot inference demo python scripts/infer_demo.py \ --model-path ./checkpoints/dreamzero-so101-lora \ --base-model-path ./checkpoints/Wan2.1-I2V-14B-480P \ --image ./sample_obs.jpg \ --joint-state "-0.47 -99.23 95.37 67.74 -1.64 1.99" \ --prompt "Pick the red cube" \ --output ./infer_output # Output: infer_output/pred_front.mp4 (predicted video) + # infer_output/actions.json (24 predicted joint commands)
dreamzero-so101/ ├── README.md ├── patches/ │ └── so101_embodiment.patch # Adds SO-101 to DreamZero's embodiment registry ├── configs/ │ ├── so101_lora.yaml # LoRA training config │ ├── so101_full_ft.yaml # Full fine-tune config │ ├── so101_inference.yaml # Inference config │ └── so101_relative.yaml # GEAR data config (installed via patch) ├── scripts/ │ ├── enumerate_so101.py # Find SO-101 datasets on HuggingFace Hub │ ├── download_and_convert.py # Download + GEAR format conversion │ ├── train_lora.sh # LoRA training launcher (2× H100) │ ├── train_full.sh # Full FT launcher (4× H100) │ ├── evaluate.py # Offline eval (policy-mode + DreamGen) │ └── infer_demo.py # Single-shot inference demo └── notebooks/ └── quickstart.ipynb # Colab quickstart notebook
Flash attention fails on some cuDNN builds during training. Fix: wrap the DiT forward pass with torch.backends.cuda.sdp_kernel(enable_flash=False) or set TORCH_CUDNN_SDPA_ENABLED=0.
Resuming from checkpoint with strict=True in the LR scheduler state_dict causes a key mismatch. Fix: pass strict=False to scheduler.load_state_dict(), or use PyTorch ≥ 2.2.
The GEAR format requires an annotation.task entry in modality.json for the language conditioning to be correctly routed. The SO-101 patch adds this mapping; without it, the UMT5 text encoder receives zero embeddings.
Setting dataloader_num_workers above 6 causes workers to be silently killed on RunPod H100 nodes. Recommended: num_workers=4.
Full architecture, training objective, and results writeup with six PaperBanana figures.
paper.html → Experimental resultsEvery video, plot, and metric from sanity tests through DreamGen rollouts.
results.html → CommunityRecord your own SO-101 scenes and help push the model's zero-shot generalisation.
contribute.html →