How to Fine-Tune OpenVLA on Your Own Robot Dataset
OpenVLA-7B is the most popular open vision-language-action model. This tutorial walks through LoRA-based fine-tuning on your own RLDS dataset: hardware sizing, PEFT config, dataloader setup, training, and evaluation on a held-out episode or the real robot.
What you will accomplish
By the end of this tutorial you will have a LoRA-fine-tuned OpenVLA-7B checkpoint that is specialized to your task and your embodiment. LoRA (Low-Rank Adaptation) is the right choice for most teams: it reduces VRAM requirements from ~100 GB for full fine-tuning down to roughly 24 GB, converges quickly on modest datasets (a few hundred episodes), and produces a small adapter that is easy to distribute.
OpenVLA is maintained by the Stanford IRIS Lab and collaborators. The base 7B model was pretrained on Open X-Embodiment, a dataset of nearly a million trajectories across 22 embodiments. Fine-tuning concentrates that prior knowledge onto your specific robot, objects, and instructions.
Prerequisites
- A 24 GB+ VRAM GPU. A100 40/80 GB, H100, RTX 4090, or L40S all work. RTX 3090 with 24 GB works but is tight.
- An RLDS dataset of at least 50 episodes (200+ recommended) with image observations, proprioception, and natural-language task instructions.
- Familiarity with PyTorch, HuggingFace transformers, and imitation learning basics.
- Ubuntu 22.04 or 20.04 with CUDA 12.1+ drivers.
- 200 GB free disk for model weights, checkpoints, and your dataset.
If you do not yet have a dataset, go collect one first — see our LeRobot recording tutorial or browse open datasets on the SVRC datasets hub. You can also mix public datasets with a smaller set of your own episodes to give the model a targeted prior.
The steps
-
Check hardware and environment
Verify your GPU and CUDA:
nvidia-smi python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0), torch.version.cuda)"You want
True, your GPU name, and a CUDA version matching your PyTorch build. If you need a clean env, create one:conda create -n openvla python=3.10 -y conda activate openvla -
Install OpenVLA and dependencies
Install the OpenVLA codebase and PEFT stack. Refer to the upstream repo at github.com/openvla/openvla for the most current install target — the general pattern is:
git clone https://github.com/openvla/openvla.git cd openvla pip install -e . pip install peft bitsandbytes accelerate wandbPEFT provides the LoRA implementation. bitsandbytes gives you 8-bit quantization for the base model so it fits in 24 GB. accelerate handles distributed training. wandb is optional but highly recommended for tracking runs.
-
Prepare your dataset in RLDS format
OpenVLA consumes RLDS (Reinforcement Learning Datasets) — a TFDS-backed format from Google Research. Each episode contains a sequence of steps, where each step has
observation(image, state),action,is_terminal, and alanguage_instruction.If your data is already in LeRobot or HDF5, you need to convert. The Open X-Embodiment repository includes reference converters you can adapt. A minimal RLDS builder looks like:
import tensorflow_datasets as tfds class MyRobotDataset(tfds.core.GeneratorBasedBuilder): VERSION = tfds.core.Version('1.0.0') def _info(self): return tfds.core.DatasetInfo( builder=self, features=tfds.features.FeaturesDict({ 'steps': tfds.features.Dataset({ 'observation': tfds.features.FeaturesDict({ 'image': tfds.features.Image(shape=(224, 224, 3)), 'state': tfds.features.Tensor(shape=(7,), dtype=tf.float32), }), 'action': tfds.features.Tensor(shape=(7,), dtype=tf.float32), 'language_instruction': tfds.features.Text(), 'is_terminal': tfds.features.Scalar(dtype=tf.bool), }), }) )Point this builder at your episode files and register the dataset name in the OpenVLA dataset config. Action normalization stats (mean / std per dimension) must be computed and stored — the training loop uses them to normalize action targets.
-
Download OpenVLA-7B base weights
Pull the base model from HuggingFace Hub:
huggingface-cli download openvla/openvla-7b --local-dir ./openvla-7bThis is about 15 GB. Optionally, cache it under
~/.cache/huggingfacefor reuse across projects. -
Configure LoRA
The default starting point for OpenVLA is LoRA rank 32 targeting the attention modules. In PEFT this looks like:
from peft import LoraConfig, get_peft_model lora_config = LoraConfig( r=32, lora_alpha=16, target_modules="all-linear", lora_dropout=0.0, bias="none", task_type="CAUSAL_LM", )Rank 32 is a sensible default for 50 to 500 episodes. Drop to 16 if VRAM is tight; bump to 64 for larger datasets. Targeting
all-linearmatches the OpenVLA reference recipe; you can restrict to attention-only projections if memory is a constraint. -
Launch fine-tuning
Use the reference finetune script from the OpenVLA repo. The exact entrypoint moves across releases — check the repo's README — the general invocation pattern is:
torchrun --standalone --nnodes 1 --nproc-per-node 1 \ vla-scripts/finetune.py \ --vla_path ./openvla-7b \ --data_root_dir /path/to/rlds \ --dataset_name my_robot_dataset \ --run_root_dir ./runs \ --adapter_tmp_dir ./adapter-tmp \ --lora_rank 32 \ --batch_size 16 \ --grad_accumulation_steps 1 \ --learning_rate 5e-4 \ --image_aug True \ --wandb_project openvla-finetuneOn a single A100 80 GB, expect about 1 hour per 10k steps. A typical 200-episode dataset converges in 20 to 40k steps.
Tip: enableimage_aug Truefor small datasets. The random crop and color jitter make a real difference on 50-episode datasets. -
Monitor training
Watch three signals in WandB or tensorboard: action MSE (your primary loss), gradient norm (should be stable, not spiking), and per-dimension action accuracy (token accuracy for the tokenized action head). If action MSE plateaus above 0.1 after 5k steps, your dataset is too small or your language instructions do not match the policy head's training distribution.
-
Merge LoRA and evaluate
At the end of training, merge the LoRA adapter into the base model for faster inference:
from peft import PeftModel from transformers import AutoModelForVision2Seq base = AutoModelForVision2Seq.from_pretrained('./openvla-7b', torch_dtype=torch.bfloat16) model = PeftModel.from_pretrained(base, './runs/my_robot_dataset/adapter') merged = model.merge_and_unload() merged.save_pretrained('./openvla-7b-my-robot')Now evaluate: run the merged model on a held-out episode and compare predicted actions to ground truth, or deploy to the real robot and measure task success rate over 20 trials. For real-robot rollouts, use the OpenVLA inference wrapper that handles image preprocessing and action denormalization for you.
What to do next
Once a LoRA adapter works, three natural follow-ups: (1) scale to more data — add episodes from Open X-Embodiment or other public datasets to give the model a broader prior, (2) try full fine-tuning on an 80 GB GPU if you have one, which generally adds 2 to 5 percentage points of task success, and (3) compare to other VLA models — Pi-Zero, Octo, and RT-2 variants all have different strengths.
If your dataset was collected with LeRobot, our LeRobot recording tutorial covers the capture side. For bimanual tasks, see the ALOHA teleop rig tutorial.
Common failure modes
OOM on startup: enable 8-bit quantization with --use_quantization True or drop batch size.
Action MSE does not decrease: check that your action normalization stats are correct. A frequent bug is computing stats on the wrong split.
Policy drives robot to joint limits: language instruction distribution in your dataset is too narrow; the model memorized instead of learned. Mix in more diverse tasks.
Inference is slow: always run with torch.compile and bfloat16 after merging LoRA.
Deep dive: LoRA vs full fine-tune vs from scratch
Three options sit on the spectrum. Full fine-tune updates every parameter of the 7B model. It typically produces the best task success rate when you have enough data (500+ episodes) and an 80 GB GPU, because the model can adapt every representation to your embodiment. LoRA adds low-rank adapters to a subset of layers, freezes the base, and converges fast on smaller datasets. In our experiments on mid-size manipulation datasets, LoRA reaches 90 to 95 percent of full fine-tune performance at a quarter of the compute. From scratch almost never makes sense for VLA-scale models — you lose the Open X-Embodiment prior and need orders of magnitude more data.
A fourth path that has been gaining traction is DoRA (Weight-Decomposed Low-Rank Adaptation), which decomposes weights into magnitude and direction components. DoRA adds a small amount of parameter overhead over LoRA and often closes the gap to full fine-tune further. PEFT supports DoRA out of the box — if you have tried LoRA and hit a ceiling, swap it in with a one-line config change.
Deep dive: dataset quality beats dataset size
Teams consistently over-invest in episode count and under-invest in episode diversity. 200 episodes across 20 lighting and object-pose variations beats 1000 episodes of the same scene. The reason is model-theoretic: VLA fine-tuning is not primarily learning what to do — the base model already knows rough manipulation priors — it is learning to ground the policy to your camera intrinsics, your workspace, and your language instructions. Diversity in those variables gives the policy more degrees of freedom to disentangle embodiment from task.
A concrete recipe that works: within your 200 episodes, vary initial object position across a 20 cm x 20 cm grid (10 positions minimum), record under 3 lighting conditions, and rotate through 5 natural-language phrasings of the same instruction. This is about 50% more work during collection, but dramatically improves the fine-tune.
Deep dive: evaluation is where teams get wrong
The most common mistake is evaluating only on held-out episodes from the same recording session. That measures interpolation, not generalization. Real evaluation has three tiers: (1) held-out from same session — lower bound on capability, (2) held-out from a different day / lighting — real-world robustness check, and (3) zero-shot on a novel task phrasing or slightly novel object — what separates a usable VLA from a demo-only one. Budget at least 30 minutes of real-robot time per candidate checkpoint for proper eval.
Frequently asked questions
How many episodes do I need? 50 is a minimum for non-trivial results. 200 is a good target. 500+ saturates LoRA capacity for most single tasks.
Can I fine-tune on multi-task data? Yes, and it usually helps. The 7B OpenVLA backbone has headroom to absorb multiple tasks per adapter.
What about Pi-Zero or Octo? Pi-Zero is Physical Intelligence's flow-matching VLA — different training recipe, often higher success rate per episode but requires proprietary weights. Octo is a smaller open model useful for research. See our VLA models comparison.
Inference latency matters for my application. OpenVLA-7B inference is roughly 200 ms on a single A100. If that is too slow, look at smaller VLAs, quantization, or action-chunking strategies.