How to Fine-Tune OpenVLA on Your Own Robot Dataset

OpenVLA-7B is the most popular open vision-language-action model. This tutorial walks through LoRA-based fine-tuning on your own RLDS dataset: hardware sizing, PEFT config, dataloader setup, training, and evaluation on a held-out episode or the real robot.

VLA / Fine-tuning Total time: about 3 hours (plus training time) Difficulty: Advanced Updated April 2026

What you will accomplish

By the end of this tutorial you will have a LoRA-fine-tuned OpenVLA-7B checkpoint that is specialized to your task and your embodiment. LoRA (Low-Rank Adaptation) is the right choice for most teams: it reduces VRAM requirements from ~100 GB for full fine-tuning down to roughly 24 GB, converges quickly on modest datasets (a few hundred episodes), and produces a small adapter that is easy to distribute.

OpenVLA is maintained by the Stanford IRIS Lab and collaborators. The base 7B model was pretrained on Open X-Embodiment, a dataset of nearly a million trajectories across 22 embodiments. Fine-tuning concentrates that prior knowledge onto your specific robot, objects, and instructions.

Prerequisites

If you do not yet have a dataset, go collect one first — see our LeRobot recording tutorial or browse open datasets on the SVRC datasets hub. You can also mix public datasets with a smaller set of your own episodes to give the model a targeted prior.

The steps

  1. Check hardware and environment

    Verify your GPU and CUDA:

    nvidia-smi
    python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0), torch.version.cuda)"

    You want True, your GPU name, and a CUDA version matching your PyTorch build. If you need a clean env, create one:

    conda create -n openvla python=3.10 -y
    conda activate openvla
  2. Install OpenVLA and dependencies

    Install the OpenVLA codebase and PEFT stack. Refer to the upstream repo at github.com/openvla/openvla for the most current install target — the general pattern is:

    git clone https://github.com/openvla/openvla.git
    cd openvla
    pip install -e .
    pip install peft bitsandbytes accelerate wandb

    PEFT provides the LoRA implementation. bitsandbytes gives you 8-bit quantization for the base model so it fits in 24 GB. accelerate handles distributed training. wandb is optional but highly recommended for tracking runs.

  3. Prepare your dataset in RLDS format

    OpenVLA consumes RLDS (Reinforcement Learning Datasets) — a TFDS-backed format from Google Research. Each episode contains a sequence of steps, where each step has observation (image, state), action, is_terminal, and a language_instruction.

    If your data is already in LeRobot or HDF5, you need to convert. The Open X-Embodiment repository includes reference converters you can adapt. A minimal RLDS builder looks like:

    import tensorflow_datasets as tfds
    
    class MyRobotDataset(tfds.core.GeneratorBasedBuilder):
        VERSION = tfds.core.Version('1.0.0')
    
        def _info(self):
            return tfds.core.DatasetInfo(
                builder=self,
                features=tfds.features.FeaturesDict({
                    'steps': tfds.features.Dataset({
                        'observation': tfds.features.FeaturesDict({
                            'image': tfds.features.Image(shape=(224, 224, 3)),
                            'state': tfds.features.Tensor(shape=(7,), dtype=tf.float32),
                        }),
                        'action': tfds.features.Tensor(shape=(7,), dtype=tf.float32),
                        'language_instruction': tfds.features.Text(),
                        'is_terminal': tfds.features.Scalar(dtype=tf.bool),
                    }),
                })
            )

    Point this builder at your episode files and register the dataset name in the OpenVLA dataset config. Action normalization stats (mean / std per dimension) must be computed and stored — the training loop uses them to normalize action targets.

  4. Download OpenVLA-7B base weights

    Pull the base model from HuggingFace Hub:

    huggingface-cli download openvla/openvla-7b --local-dir ./openvla-7b

    This is about 15 GB. Optionally, cache it under ~/.cache/huggingface for reuse across projects.

  5. Configure LoRA

    The default starting point for OpenVLA is LoRA rank 32 targeting the attention modules. In PEFT this looks like:

    from peft import LoraConfig, get_peft_model
    
    lora_config = LoraConfig(
        r=32,
        lora_alpha=16,
        target_modules="all-linear",
        lora_dropout=0.0,
        bias="none",
        task_type="CAUSAL_LM",
    )

    Rank 32 is a sensible default for 50 to 500 episodes. Drop to 16 if VRAM is tight; bump to 64 for larger datasets. Targeting all-linear matches the OpenVLA reference recipe; you can restrict to attention-only projections if memory is a constraint.

  6. Launch fine-tuning

    Use the reference finetune script from the OpenVLA repo. The exact entrypoint moves across releases — check the repo's README — the general invocation pattern is:

    torchrun --standalone --nnodes 1 --nproc-per-node 1 \
      vla-scripts/finetune.py \
      --vla_path ./openvla-7b \
      --data_root_dir /path/to/rlds \
      --dataset_name my_robot_dataset \
      --run_root_dir ./runs \
      --adapter_tmp_dir ./adapter-tmp \
      --lora_rank 32 \
      --batch_size 16 \
      --grad_accumulation_steps 1 \
      --learning_rate 5e-4 \
      --image_aug True \
      --wandb_project openvla-finetune

    On a single A100 80 GB, expect about 1 hour per 10k steps. A typical 200-episode dataset converges in 20 to 40k steps.

    Tip: enable image_aug True for small datasets. The random crop and color jitter make a real difference on 50-episode datasets.
  7. Monitor training

    Watch three signals in WandB or tensorboard: action MSE (your primary loss), gradient norm (should be stable, not spiking), and per-dimension action accuracy (token accuracy for the tokenized action head). If action MSE plateaus above 0.1 after 5k steps, your dataset is too small or your language instructions do not match the policy head's training distribution.

  8. Merge LoRA and evaluate

    At the end of training, merge the LoRA adapter into the base model for faster inference:

    from peft import PeftModel
    from transformers import AutoModelForVision2Seq
    
    base = AutoModelForVision2Seq.from_pretrained('./openvla-7b', torch_dtype=torch.bfloat16)
    model = PeftModel.from_pretrained(base, './runs/my_robot_dataset/adapter')
    merged = model.merge_and_unload()
    merged.save_pretrained('./openvla-7b-my-robot')

    Now evaluate: run the merged model on a held-out episode and compare predicted actions to ground truth, or deploy to the real robot and measure task success rate over 20 trials. For real-robot rollouts, use the OpenVLA inference wrapper that handles image preprocessing and action denormalization for you.

What to do next

Once a LoRA adapter works, three natural follow-ups: (1) scale to more data — add episodes from Open X-Embodiment or other public datasets to give the model a broader prior, (2) try full fine-tuning on an 80 GB GPU if you have one, which generally adds 2 to 5 percentage points of task success, and (3) compare to other VLA models — Pi-Zero, Octo, and RT-2 variants all have different strengths.

If your dataset was collected with LeRobot, our LeRobot recording tutorial covers the capture side. For bimanual tasks, see the ALOHA teleop rig tutorial.

Common failure modes

OOM on startup: enable 8-bit quantization with --use_quantization True or drop batch size.

Action MSE does not decrease: check that your action normalization stats are correct. A frequent bug is computing stats on the wrong split.

Policy drives robot to joint limits: language instruction distribution in your dataset is too narrow; the model memorized instead of learned. Mix in more diverse tasks.

Inference is slow: always run with torch.compile and bfloat16 after merging LoRA.

Deep dive: LoRA vs full fine-tune vs from scratch

Three options sit on the spectrum. Full fine-tune updates every parameter of the 7B model. It typically produces the best task success rate when you have enough data (500+ episodes) and an 80 GB GPU, because the model can adapt every representation to your embodiment. LoRA adds low-rank adapters to a subset of layers, freezes the base, and converges fast on smaller datasets. In our experiments on mid-size manipulation datasets, LoRA reaches 90 to 95 percent of full fine-tune performance at a quarter of the compute. From scratch almost never makes sense for VLA-scale models — you lose the Open X-Embodiment prior and need orders of magnitude more data.

A fourth path that has been gaining traction is DoRA (Weight-Decomposed Low-Rank Adaptation), which decomposes weights into magnitude and direction components. DoRA adds a small amount of parameter overhead over LoRA and often closes the gap to full fine-tune further. PEFT supports DoRA out of the box — if you have tried LoRA and hit a ceiling, swap it in with a one-line config change.

Deep dive: dataset quality beats dataset size

Teams consistently over-invest in episode count and under-invest in episode diversity. 200 episodes across 20 lighting and object-pose variations beats 1000 episodes of the same scene. The reason is model-theoretic: VLA fine-tuning is not primarily learning what to do — the base model already knows rough manipulation priors — it is learning to ground the policy to your camera intrinsics, your workspace, and your language instructions. Diversity in those variables gives the policy more degrees of freedom to disentangle embodiment from task.

A concrete recipe that works: within your 200 episodes, vary initial object position across a 20 cm x 20 cm grid (10 positions minimum), record under 3 lighting conditions, and rotate through 5 natural-language phrasings of the same instruction. This is about 50% more work during collection, but dramatically improves the fine-tune.

Deep dive: evaluation is where teams get wrong

The most common mistake is evaluating only on held-out episodes from the same recording session. That measures interpolation, not generalization. Real evaluation has three tiers: (1) held-out from same session — lower bound on capability, (2) held-out from a different day / lighting — real-world robustness check, and (3) zero-shot on a novel task phrasing or slightly novel object — what separates a usable VLA from a demo-only one. Budget at least 30 minutes of real-robot time per candidate checkpoint for proper eval.

Frequently asked questions

How many episodes do I need? 50 is a minimum for non-trivial results. 200 is a good target. 500+ saturates LoRA capacity for most single tasks.

Can I fine-tune on multi-task data? Yes, and it usually helps. The 7B OpenVLA backbone has headroom to absorb multiple tasks per adapter.

What about Pi-Zero or Octo? Pi-Zero is Physical Intelligence's flow-matching VLA — different training recipe, often higher success rate per episode but requires proprietary weights. Octo is a smaller open model useful for research. See our VLA models comparison.

Inference latency matters for my application. OpenVLA-7B inference is roughly 200 ms on a single A100. If that is too slow, look at smaller VLAs, quantization, or action-chunking strategies.

Related tutorials and resources