Scaling VLA Model Training on a Budget: Lessons From 100+ Fine-Tunes

Fine-tuning a 7B-parameter vision-language-action model sounds like a frontier-lab-only activity. It is not, if you pick the right base model, the right parameter-efficient method, and the right failure modes to optimize for. Here is what we have learned running this loop hundreds of times.

Published 2026-04-03 by the Silicon Valley Robotics Center research team.

TL;DR. You do not need an H100 cluster to fine-tune a useful VLA. A single H100 with LoRA (or QLoRA on a smaller budget) plus FlashAttention-2 and gradient checkpointing will fine-tune OpenVLA on a few hundred episodes overnight. Octo fine-tunes on a consumer 24GB card in an afternoon. The scarce resource is almost never FLOPs — it is clean, well-labeled, well-bucketed demonstration data. Buy compute on demand; invest structurally in data. Choose Octo for small datasets and single embodiments; choose OpenVLA when you have >500 episodes or need language conditioning.

1. Frame the budget correctly

Before picking a GPU or a model, frame your budget as a single number: cost per "useful evaluation." Not cost per epoch, not cost per gradient step, not cost per TFLOP. A useful evaluation is one successful rollout on a real robot, measured against a fixed evaluation protocol. Every dollar spent on compute, engineers, or data should be traceable to improving that number.

This sounds obvious and is routinely violated. Teams spend six weeks optimizing a training recipe to shave 15% off wall-clock time, then discover their evaluation protocol cannot distinguish two policies that differ by 20 percentage points in real-world success rate. A well-instrumented evaluation suite is worth more than a second GPU.

2. Choose the base model by dataset size, not by prestige

Octo

Octo is the right default when your demonstration dataset is modest (roughly <500 episodes per task), your embodiment is a single known arm, and you do not need open-vocabulary language conditioning. It fine-tunes on a single 24GB consumer GPU in hours, has permissive licensing, and is more forgiving of noisy data than larger VLAs. Its action decoder is simple, which means the failure modes are easy to diagnose.

OpenVLA

OpenVLA is the right default when you have >500 demonstrations, need language-conditioned behavior, or want to benefit from cross-embodiment pre-training. The 7B-parameter base means you will want at least a 40GB A100 or, ideally, an 80GB H100 for comfortable batch sizes with LoRA. The reference implementation lives at github.com/openvla/openvla. Our VLA models explained post covers the architecture side; our curated list lives at /vla-models/.

pi0 and others

Several frontier VLAs (pi0 and descendants) outperform OpenVLA on the hardest tasks but carry heavier compute and licensing costs. For most teams this is premature optimization — start with OpenVLA or Octo, establish a working pipeline, and revisit.

Rule of thumb. Double your dataset before you double your model. Going from 300 to 600 well-curated demonstrations reliably outperforms going from a 1B to a 7B model, at a fraction of the compute cost.

3. Parameter-efficient fine-tuning: LoRA, QLoRA, and when each wins

LoRA

LoRA attaches small trainable rank-r adapter matrices to each attention and MLP weight, freezing the base. For a 7B VLA, LoRA with rank 16-32 reduces trainable parameters by ~100x, dramatically lowers memory, and in our experience matches full fine-tune performance on most downstream manipulation tasks. The rank-32 setting is our default for new projects.

QLoRA

QLoRA loads the frozen base model in 4-bit NF4 quantization and trains LoRA adapters on top. It roughly halves the memory of LoRA at a small throughput cost. QLoRA is what makes OpenVLA fine-tuning feasible on a single A100 40GB or a consumer card with 24-32GB, and it is our recommendation for teams without H100 access. The Hugging Face PEFT library makes both methods a few lines of code.

When to full fine-tune

Full fine-tune beats LoRA only when (a) you have a very large dataset (>10k episodes, embodiment-diverse) or (b) the downstream task is sufficiently different from pre-training that adapters cannot bridge the gap. For most practitioners, those conditions do not hold.

4. Memory and throughput tricks that actually matter

FlashAttention-2

FlashAttention-2 is not optional. Use the version bundled with your transformer library of choice. On a 7B model with 1024-token context, it typically cuts attention memory by 3-4x and speeds up training end-to-end by 20-40%. The reference is at github.com/Dao-AILab/flash-attention.

Gradient checkpointing

Trade compute for memory. With gradient checkpointing on, peak activation memory for a 7B model drops by roughly a factor of 2 at the cost of ~20-30% throughput. On memory-constrained cards this is an unambiguous win.

Mixed-precision training

Use bfloat16 on Ampere and Hopper; prefer it over fp16 for numerical stability on VLA fine-tunes. Keep the optimizer state in fp32 unless you are actively memory-starved; the stability dividend is worth the extra memory.

Effective batch size via accumulation

If your hardware cannot fit the batch size your recipe wants, use gradient accumulation to build it up. For LoRA fine-tunes, effective batch sizes of 64-128 are usually more than enough; pushing higher rarely helps.

Compile

PyTorch 2.x torch.compile gives us 10-25% throughput in most cases with zero code changes, as long as the model has no shape-varying control flow. Always worth trying once per recipe.

5. Compute choices: what runs where

GPUVRAMOpenVLA (QLoRA)OpenVLA (LoRA bf16)Octo fine-tuneTypical use
RTX 3090 / 409024GBWorks, small batchTight, small batchComfortableSolo researchers, prototypes
A6000 / L40S48GBComfortableWorksVery comfortableSmall labs, shared workstation
A100 40GB40GBComfortableWorks, modest batchVery comfortableCloud default, good value
A100 80GB / H100 80GB80GBLarge batchComfortableOverkillProduction, multi-run sweeps

Cloud pricing varies wildly; spot instances on major clouds will reliably beat on-demand by 40-70% if your training loop tolerates preemption. For reproducibility we recommend checkpointing every 15-30 minutes.

6. Dataset size vs model size: the real tradeoff

In our experience across 100+ fine-tunes, the single strongest predictor of downstream success rate is demonstration data quality, not model size. The second strongest is task-appropriate camera setup. Model size is third.

Concretely: a 7B OpenVLA fine-tuned on 200 noisy, inconsistently-labeled demonstrations will routinely underperform a smaller Octo trained on the same 200 demos after a day of curation. The curation work is described in full in the teleoperation data quality checklist. Our curated and validated datasets sit in the SVRC dataset catalog and our data services team runs curation engagements.

There is a practical corollary: if your fine-tune is not working, your first three hypotheses should all be about data. Bad episode boundaries, inconsistent action normalization, camera drift between recording sessions, and mislabeled success flags are together responsible for the majority of fine-tune failures we have diagnosed.

7. Evaluation: the other half of the budget

A fine-tuned VLA is a liability without a trusted evaluation loop. We recommend three evaluation modes, run in this order:

  • Offline action-MSE. Quick sanity check. Cheap. Does not predict real-world success.
  • In-sim rollout. A sim environment that mirrors your real task. Useful as a pre-filter before burning robot time. Our sim-to-real tips covers the gotchas.
  • Real-robot rollout. The only source of truth. Budget at least 50 rollouts per evaluation. Our deployment checklist walks through the safety prep.

For teams without a dedicated eval robot, the SVRC leasing program can provide evaluation time on calibrated hardware — useful for reviewer requests and reproducibility guarantees.

8. Budget archetypes

The solo researcher ($0-$500/month)

One 3090 or 4090. QLoRA fine-tunes of Octo and OpenVLA. Cloud spot for the occasional larger run. Focus: one task, one embodiment, rigorous evaluation. Our LeRobot getting started guide is the fastest on-ramp.

The small lab ($2-5K/month)

One A6000 or shared A100 via cloud. LoRA OpenVLA as default; Octo for rapid iteration. Invest heavily in a small (200-1000 episode) curated dataset per task. Compare robot platforms before committing hardware.

The enterprise pilot ($20-50K/month)

Reserve H100 access, build multi-seed sweeps into every experiment, invest in a dedicated data-curation pipeline. At this budget the bottleneck is rarely compute; it is data throughput and evaluation rigor. Our data services team handles dataset-building for teams at this scale.

9. Common pitfalls

  • Over-training. LoRA adapters overfit quickly on small datasets. Early-stopping on a held-out success rate matters more than on loss.
  • Action normalization mismatches. If your dataset was recorded with one normalization and you deploy with another, you will get a policy that looks sane on paper and fails on hardware. Freeze your normalization and version it.
  • Camera drift. Extrinsic calibration wanders over months. Re-calibrate before every major data collection run. See robot camera setup.
  • Episode boundary errors. If your dataset has mis-segmented episodes, LoRA will dutifully learn the wrong thing. Spend real time on boundary review.
  • Claiming a gap that does not exist. If your held-out evaluation has 50 trials and you see a 5% difference, you have measurement noise, not a result. Budget for statistical power.

10. Closing note

The VLA fine-tune playbook is no longer research-lab-only. Octo and OpenVLA plus LoRA plus a single well-used GPU is enough to produce publication-quality results for most single-task, single-embodiment problems. What separates teams that succeed from teams that struggle is not compute — it is the discipline of treating data, evaluation, and hardware calibration as first-class engineering. Spend accordingly.

For hardware, our store, compare tool, and buyer guides are the fastest way to size a program. For bench-level details, our tutorials walk through specific training recipes. When you are ready to evaluate at scale, get in touch.

11. Frequently asked questions

Does LoRA rank matter much?

Rank 16-32 works for most VLA fine-tunes. Rank 8 occasionally underfits on bimanual tasks; rank 64 occasionally overfits on small datasets. If you are unsure, start at 32 and only tune if you have a specific symptom.

Should we pre-train from scratch?

Almost never. Pre-training a 7B VLA from scratch requires budgets that make fine-tuning irrelevant. Start from OpenVLA or Octo checkpoints. The only case for from-scratch training is a genuinely novel action space not covered by any open checkpoint.

How do we decide when training is done?

Not by loss curves. By held-out evaluation success rate, ideally on a real robot. Fix an evaluation budget (e.g. 50 rollouts per checkpoint) and early-stop on that metric, with a patience of 1-3 eval runs.

What about Diffusion Policy memory usage?

Diffusion Policy is often lighter than VLAs because the visual backbone is smaller. A single 24GB card comfortably trains it on most tabletop tasks. The bottleneck is usually data loading, not GPU memory.