ALOHA Robot: What It Is, How It Works, and How to Get Started

ALOHA is the bimanual teleoperation platform from Stanford University that demonstrated, for the first time, that a robot could learn dexterous two-handed manipulation tasks — like opening a bag of chips, tying a cable, or cooking — from a small number of human demonstrations. It is now the most widely referenced bimanual research platform in the world. This guide explains what ALOHA is, how it works, and how to start using it.

The Stanford Origin Story

ALOHA — A Low-cost Open-source Hardware System for Bimanual Teleoperation — was developed at Stanford's Mobile Manipulation Lab and published in the paper "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware" by Tony Z. Zhao et al. in 2023. The central thesis was provocative: you do not need expensive, proprietary robot hardware to perform impressive dexterous manipulation. ALOHA used four ViperX 300 and WidowX 250 robot arms (two per side, one as the leader for teleoperation and one as the follower) costing under $20,000 total, combined with the ACT algorithm, to perform tasks that had previously required custom-engineered systems costing many times more.

The paper demonstrated 10 bimanual tasks including unwrapping a piece of candy, inserting a battery into a slot, and threading a rope through a hole — all with success rates above 80% using 50 demonstrations. These results shocked the robotics community not because the tasks were novel, but because of the cost and data efficiency. ALOHA and ACT together established a new benchmark for accessible dexterous manipulation research and triggered a wave of follow-on work that continues today.

The ALOHA hardware design and all software are fully open-source. The bill of materials, assembly instructions, and ACT training code are publicly available on GitHub. This openness has made ALOHA the de facto standard bimanual research platform, with dozens of research groups worldwide running variants of the original design. SVRC supports ALOHA-class platforms through our data services and hardware leasing program.

Full Hardware Specification

Understanding the exact hardware configuration matters because small deviations — a different arm model, a misaligned camera, a lower-torque servo — can produce datasets that are incompatible with the reference ACT implementation.

Arms: ViperX 300 6DOF (Follower) and WidowX 250 6DOF (Leader)

SpecViperX 300 (Follower)WidowX 250 (Leader)
DOF6 + gripper6 + gripper
Reach750mm625mm
Payload750g (at full extension)250g (at full extension)
Repeatability~1mm~2mm
ServosDYNAMIXEL XM540 + XM430DYNAMIXEL XM430 + XL430
CommunicationUSB via U2D2 adapter, 1MbpsUSB via U2D2 adapter, 1Mbps
Control rate50Hz (20ms loop)50Hz (reads only)
Weight3.6kg2.1kg
Price (per arm)~$5,500~$3,200

Camera System

The original paper used 3 Logitech C922 webcams (2 wrist-mounted + 1 overhead) recording at 480x640 resolution, 50fps. ALOHA 2 upgraded to Intel RealSense D405 cameras for improved image quality and optional depth. SVRC's ALOHA stations use RealSense D435i cameras with hardware sync for consistent multi-camera timing — see our camera setup guide for the rationale behind this upgrade.

Full Assembly Cost Breakdown

ComponentQtyUnit CostSubtotal
ViperX 300 6DOF arms (follower)2$5,500$11,000
WidowX 250 6DOF arms (leader)2$3,200$6,400
Cameras (RealSense D435i or D405)3-4$300$900-1,200
U2D2 USB adapters4$30$120
Power supplies (12V)4$35$140
Aluminum extrusion frame + brackets1 set$500$500
Camera mounts, cable management, misc1 set$200$200
Workstation PC (RTX 4090, 32GB RAM)1$2,500$2,500
Total (DIY build)~$21,800-22,100
Total (with Mobile ALOHA base)+ AgileX Tracer ($10,000)~$32,000

Assembly time for an experienced builder is approximately 20-30 hours. For a first build, budget 40-60 hours including debugging Dynamixel communication issues, camera calibration, and software setup. The most common assembly failure points are: Dynamixel daisy-chain wiring (incorrect bus ID assignment causes arm segments to respond out of order), camera cable routing through the wrist (cables that are too tight restrict wrist rotation), and frame rigidity (undertightened extrusion bolts create vibration that affects camera image quality).

Software Setup: From Clone to First Demo

The ALOHA software stack has two components: the teleoperation control loop (reading leader, commanding follower, recording data) and the ACT training pipeline.

# Clone and install the ACT repository git clone https://github.com/tonyzhaozh/act.git cd act conda create -n aloha python=3.8.16 -y conda activate aloha pip install -r requirements.txt # Install Interbotix SDK for arm control # Follow: https://docs.trossenrobotics.com/interbotix_xsarms_docs/ sudo apt install ros-humble-interbotix-xsarm-control # Configure arm USB devices # Each arm needs a unique USB port mapping in /etc/udev/rules.d/ # Leader left: /dev/ttyDXL_leader_left # Leader right: /dev/ttyDXL_leader_right # Follower left: /dev/ttyDXL_follower_left # Follower right: /dev/ttyDXL_follower_right # Test arm communication python scripted_test.py # moves each arm through safe test positions # Record your first episode python record_episodes.py \ --task_name cup_transfer \ --episode_idx 0 \ --num_episodes 50

The recording script produces HDF5 files with the following structure per episode: /observations/images/cam_high (overhead), /observations/images/cam_left_wrist, /observations/images/cam_right_wrist, /observations/qpos (14-dim joint positions: 7 per arm), /action (14-dim joint position targets), and /timestamps. Each episode is a self-contained HDF5 file, typically 20-50MB depending on camera resolution and episode length.

ROS2 Integration

While the original ALOHA code uses the Interbotix Python SDK directly (not ROS), many teams integrate with ROS2 for compatibility with MoveIt2, camera drivers, and visualization. The interbotix_ros_manipulators package provides ROS2 nodes for ViperX and WidowX arms. Key considerations for ROS2 ALOHA setups:

  • Set the Dynamixel baud rate to 1Mbps for both leader and follower arms to maintain 50Hz control rate
  • Use a dedicated USB hub (not a daisy-chained hub) to avoid bus contention between 4 simultaneous USB-serial connections
  • Configure camera drivers to publish compressed images to reduce bandwidth — raw 640x480 RGB at 30fps per camera is 27.6 MB/s per camera, 110 MB/s for 4 cameras
  • Use ROS2 message filters for time synchronization across camera topics

Hardware Architecture: Bimanual Leader-Follower Setup

The ALOHA system consists of two kinematic pairs, one for each arm. Each pair has a "leader" arm — a lightweight, back-drivable arm that the operator holds and moves with their hands — and a "follower" arm that mirrors the leader's joint positions in real time. The follower arm carries the actual manipulator (gripper, tool, or end-effector) and interacts with the physical world. The leader arm has no end-effector payload requirements because it only needs to be back-drivable and provide torque feedback to the operator.

The bimanual configuration — two complete leader-follower pairs — is what makes ALOHA uniquely capable for dexterous tasks. Human hands are bimanual by nature: one hand holds the object while the other manipulates it, or both hands cooperate to complete a task that requires two simultaneous contact points. Single-arm robots can only approximate these tasks with complex fixtures or sequencing; bimanual robots can handle them directly. The ALOHA form factor, with both arms mounted on a shared table fixture, is optimized for tabletop manipulation tasks where the operator sits in front of the system.

The camera setup in the original ALOHA paper used three cameras: one overhead (bird's-eye view of the full workspace), one on the left wrist, and one on the right wrist. All three cameras are used as visual observations for the ACT policy. This multi-view setup is critical: the wrist cameras provide close-up views of grasping and contact events, while the overhead camera provides global context for two-handed coordination. Single-camera ALOHA variants show measurably lower policy performance on coordination-heavy tasks.

Alternative Bimanual Setups: SVRC DK1 Kit

ALOHA is not the only option for bimanual teleoperation research. SVRC's DK1 bimanual kit offers several advantages for teams that prioritize data collection throughput over exact ALOHA compatibility:

FeatureALOHA (DIY Build)SVRC DK1 Kit
ArmsViperX 300 + WidowX 250OpenArm 101 leader-follower pairs
Total cost~$22,000 + assembly$9,000 pre-assembled
Assembly time20-60 hours2 hours (unbox + calibrate)
Payload (per arm)750g500g
Data formatHDF5 (custom schema)HDF5 + LeRobot export
ACT compatibilityNativeVia format conversion
Community sizeLargest (100+ labs)Growing (SVRC ecosystem)

The DK1 is the better choice for teams that need to start collecting data immediately and plan to train with Diffusion Policy or VLAs (which are architecture-agnostic regarding the specific robot). ALOHA is the better choice for teams that need exact compatibility with published ACT results or want to contribute to the ALOHA open-source ecosystem. See our hardware catalog for DK1 details.

ALOHA Use Cases: What Works and What Does Not

Tasks Where ALOHA Excels

  • Tabletop bimanual manipulation: Opening containers, folding cloth, transferring between hands, two-handed assembly. The leader-follower interface gives operators natural bimanual control.
  • Food preparation research: Stirring, scooping, pouring, cutting soft materials. The 750g payload handles most kitchen items.
  • Cable and deformable object manipulation: Threading, wrapping, tying. The wrist cameras provide the close-up view needed for deformable contact tasks.
  • Data collection at scale: The open-source design means you can build multiple identical stations. SVRC runs 4 ALOHA-class stations simultaneously for high-volume campaigns.

Tasks Where ALOHA Struggles

  • Heavy object manipulation: The 750g payload limit (at full extension) excludes many industrial objects. A water bottle, book, or power tool exceeds the safe handling envelope.
  • Precision insertion below 1mm: The ~1mm repeatability of ViperX servos limits precision. For sub-millimeter tasks, you need either force feedback (not included in base ALOHA) or a higher-precision arm platform.
  • Large workspace tasks: The 750mm reach constrains the workspace to a roughly 1m x 0.5m tabletop area. Tasks requiring movement across a kitchen counter or between shelves need Mobile ALOHA or a different platform.
  • High-speed manipulation: The 50Hz control loop limits reactive speed. Tasks requiring fast reflexive responses to unexpected perturbations (catching a thrown object, dynamic handover) are beyond current ALOHA capability.

ACT: The Algorithm Behind ALOHA

ACT (Action Chunking with Transformers) was developed alongside ALOHA and is the primary learning algorithm for the platform. ACT is a transformer-based imitation learning policy that predicts a chunk of future joint positions — typically 100 timesteps at 50Hz, covering 2 seconds of motion — rather than a single next action. This action chunking architecture substantially reduces the compounding error problem of naive behavioral cloning, where small prediction mistakes at each timestep accumulate into large trajectory deviations over the course of a task.

The ACT policy architecture uses a CVAE (Conditional Variational Autoencoder) encoder during training to capture the latent style of each demonstration — essentially, a compressed representation of "how" the human completed the task, distinct from "what" the task outcome was. This enables the policy to model the natural variation in human demonstrations without mode-averaging artifacts. At inference time, only the CVAE decoder runs, conditioned on the current observation and a sampled latent vector, to generate the action chunk.

Training ACT on an ALOHA dataset with 50 demonstrations per task takes 2–4 hours on a single RTX 3090 GPU. The training code, released with the original paper, is straightforward to run with documented hyperparameters for standard ALOHA tasks. For custom tasks, the most impactful hyperparameter to tune is the chunk size (kl_weight in the config) — larger chunks improve temporal consistency at the cost of reactivity to unexpected perturbations. SVRC's platform includes pre-configured ACT training pipelines for ALOHA-format datasets.

Mobile ALOHA: Taking ALOHA Off the Table

Mobile ALOHA, published by the same Stanford group in 2024, extended the ALOHA concept to a mobile base. The bimanual arm setup was mounted on an AgileX Tracer mobile base, enabling the system to navigate to different locations within a space — approaching a kitchen counter, moving to a dining table, navigating a hallway — while retaining the ALOHA arms for manipulation. Mobile ALOHA demonstrated tasks like cooking shrimp on a stove, loading a dishwasher, and delivering a package — tasks that require both locomotion and dexterous manipulation.

Mobile ALOHA introduced the concept of whole-body teleoperation: the operator controls both the mobile base and the two arms simultaneously, either through separate control interfaces or through a unified interface that maps the operator's body movements to the robot's whole-body configuration. Data collection for Mobile ALOHA is significantly more complex than tabletop ALOHA because the policy must learn to coordinate navigation and manipulation, requiring demonstrations that cover spatial variation in the environment as well as object variation.

Mobile ALOHA also introduced co-training: training the Mobile ALOHA policy jointly on mobile manipulation demonstrations and static ALOHA manipulation demonstrations. The co-training improved manipulation performance on the mobile platform, suggesting that the bimanual manipulation knowledge from tabletop data transfers usefully to the mobile context. SVRC offers Mobile ALOHA-compatible datasets and can collect mobile manipulation demonstrations at our Mountain View facility. Contact us to discuss your Mobile ALOHA data requirements.

Differences Between ALOHA, ALOHA 2, and Commercial Derivatives

ALOHA 2, published in late 2024, improved on the original in several dimensions: higher-quality arms with better repeatability, an improved camera mounting system, and a revised wrist design that reduces cable routing complexity. The electrical system was also updated to use a dedicated power distribution board rather than daisy-chained power cables, improving reliability during long data collection sessions. ALOHA 2 maintains full software compatibility with the original — datasets collected on one can train policies evaluated on the other, subject to the usual caveats about hardware variation.

Several commercial vendors now sell ALOHA-compatible platforms — pre-assembled, tested systems that follow the ALOHA mechanical and software specification without requiring the builder to source components and assemble the arms themselves. These commercial ALOHA systems cost more than the DIY bill of materials but substantially reduce setup time and the risk of assembly errors. SVRC's hardware catalog includes ALOHA-compatible configurations; see the store for current options and pricing.

Getting Started with ALOHA Through SVRC

SVRC supports ALOHA-based research at every stage. For teams just getting started, we offer ALOHA platform leasing through our robot leasing program — access a complete bimanual setup for a fixed monthly fee without the capital commitment of purchasing hardware. Leased systems arrive pre-calibrated and ready to collect demonstrations on day one.

For data collection, our managed service provides trained ALOHA operators who can collect demonstrations at our Mountain View facility, with datasets delivered in RLDS/LeRobot format compatible with ACT, Diffusion Policy, and OpenVLA training pipelines. Our operators are experienced with bimanual coordination tasks and follow structured quality protocols that produce cleaner datasets than first-time researchers typically achieve. Pilot programs start at $2,500 for 200 demonstrations. We can also visit your site for on-location data collection campaigns if your task requires it.

For policy training and evaluation, the SVRC platform provides pre-configured ACT training pipelines, experiment tracking, and evaluation tooling for ALOHA policies. Our benchmarks include ALOHA-specific task evaluations that let you compare your policy performance against reference implementations. Whether you are building a bimanual manipulation research program from scratch or trying to push the performance of an existing system, SVRC's team can help you plan the right approach.

Related Reading