How to Set Up an ALOHA-Style Bimanual Teleop Rig
ALOHA (A Low-cost Open-source Hardware System for Bimanual Teleoperation) from Stanford has become the reference bimanual teleop platform for imitation learning. This tutorial walks through building your own: four WidowX 250 arms (two leader, two follower), three cameras, ROS 2 Humble, and leader-follower sync. Budget one full day for a first build.
What you will accomplish
At the end of this tutorial you will have a working ALOHA-style bimanual teleoperation rig: operator holds two leader arms, two follower arms mirror the motion in real time, and three cameras record the scene from ceiling, left wrist, and right wrist. You can record an episode to disk and feed it into LeRobot or the ACT / Diffusion Policy training pipeline.
Why ALOHA? Bimanual manipulation unlocks tasks that are genuinely hard to do with a single arm — untying a twist tie, pouring from one vessel to another, routing cable. The ALOHA form factor of cheap arms on a shared frame with good camera coverage was the breakthrough that made bimanual imitation learning practical on a lab budget.
Prerequisites
- Budget: expect roughly $20,000 to $32,000 for a full 4-arm build with cameras, frame, and workstation. Trossen's pre-integrated ALOHA kit ships the same hardware pre-assembled.
- Space: a 1.5 m by 1.0 m table plus room to stand behind it.
- Skills: comfortable with mechanical assembly, USB / power wiring, Ubuntu, and ROS 2 basics.
- Workstation: Ubuntu 22.04, RTX-class GPU, 16+ CPU cores. Intel NUC 13 Pro or equivalent.
The steps
-
Order parts and plan the workspace
Core parts list for a full rig:
- 4x Trossen WidowX 250 6-DOF arms (or the full Trossen ALOHA 2 kit if you want everything pre-integrated).
- 3x cameras — Intel RealSense D405 is the ALOHA reference; Logitech C922 also works for RGB-only.
- 80/20 aluminum extrusion and brackets for the frame.
- Workstation with RTX 3080+ GPU (data collection is CPU/IO bound; training is GPU bound).
- Powered USB 3.0 hub with at least 7 ports (4 arms + 3 cameras).
Plan the workspace so leader arms sit in front of the operator and follower arms face the task area, with a shared table surface. If you are new to this, start with a Trossen kit — you can browse comparable bimanual kits on our store.
-
Build the frame and mount the arms
Assemble the aluminum extrusion frame. The reference ALOHA frame separates the operator zone from the task zone by about 50 cm so the operator's hands do not collide with the followers' workspace. Mount each arm firmly to the frame — use the official Trossen base plates and torque all M5 bolts to spec (around 5 N·m).
Label every arm clearly:
leader_left,leader_right,follower_left,follower_right. You will thank yourself an hour from now. -
Wire power and USB
Each WidowX 250 needs its own 12 V / 5 A power supply. Plug each arm's U2D2 USB-serial adapter into the powered hub. Use a powered hub with per-port current isolation — the Dynamixel motors draw enough current that a cheap unpowered hub will brownout.
Safety: emergency stop. Wire an external kill switch into the 12 V line so you can cut power to all four arms with one button. You will need it. -
Install ROS 2 Humble on the workstation
ALOHA is a ROS 2 Humble stack on Ubuntu 22.04. Install via the official apt path:
sudo apt update && sudo apt install locales sudo locale-gen en_US en_US.UTF-8 sudo update-locale LC_ALL=en_US.UTF-8 LANG=en_US.UTF-8 sudo apt install software-properties-common sudo add-apt-repository universe sudo apt update && sudo apt install curl -y sudo curl -sSL https://raw.githubusercontent.com/ros/rosdistro/master/ros.key -o /usr/share/keyrings/ros-archive-keyring.gpg # Follow the ROS 2 Humble install docs to add the apt repo, then: sudo apt install ros-humble-desktopSource the environment:
source /opt/ros/humble/setup.bash. Add it to~/.bashrcso it always loads. -
Install the Interbotix ROS 2 stack
Trossen maintains the Interbotix ROS 2 stack for WidowX. Follow the official Trossen / Interbotix install instructions — the general pattern is:
mkdir -p ~/interbotix_ws/src cd ~/interbotix_ws/src git clone https://github.com/Interbotix/interbotix_ros_manipulators.git -b humble cd .. rosdep install --from-paths src --ignore-src -r -y colcon build source install/setup.bashVerify you can talk to one arm:
ros2 launch interbotix_xsarm_control xsarm_control.launch.py robot_model:=wx250sThe arm should torque on. Do not place your hand in the workspace during this test.
-
Mount and enumerate cameras
Mount the top-down camera on the frame looking down at the workspace, and wrist cameras on each follower arm just above the gripper. Connect all three to the powered USB hub.
Critical: create udev rules so cameras always enumerate at the same
/dev/video*path. Otherwise you will discover halfway through a recording session that left and right wrist cameras have swapped. Example rule:# /etc/udev/rules.d/99-aloha-cameras.rules SUBSYSTEM=="video4linux", ATTRS{serial}=="", SYMLINK+="aloha_wrist_left" SUBSYSTEM=="video4linux", ATTRS{serial}==" ", SYMLINK+="aloha_wrist_right" SUBSYSTEM=="video4linux", ATTRS{serial}==" ", SYMLINK+="aloha_top" Reload with
sudo udevadm control --reload && sudo udevadm trigger. -
Run leader-follower teleop
Launch the bimanual teleop node. The Interbotix ALOHA repo provides one; the exact launch file moves across releases, but the typical pattern is:
ros2 launch interbotix_xsarm_dual aloha_bringup.launch.py \ leader_left:=wx250s leader_right:=wx250s \ follower_left:=wx250s follower_right:=wx250sGravity-compensate the leaders, torque-enable the followers, and set the control loop to 100 Hz minimum. Grab a leader by the wrist, move it — the follower should mirror in real time with less than 30 ms latency.
-
Record your first episode
With teleop running, launch a recorder node that subscribes to joint states, follower commands, and all three camera topics, then writes a timestamped episode to disk:
ros2 run aloha_data_collection record_episode \ --task pick_and_place \ --duration 20 \ --output ~/aloha_episodes/Review the recording with the visualizer. If the images, joint states, and actions look aligned, you are ready for a full recording session — typically 50 to 200 episodes per task. Next stop is our LeRobot recording tutorial or VLA fine-tuning.
What to do next
Once your rig is recording cleanly, the next investments pay off quickly: (1) add a fourth camera at a side-front angle for better depth cues, (2) add force-torque sensors at each wrist for contact-rich tasks, and (3) iterate your task taxonomy. Great bimanual datasets have breadth — 20 tasks, 50 episodes each beats 1 task with 1000 episodes for generalization.
Common failure modes
Follower lags leader: loop rate too low, or you are running the visualizer on the same ROS node. Separate the teleop control loop onto its own executor.
One arm drops out mid-recording: USB brownout. Move to a better-powered hub.
Wrist cameras swap identities after reboot: udev rules missing or not reloaded. See step 6.
Operator fatigue: real. 30-minute sessions, mandatory breaks. Your data quality degrades with operator fatigue.
Deep dive: kit vs DIY
The question every lab asks: buy the Trossen ALOHA 2 kit, or source the parts and build from scratch? Purely on hardware cost, DIY saves 10 to 15 percent. On time-to-first-episode, the pre-integrated kit is almost always cheaper by the time you include engineer hours — budget 40 to 60 engineer-hours for a clean DIY build, versus roughly 8 hours to unpack and commission a kit. If this is your first rig, buy the kit and learn the integration as you go. If this is your third rig and you have specific modifications in mind (different gripper, different camera layout), DIY gives you that freedom.
A middle path that often makes sense: buy the pre-built arms and leader-follower cabling but fabricate your own frame. The frame is the easiest part to customize and the part you most often want to modify for your specific task (bench height, camera angles, dual-station layout).
Deep dive: the subtle stuff that wrecks bimanual data
Things that look fine individually but ruin datasets in aggregate:
- Unsynchronized clocks. Camera frames, follower joint states, and leader commands should all share a single reference clock. If cameras use their own clock, a 20 ms offset turns your dataset into low-quality noise. Use the ROS 2
message_filters::ApproximateTimesynchronizer or timestamp-in-a-single-process. - Follower compliance tuning. Too stiff and the operator feels disconnected; too compliant and the follower lags on fast motions. 80% of nominal gains is a good starting point.
- Camera exposure drift. Auto-exposure reacts to moving arms and makes every episode look slightly different. Lock exposure and white balance after warm-up.
- Gripper hysteresis. The WidowX gripper has noticeable hysteresis. Calibrate both open and close positions per-arm, not just one value.
Deep dive: dataset formats for bimanual
Bimanual episodes typically pack dual 7-DOF actions (6 joints + gripper per arm) into a 14-dimensional action vector. Most downstream stacks — LeRobot, OpenVLA, HuggingFace datasets — accept arbitrary action dimensionality, but you need to be explicit about action ordering in your metadata. The convention we recommend: [left_j0..j5, left_gripper, right_j0..j5, right_gripper]. Document it in a dataset card so anyone training on your data knows what the channels mean.
Frequently asked questions
Do I need ROS 2 for ALOHA? The official stack uses ROS 2 Humble. You can run bimanual teleop without ROS by writing your own serial bus coordinator, but it is a lot of work for marginal benefit.
Can I substitute different arms? Yes — as long as the leader-follower geometry matches. Teams have built ALOHA-style rigs with Koch v1.1, Moss, and SO-ARM bimanual variants for lower budgets.
How many episodes to train a policy? For ACT on a bimanual task, 50 is minimum, 100-200 sweet spot.
Force feedback on the leader? The standard ALOHA rig is non-haptic. Haptic variants exist but add cost and complexity.