BimanualShift achieves skill transfer from arbitrary pretrained unimanual policies to complex bimanual manipulation tasks by adapting fully frozen unimanual policy priors.
The framework is built upon frozen pretrained unimanual policies and consists of three core learnable modules. Module 1 (Visual Tracker) uses semantic masks to decouple the shared workspace into arm-specific visual inputs, eliminating attention interference. Module 2 (Action Generator) acts as an adapter that transforms high-level instructions into dynamic skill weights and compensation vectors, guiding the low-level policies to achieve coordinated behaviors. Module 3 (Skill Memory) retrieves relevant past experience and incorporates closed-loop reflection, enabling continual learning and rapid adaptation to new tasks.
We evaluate BimanualShift on six representative bimanual manipulation tasks from RLBench2. RLBench2 extends the widely used unimanual benchmark RLBench to bimanual manipulation scenarios. In simulation, we use two Franka Panda robotic arms equipped with parallel grippers. To fully cover the workspace, we deploy six noise-free RGB-D cameras (256 × 256 resolution) at the front, left shoulder, right shoulder, left wrist, right wrist, and top-down viewpoints.
The evaluation involves eight real-world tasks covering diverse challenges: Flower Arrangement and Pouring Water require coordinated stability and smooth motion control; Toasting Completion and Vegetable Sorting test synchronous grasping and timing; Quilt Folding and Cable Routing demand robust control under deformable object dynamics; while Toaster Activation and Block Threading involve heterogeneous actions and high-precision spatial alignment in constrained spaces.
Vegetable Sorting
Pouring Water
Cable Routing
Block Threading
Toaster Activation
Toasting Completion
Flower Arrangement
Quilt Folding
To evaluate the generalization capability of BimanualShift under unseen conditions, we conduct a generalization study on the Block Threading task using the best-performing configuration, BimanualShift-PerAct. Five types of perturbations are considered: unseen object color, unseen object shape, lighting change, left–right task exchange, and unseen background.
Unseen Object Color
Unseen Object Shape
Lighting Change
Left-Right Task Exchange
Unseen Background
To validate the reliability of BimanualShift in industrial deployment, we introduce two representative sources of real-world disturbances on the physical robot platform: 1) Extreme Glare, induced by a high-intensity point light to create severe reflections and shadows, and 2) Camera Perturbations, simulated by injecting a 1 cm translational error and a 2° rotational error into the camera extrinsic parameters.
To evaluate the lifelong learning capability of the skill memory module in BimanualShift, we construct a long-horizon task that requires sequential composition of multiple actions, after the model has learned the Toasting Completion and Toaster Activation skills. The task consists of four actions: Action 1 (inserting the toast into the toaster), Action 2 (pressing the activation button), Action 3 (grasping the toasted bread), and Action 4 (placing the bread onto a plate).