MobileH2R: Learning Generalizable Human to Mobile Robot Handover Exclusively from Scalable and Diverse Synthetic Data

Zifan Wang*^1,2,5, Ziqing Chen*^1,2, Junyu Chen*¹, Jilong Wang^2,3, Yuxin Yang², Yunze Liu^1,5, Xueyi Liu^1,5, He Wang^2,3, Li Yi†^4,5,

¹Tsinghua University, ¹Galbot, ¹Peking University, ²Shanghai Artificial Intelligence Laboratory, ³Shanghai Qi Zhi Institute

Paper arXiv Video Code

Abstract

This paper introduces MobileH2R, a framework for learning generalizable vision-based human-to-mobile-robot (H2MR) handover skills.

Unlike traditional fixed-base handovers, this task requires a mobile robot to reliably receive objects in a large workspace enabled by its mobility. Our key insight is that generalizable handover skills can be developed in simulators using high-quality synthetic data, without the need for real-world demonstrations. To achieve this, we propose a scalable pipeline for generating diverse synthetic full-body human motion data, an automated method for creating safe and imitation-friendly demonstrations, and an efficient 4D imitation learning method for distilling large-scale demonstrations into closed-loop policies with base-arm coordination.

Experimental evaluations in both simulators and the real world show significant improvements (at least +15% success rate) over baseline methods in all cases. Experiments also validate that large-scale and diverse synthetic data greatly enhances robot learning, highlighting our scalable framework.

Video

Method

MobileH2R-Sim

To scale up synthetic and diverse full-body human motion data, we introduce an automatic pipeline emphasizing both diversity and fidelity.

Existing datasets like AMASS, and MotionX lack task-specific specialization as well as interactive behaviors for complex HRI tasks like handover. Therefore, We propose a two-stage pipeline for full-body handover motion synthesis. By leveraging some generic motion synthesis algorithms, we generate a wide range of realistic full-body motions, while a task-specific synthesis method creates diverse hand and arm movements tailored for handovers.

Furthermore, we design an interactive human agent to respond to the robot’s proximity. Our synthetic dataset includes over 100K interactive handover scenes, scalable for task-specific HRI training

Safe and Imitation-friendly Demonstration

To ensure safe interaction, we opt to learn interaction policies through imitation rather than reinforcement learning. Inspired by GenH2R, we explore motion planning methods for demonstration generation. In terms of safety, we optimize the planner to ensure that the planned trajectories avoid collisions with the human body and prevent entering the human's blind side. To further facilitate the imitation from the demonstrations, we also ensure that the vision sequence paired with the planned trajectory provides clear and informative object state estimates. Since object states are strongly correlated with the robot's actions, this also strengthens the connection between the vision signal and the robot's actions, making imitation learning more effective. To implement all these requirements, we define several losses in the motion planning process to ensure the generation of high-quality demonstrations.

Imitation for Coordinated Base-Arm Actions

Finally, to distill these demonstrations into a visuomotor policy, we employ a 4D imitation learning approach that incorporates both human and object vision inputs, along with coordinated base-arm action outputs.

Unlike previous works that focus solely on hand-object point clouds, our approach also includes the human body as input. To address scale differences between body and object point clouds, we apply set abstraction layers with varied sampling radii.

The resulting features are merged into a global representation and decoded using an MLP to generate coordinated base-arm movements, which are essential for controlling a mobile robot