MobileH2R: Learning Generalizable Human to Mobile Robot Handover Exclusively from Scalable and Diverse Synthetic Data

Zifan Wang*1,2,5, Ziqing Chen*1,2, Junyu Chen*1, Jilong Wang2,3, Yuxin Yang2, Yunze Liu1,5, Xueyi Liu1,5, He Wang2,3, Li Yi†4,5,
1Tsinghua University, 1Galbot, 1Peking University, 2Shanghai Artificial Intelligence Laboratory, 3Shanghai Qi Zhi Institute

Abstract

This paper introduces MobileH2R, a framework for learning generalizable vision-based human-to-mobile-robot (H2MR) handover skills.

Unlike traditional fixed-base handovers, this task requires a mobile robot to reliably receive objects in a large workspace enabled by its mobility. Our key insight is that generalizable handover skills can be developed in simulators using high-quality synthetic data, without the need for real-world demonstrations. To achieve this, we propose a scalable pipeline for generating diverse synthetic full-body human motion data, an automated method for creating safe and imitation-friendly demonstrations, and an efficient 4D imitation learning method for distilling large-scale demonstrations into closed-loop policies with base-arm coordination.

Experimental evaluations in both simulators and the real world show significant improvements (at least +15% success rate) over baseline methods in all cases. Experiments also validate that large-scale and diverse synthetic data greatly enhances robot learning, highlighting our scalable framework.

Video

Method

MobileH2R-Sim

To scale up synthetic and diverse full-body human motion data, we introduce an automatic pipeline emphasizing both diversity and fidelity.

Existing datasets like AMASS, and MotionX lack task-specific specialization as well as interactive behaviors for complex HRI tasks like handover. Therefore, We propose a two-stage pipeline for full-body handover motion synthesis. By leveraging some generic motion synthesis algorithms, we generate a wide range of realistic full-body motions, while a task-specific synthesis method creates diverse hand and arm movements tailored for handovers.

Furthermore, we design an interactive human agent to respond to the robot’s proximity. Our synthetic dataset includes over 100K interactive handover scenes, scalable for task-specific HRI training


Safe and Imitation-friendly Demonstration

To ensure safe interaction, we opt to learn interaction policies through imitation rather than reinforcement learning. Inspired by GenH2R, we explore motion planning methods for demonstration generation. In terms of safety, we optimize the planner to ensure that the planned trajectories avoid collisions with the human body and prevent entering the human's blind side. To further facilitate the imitation from the demonstrations, we also ensure that the vision sequence paired with the planned trajectory provides clear and informative object state estimates. Since object states are strongly correlated with the robot's actions, this also strengthens the connection between the vision signal and the robot's actions, making imitation learning more effective. To implement all these requirements, we define several losses in the motion planning process to ensure the generation of high-quality demonstrations.

Imitation for Coordinated Base-Arm Actions

Finally, to distill these demonstrations into a visuomotor policy, we employ a 4D imitation learning approach that incorporates both human and object vision inputs, along with coordinated base-arm action outputs.

Unlike previous works that focus solely on hand-object point clouds, our approach also includes the human body as input. To address scale differences between body and object point clouds, we apply set abstraction layers with varied sampling radii.

The resulting features are merged into a global representation and decoded using an MLP to generate coordinated base-arm movements, which are essential for controlling a mobile robot

Experiments (need updates)

Simulation Experiments

In the s0(sequential) benchmark:

Our model can find better grasps, especially for challenging objects.

     Baseline(HandoverSim2real)  -  success rate(%): 75.23

                        Ours  -  success rate(%): 86.57

In the s0(simultaneous) benchmark:

Our model adjusts both the distance from the object and the pose to avoid frequent pose adjustments when the gripper is close to the object.

     Baseline(HandoverSim2real)  -  success rate(%): 68.75

                        Ours  -  success rate(%): 85.65

In the t0 benchmark:

Our model predicts the future pose of the object to enable more reasonable approaching trajectories.

     Baseline(HandoverSim2real)  -  success rate(%): 29.17

                        Ours  -  success rate(%): 41.43

In the t1 benchmark:

Our model can generalize to unseen real-world objects with diverse geometries.

     Baseline(HandoverSim2real)  -  success rate(%): 52.4

                        Ours  -  success rate(%): 68.33

Real-world Experiments

For more complex trajectories including rotations, our model demonstrates robustness compared to baseline methods.

                                Baseline (HandoverSim2real)

                                                     Ours

For novel objects with complex trajectories, our model exhibits greater generalizability.

                                Baseline (HandoverSim2real)

                                                     Ours

BibTeX

If you have any questions, please contact:

Zifan Wang (wzf22@mails.tsinghua.edu.cn)

Ziqing Chen (zq-chen22@mails.tsinghua.edu.cn)

Junyu Chen (junyu-ch21@mails.tsinghua.edu.cn)

BibTeX