Text2HOI project page

Abstract

This paper introduces the first text-guided work for generating the sequence of hand-object interaction in 3D. The main challenge arises from the lack of labeled data where existing ground-truth datasets are nowhere near generalizable in interaction type and object category, which inhibits the modeling of diverse 3D hand-object interaction with the correct physical implication (e.g., contacts and semantics) from text prompts. To address this challenge, we propose to decompose the interaction generation task into two subtasks: hand-object contact generation; and hand-object motion generation. For contact generation, a VAE-based network takes as input a text and an object mesh, and generates the probability of contacts between the surfaces of hands and the object during the interaction. The network learns a variety of local geometry structure of diverse objects that is independent of the objects' category, and thus, it is applicable to general objects. For motion generation, a Transformer-based diffusion model utilizes this 3D contact map as a strong prior for generating physically plausible hand-object motion as a function of text prompts by learning from the augmented labeled dataset; where we annotate text labels from many existing 3D hand and object motion data. Finally, we further introduce a hand refiner module that minimizes the distance between the object surface and hand joints to improve the temporal stability of the object-hand contacts and to suppress the penetration artifacts. In the experiments, we demonstrate that our method can generate more realistic and diverse interactions compared to other baseline methods. We also show that our method is applicable to unseen objects. We will release our model and newly labeled data as a strong foundation for future research. Codes and data are available in: https://github.com/JunukCha/Text2HOI.

Framework

Given a text prompt and a canonical object mesh prompt, our aim is to generate the 3D motion for hand-object interaction. We first generate a contact map from the canonical object mesh conditioned by the text prompt and object's scale. The hand-object motion generation module removes the noise from the inputs for the denoised outputs to align with the predicted contact map and the text prompt. The denoised outputs exhibit artifacts, including the penetration. To address these artifacts, the hand refinement module adjusts the generated (denoised) hand pose parameters to restrain the penetration and to improve contact interactions.

Results

Close a laptop with the left hand.

Close a microwave with both hands.

Cook using a frying pan with the left hand.

Fly an airplain with the right hand.

Hand over an apple with both hands.

Open a box with the right hand.

Open a waffle iron with the left hand.

Place a book with both hands.

Place a waffle iron with the left hand.

Play the flute with both hands.

Play a train with the right hand.

Pour milk with the right hand.

Type a laptop with both hands.

Unseen

Grab a teddy bear with the left hand.

Pour milk in round bottle with the right hand.

Contact maps depending on the text prompt

Pour milk with the right hand.

Close a milk carton with both hands.

BibTeX (CVPR2024)

@inproceedings{cha2024text2hoi, title={Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction}, author={Cha, Junuk and Kim, Jihyeon and Yoon, Jae Shin and Baek, Seungryul}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages={1577--1585}, year={2024} }

Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction

Abstract

Framework

Video Presentation

Results

Close a laptop with the left hand.

Close a microwave with both hands.

Cook using a frying pan with the left hand.

Fly an airplain with the right hand.

Hand over an apple with both hands.

Open a box with the right hand.

Open a waffle iron with the left hand.

Place a book with both hands.

Place a waffle iron with the left hand.

Play the flute with both hands.

Play a train with the right hand.

Pour milk with the right hand.

Type a laptop with both hands.

Unseen

Grab a teddy bear with the left hand.

Pour milk in round bottle with the right hand.

Hand type

Grasp a cappuccino with the left hand.

Grasp a cappuccino with the right hand.

Hand motion in the canonical coord.

Grab a box with the right hand.

Diversity

Type a laptop with both hands.

Contact maps depending on object sizes

Pass a pyramid with the right hand.

Small

Large

Contact maps depending on the text prompt

Pour milk with the right hand.

Close a milk carton with both hands.

BibTeX (CVPR2024)