HOI-PAGE: Zero-Shot Human-Object Interaction Generation with Part Affordance Guidance

Zero-Shot Human-Object Interaction Generation with Part Affordance Guidance

Abstract

We present HOI-PAGE, a new approach to synthesizing 4D human-object interactions (HOIs) from text prompts in a zero-shot fashion, driven by part-level affordance reasoning. In contrast to prior works that focus on global, whole body-object motion for 4D HOI synthesis, we observe that generating realistic and diverse HOIs requires a finer-grained understanding of how specific human body parts engage with object parts.

We thus introduce Part Affordance Graphs (PAGs), a structured HOI representation distilled from large language models (LLMs) that encodes fine-grained part information and contact relations. We then use these PAGs to guide a three-stage synthesis approach: first, decomposing input 3D objects into geometric parts; then, generating reference HOI videos from text prompts, from which we extract part-based motion constraints; finally, optimizing for 4D HOI motion sequences that not only mimic the reference dynamics but also satisfy part-level contact constraints.

Extensive experiments show that our approach is flexible and capable of generating complex multi-object or multi-person interaction sequences, with significantly improved realism and text alignment for zero-shot 4D HOI generation.

How It Works

HOI-PAGE generates realistic 4D human-object interaction motions from a given set of 3D objects and a text prompt. We introduce Part Affordance Graphs (PAGs) to capture how specific object parts relate to human body parts (top-middle). The PAG is distilled from a large language model based on the text prompt and is used to guide a three-stage generation pipeline:

Decomposing the input objects into geometric parts based on multi-view detection and segmentation (bottom-left);

Generating an HOI video from the text prompt and estimating object masks, depths, and 4D human motions (bottom-middle);

Optimizing for objects motions by fitting to the video while enforcing part-level contact constraints from the PAG (top-right).

Comparison to Baselines

Our part affordance-guided generations are more realistic and have better text alignment, when compared to baselines that only model overall full-body and object motions and require extensive captured interaction data for supervision.

HOI-PAGE

Zero-Shot Human-Object Interaction Generation with Part Affordance Guidance

Abstract

Video

Single-Person Multi-object Interaction Generation

Multi-person Single-Object Interaction Generation

How It Works

Comparison to Baselines