Generating realistic two-person interaction motions from text holds immense potential in computer vision and animations. While existing latent motion diffusion models offer compact and efficient representations, they are typically limited to a single canonical body shape and often fail to produce physically plausible contacts. As a result, the generated motion sequences exhibit substantial mesh penetrations and lack interaction realism. To address these limitations, we propose a contact and shape-aware latent motion representation and diffusion model (CoShMDM) for generating realistic two-person interactions from text. Our framework begins by constructing contact-compatible motion using SMPL-based meshes and a normal alignment-based mesh contact matrix to capture fine-grained mesh-level contacts. To account for shape diversity, we incorporate SMPL shape parameters and iteratively learn contact dynamics across different body shapes. Additionally, a reinforcement learning-based mesh penetration avoidance policy network, guided by signed distance fields, is introduced to minimize mesh penetrations while preserving contact fidelity and shape-aware motion. We further employ a dual-encoder VQ-VAE to learn disentangled latent representations for motion and contact, which are then utilized in a text- and body-shape-conditioned diffusion model. To ensure spatial, temporal, and semantic coherence, we integrate a novel contact and motion consistency module into the diffusion transformer. Extensive evaluations on the InterHuman and InterX datasets demonstrate that our method outperforms state-of-the-art approaches, achieving 19% and 17.3% lower mesh penetration and 17.8% and 33.2% higher contact similarity, respectively.
Overview of CoShMDM: (a) Contact and shape-aware latent motion representation, (b) Contact and shape-aware interaction motion diffusion model (CoShMDM).
Visualization of Contact and Motion Consistency (CMC) Module. (Left) Mesh contact matrix encodes contact regions between interacting bodies. (Middle) A bipartite graph captures inter- and intra-skeletal relations. (Right) Features injected into the self-attention mechanism to enhance spatial and temporal coherence.
Mesh contact computed as normal alignment and centroid distance between face pairs, forming a contact matrix Ci.
Interaction features include orientation vectors (head, chest, mid-hip), proximity, and trajectories of the interacting meshes Ma and Mb.