Gesture2Prop: Conditional VR Prop Generation from HandGestures and Speech

Zhihao Yao, Xiwen Yao, Qirui Sun, Haowei Xiong, Kin Wang Lau, and Haipeng Mi. 2025. Gesture2Prop: Conditional VR Prop Generation from Hand Gestures and Speech. In Adjunct Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST Adjunct '25). ACM, New York, NY, USA, Article 38, 1–4.(CCF-A)

Objective:
Address the complexity and low intuitiveness of traditional 3D modeling in VR by enabling natural and efficient prop generation through multimodal interaction.

Methods:

· Proposed Gesture2Prop, a multimodal interaction method combining bimanual hand gestures and voice input to infer user intent.

· Designed a semi-constrained 3D generation pipeline integrating gesture recognition, LLM-based intent parsing, image generation (Stable Diffusion + ControlNet), and image-to-3D reconstruction.

· Implemented real-time interaction in VR (Quest 3, Unity), including gesture-based size estimation, speech-driven refinement, and model-hand alignment using inverse kinematics.

Results:

· Developed Gesture2Prop, a VR system enabling rapid generation of graspable 3D props through natural gesture and speech interaction.

· Demonstrated that the system can infer object attributes (e.g., size, grip type, interaction pose) from user input and generate aligned, interactive 3D models within seconds.

· The work was accepted at UIST Adjunct 2025 (CCF-A), validating its contribution to multimodal interaction and 3D content creation.

Contribution:
Contributed to system implementation and integration, including gesture recognition, AI-driven generation pipeline, and interaction design for multimodal VR content creation.

Abstract:

Virtual reality is increasingly adopted in education, gaming, and creative applications; however, traditional 3D modeling workflows remain complex and unintuitive, limiting accessibility and real-time interaction. While generative AI has advanced text-to-3D techniques, natural interaction modalities such as hand gestures remain underexplored in immersive environments.

This study presents Gesture2Prop, a multimodal VR system that enables rapid generation of graspable 3D props through the combination of bimanual hand gestures and voice input. By analyzing spatial features of hand gestures and integrating large language models for intent understanding, the system infers key object attributes such as size, grasping style, and interaction pose.

The pipeline combines gesture-based geometric initialization, AI-driven image generation with structural constraints, and image-to-3D reconstruction to produce interactive models aligned with the user’s hands. The system supports real-time iterative refinement through voice commands, enabling an intuitive and expressive creation workflow.

The results demonstrate the potential of multimodal interaction and generative AI to lower the barrier of 3D content creation and enhance natural interaction in VR environments.

Keywords:
Virtual Reality (VR), Multimodal Interaction, 3D Content Generation, Hand Gesture, Generative AI