May 31, 2024
“What’s over there?” “How do I solve this math problem?”
If you try asking a voice assistant (VA) like Siri or Alexa such questions you won’t get much information. While VAs are transforming human-computer interaction, they can’t see what you’re looking at or where you’re pointing. CREATE Ph.D. student Jaewook Lee has led an evaluation of GazePointAR, a fully-functional, context-aware VA for wearable augmented reality (AR) that uses eye gaze, pointing gestures, and conversation history to make sense of spoken questions.

real-time gaze tracking, pointing gesture recognition, and computer vision to replace “this” with “packaged item with text
that says orion pocachip original,” which is then sent to a large language model for processing and the response read by a
text-to-speech engine.
Lee, along with advisor Jon E. Froehlich and fellow researchers, evaluated GazePointAR by comparing it to two commercial systems. The team also studied GazePointAR’s pronoun handling across three assigned tasks and its responses to participants’ own questions. In short, participants appreciated the naturalness and human-like nature of pronoun-driven queries, although sometimes pronoun use was counter-intuitive.
Lee presented the team’s paper, GazePointAR: A Context-Aware Multimodal Voice Assistant for Pronoun Disambiguation in Wearable Augmented Reality, at CHI ‘24, sharing a first-person diary study illustrating how GazePointAR performs in the wild. The paper, whose authors also include Jun Wang, Elizabeth Brown, Liam Chu, and Sebastian S. Rodriguez, enumerate limitations and design considerations for future context-aware VAs.