Take That for Me: Multimodal Exophora Resolution with Interactive Questioning for Ambiguous Out-of-View Instructions

Akira Oyama¹, Shoichi Hasegawa^1,*, Akira Taniguchi¹ Yoshinobu Hagiwara^2,1 Tadahiro Taniguchi^3,1

¹Ritsumeikan University, ²Soka University, ³Kyoto University
Accepted at IEEE RO-MAN 2025
^*Corresponding Author

Abstract

Daily life support robots must interpret ambiguous verbal instructions involving demonstratives such as ``Bring me that cup,'' even when objects or users are out of the robot's view. Existing approaches to exophora resolution primarily rely on visual data and thus fail in real-world scenarios where the object or user is not visible. We propose Multimodal Interactive Exophora resolution with user Localization (MIEL), which is a multimodal exophora resolution framework leveraging sound source localization (SSL), semantic mapping, visual-language models (VLMs), and interactive questioning with GPT-4o. Our approach first constructs a semantic map of the environment, integrating linguistic queries and skeletal detection to estimate candidate target objects. SSL is utilized to orient the robot toward users who are initially outside its visual field, enabling accurate identification of user gestures and pointing directions. When ambiguities remain, the robot proactively interacts with the user, employing GPT-4o to formulate clarifying questions. Experiments in a real-world environment showed results that were approximately 1.3 times better when the user was visible to the robot and 2.0 times better when the user was not visible to the robot, compared to the methods without SSL and interactive questioning.

Overview

This study addresses situations where a user gives a robot an ambiguous command with a demonstrative (e.g., “Take that for me”), especially when the user is outside the robot’s view and non-verbal cues like pointing are unavailable, making it hard to identify what “that” refers to (Left). To address this issue, (a) Utilizing sound source localization to estimate the direction of the user and obtain pointing information. (b) Narrowing down candidate objects by exophora resolution. (c) Supplementing missing information by asking questions using GPT-4o.

Multimodal Interactive Exophora resolution with user Localization (MIEL)

An overview of the MIEL. Initially, linguistic queries from the user, including demonstratives, are processed into semantic and visual representations using SentenceBERT and CLIP encoders. A semantic map provides object locations and visual data, while skeletal detection and SSL determine user direction and pointing gestures. Three estimators generate candidate target object probabilities. If the initial identification is ambiguous, GPT-4o engages the user interactively through targeted questions, refining object identification and ensuring robust exophora resolution.

Linguistic Query-based Estimator

The linguistic query-based estimator computes the probability of each candidate object being the target by multiplying and normalizing two cosine similarities: one between object label features and SentenceBERT-encoded linguistic queries, and the other between visual features and CLIP-encoded queries, all derived from the semantic map.

Demonstrative Region-based Estimator

Each region is represented by a 3D Gaussian distribution using the different characters of the demonstrative series. Robot obtains eye and wrist coordinates by MediaPipe (skeleton detector). Coordinates (eye and wrist) are used as parameters of a 3D Gaussian distribution

Pointing Direction-based Estimator

After skeleton detection, the robot calculates the angle between the pointing (blue) and the object direction (red). Then, the robot outputs the probability density of a 2D von Mises distribution with the obtained angles. Our model can deal with situations where there are no users in the robot's field of vision.

Interactive Questioning Module

An overview of the module for interactive questioning. The red arrows represent the process flow when the target object is successfully identified, while the blue arrows indicate the process flow when the target object remains unidentified.

Level of Ambiguous Instructions for Experiments

Divided queries into three levels, and manually created each of the 30 queries in Japanese.

Result-1: Case of User is Visible from the Robot’s Position

We evaluated MIEL under conditions where the user was visible from the robot's initial position. MIEL outperformed ECRAP with a 1.3× higher SR (success rate) overall and over 2× higher SR for level-3 queries lacking object class information, highlighting the benefit of interactive clarification. While MIEL achieved lower SRs than humans (30–60% lower), the gap is attributed to challenges in generating effective questions and the absence of gaze cues, which humans used alongside pointing.

Result-2: Case of User is Invisible from the Robot’s Position

When the user is not initially visible, MIEL achieves 3× higher SR than ECRAP by using SSL to locate the user and obtain skeletal data. This allows MIEL to maintain SR comparable to visible-user conditions, whereas ECRAP's performance drops significantly due to its reliance on limited query information.

Result-3: The Ablation Study for Exophora Resolution. L1, L2, and L3 Indicate Levels.

The ablation study shows that Q&A significantly improves SR for higher-level queries, with level-3 SR more than doubling when Q&A is used. SSL also boosts SR by 1.7×, confirming its role in acquiring user skeletal data when initially unavailable. Together, these components are crucial for handling ambiguous queries in MIEL.

Prompts

　(1) Prompts for Identifying Target Objects based on the Results of Exophora Resolution

　(2) Prompts for Generation a Question

　(3) Prompts for Identifying Target Objects based on User's Answering

BibTeX


      @inproceedings{oyama2025miel,
        author={Oyama, Akira and Hasegawa, Shoichi and Taniguchi, Akira and Hagiwara, Yoshinobu and Taniguchi, Tadahiro},
        title={Take That for Me: Multimodal Exophora Resolution with Interactive Questioning for Ambiguous Out-of-View Instructions},
        booktitle = {{IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)}},
	      year = {2025, accepted}
      }

Related Research

Laboratory Information

Funding

This work was supported by JSPS KAKENHI Grants-in-Aid for Scientific Research (Grant Numbers JP23K16975, JP22K12212), JST Moonshot Research & Development Program (Grant Number JPMJMS2011).