Take That for Me: Multimodal Exophora Resolution with Interactive Questioning for Ambiguous Out-of-View Instructions

1Ritsumeikan University, 2Soka University, 3Kyoto University
Under Review

*Corresponding Author

Abstract

Daily life support robots must interpret ambiguous verbal instructions involving demonstratives such as ``Bring me that cup,'' even when objects or users are out of the robot's view. Existing approaches to exophora resolution primarily rely on visual data and thus fail in real-world scenarios where the object or user is not visible. We propose Multimodal Interactive Exophora resolution with user Localization (MIEL), which is a multimodal exophora resolution framework leveraging sound source localization (SSL), semantic mapping, visual-language models (VLMs), and interactive questioning with GPT-4o. Our approach first constructs a semantic map of the environment, integrating linguistic queries and skeletal detection to estimate candidate target objects. SSL is utilized to orient the robot toward users who are initially outside its visual field, enabling accurate identification of user gestures and pointing directions. When ambiguities remain, the robot proactively interacts with the user, employing GPT-4o to formulate clarifying questions. Experiments in a real-world environment showed results that were approximately 1.3 times better when the user was visible to the robot and 2.0 times better when the user was not visible to the robot, compared to the methods without SSL and interactive questioning.

Overview

image of model

This study addresses situations where a user gives a robot an ambiguous command with a demonstrative (e.g., “Take that for me”), especially when the user is outside the robot’s view and non-verbal cues like pointing are unavailable, making it hard to identify what “that” refers to (Left). To address this issue, (a) Utilizing sound source localization to estimate the direction of the user and obtain pointing information. (b) Narrowing down candidate objects by exophora resolution. (c) Supplementing missing information by asking questions using GPT-4o.

Multimodal Interactive Exophora resolution with user Localization (MIEL)

image of model

An overview of the MIEL. Initially, linguistic queries from the user, including demonstratives, are processed into semantic and visual representations using SentenceBERT and CLIP encoders. A semantic map provides object locations and visual data, while skeletal detection and SSL determine user direction and pointing gestures. Three estimators generate candidate target object probabilities. If the initial identification is ambiguous, GPT-4o engages the user interactively through targeted questions, refining object identification and ensuring robust exophora resolution.

Linguistic Query-based Estimator

image of model

The linguistic query-based estimator computes the probability of each candidate object being the target by multiplying and normalizing two cosine similarities: one between object label features and SentenceBERT-encoded linguistic queries, and the other between visual features and CLIP-encoded queries, all derived from the semantic map.

Interactive Questioning Module

image of model

An overview of the module for interactive questioning. The red arrows represent the process flow when the target object is successfully identified, while the blue arrows indicate the process flow when the target object remains unidentified.

Level of Ambiguous Instructions for Experiments

Linguistic queries were divided into the following three levels. Object features represent attributes expressed through referential terms such as color, shape, and size. Level1 queries were created first, followed by level2 and level3 queries, which were generated sequentially by removing information from the level1 queries.

  • Level1: object class, demonstratives, query containing object features (e.g., ``Bring me that stuffed pig.'')
  • Level2: object class,query containing a demonstrative (e.g., ``Bring me that doll.'')
  • Level3: query containing only a demonstrative (e.g., ``Bring me that.'')

Result1: The SR of Exophora Resolution When the User is Visble from the Robot’s Initial Position.

image of model

We evaluated MIEL under conditions where the user was visible from the robot's initial position. MIEL outperformed ECRAP with a 1.3× higher SR overall and over 2× higher SR for level-3 queries lacking object class information, highlighting the benefit of interactive clarification. While MIEL achieved lower SRs than humans (30–60% lower), the gap is attributed to challenges in generating effective questions and the absence of gaze cues, which humans used alongside pointing.

Result2: The SR of Exophora Resolution When the User is not Visble from the Robot’s Initial Position.

image of model

When the user is not initially visible, MIEL achieves 3× higher SR than ECRAP by using SSL to locate the user and obtain skeletal data. This allows MIEL to maintain SR comparable to visible-user conditions, whereas ECRAP's performance drops significantly due to its reliance on limited query information.

Result3: The Ablation Study for Exophora Resolution. L1, L2, and L3 Indicate Levels.

image of model
 

The ablation study shows that Q&A significantly improves SR for higher-level queries, with level-3 SR more than doubling when Q&A is used. SSL also boosts SR by 1.7×, confirming its role in acquiring user skeletal data when initially unavailable. Together, these components are crucial for handling ambiguous queries in MIEL.

Prompts

image of model

 Prompts for Identifying Target Objects based on the Results of Exophora Resolution

image of model

 Prompts for Generation a Question

image of model

 Prompts for Identifying Target Objects based on User's Answering

BibTeX


      @inproceedings{oyama2025miel,
        author={Oyama, Akira and Hasegawa, Shoichi and Taniguchi, Akira and Hagiwara, Yoshinobu and Taniguchi, Tadahiro},
        title={Take That for Me: Multimodal Exophora Resolution with Interactive Questioning for Ambiguous Out-of-View Instructions},
        year={2025, under review}
      }
    

Related Research

Laboratory Information

Funding

This work was supported by JSPS KAKENHI Grants-in-Aid for Scientific Research (Grant Numbers JP23K16975, JP22K12212), JST Moonshot Research & Development Program (Grant Number JPMJMS2011).