Pointing Frame Estimation with Audio-Visual Time Series Data for Daily Life Service Robots

1Ritsumeikan University, 2Soka University, 3Kyoto University
Winner of the Best Paper Award in IEEE SMC 2024 (1st of 755)!

*Corresponding Author

Abstract

Daily life support robots in the home environment interpret the user's pointing and understand the instructions, thereby increasing the number of instructions accomplished. This study aims to improve the estimation performance of pointing frames by using speech information when a person gives pointing or verbal instructions to the robot. The estimation of the pointing frame, which represents the moment when the user points, can help the user understand the instructions. Therefore, we perform pointing frame estimation using a time-series model, utilizing the user's speech, images, and speech-recognized text observed by the robot. In our experiments, we set up realistic communication conditions, such as speech containing everyday conversation, non-upright posture, actions other than pointing, and reference objects outside the robot's field of view. The results showed that adding speech information improved the estimation performance, especially the Transformer model with Mel-Spectrogram as a feature. This study will lead to be applied to object localization and action planning in 3D environments by robots in the future.

Overview

image of model

An overview of our study. The robot makes an estimation of the moment at which a person gives a pointing instruction. The robot uses the speech uttered by the person, the video showing the pointing, and the sentence for the estimation.

Pointing Frame Estimator with Audio-Visual Time Series Data

image of model

An overview of our study. The robot makes an estimation of the moment at which a person gives a pointing instruction. The robot uses the speech uttered by the person, the video showing the pointing, and the sentence for the estimation.

Scenarios of Instructions

We assumed 4 scenarios when the user gives pointing instructions to the robot.

Pointing frame estimation results

We used F1 Score as an evaluation item.
The proposed method achieved a high score when utilizing the combination of a Mel-Spectrogram for feature representation and a Transformer for its architecture.

image of representations

Poster

BibTeX


      @inproceedings{nakagawa2024pointing,
        author={Nakagawa, Hikaru  and Hasegawa, Shoichi  and Hagiwara, Yoshinobu  and Taniguchi, Akira  and Taniguchi, Tadahiro},
        title={Pointing Frame Estimation with Audio-Visual Time Series Data for Daily Life Service Robots},
        booktitle={IEEE International Conference on Systems, Man, and Cybernetics (SMC)},
        year={2024, in press}
      }
    

Laboratory Information

Funding

This work was supported by JSPS KAKENHI Grants-in-Aid for Scientific Research (Grant Numbers JP23K16975, 22K12212), JST Moonshot Research & Development Program (Grant Number JPMJMS2011), and JST SPRING, Grant Number JPMJSP2101.