Daily life support robots in the home environment interpret the user's pointing and understand the instructions, thereby increasing the number of instructions accomplished. This study aims to improve the estimation performance of pointing frames by using speech information when a person gives pointing or verbal instructions to the robot. The estimation of the pointing frame, which represents the moment when the user points, can help the user understand the instructions. Therefore, we perform pointing frame estimation using a time-series model, utilizing the user's speech, images, and speech-recognized text observed by the robot. In our experiments, we set up realistic communication conditions, such as speech containing everyday conversation, non-upright posture, actions other than pointing, and reference objects outside the robot's field of view. The results showed that adding speech information improved the estimation performance, especially the Transformer model with Mel-Spectrogram as a feature. This study will lead to be applied to object localization and action planning in 3D environments by robots in the future.
We assumed 4 scenarios when the user gives pointing instructions to the robot.
We used F1 Score as an evaluation item.
The proposed method achieved a high score when utilizing the combination of a Mel-Spectrogram for feature representation and a Transformer for its architecture.
@inproceedings{nakagawa2024pointing,
author={Nakagawa, Hikaru and Hasegawa, Shoichi and Hagiwara, Yoshinobu and Taniguchi, Akira and Taniguchi, Tadahiro},
title={Pointing Frame Estimation with Audio-Visual Time Series Data for Daily Life Service Robots},
booktitle={IEEE International Conference on Systems, Man, and Cybernetics (SMC)},
year={2024, in press}
}
This work was supported by JSPS KAKENHI Grants-in-Aid for Scientific Research (Grant Numbers JP23K16975, 22K12212), JST Moonshot Research & Development Program (Grant Number JPMJMS2011), and JST SPRING, Grant Number JPMJSP2101.