IEEE Systems, Man and Cybernetics Magazine - April 2023 - 26
audio and visual information or the
short-term and long-term segments,
and they have not explored combining
important conversation cues-
the facial and audio features. In
addition, no audiovisual spatiotemporal
annotated dataset captured in
mixed human- to-human and
human-to-robot settings is available
to support exploring the area using
new approaches.
This work presents a novel
Visual and audio
information are the
most evident clues
that helps us to
identify if someone is
addressing us.
AD dataset recorded in mixed
human-to-human and human-to-robot settings called
Extended MuMMER (E-MuMMER), extending the existing
MuMMER dataset. E-MuMMER consists of spatiotemporal
annotations of spoken activity. We also
propose a two-stream-based deep learning framework
called ADNet, which uses long-term and short-term
audiovisual features to predict the addressee. It consists
of audio and video stream encoders, an audiovisual
cross-attention (CA) approach for intermodality interaction,
bilinear fusion (BLF) to combine the two modalities,
and a self-attention (SA) approach to capture
long-term speech activity. This work is the first to introduce
a deep learning framework that leverages facial
and audio features. As a result, this study presents a new
AD research paradigm, and E-MuMMER is the first to
promote AD.
The raw MuMMER dataset is available from https://
www.idiap.ch/en/dataset/mummer, and the new annotation
and source code will be available at https://github.com/
falmi/addressee-detection.
Introduction
The fundamental challenge in the humanoid robot is
endowing it with smart audiovisual perception capabilities
systems that aid in interacting and cooperating naturally
with humans [1]. One way of enriching this is to let
the robot identify whether it is being addressed. This
helps the robot decide whether to react to a person's
utterances. Its application is mainly required for a guide
robot, a companion acting as an assistant, a robot butler,
a robot lifeguard, and a mobile nursing care robot [2].
However, despite a few prior works, the area has not
been widely explored as intended with state-of-the-art
approaches in realistic settings using useful communication
cues.
Naturally, we humans have several senses that enable
us to gather the necessary multimodal information to
decide whether we are being addressed or not. Visual and
audio information are the most evident clues that helps us
to identify if someone is addressing us. We mostly use the
visual information originating from head position and gaze
since we tend to look at the person (or object) we are talking
to [3]. On the other hand, during conversation, humans
26 IEEE SYSTEMS, MAN, & CYBERNETICS MAGAZINE April 2023
show different facial motions to
control the flow of a dialogue,
address someone, help change
what is said, or convey complex
intentions without saying a word.
Most of these facial expressions
rely primarily on the movement of
a single area or sequence of facial
areas to convey meaning. In particular,
the combination of motions of
the mouth region, the rigid head,
eyes, and eyebrows plays a significant
role in changing the course of
a conversation [4]. Furthermore, human communication
studies have also proven that facial movements play an
important role in regulating conversation [4].
The next cue is based on the observation that humans
tend to change their way of speech when addressing
someone or talking to each other. This can be easily
observed in conversations between adults and children
having a lack of communication experience [5], between
healthy people and hearing impaired people [6], or
between close friends and strangers. Similarly, humans
do not yet perceive robots as full conversational agents.
As a result, humans change their normal manner of
speech by making it rhythmic and generally easier to
understand as soon as they start talking to the agent [7],
or they talk to the agent as if they were talking to a child
[5]. Furthermore, studies in human-robot interactions
(HRIs) indicate that humans tend to speak to a computer
more loudly and slowly than when speaking to humans
[8], [9]. A recent HRI study also found that acoustic
information is the most useful clue for AD compared to
other modalities [10].
Previous AD studies have tried to address this area by
combining gaze and utterance information [11], [12], [13]
following these findings. However, these approaches fail
to detect the addressee when the human speaks to the
addressee without looking. In addition, existing works
[2] do not benefit much from available audio and visual
information, the long-term and short-term segments.
Most of these studies focus on segment-level (singleimage)
information, which is 0.2-0.6 s long. However, it
is difficult to predict the speaking activity from a single
image or a video segment of 0.2 s. Humans consider the
whole sentence, which spans hundreds of video frames,
to decide whether the person is addressing another person.
For instance, a 5-s video contains an average of 15
words. A short-term segment of 0.2 s does not even
cover a complete word. Combining audio and visual
modalities, especially facial regions, is advancing in
related areas, such as lip synchronization [14], lipreading
[15], voice activity detection, and active speaker
detection [16]. However, as far as our knowledge is concerned,
no prior works have explored AD by combining
facial regions with audio features. This study explores
http://www.idiap.ch/en/dataset/mummer
http://www.idiap.ch/en/dataset/mummer
https://www.github.com/addressee-detection
https://www.github.com/addressee-detection
IEEE Systems, Man and Cybernetics Magazine - April 2023
Table of Contents for the Digital Edition of IEEE Systems, Man and Cybernetics Magazine - April 2023
IEEE Systems, Man and Cybernetics Magazine - April 2023 - Cover1
IEEE Systems, Man and Cybernetics Magazine - April 2023 - Cover2
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 1
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 2
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 3
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 4
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 5
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 6
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 7
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 8
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 9
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 10
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 11
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 12
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 13
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 14
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 15
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 16
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 17
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 18
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 19
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 20
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 21
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 22
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 23
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 24
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 25
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 26
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 27
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 28
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 29
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 30
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 31
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 32
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 33
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 34
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 35
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 36
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 37
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 38
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 39
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 40
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 41
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 42
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 43
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 44
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 45
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 46
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 47
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 48
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 49
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 50
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 51
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 52
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 53
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 54
IEEE Systems, Man and Cybernetics Magazine - April 2023 - Cover3
IEEE Systems, Man and Cybernetics Magazine - April 2023 - Cover4
https://www.nxtbook.com/nxtbooks/ieee/smc_202310
https://www.nxtbook.com/nxtbooks/ieee/smc_202307
https://www.nxtbook.com/nxtbooks/ieee/smc_202304
https://www.nxtbook.com/nxtbooks/ieee/smc_202301
https://www.nxtbook.com/nxtbooks/ieee/smc_202210
https://www.nxtbook.com/nxtbooks/ieee/smc_202207
https://www.nxtbook.com/nxtbooks/ieee/smc_202204
https://www.nxtbook.com/nxtbooks/ieee/smc_202201
https://www.nxtbook.com/nxtbooks/ieee/smc_202110
https://www.nxtbook.com/nxtbooks/ieee/smc_202107
https://www.nxtbook.com/nxtbooks/ieee/smc_202104
https://www.nxtbook.com/nxtbooks/ieee/smc_202101
https://www.nxtbook.com/nxtbooks/ieee/smc_202010
https://www.nxtbook.com/nxtbooks/ieee/smc_202007
https://www.nxtbook.com/nxtbooks/ieee/smc_202004
https://www.nxtbook.com/nxtbooks/ieee/smc_202001
https://www.nxtbook.com/nxtbooks/ieee/smc_201910
https://www.nxtbook.com/nxtbooks/ieee/smc_201907
https://www.nxtbook.com/nxtbooks/ieee/smc_201904
https://www.nxtbook.com/nxtbooks/ieee/smc_201901
https://www.nxtbook.com/nxtbooks/ieee/smc_201810
https://www.nxtbook.com/nxtbooks/ieee/smc_201807
https://www.nxtbook.com/nxtbooks/ieee/smc_201804
https://www.nxtbook.com/nxtbooks/ieee/smc_201801
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_1017
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0717
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0417
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0117
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_1016
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0716
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0416
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0116
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_1015
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0715
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0415
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0115
https://www.nxtbookmedia.com