IEEE Systems, Man and Cybernetics Magazine - April 2023 - 33

speaker detection [40]. It begins with spatiotemporal convolution,
the 3D convolutional layer, and then reduces the spatial
dimension progressively with an 18-layer residual
network (ResNet-18) [39]. It aims to learn each video
frame's spatial information and encode the video frame
stream into a sequence of frame-based embedding. The
visual temporal convolutional module is proposed to represent
the temporal content in a long-term visual spatiotemporal
stream. The network contains five residual
connected linear units, batch normalization, and depthwise
separable convolutional (Conv1D) layers. Finally, a Conv1D
layer is attached to reduce the feature dimension to 128.
The VSE represents the temporal content in a long-term
visual spatiotemporal structure. For instance, if the VSE
consists of a receptive field of nine video frames, it takes a
segment of up to 60 ms to encode a video embedding when
the video frame rate is 15 fps.
The ASE
The ASE is proposed to learn an audio feature representation
from the temporal dynamics. Following its advancement
in active speaker detection [38], [40], the ResNet-34
network with a squeeze-and-excitation module [41] presented
in [42] was adopted for the ASE. The ASE uses
mel-frequency cepstral coefficients (MFCCs), and 13 melfrequency
bands were used in each timestep. The network
inputs a sequence of audio frames to generate the
sequence of audio embeddings
E .a
The ASE feature
dimension output is set to (1, 128). ResNet-34 is
designed with dilated convolutions to match the audio
embeddings Ea
dings Ev
time resolution with the visual embedto
ease the following attention module. For
instance, if the ASE has a receptive field of 90 audio
frames, we take a segment of 900 ms to encode an
audio embedding. With this, we capture the long-term
temporal context. The MFCC features are
extracted using a 25-ms analysis window
with a stride of 10 ms, yielding 100 audio
frames every second.
Va
The Back-End Network
Softmax
Audiovisual CA
The embedded features Ev
and Ea
are
intended to distinguish the events appropriate
to speaking activities for audio and
visual, respectively. It has been proven
that audiovisual synchronization is an
informative cue in related problems, such
as active speaker detection [40], [43] and
lipreading [43]. Motivated by this fact, we
employed audiovisual synchronization for
AD. Since the audio and visual modalities
are from different inputs, they are not
synchronized. The actual audiovisual
alignment may depend on the speakers'
Scale
MatMul
Ka
Linear
Ea
(a)
Qv Qa
Softmax
Scale
MatMul
Kv
Linear
Ev
instantaneous phonetic content and speaking behaviors.
Recently, attention mechanisms have become an integral
part of models that must capture global dependencies
[44]. Hence, we adopted two CA networks [38] for
AD to handle dynamic audiovisual interaction along the
temporal dimension. As shown in Figure 7, the essential
part of the CA network is the attention layer [44]. The
inputs are the vectors of a query (, ),QQ
av
av key (, ),KK
av
and value (, )VV from audio and visual embeddings,
respectively, projected by a linear layer. As illustrated
in (6) and (7), the outputs are the audio attention feature
[audio CA (ACA)] and visual attention feature
[visual CA (VCA)]:
ACA (, ,)QK V = softmaxdn (6)
QKv
a
T
va a
d
VC (, ,)AQ KVv =
av softmaxdn (7)
QKa
v
T
d
Vv
where d denotes the dimensions of Q, A, and V. As illustrated
in (6) and (7), the ACA generates the new interacted
audio feature employing the target sequence from the
video stream to generate a query, employing the source
sequence from the audio stream to generate a key and a
value to learn VCA vice versa. Finally, before feeding the
two CA modules to the fusion layer, the feedforward layer,
residual connection, and layer normalization were added
to generate the final cross-modal attention network.
BLF
BLF has been widely used in multimodal fusion due to
its advantage of joining every channel of one network
(as a product) with every other channel. We adopt BLF
for audiovisual fusion as follows.
Va
MatMul
MatMul
Vv
Vav
Softmax
Scale
MatMul
Kav
Linear
Eav
(b)
Figure 7. (a) The CA network layers. The audio attention feature
ACA is generated considering the Ea as the source and the visual
features Ev as the target. The visual attention feature VCA is generated
similarly. (b) The SA network layers.
April 2023 IEEE SYSTEMS, MAN, & CYBERNETICS MAGAZINE 33
MatMul
Qav

IEEE Systems, Man and Cybernetics Magazine - April 2023

Table of Contents for the Digital Edition of IEEE Systems, Man and Cybernetics Magazine - April 2023

IEEE Systems, Man and Cybernetics Magazine - April 2023 - Cover1
IEEE Systems, Man and Cybernetics Magazine - April 2023 - Cover2
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 1
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 2
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 3
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 4
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 5
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 6
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 7
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 8
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 9
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 10
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 11
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 12
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 13
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 14
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 15
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 16
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 17
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 18
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 19
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 20
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 21
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 22
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 23
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 24
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 25
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 26
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 27
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 28
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 29
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 30
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 31
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 32
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 33
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 34
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 35
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 36
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 37
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 38
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 39
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 40
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 41
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 42
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 43
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 44
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 45
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 46
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 47
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 48
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 49
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 50
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 51
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 52
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 53
IEEE Systems, Man and Cybernetics Magazine - April 2023 - 54
IEEE Systems, Man and Cybernetics Magazine - April 2023 - Cover3
IEEE Systems, Man and Cybernetics Magazine - April 2023 - Cover4
https://www.nxtbook.com/nxtbooks/ieee/smc_202310
https://www.nxtbook.com/nxtbooks/ieee/smc_202307
https://www.nxtbook.com/nxtbooks/ieee/smc_202304
https://www.nxtbook.com/nxtbooks/ieee/smc_202301
https://www.nxtbook.com/nxtbooks/ieee/smc_202210
https://www.nxtbook.com/nxtbooks/ieee/smc_202207
https://www.nxtbook.com/nxtbooks/ieee/smc_202204
https://www.nxtbook.com/nxtbooks/ieee/smc_202201
https://www.nxtbook.com/nxtbooks/ieee/smc_202110
https://www.nxtbook.com/nxtbooks/ieee/smc_202107
https://www.nxtbook.com/nxtbooks/ieee/smc_202104
https://www.nxtbook.com/nxtbooks/ieee/smc_202101
https://www.nxtbook.com/nxtbooks/ieee/smc_202010
https://www.nxtbook.com/nxtbooks/ieee/smc_202007
https://www.nxtbook.com/nxtbooks/ieee/smc_202004
https://www.nxtbook.com/nxtbooks/ieee/smc_202001
https://www.nxtbook.com/nxtbooks/ieee/smc_201910
https://www.nxtbook.com/nxtbooks/ieee/smc_201907
https://www.nxtbook.com/nxtbooks/ieee/smc_201904
https://www.nxtbook.com/nxtbooks/ieee/smc_201901
https://www.nxtbook.com/nxtbooks/ieee/smc_201810
https://www.nxtbook.com/nxtbooks/ieee/smc_201807
https://www.nxtbook.com/nxtbooks/ieee/smc_201804
https://www.nxtbook.com/nxtbooks/ieee/smc_201801
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_1017
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0717
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0417
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0117
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_1016
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0716
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0416
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0116
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_1015
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0715
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0415
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0115
https://www.nxtbookmedia.com