Outline

Ingegneria Sismica

Ingegneria Sismica

A Stable Transformer-based Multimodal Framework for Dialogue Emotion Recognition with Speaker-Aware Context Modeling

Author(s): Mingmin Gao1
1Future Technology Institute, South China University of Technology, Guangzhou 510000, Guangdong, China
Gao, Mingmin. “A Stable Transformer-based Multimodal Framework for Dialogue Emotion Recognition with Speaker-Aware Context Modeling.” Ingegneria Sismica Volume 43 Issue 2: 1-16, doi:10.65102/is2026859.

Abstract

Emotion recognition of dialogue is difficult because of the flexibility of dialogue and the involvement of more than two speakers in a conversation and the ambiguity of interverbal expression of emotions. Although recent studies have shown that multi-modal information can improve the accuracy of emotion recognition, how to efficiently combine text, sound and pictures in an unstable and unequal way for training is still an unsolved problem. The proposed model in this paper is called M2FNet, a multimodal fusion network for dialogue emotion recognition. It is a system for learning that integrates text, sound and pictures together. Pretrained BERT embeddings are employed to obtain linguistic representations, and transformer-based layers based on cross-modal interactions are used. A Gated Recurrent Unit (GRU) is used to process dialogue and needs to be equipped with functions for time-evolution and speaker-specific emotions. A weighted loss function and stable optimisation methods have also been introduced to deal with class imbalance and strengthen the stability of training, such as AdamW optimisation with gradient clipping, learning rate scheduling and early stopping. Experiments based on the publicly available MELD dataset show that the proposed strategy can achieve balanced performance among the emotion classes, and dominant conversational emotions are still recognized relatively fairly in the presence of class imbalance. Rather than pursuing the state-of-the-art performance, this work presents a reproducible and easily understandable multimodal baseline for dialogue emotion recognition to help us learn about the deficiencies and strengths of fusion-based architectures in real-life conversation.

Keywords
Dialogue Emotion Recognition; Multimodal Learning; Emotion Recognition; Transformer Fusion; Speaker-Aware Modeling; MELD Dataset; Deep Learning

Related Articles

Huiqiao Liu1
1Yinchuan University of Energy, Ningxia, 750000, China
Xin Zhao1, Yan Li1, Xiangyang Cao1, Qiushuang Li1, Jianing Zhang1
1State Grid Shandong Electric Power Company Economic and Technological Research Institute ShanDong JiNan 250001, China
Dan Yang1
1School of Marxism, Suzhou Polytechnic University, Suzhou, 215104, China
Liuhang Shen1, Xiangwen Sun1
1Ulster college at Shaanxi University of Science &Technology, Xi’an,710021, Shaanxi, China