A Stable Transformer-based Multimodal Framework for Dialogue Emotion Recognition with Speaker-Aware Context Modeling

Gao, Mingmin

doi:10.65102/is2026859

Research article

Ingegneria Sismica

Volume 43 Issue 2
Pages: 1
-16

A Stable Transformer-based Multimodal Framework for Dialogue Emotion Recognition with Speaker-Aware Context Modeling

Author(s): ^¹

¹Future Technology Institute, South China University of Technology, Guangzhou 510000, Guangdong, China

Published: 30/04/2026

Cite

Gao, Mingmin. “A Stable Transformer-based Multimodal Framework for Dialogue Emotion Recognition with Speaker-Aware Context Modeling.” Ingegneria Sismica Volume 43 Issue 2: 1-16, doi:10.65102/is2026859.

https://doi.org/10.65102/is2026859

Abstract

Emotion recognition of dialogue is difficult because of the flexibility of dialogue and the involvement of more than two speakers in a conversation and the ambiguity of interverbal expression of emotions. Although recent studies have shown that multi-modal information can improve the accuracy of emotion recognition, how to efficiently combine text, sound and pictures in an unstable and unequal way for training is still an unsolved problem. The proposed model in this paper is called M2FNet, a multimodal fusion network for dialogue emotion recognition. It is a system for learning that integrates text, sound and pictures together. Pretrained BERT embeddings are employed to obtain linguistic representations, and transformer-based layers based on cross-modal interactions are used. A Gated Recurrent Unit (GRU) is used to process dialogue and needs to be equipped with functions for time-evolution and speaker-specific emotions. A weighted loss function and stable optimisation methods have also been introduced to deal with class imbalance and strengthen the stability of training, such as AdamW optimisation with gradient clipping, learning rate scheduling and early stopping. Experiments based on the publicly available MELD dataset show that the proposed strategy can achieve balanced performance among the emotion classes, and dominant conversational emotions are still recognized relatively fairly in the presence of class imbalance. Rather than pursuing the state-of-the-art performance, this work presents a reproducible and easily understandable multimodal baseline for dialogue emotion recognition to help us learn about the deficiencies and strengths of fusion-based architectures in real-life conversation.

Keywords
Dialogue Emotion Recognition; Multimodal Learning; Emotion Recognition; Transformer Fusion; Speaker-Aware Modeling; MELD Dataset; Deep Learning