Emotion recognition of dialogue is difficult because of the flexibility of dialogue and the involvement of more than two speakers in a conversation and the ambiguity of interverbal expression of emotions. Although recent studies have shown that multi-modal information can improve the accuracy of emotion recognition, how to efficiently combine text, sound and pictures in an unstable and unequal way for training is still an unsolved problem. The proposed model in this paper is called M2FNet, a multimodal fusion network for dialogue emotion recognition. It is a system for learning that integrates text, sound and pictures together. Pretrained BERT embeddings are employed to obtain linguistic representations, and transformer-based layers based on cross-modal interactions are used. A Gated Recurrent Unit (GRU) is used to process dialogue and needs to be equipped with functions for time-evolution and speaker-specific emotions. A weighted loss function and stable optimisation methods have also been introduced to deal with class imbalance and strengthen the stability of training, such as AdamW optimisation with gradient clipping, learning rate scheduling and early stopping. Experiments based on the publicly available MELD dataset show that the proposed strategy can achieve balanced performance among the emotion classes, and dominant conversational emotions are still recognized relatively fairly in the presence of class imbalance. Rather than pursuing the state-of-the-art performance, this work presents a reproducible and easily understandable multimodal baseline for dialogue emotion recognition to help us learn about the deficiencies and strengths of fusion-based architectures in real-life conversation.