In this paper, we design a multimodal sequence feature extraction model based on self-attention mechanism and propose an improved RoBERTa-MEN to realize emotion classification. Combined with the influence of cognition on emotion, the animated character emotion modeling and behavior modeling methods are optimized. Select mainstream models for emotion recognition performance comparison to explore the superiority of the proposed method. Combine the results of correlation visualization analysis to verify the effectiveness of the improvement scheme in this paper. Conduct an animated character emotion transfer experiment to analyze the feasibility of the proposed method. In the performance test, the four indexes of this paper’s model are improved by 1.5%~2.5% than the TBJE model. The research on the effectiveness of facial emotion shows that the classification accuracy of this paper’s model in seven emotion labels is above 80%, and the average value of the emotion index evaluation reaches 82 points. It proves that the scheme of this paper can integrate multimodal data and realize the effective characterization and transfer of animated characters’ emotions.