Opera is a treasure of Chinese art, and opera Zhuhu Performance is an indirect expression of human language and emotion, and the emotion analysis in opera plays a more important role in the understanding of opera and the transmission of music language. On the basis of introducing the basic theory and relevant features of musical signal, the article preprocesses the multimodal data of the Zhuhu Performance of opera by means of aggravation, frame splitting and windowing, and then completes the classification and recognition of the emotion of the Zhuhu Performance of opera based on the emotion recognition model of improved multimodal RCNN. The article uses the confusion matrix method to show how accurately the model classifies emotions in experiments, and it builds an audio signal feature where the model can identify the emotions of Zhuhu’s opera performance with a relatively high accuracy rate and the classification accuracy rate of the emotion “happy” reaches the highest level.