Choral conductor gesture is the core medium for conveying the emotion of a musical work, but its intrinsic mechanism of action has long remained in subjective descriptions. This paper reveals the role of choral conductor gesture language in the emotional communication of musical works through a data-driven approach. The article first optimizes the 3DCNN network structure for the characteristics of choral conductor gesture data signals, designs a spatio-temporal separation convolutional neural network, introduces a spatio-temporal channel adjustment factor, and quickly changes the model size, obtaining a classification effect with low computational cost and high recognition accuracy. The empirical results show that the accuracy of the model in choral conductor gesture and emotion recognition on different datasets reaches more than 85%, and the correlation analysis of the gesture and emotion recognition results in the multimodal data of “Sunset xiao drums” performed by three conductors reveals the mechanism of the role of the gesture and emotion, and the conductor realizes the emotion of the performer through the amplitude of the gesture, the speed of the gesture and other spatial dynamics features interaction. The results of this paper provide a theoretical and technical basis for the teaching, evaluation and intelligent assistance of music conducting art.