For digital music dissemination and intelligent audio analysis scenarios, this paper constructs a deep neural network model around the problem of audience reaction prediction in vocal emotion analysis. Based on the time-frequency feature extraction and preprocessing of human voice audio, the model uses the fully convolutional network to extract the spatial information in the spectral domain, combines the bidirectional long short-term memory network to capture the time dependence of emotion in phrase progression, and introduces the context attention fusion mechanism to adaptively weight the key frequency band, key frame and cross-segment association information. Thus a computational mapping between vocal expression and listener feedback is established. The experimental results show that the accuracy of the model on the human voice emotion recognition task reaches 91.8%, and the macro-average F1 value is 91.1%. In the listener response prediction task, the mean absolute errors of preference prediction and arousal prediction are reduced to 0.356 and 0.339, respectively. The results show that the proposed model can stably improve the accuracy of emotion analysis and the ability of listener feedback prediction in complex vocal clips.