The changes of students’ learning condition in English class are transmitted through physiological data which are shown in many kinds of forms. Collecting, analyzing and using these data to identify the emotional changes in the students’ learning process can provide a reference for teachers to reasonably design classroom activities. In this paper, we use Haar feature face detection method and OpenPose network structure feature recognition method to extract students’ facial expression and behavioral posture feature data in English classroom. One model which recognizes emotion in learning has been built to complete the integration of multimodal data through the utilization of the multi-head attention mechanism. After that, it merges this with the time features that got from the long-short-term memory network to implement the recognition and classification of the emotions of students. This model, through continuous experiment, therefore, shows that the precision of its emotion recognition exceeds 85%. Through four times of examinations, the average score achievement of students in the experiment class, which is taught under the multi-modal teaching thought, rose to above 85 points. In all kinds of classroom activities, “acting English dramas and chanting English songs” has been proven to be the one with the biggest influence.