In response to the problem of strong subjectivity and insufficient process evidence in the evaluation of interactive quality in English classrooms, this paper constructs a multimodal intelligent evaluation model that integrates video, audio, and classroom transcribed text. Based on 48 real English classes and 1920 interactive segments, establish a five dimensional annotation system for questioning quality, feedback effectiveness, participation breadth, emotional atmosphere, and target language interaction density. The results showed that the intra class correlation coefficient (ICC) was 0.86. The macro average F1 value (Macro-F1) of the proposed model on the test set is 0.803, the mean absolute error (MAE) is 0.298, and the correlation coefficient between classroom level prediction and expert rating is 0.861. The ablation and case analysis show that there are differences in the dependence of different interaction dimensions on text, audio, and visual modalities. The model can also identify clues such as open questioning, waiting time, and participation coverage, providing interpretable basis for teachers to improve classroom interaction.