With the expansion of digital education infrastructure, the computational analysis of the interaction efficacy of mental health classroom in colleges and universities has become the basis for supporting adaptive teaching. This paper proposes a structural design that integrates cross-modal attention representation, behavior-emotional response coding, and feedback trigger mechanism to enhance classroom interaction. The framework organizes video streams, speech signals, dialogue texts, click logs, and attendance events of 12 university classes into a multimodal sequence dataset, which contains 624 minutes of audio recordings, 1.12 million video frames, and 18460 labeled interaction segments. The visual branch of ConvNeXt is used to extract posture and gesture cues, and the Transformer encoder models the temporal dependence of utterance turns, emotional fluctuations, and response delays. Training was done in PyTorch using AdamW on an RTX 4090 with batch size 16 and 90 rounds of training. The experimental results show that the accuracy of interaction state is 91.6%, the F1 value of interaction efficacy is 89.8%, the feedback trigger accuracy is 88.7%, and the AUC is 0.812, which can provide support for classroom interaction analysis and intervention.