In order to improve the ability of traditional dulcimer teaching effect evaluation, state recognition and optimization generation, this paper constructs a multi-modal analysis framework driven by artificial intelligence. Teaching audio, performance movements, classroom videos, and practice logs were collected from 96 learners in 16 teaching units to form a teaching dataset containing 18,240 synchronized samples. The model is composed of acoustic branch, action branch and classroom behavior branch. Time convolution, graph modeling and cross-modal attention mechanism are combined to jointly encode timespan stability, rhythm deviation, chord tapping posture and classroom interaction behavior, and output teaching scores and state labels. In the training phase, population search, AdamW and dynamic learning rate scheduling are used to complete the optimization, and the data set is divided by 7:2:1 on the NVIDIA RTX 4090 platform to complete the training and testing. The experimental results show that the proposed framework achieves 93.6% accuracy, 0.918 Macro-F1 and 24.6 ms reasoning delay, which provides a computable basis for the fine analysis and digital adaptation of traditional dulctic classroom.