In order to quantify the immersive experience in virtual reality music aesthetic education scene, an evaluation model based on physiological signal synchronization analysis was proposed. In this paper, EEG, ECG, EMG, respiration and head pose data of 36 participants were synchronously collected during 180 VR music interactions, forming 1440 samples under four immersion levels. The model takes music events as time anchors, constructs features by timing alignment, phase consistency, coupling strength and synchronization representation, and combines context reweighting, relationship aggregation, gated screening and dual-branch output to complete immersion level recognition and continuous scoring. The accuracy, macro-F1, mean absolute error and Pearson correlation were used as evaluation indicators in the experiment, and compared with single EEG classifier, early stitching model, convolution loop fusion model and time domain statistical synchronization model. The results show that the proposed model achieves 92.8% classification accuracy, 90.6% macro-F1 value, 0.214 mean absolute error and 0.873 correlation, which shows cross-subject stability and scene adaptability in the immersion evaluation of VR music aesthetic education.