Legal training based on virtual simulation requires accurate scene reconstruction, character interaction tracking, and computable immersion effect evaluation. Integrating VR scene modeling, multimodal interactive capture and intelligent feedback, this paper designs an immersive teaching mode for legal training scenarios. The virtual court, mediation room and legal consultation space are constructed with procedural nodes, evidence objects, role tasks and case materials. Speech, eye movement, head movement, operation log, task duration, and interaction frequency were collected from 96 learners in 12 rounds of training to form 2840 labeled samples. The multi-modal fusion network is used to identify the character behavior and evaluate the immersion state, and the edge assistance module is used to reduce the feedback delay. The experimental results show that the behavior recognition accuracy is 92.4%, the weighted F1 value is 0.887, the task adaptation accuracy is 88.6%, and the average response delay is 46 ms. The results show that the system can support data-driven legal training in immersive practice environment, more reliable process evaluation, task feedback and real-time interactive feedback.