In view of the demand for computable evaluation in university labor education, this paper proposes a multi-modal data-driven intelligent evaluation system for learning scenarios. The system integrates video action streams, operation logs, task submission records, collaboration trajectories and reflective texts into the feature space. The labor behavior encoder extracts posture change, tool use sequence, completion time, process consistency, and reflective semantic features. Attention fuses the quality of network evaluation participation, task completion, safety compliance, collaboration status and depth of reflection to generate evidence evaluation scores for teachers and students. A data set containing 4280 valid records was constructed from 126 undergraduates, covering tasks such as campus cleaning, green plant maintenance, equipment maintenance, community service and handcrafting. Experimental results show that the proposed model achieves 93.2% Accuracy, 0.914 F1-score and 0.176 MAE, which is 5.8, 6.4 and 3.9 percentage points higher than the CNN-LSTM, Text-BERT and Late-FusionNet baselines, respectively. It presents a stable evaluation performance.