Aiming at the recognition requirements of children’s natural aesthetic education activity experience, a deep learning driven multimodal analysis system was proposed. The video sequence, speech segment, action trajectory, gaze area and environmental semantics were unified to describe the emotional cognition, behavioral influence and online feedback in natural aesthetic experience. The system consists of cross-modal semantic encoding, shared temporal backbone and relationship inference modules, which capture the collaborative changes between expression, language response, action rhythm and scene attributes. Based on the training evaluation of 8640 labeled interaction segments from 216 young children in 36 activity sessions, 80:20 split and 5-fold cross validation were used. Experimental results show that the accuracy of experience participation prediction reaches 92.1%, the accuracy of emotional state recognition reaches 89.7%, the score of scene semantic matching reaches 87.9%, and the average inference delay is 84 ms. Compared with the control method, the response gain of aesthetic education experience reaches 18.6%, and the F1 of behavior influence trajectory recognition reaches 90.2%, which reflects a good analysis and adjustment effect.