Aiming at the problems of action recognition relying on a single video feature, injury risk prediction lag and insufficient explanation of training load in soccer training, an action pattern recognition and sports injury prediction model based on multi-modal deep learning was constructed. The system fuses video skeleton key points, IMU-GPS motion sequences, heart rate, RPE, training load and previous injury records to form a dataset of 126 football players, 18420 action clips and 1260 weekly risk samples. The model extracts joint coordination features by spatio-temporal graph convolution, analyzes the changes of exercise load by sequence network, and introduces an attention mechanism to complete multimodal fusion. The experimental results show that the accuracy of action recognition of the full model is 94.6%, Macro-F1 is 93.8%, the AUC of damage prediction is 0.921, and the recall rate of high-risk damage is 88.7%. Compared with the model that removes key modes, the proposed method is more stable in scenes such as complex confrontation, emergency stop and landing buffer, and can provide data support for football training monitoring, action correction and injury warning.