All quantification and optimization of film-narration-rhythm have relied on the intuitive judgment of subject-editor without a firm mathematical basis in practice. At present, these traditional approaches cannot distinguish between minor combinations of spatial visual composition and long-term temporal narrative pacing. To overcome these deficiencies, this paper presents a new Deep Learning framework combining Spatiotemporal feature fusion (STFF) to intelligently quantify and optimise cinema rhythm. A parallel extraction network has been introduced to the architecture of this study; specifically, there is a ResNet-50-based spatial feature-extraction module and a three-dimensional convolutional networks (C3D) used for time-dependent movements and behaviours analysis. A custom Spatiotemporal Attention Module (STAM) is introduced to adaptively re-calibrate feature weights across both dimensions. Based on the curated annotation data of 12,500 films from a specific collection, the proposed STFF model obtained a Rhythm Concordance Index (RCI) of 94.6% and an MAE value of 0.082 compared to the baseline methods; these were significantly higher than anticipated outcomes. Ablation studies have confirmed that the two streams combined work together. A scalable, quantifiable and clinically-relevant model of automatic film editing and rhythm adjustment is introduced in this paper.