To improve counting accuracy in dense soybean pod scenes under small-object occlusion, overlap, and repeated response conditions, this study proposes a spatiotemporal feature-decoupled improved YOLOv8 model that differs from detection-then-tracking counting methods. In this paper, “spatiotemporal decoupling” is defined as encoding soybean pod boundaries, contour gradients, and neighborhood occlusion relationships within the detection network as spatial structural representations, while encoding cross-frame center displacement, scale fluctuation, and short-term visibility variation as temporal association representations. Before the detection head, gated fusion is used to calibrate candidate box confidence and constrain counting bias. Unlike post-processing methods such as DeepSORT and ByteTrack, which rely on detection results for trajectory association, the temporal branch in the proposed method directly participates in candidate generation, candidate filtering, and quantity regression, allowing dense target responses to be corrected before NMS over-suppression and short-term missed detections occur. To address the susceptibility of conventional YOLOv8 to single-frame texture interference, weakened slender pod boundaries, and candidate drift in highly overlapping regions, the model constructs a spatial structural branch and a temporal association branch, and further introduces a P2 fine-grained fidelity branch, multi-scale semantic fusion, candidate target filtering constraints, and repeated-counting and missed-counting bias correction methods. On this basis, the model establishes a joint optimization strategy using localization loss, quantity regression loss, and temporal consistency loss. Experimental results show that the improved model achieves MAE/RMSE/F1 values of 4.2/6.8/0.91, 3.1/5.0/0.94, and 6.4/8.9/0.88 on the self-built soybean field dataset, PlantCrop subset, and occlusion-enhanced synthetic sequence, respectively, significantly reducing counting errors compared with the YOLOv8n baseline.The model operates at 51.7 FPS with a single-frame inference time of 19.3 ms on an NVIDIA RTX 4090 platform, meeting the real-time requirements of field counting.