Cross-modal matching among vision and time data is a basic problem in multimodal study, which supports applications from video-word search to time activity position fixing. Current methods mainly depend on dot-product attention or similarity measures based on kernels, which regard feature vectors as separate points and cannot capture the geometric structure of heterogeneous feature distributions. This article puts forward a comprehensive frame which is based on optimal transport (OT) theory for visual-temporal modality matching, bringing forward three step-by-step new creations: (i) Wasserstein Cross-Modal Attention (WCA) takes the place of traditional attention with weights from transport plan that can ensure overall consistent cross-modal corresponding relations; (ii) Hierarchical Optimal Transport (HOT) lets alignment expand into a multi-scale framework that has coarse-to-fine transport plan perfecting through token, segment, and global semantic levels; and (iii) the Adaptive Sinkhorn Divergence which has Learnable Cost Matrix (ASLC) jointly carry out the learning of a task-special cost function, and thus dynamically adjust entropic regularization on the foundation of cross-modal difference. Extensive experiments on three standard benchmarks (MSR-VTT, ActivityNet Captions, UCF101) demonstrate that ASLC achieves consistent improvements of 5.0-7.4% in Recall@1 and 30-60% lower MMD over baselines. Overall ablation researches, hyperparameter sensitivity analyses, noise robustness estimations, and cross-dataset generalization experiments confirm the effectiveness, robustness, and transferability of the framework we put forward.