Optimal Transport Theory for Aligning Visual and Temporal Modalities: A Wasserstein Distance Approach

Xu, Yapeng

doi:10.65102/is2026878

Research article

Ingegneria Sismica

Volume 43 Issue 2
Pages: 1
-22

Optimal Transport Theory for Aligning Visual and Temporal Modalities: A Wasserstein Distance Approach

Author(s): ^¹

¹School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China

Published: 30/04/2026

Cite

Xu, Yapeng. “Optimal Transport Theory for Aligning Visual and Temporal Modalities: A Wasserstein Distance Approach.” Ingegneria Sismica Volume 43 Issue 2: 1-22, doi:10.65102/is2026878.

https://doi.org/10.65102/is2026878

Abstract

Cross-modal matching among vision and time data is a basic problem in multimodal study, which supports applications from video-word search to time activity position fixing. Current methods mainly depend on dot-product attention or similarity measures based on kernels, which regard feature vectors as separate points and cannot capture the geometric structure of heterogeneous feature distributions. This article puts forward a comprehensive frame which is based on optimal transport (OT) theory for visual-temporal modality matching, bringing forward three step-by-step new creations: (i) Wasserstein Cross-Modal Attention (WCA) takes the place of traditional attention with weights from transport plan that can ensure overall consistent cross-modal corresponding relations; (ii) Hierarchical Optimal Transport (HOT) lets alignment expand into a multi-scale framework that has coarse-to-fine transport plan perfecting through token, segment, and global semantic levels; and (iii) the Adaptive Sinkhorn Divergence which has Learnable Cost Matrix (ASLC) jointly carry out the learning of a task-special cost function, and thus dynamically adjust entropic regularization on the foundation of cross-modal difference. Extensive experiments on three standard benchmarks (MSR-VTT, ActivityNet Captions, UCF101) demonstrate that ASLC achieves consistent improvements of 5.0-7.4% in Recall@1 and 30-60% lower MMD over baselines. Overall ablation researches, hyperparameter sensitivity analyses, noise robustness estimations, and cross-dataset generalization experiments confirm the effectiveness, robustness, and transferability of the framework we put forward.

Keywords
optimal transport, cross-modal alignment, Wasserstein distance, Sinkhorn algorithm, visual-temporal alignment, multimodal learning