Outline

Ingegneria Sismica

Ingegneria Sismica

Optimal Transport Theory for Aligning Visual and Temporal Modalities: A Wasserstein Distance Approach

Author(s): Yapeng Xu1
1School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
Xu, Yapeng. “Optimal Transport Theory for Aligning Visual and Temporal Modalities: A Wasserstein Distance Approach.” Ingegneria Sismica Volume 43 Issue 2: 1-22, doi:10.65102/is2026878.

Abstract

Cross-modal matching among vision and time data is a basic problem in multimodal study, which supports applications from video-word search to time activity position fixing. Current methods mainly depend on dot-product attention or similarity measures based on kernels, which regard feature vectors as separate points and cannot capture the geometric structure of heterogeneous feature distributions. This article puts forward a comprehensive frame which is based on optimal transport (OT) theory for visual-temporal modality matching, bringing forward three step-by-step new creations: (i) Wasserstein Cross-Modal Attention (WCA) takes the place of traditional attention with weights from transport plan that can ensure overall consistent cross-modal corresponding relations; (ii) Hierarchical Optimal Transport (HOT) lets alignment expand into a multi-scale framework that has coarse-to-fine transport plan perfecting through token, segment, and global semantic levels; and (iii) the Adaptive Sinkhorn Divergence which has Learnable Cost Matrix (ASLC) jointly carry out the learning of a task-special cost function, and thus dynamically adjust entropic regularization on the foundation of cross-modal difference. Extensive experiments on three standard benchmarks (MSR-VTT, ActivityNet Captions, UCF101) demonstrate that ASLC achieves consistent improvements of 5.0-7.4% in Recall@1 and 30-60% lower MMD over baselines. Overall ablation researches, hyperparameter sensitivity analyses, noise robustness estimations, and cross-dataset generalization experiments confirm the effectiveness, robustness, and transferability of the framework we put forward.

Keywords
optimal transport, cross-modal alignment, Wasserstein distance, Sinkhorn algorithm, visual-temporal alignment, multimodal learning

Related Articles

Huiqiao Liu1
1Yinchuan University of Energy, Ningxia, 750000, China
Xin Zhao1, Yan Li1, Xiangyang Cao1, Qiushuang Li1, Jianing Zhang1
1State Grid Shandong Electric Power Company Economic and Technological Research Institute ShanDong JiNan 250001, China
Dan Yang1
1School of Marxism, Suzhou Polytechnic University, Suzhou, 215104, China
Liuhang Shen1, Xiangwen Sun1
1Ulster college at Shaanxi University of Science &Technology, Xi’an,710021, Shaanxi, China