Outline

Ingegneria Sismica

Ingegneria Sismica

Optimal Transport Theory for Aligning Visual and Temporal Modalities: A Wasserstein Distance Approach

Author(s): Yapeng Xu1
1School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
Xu, Yapeng. “Optimal Transport Theory for Aligning Visual and Temporal Modalities: A Wasserstein Distance Approach.” Ingegneria Sismica Volume 43 Issue 2: 1-22, doi:10.65102/is2026878.

Abstract

Cross-modal matching among vision and time data is a basic problem in multimodal study, which supports applications from video-word search to time activity position fixing. Current methods mainly depend on dot-product attention or similarity measures based on kernels, which regard feature vectors as separate points and cannot capture the geometric structure of heterogeneous feature distributions. This article puts forward a comprehensive frame which is based on optimal transport (OT) theory for visual-temporal modality matching, bringing forward three step-by-step new creations: (i) Wasserstein Cross-Modal Attention (WCA) takes the place of traditional attention with weights from transport plan that can ensure overall consistent cross-modal corresponding relations; (ii) Hierarchical Optimal Transport (HOT) lets alignment expand into a multi-scale framework that has coarse-to-fine transport plan perfecting through token, segment, and global semantic levels; and (iii) the Adaptive Sinkhorn Divergence which has Learnable Cost Matrix (ASLC) jointly carry out the learning of a task-special cost function, and thus dynamically adjust entropic regularization on the foundation of cross-modal difference. Extensive experiments on three standard benchmarks (MSR-VTT, ActivityNet Captions, UCF101) demonstrate that ASLC achieves consistent improvements of 5.0-7.4% in Recall@1 and 30-60% lower MMD over baselines. Overall ablation researches, hyperparameter sensitivity analyses, noise robustness estimations, and cross-dataset generalization experiments confirm the effectiveness, robustness, and transferability of the framework we put forward.

Keywords
optimal transport, cross-modal alignment, Wasserstein distance, Sinkhorn algorithm, visual-temporal alignment, multimodal learning

Related Articles

Junhua Li1, Xiaojie He1, Hua Liu2
1School of Mathematics and Computer Science, Hanjiang Normal University, Shiyan, 442000, Hubei, China
2School of Mathematics and Physics, Jingchu University of Technology, Jingmen 448000, Hubei, China
Wei Guo1, Peng Tao1, Bo Ling1, Shen Hao1, Nan Kai1
1State Grid Hebei Marketing Service Center, Shijiazhuang 050000, Hebei, China
Tianzi Zheng1, Genlang Chen2, Binhua He1
1School of Computer Science and Technology (School of Artificial Intelligence), Zhejiang Sci-Tech University, Hangzhou 310018, China
2School of Computer and Data Engineering, Ningbo Tech University, Ningbo 315199, China
Jingwen Wu1
1School of Business, Minnan Normal University, Zhangzhou 363000, Fujian, China
Zijie Peng1, Qianhua Xiao2
1JiLuan College, Nanchang University, Jiangxi 330031, Nanchang, China
2College of Information Engineering, Nanchang University, Jiangxi 330031, Nanchang, China