This paper proposes a cross-cultural classification method for images of Dunhuang murals and medieval European Musical Instruments. A dataset of 9216 images covering 16 musical instrument categories was constructed, with 4768 Dunhuang samples and 4448 European samples. The framework uses region clipping, label normalization and block-level coding, introduces hierarchical visual semantic alignment and adaptive gated fusion, and maps heterogeneous cultural cues into the discriminant space. A two-branch classifier is designed to extract the organ-shaped commonalities while preserving the details. The focus classification loss, contras alignment loss and domain constraint regularization term are jointly optimized in training to stabilize class boundaries. Experimental results show that the proposed method achieves 94.8% Accuracy, 93.6% Macro-F1 and 92.4% CDR, which are better than the ResNet50, Swin-T and CLIP-Linear baselines. The test results show that the misjudgment proportion of harp and harp is less than 6.3%, and the confusion rate of harp and harp is reduced to 8.1%, which indicates that the model can provide computational support for cross-cultural instrument analysis, cataloguing and knowledge organization.