This article proposes a knowledge graph construction method under multimodal information fusion, which enhances text semantic information and improves the accuracy of entity recognition and relationship extraction by introducing feature guidance and multimodal cross attention mechanism. The proposed model adopts a multi-level visual cue mechanism and aligns multimodal feature distributions, effectively bridging the semantic gap between text and images and achieving accurate matching of associated objects between entities and images. In terms of model training, Adam optimizer and linear scheduler are used, with different learning rates for language, common sense, visual, and EICF encoders, and a large number of hyperparameter search experiments are conducted to ensure fair comparison. The experimental results on public datasets such as Amazon, YouTube, and self-built datasets show that the proposed model is significantly better than baseline models such as Seq2Seq, NFM, CKE, KGCN, and MMGCN in evaluation metrics such as AUC, AP, and F1. The experimental results have verified the effectiveness and superiority of the proposed model in multimodal information fusion and knowledge graph construction.