Aiming at the problems of traditional English translation models, such as insufficient scene constraints, limited ambiguity resolution ability, and unstable image-text semantic coordination, this paper constructed a multimodal English translation production model combining cross-modal alignment and attention mechanism. Based on the collaborative input of text and image, the model forms an integrated technical link of “alignment-attention-generation” through multimodal input representation, shared semantic space mapping, bidirectional cross-modal alignment and attention-driven decoding generation. Experimental results show that the BLEU, METEOR and ROUGE-L of the proposed model on the test set reach 37.4, 32.5 and 41.3 respectively, which are 5.6, 4.4 and 5.9 percentage points higher than those of the basic Transformer model. The accuracy of image-text consistency, ambiguity resolution and entity alignment reaches 85.9%, 84.2% and 85.1%, respectively. The results show that cross-modal alignment can effectively reduce the representation deviation between text semantics and visual semantics, and the attention mechanism can enhance the dynamic screening ability of key contexts in the translation generation stage, thereby improving the accuracy, stability and application adaptability of multimodal English translation production.