Aiming at the problems of insufficient efficiency, unstable cultural expression and difficult style control in the visual image design of digital cultural tourism, this paper proposes a visual image design method for digital cultural tourism based on artificial intelligence generation model, and constructs a technical link of “demand analysis – data construction – multi-modal semantic fusion – generation control – feedback optimization”. This study integrates deep learning, Transformer, multi-modal representation learning and diffusion generation technology, and constructs a dataset containing 12,000 cultural travel images, 8500 regional cultural texts, 3200 promotional texts and 2,100 groups of brand visual cases. On this basis, the generation of posters, logos, IP images and digital guide interfaces is completed. The experimental results show that the average FID of the model is 18.7, the SSIM is 0.842, the CLIP semantic similarity is 0.781, the cultural element fidelity is 88.4%, and the style matching degree is 86.9%. In the application case, the average generation time of four types of tasks is 6.33 s, and the average user satisfaction is 86.95%. The results show that this method can effectively improve the automation degree, cultural recognition degree and communication adaptability of digital cultural tourism visual design, and has strong theoretical value and application potential.