Multi-modal generation technology is entering the scene of digital art creation and real-time interaction. However, the existing diffusion generation methods mostly rely on static conditional input, which is prone to problems such as semantic offset, insufficient feedback absorption and unstable interaction state under the joint action of continuous speech, gesture and image prompts. To solve this problem, this paper proposes a double-loop feedback control method combining diffusion model and policy gradient optimization. The text, image, speech and gesture signals are uniformly encoded into control states, and the main direction is maintained through the outer loop semantic constraints, and the inner loop local correction is used to respond to user feedback disturbances. Experiments on 5240 groups of multimodal interaction samples show that, The Dual-Loop model achieves 89.6%±0.7, 91.2%±0.6 and 88.4%±0.5 in Controllability Score, Response Consistency and Interaction Stability, respectively. The response consistency is still 90.4% after 10 consecutive rounds of interaction, and the reasoning throughput is 17.9 frames/s under the condition of high load and complex interaction. The results show that the double-loop feedback mechanism can improve the stability of continuous interaction while ensuring the controllability of generation, which provides technical support for feedback-driven generation and real-time interaction optimization in artificial intelligence digital art creation.