Generative artificial intelligence has shown great potential in the creative industries, but text-driven diffusion models face severe challenges in handling complex professional designs, including spatial topological loss of control and creator intent drift. To address this challenge, this paper proposes a visually guided adaptive assisted generative network. We constructed a high-precision paired dataset of over 52,000 sets across two major scenarios: architecture and industry. We innovatively introduced a parameterized spatial degradation mask, effectively eliminating domain offset errors in real human-computer interaction. At the algorithmic level, this framework deeply couples multimodal visual priors and structural consistency penalty terms in the feature space and pixel domain, respectively, completely reconstructing the non-convex optimization trajectory. Quantitative testing and controlled user studies show that VGAGN not only reduces FID to 15.61, achieving an extreme structural fidelity of 0.89, but also significantly reduces the repetitive cognitive load on designers while ensuring an inference latency of less than four seconds. This research substantially demonstrates the core engineering value of strong computer vision intervention in promoting the paradigm shift of precise human-computer co-creation.