Generative artificial intelligence (AI) has accelerated its entry into digital content production scenarios and promoted the evolution of intelligent composition from single sequence prediction to multi-modal collaborative modeling. Aiming at the composition task driven by multi-modal music features, this paper constructs a generative model that integrates audio, MIDI, lyric text, style labels and emotion labels. Through unified feature representation, cross-modal attention fusion, hierarchical sequence generation, melody rhythm and harmony synergy constraints, as well as bi-conditional modulation of style and emotion and structure verification of music score. The closed-loop design from feature input to symbolic output is realized. Experimental results show that the melody coherence, style matching and emotional accuracy of the model reach 91, 90 and 87 points respectively, and the comprehensive quality score is 88.4, which is still 84.7 under 20% disturbance. In the lyric-assisted composition scene, the comprehensive score is 89.1, and the style preservation rate is 91.4%. The research shows that this method can improve the structure stability and expression consistency of the generated music, which has reference significance for the engineering implementation of intelligent composition system.