The wide application of generative AI programs in text generation, question answering, platform services and organizational decision-making has changed the evaluation problem from simple model performance measurement to the collaborative analysis of data, algorithms, platforms and targets. This paper introduces Aristotle’s four factors, integrates critical political economy and realist evaluation, and constructs a comprehensive evaluation framework covering training data quality, computing power support, model architecture, generation mechanism, MLOps process, API collaboration, log audit and value alignment. Amazon Hiring, Google Bard and Microsoft Tay are taken as case studies. The results show that the program bias score is 0.18 under the condition of “low data bias-low model fluctuation”, and rises to 0.81 under the condition of “high data bias-high model fluctuation”. In Amazon Hiring, material and form reasons contributed a total of 61%; in Google Bard, form reasons contributed 38%; in Microsoft Tay, motivation and purpose reasons contributed 58%. The research shows that the failure of generative AI programs has a significant four-factor coupling feature. The framework can provide an evaluation basis with both computer technology depth and social interpretation power for algorithm governance, responsibility tracking and platform regulation, which has methodological significance.