In order to solve the problems of static identification lag, insufficient matching of regulatory actions and insufficient utilization of feedback in the regulation of supply chain greenwashing behavior, this paper constructs a dynamic regulation strategy optimization model based on reinforcement learning. The model takes the consistency of green declaration, performance deviation, certification change, text anomaly and historical feedback as the status input, sets up supervision actions such as prompt description, data review, key spot check, credit constraint and continuous tracking, and comprehensively restricts risk reduction, resource consumption and misjudgment loss through the reward function. The experiment was carried out based on 1260 supply chain subjects, 85420 structured records and 18670 text disclosure samples. The model was trained for 500 rounds, and compared with Logistic regression, SVM, random forest, XGBoost and static DQN. The results show that the Accuracy of the model in this paper reaches 93.6%, Macro-F1 reaches 91.8%, the high-risk recall rate reaches 92.4%, the invalid resource consumption rate is reduced to 13.8%, and the average response cycle is shortened to 2.4 working days. The research results show that the proposed model can improve the identification accuracy of greenwashing risk and the adaptation ability of dynamic supervision actions, and provide a computable optimization path for the intelligent supervision of supply chain greenwashing behavior.