Aiming at the problems of lag of state update, rough matching of candidate resources and unstable strategy convergence in personalized path generation, an adaptive path dynamic generation model driven by reinforcement learning was constructed. The model encodes user behavior sequence, resource features, knowledge status and feedback records into a 128-dimensional state vector. GRU is used to extract historical interaction features, and knowledge dependence matrix is combined to compress the candidate action space. The reward function integrates path revenue, resource matching degree, completion feedback and load penalty, and the update range of the strategy is constrained by the PPO cutting objective function. The experiment constructed a data set based on 12864 interaction records, 420 resource nodes and 96 knowledge units. The results show that the Accuracy of Proposed PPO reaches 93.5%, NDCG@10 is 0.881, Completion Rate is 89.7%, Mastery Gain is 22.4%, and Average Reward is 0.842. Compared with the DQN model, the relative improvement rates of Accuracy, NDCG@10 and Completion Rate are 6.74%, 8.50% and 8.86%, respectively. The average reward stabilizes after round 82. Experimental results show that the proposed model can improve the accuracy of dynamic path generation, the quality of resource ordering and the stability of strategy iteration.