To address the conflict between immediate click-through rate and long-term retention in new media recommendation, this paper proposes DRL-MOREC, a deep reinforcement learning framework. A hybrid state encoder fuses users’ short-term behavior sequences with long-term interest graphs to capture dual temporal scales of interest. A two-stage reward function allocates session-level 7-day retention prediction signals to each step via a discount factor, alleviating delayed reward sparsity. Conservative Q-learning and inverse propensity score weighting are introduced to mitigate distribution shift and popularity bias, respectively. Offline experiments on a short-video platform dataset show that the proposed method achieves a 7-day retention rate 2.0, 2.7, and 1.3 percentage points higher than DeepFM, DDPG-TD3, and SAC-Rec, respectively, while improving catalog coverage (ECC) by 0.09. Online A/B testing demonstrates an 11.3% lift in daily active user retention over DeepFM. Ablation studies reveal that the delayed reward contributes 4.0 percentage points to the retention improvement. These results validate the effectiveness of reinforcement learning in optimizing long-term user value under dynamic recommendation scenarios.