With the increasing complexity of animation production, behavior generation has become a key link to maintain the temporal consistency, physical rationality and expression coordination of virtual characters. In this paper, we propose a deep reinforcement learning based animation character behavior generation algorithm, which jointly encodes multi-modal motion cues, scene states and semantic action labels to generate controllable behavior sequences. Based on the actor-critic learning mechanism and hierarchical state representation, a policy learning framework is constructed to improve the quality of action cohesion and scene response accuracy. Experiments are carried out on 12480 segments of motion data composed of CMU Mocap, Mixamo and self-built interactive segments. Action accuracy, macro average F1 value, trajectory offset error and structural similarity are used as evaluation indicators, and GRU, Transformer and PPO-RNN are used as comparison methods. The results show that the motion accuracy of the proposed method reaches 95.84%, the macro-average F1 value is 93.12%, the structural similarity is 0.927, and the trajectory offset error is 3.84 cm. The scoring results show that the proposed method has better performance in smoothness, controllability and visual naturalness.