With the extension of immersive digital media art to 3D real-time interactive space, scene generation requires higher precision reconstruction, low-latency rendering and behavior-driven feedback. Multi-source visual data, depth map, camera pose, and semantic annotation are used to complete unified coding. Neural radiance field is combined to achieve continuous 3D reconstruction. At the same time, the multi-modal fusion of head display pose, gaze, gesture and control command is introduced, and the interaction intention recognition and closed-loop update of scene state are realized through attention weight allocation, intention recognition and state scheduling. Experiments show that the average reconstruction error of four types of art scenes is 1.61 cm, the PSNR of digital exhibition hall reaches 32.4 dB, the SSIM is 0.936, the frame rate of 1080p high-complexity scenes is 59.7 FPS, and the overall recognition accuracy of interactive tasks reaches 94.5%. This study provides technical support for intelligent scene construction and real-time interaction design of immersive digital media art.