In order to solve the problem of low precision and shortage of space information in single sensor system, a new method of 3D object detection based on 4D MMW radar is presented. The proposed method combines multi-modal feature fusion and spatiotemporal scaling, and adopts a middle level fusion strategy to model multi-element features of radar point cloud. The camera branch uses the ResNet-FPN architecture to extract multiscale semantic features, and the radar branch uses a VoxelNet-based compression structure to improve the performance of the algorithm. Experiments on the 4D-Drive and nuScenes data sets show that the proposed algorithm achieves 86.4% detection accuracy when the IoU threshold is 0.7. Compared with the single-modality baseline, there is a 12.8% increase in performance compared to a camera only mode, and 9.5% in the radar-only mode. Compared with the monocular vision approach, the absolute displacement error has been decreased to 0.184 m, and the detection accuracy remains above 85% under challenging conditions, showing strong robustness and generalization ability. The system achieves real time performance at 31 FPS on the RTX A6000 and Jetson AGX Orin platforms. This study has overcome the shortcomings of the existing detection algorithms, such as precision, latency, and environment adaptability, and provides an efficient way to realize multi-source collaboration.