In cluttered manufacturing cells, industrial robot grasping needs to deal with occlusion, reflection, scale change and attitude deflection at the same time, and it is difficult to stably complete target recognition and grasp parameter expression with single-layer features. This paper proposes a deep learning detection model that combines multi-scale attention mechanism and hierarchical feature interaction, and constructs an integrated framework of scene modeling, feature enhancement, localization regression and task mapping. The model jointly uses shallow texture and high-level semantic information, and the detection head synchronously outputs the category, center coordinates, width and height parameters and rotation direction. Experiments are carried out on 18240 RGB-D images. The results show that the mAP@0.5 of the proposed method in four types of industrial scenarios is higher than 95.84%, the test set Accuracy reaches 96.74%, and the total system delay is 24.6ms under TensorRT acceleration. Compared with YOLOv5s, HTC-Grasp, FAGD-Net and ODGNet, the proposed model has lower missed detection rate and more stable online calling ability, which can provide reliable visual support for industrial robots to grasp in real-time.