This paper first designs an object recognition algorithm based on YOLOv3. By incorporating an improved temporal-spatial context similarity feature fusion structure, the accuracy of object location prediction is enhanced. Combining the residual modules of the Darknet53 backbone with a multi-scale detection architecture optimizes object classification performance. An improved SiamFC++ object tracking algorithm is proposed to strengthen tracking robustness and consistency. Experiments validated using the VisDrone2023 dataset and multiple aerial video sequences demonstrate that our model outperforms several alternative models in both overlap rate and center error metrics. Notably, in the person1 scenario, it achieves an average coverage rate of 82.1% with a center position error of only 11.4 pixels, effectively handling scenarios involving target occlusion and rapid motion.