In unstructured environments, autonomous robot manipulation suffers from high visual perception uncertainty, large control delays, and shallow vision-control fusion, resulting in low success rates and poor trajectory accuracy under disturbances. Existing visual servoing, Diffusion Policy, and vision-language-action (VLA) models mostly employ one-way or static fusion, lacking real-time bidirectional interaction. This study proposes a Bidirectional Vision-Control Fusion Framework (BVCFF). An Uncertainty-Aware Adaptive Fusion mechanism (UAAF) dynamically balances vision and control weights via visual entropy and Lyapunov gradients. A Graph Attention Temporal Fusion network (GAT-TF) captures multimodal long-term dependencies. An end-to-end differentiable joint optimization embeds Lyapunov stability into the composite loss for bidirectional error back-propagation. Gazebo simulation experiments and preliminary real-robot validation on a UR5e platform show superior performance: 94.8% grasping success, 87.6% insertion success, 80.3% dynamic success (simulation) and 87.2%, 76.4%, 68.7% (real-robot), 5.3 mm trajectory error, 43 FPS, and 0.92 robustness, outperforming seven benchmarks including Diffusion Policy and OpenVLA-inspired VLA. The deep bidirectional fusion provides an efficient, robust solution for embodied intelligence deployment.