Aiming at the problem of high false alarm rate and disposal lag caused by insufficient semantic alignment of multimodal features in power metering messages, a cross-modal fusion anomaly blocking method based on CLIP is proposed. The CLIP model jointly encodes text, timing signals and device images to construct a 512-dimensional unified semantic space, which is combined with GMM modeling for dynamic threshold determination to achieve accurate matching of anomalies. DQN is introduced to optimize the blocking strategy, and real-time disposal decision is generated by integrating grid topology and historical data. Experiments show that the method has a false alarm rate of ≤2.3%, a leakage rate of ≤1.8%, an average response time of 0.78 seconds, and improves the blocking efficiency by 8.8%, providing an intelligent guarantee for the security of the power system.