Building a practical research model for evaluating, selecting, and adapting general-purpose large models in the two energy application scenarios of substations’ intelligent inspection and transmission corridors visualisation. The benchmark includes 18,642 retained multimodal evidence records that include: visible images; thermal frames; OCR string; equipment metadata; corridor attribute; rule clause; and historical ticket text. Anonymised six models were evaluated at set data divisions, prompts templates, inference upper limit and scoring script. Targeted power service judgment: Object localisation, risk inference, rule-based evidence, unsupervised alarm control, and robustness to field perturbations.Based on a weighted-score screen of the candidates, an adaptive selection result of the selected model included retrieval evidence, LoRA tuning, visual-grounding calibration, and safety verifier. Adapted Power-GM obtained the best comprehensive scores of 89.0%, 86.0%, 87.0%, 88.0% and 84.0%, respectively, for visual anchoring, risk judgement, rule obedience, hallucination suppression, and robustness. Eight of the selected tasks surpassed the most powerful open multimodal baseline by 9.9%.-20.3 percentage points and the closed multimodal baseline by 3.3-8.5 percentage points. The best response-surface area is LoRA ranking 48 and retrieval top-k 6, which still has a power-biz score of around 89.0% inside the latency bound. Ablation demonstrated that retrieval enhanced rule adherence, LoRA strengthened task reasoning, grounding calibration reduced object-Region misalignment, and the safety verifier decreased hallucinated risk assertions. This study is confined to the two tested scenes, with fixed model labels, task definitions, scoring scripts, test record retention policies, and only included the evidence types from the benchmark collection.8.5 percentage points. The best response-surface region was LoRA rank 48 with retrieval top-k 6, where the Power-Biz score remained near 89.0% within the latency target. Ablation showed that retrieval improved rule compliance, LoRA strengthened task reasoning, grounding calibration reduced object-region mismatch, and the safety verifier reduced hallucinated risk statements. The conclusion is limited to the two tested scenes, fixed model labels, fixed task definitions, fixed scoring scripts, retained test records, and the evidence types in the benchmark corpus.