An end-to-end translation paradigm was proposed to address the problems of error accumulation and low robustness in cascaded music score-to-braille translation methods. For this purpose, a large-scale staff music-braille parallel corpus was constructed, which contains 300,000 sample pairs with music-element-level alignment. An encoder-decoder model was then designed to achieve direct conversion from score images to braille symbol sequences. This was accomplished by jointly optimizing visual feature extraction, musical semantic understanding, and sequence generation. A data augmentation strategy specific to music score characteristics was also introduced to enhance the model’s generalization ability towards diverse layouts and image noise. Experimental results show the model’s performance is significantly superior to that of cascaded baseline methods. The generated braille exhibits high accuracy in key musical semantics, such as pitch and duration. Furthermore, a strong robustness is demonstrated against different layouts and noisy environments. A new technical solution for efficient and accurate automated braille music production is thus provided.