End-to-End Music Braille Transcription from Sheet Music Images

Chai, Lihua; Fu, Yue; Yu, Zhi; Huang, Tianyuan; Zhu, Zepeng; He, Jiaxian

doi:10.65102/is20261078

Research article

Ingegneria Sismica

Volume 43 Issue 3
Pages: 1
-21

End-to-End Music Braille Transcription from Sheet Music Images

Author(s): ^¹, ^¹, ^², ^³, ^², ^²

¹Department of Special Education, Zhejiang College of Special Education, Hangzhou, China

²School of Software Technology, Zhejiang University, Hangzhou, China

³College of Computer Science and Technology, Zhejiang University, Hangzhou, China

Published: 10/06/2026

Cite

Chai, Lihua . et al “End-to-End Music Braille Transcription from Sheet Music Images.” Ingegneria Sismica Volume 43 Issue 3: 1-21, doi:10.65102/is20261078.

https://doi.org/10.65102/is20261078

Abstract

An end-to-end translation paradigm was proposed to address the problems of error accumulation and low robustness in cascaded music score-to-braille translation methods. For this purpose, a large-scale staff music-braille parallel corpus was constructed, which contains 300,000 sample pairs with music-element-level alignment. An encoder-decoder model was then designed to achieve direct conversion from score images to braille symbol sequences. This was accomplished by jointly optimizing visual feature extraction, musical semantic understanding, and sequence generation. A data augmentation strategy specific to music score characteristics was also introduced to enhance the model’s generalization ability towards diverse layouts and image noise. Experimental results show the model’s performance is significantly superior to that of cascaded baseline methods. The generated braille exhibits high accuracy in key musical semantics, such as pitch and duration. Furthermore, a strong robustness is demonstrated against different layouts and noisy environments. A new technical solution for efficient and accurate automated braille music production is thus provided.

Keywords
End-to-End Music Braille Transcription; Optical Music Recognition (OMR); Staff-to-Braille Parallel Corpus; Encoder-Decoder Model; Hybrid Vision Transformer (Hybrid ViT); Data Augmentation; Error Propagation; Music Accessibility