Deep Performer is a novel three-stage system for score-to-audio music performance synthesis. It is based on a transformer encoder-decoder architecture commonly used in text-to-speech synthesis. In order to handle polyphonic music inputs, we propose a new polyphonic mixer for aligning the encoder and decoder. Moreover, we propose a new note-wise positional encoding for providing a fine-grained conditioning to the model so that the model can learn to behave differently at the beginning, middle and end of a note.
We used two datasets to train our proposed system: Bach Violin Dataset for the violin and MAESTRO Dataset for the piano.
The Bach Violin Dataset is a collection of high-quality public recordings of Bach’s sonatas and partitas for solo violin (BWV 1001–1006). The dataset consists of 6.5 hours of professional recordings from 17 violinists recorded in various recording setups. It also provides the reference scores and estimated alignments between the recordings and scores. The dataset and the source code for the alignment process can be found here.
Audio samples synthesized by our proposed model can be found in the samples
directory and on our project homepage.
Deep Performer: Score-to-Audio Music Performance Synthesis
Hao-Wen Dong, Cong Zhou, Taylor Berg-Kirkpatrick, and Julian McAuley
Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022
[homepage] [paper] [reviews]