A Survey on Music Generation from Single-Modal, Cross-Modal, and Multi-Modal Perspectives
/ Authors
/ Abstract
With the rapid development of artificial intelligence, music generation has evolved from single-modal to cross-modal approaches and is gradually moving toward multi-modal fusion. This survey systematically reviews this developmental trajectory. The discussion begins with the representation methods for key modalities, including audio, symbolic, text, and visual data. Music generation techniques are then organized across single-modal, cross-modal, and multi-modal settings. In addition, key datasets and evaluation methodologies relevant to these tasks are compiled. Finally, the survey discusses core challenges in the field, including modal fusion, data scarcity, and evaluation frameworks, and outlines potential directions for future research.
Journal: ACM Computing Surveys
DOI: 10.1145/3800682