Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising — arXiv2