Learning Representations from Audio-Visual Spatial Alignment — arXiv2