Probing Contextual Language Models for Common Ground with Visual Representations — arXiv2