What Makes Multimodal In-Context Learning Work? — arXiv2