Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding — arXiv2