Enhancing Video Transformers for Action Understanding with VLM-aided Training — arXiv2