Towards Event-oriented Long Video Understanding — arXiv2