Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification — arXiv2