OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning — arXiv2