Optimizing Distributed Tensor Contractions using Node-Aware Processor Grids — arXiv2