Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation — arXiv2