Liang He, Shizhuo Zhang, Lijun Wu, Huanhuan Xia, Fusong Ju, He Zhang, Siyuan Liu, Yingce Xia, Jianwei Zhu, Pan Deng, Bin Shao, Tao Qin, Tie-Yan Liu
Understanding protein sequences is vital and urgent for biology, healthcare, and medicine. Labeling approaches are expensive yet time-consuming, while the amount of unlabeled data is increasing quite faster than that of the labeled data due to low-cost, high-throughput sequencing methods. In order to extract knowledge from these unlabeled data, representation learning is of significant value for protein-related tasks and has great potential for helping us learn more about protein functions and structures. The key problem in the protein sequence representation learning is to capture the co-evolutionary information reflected by the inter-residue co-variation in the sequences. Instead of leveraging multiple sequence alignment as is usually done, we propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM). In a conventional masked language model, the masked tokens are modeled by conditioning on the unmasked tokens only, but processed independently to each other. However, our proposed PMLM takes the dependency among masked tokens into consideration, i.e., the probability of a token pair is not equal to the product of the probability of the two tokens. By applying this model, the pre-trained encoder is able to generate a better representation for protein sequences. Our result shows that the proposed method can effectively capture the inter-residue correlations and improves the performance of contact prediction by up to 9% compared to the MLM baseline under the same setting. The proposed model also significantly outperforms the MSA baseline by more than 7% on the TAPE contact prediction benchmark when pre-trained on a subset of the sequence database which the MSA is generated from, revealing the potential of the sequence pre-training method to surpass MSA based methods in general.
Guoao Yang, Jianhui Zhou, Tao Qin
The multipole moments are fundamental properties of insulators, and have attracted lots of attention with emerging of the higher-order topological insulators. A couple of ways, including generalization of the formula for the polarization and the Wilson loop, have been proposed to calculate it in real materials. However, a practical method to explore it in correlated insulators is still lacking. Here, we proposed a systematic way, which combines the general Green's function formula for multiopoles with the real-space dynamical mean-field theory, to calculate the multipole moments in correlated materials. Our demonstrating calculations are consistent with symmetry analysis, and the calculations of the spectral functions further confirm our results. This method opens the new avenue to study the topological phase transitions in correlated multipole insulators and other crucial physical quantities closely related to multipole moments.
Qin Tao, Caijun Zhong, Hai Lin, Zhaoyang Zhang
Ambient backscatter communication is a newly emerged paradigm, which utilizes the ambient radio frequency (RF) signal as the carrier to reduce the system battery requirement, and is regarded as a promising solution for enabling large scale deployment of future Internet of Things (IoT) networks. The key issue of ambient backscatter communication systems is how to perform reliable detection. In this paper, we propose novel encoding methods at the information tag, and devise the corresponding symbol detection methods at the reader. In particular, Manchester coding and differential Manchester coding are adopted at the information tag, and the corresponding semi-coherent Manchester (SeCoMC) and non-coherent Manchester (NoCoMC) detectors are developed. In addition, analytical bit error rate (BER) expressions are characterized for both detectors assuming either complex Gaussian or unknown deterministic ambient signal. Simulation results show that the BER performance of unknown deterministic ambient signal is better, and the SeCoMC detector outperforms the NoCoMC detector. Finally, compared with the prior detectors for ambient backscatter communications, the proposed detectors have the advantages of achieving superior BER performance with lower communication delay.
Yu-Liang Tao, Tao Qin, Yong Xu
Topological heavy-fermion systems in three dimensions are usually classified as topological insulators or semimetals. Here, we theoretically predict a different type of heavy-fermion system (dubbed exceptional heavy-fermion semimetal) by studying a three-dimensional periodic Anderson model consisting of strongly correlated localized $f$ electrons and itinerant conduction $c$ electrons in a zincblende lattice. Due to the breaking of inversion symmetry, the quasiparticle lifetimes at different sublattices are distinct, leading to the emergence of Weyl exceptional rings in the complex pole of the Green's function at finite temperatures; such rings lead to the appearance of bounded Fermi surfaces (bulk Fermi disks). As temperatures rise, two pairs of Weyl exceptional rings merge into two exceptional rings with one bounded bulk Fermi surface (bulk Fermi tube), which are experimentally measurable by angle-resolved photoemission spectroscopy. Finally, we use the dynamical mean field theory to calculate the spectral functions which illustrate the emergence of bulk Fermi tubes. Our work thus opens the door for studying exceptional heavy-fermion semimetal phases in three dimensions.
Yingce Xia, Haifang Li, Tao Qin, Nenghai Yu, Tie-Yan Liu
Thompson sampling is one of the earliest randomized algorithms for multi-armed bandits (MAB). In this paper, we extend the Thompson sampling to Budgeted MAB, where there is random cost for pulling an arm and the total cost is constrained by a budget. We start with the case of Bernoulli bandits, in which the random rewards (costs) of an arm are independently sampled from a Bernoulli distribution. To implement the Thompson sampling algorithm in this case, at each round, we sample two numbers from the posterior distributions of the reward and cost for each arm, obtain their ratio, select the arm with the maximum ratio, and then update the posterior distributions. We prove that the distribution-dependent regret bound of this algorithm is $O(\ln B)$, where $B$ denotes the budget. By introducing a Bernoulli trial, we further extend this algorithm to the setting that the rewards (costs) are drawn from general distributions, and prove that its regret bound remains almost the same. Our simulation results demonstrate the effectiveness of the proposed algorithm.
Xiang Li, Tao Qin, Jian Yang, Tie-Yan Liu
Recurrent neural networks (RNNs) have achieved state-of-the-art performances in many natural language processing tasks, such as language modeling and machine translation. However, when the vocabulary is large, the RNN model will become very big (e.g., possibly beyond the memory capacity of a GPU device) and its training will become very inefficient. In this work, we propose a novel technique to tackle this challenge. The key idea is to use 2-Component (2C) shared embedding for word representations. We allocate every word in the vocabulary into a table, each row of which is associated with a vector, and each column associated with another vector. Depending on its position in the table, a word is jointly represented by two components: a row vector and a column vector. Since the words in the same row share the row vector and the words in the same column share the column vector, we only need $2 \sqrt{|V|}$ vectors to represent a vocabulary of $|V|$ unique words, which are far less than the $|V|$ vectors required by existing approaches. Based on the 2-Component shared embedding, we design a new RNN algorithm and evaluate it using the language modeling task on several benchmark datasets. The results show that our algorithm significantly reduces the model size and speeds up the training process, without sacrifice of accuracy (it achieves similar, if not better, perplexity as compared to state-of-the-art language models). Remarkably, on the One-Billion-Word benchmark Dataset, our algorithm achieves comparable perplexity to previous language models, whilst reducing the model size by a factor of 40-100, and speeding up the training process by a factor of 2. We name our proposed algorithm \emph{LightRNN} to reflect its very small model size and very high training speed.
Tao Qin, Alexander Schnell, Klaus Sengstock, Christof Weitenberg, André Eckardt, Walter Hofstetter
We analyze strong correlation effects and topological properties of interacting fermions with a Falicov-Kimball type interaction in circularly shaken hexagonal optical lattices, which can be effectively described by the Haldane-Falicov-Kimball model, using the real-space Floquet dynamical mean-field theory (DMFT). The Haldane model, a paradigmatic model of the Chern insulator, is experimentally relevant, because it has been realized using circularly shaken hexagonal optical lattices. We show that in the presence of staggering a charge density wave emerges, which is affected by interactions and resonant tunneling. We demonstrate that interactions smear out the edge states by introducing a finite life time of quasiparticles. Even though a general method for calculating the topological invariant of a nonequilibrium steady state is lacking, we extract the topological invariant using a Laughlin charge pump set-up. We find and attribute to the dissipations into the bath connected to every lattice site, which is intrinsic to real-space Floquet DMFT methods, that the pumped charge is not an integer even for the non-interacting case at very low reservoir temperatures. Furthermore, using the rate equation based on the Floquet-Born-Markov approximation, we calculate the charge pump from the rate equations for the non-interacting case to identify the role of the spectral properties of the bath. Starting from this approach we propose an experimental protocol for measuring quantized charge pumping.
Tao Qin, Walter Hofstetter
We present a systematic study of spectral functions of a time-periodically driven Falicov-Kimball Hamiltonian. In the high-frequency limit, this system can be effectively described as a Harper-Hofstadter-Falicov-Kimball model. Using real-space Floquet dynamical mean-field theory (DMFT), we take into account interaction effects and contributions from higher Floquet bands in a non-perturbative way. Our calculations show a high degree of similarity between the interacting driven system and its effective static counterpart with respect to spectral properties. However, as also illustrated by our results, one should bear in mind that Floquet DMFT describes a non-equilibrium steady state (NESS), while an effective static Hamiltonian describes an equilibrium state. We further demonstrate the possibility of using real-space Floquet DMFT to study edge states on a cylinder geometry.
Tao Qin
In this paper, we consider the subdivision map between two KLRW algebras of type $A^{(1)}_e$ and $A^{(1)}_{e+1}$. We show that the image of an idempotent indexed by a partition under this map is still an idempotent indexed by a partition, and give the form of this new partition. Moreover, we give an equality of some graded decomposition numbers.
Yuxuan Ren, Dihan Zheng, Chang Liu, Peiran Jin, Yu Shi, Lin Huang, Jiyan He, Shengjie Luo, Tao Qin, Tie-Yan Liu
In recent years, machine learning has demonstrated impressive capability in handling molecular science tasks. To support various molecular properties at scale, machine learning models are trained in the multi-task learning paradigm. Nevertheless, data of different molecular properties are often not aligned: some quantities, e.g. equilibrium structure, demand more cost to compute than others, e.g. energy, so their data are often generated by cheaper computational methods at the cost of lower accuracy, which cannot be directly overcome through multi-task learning. Moreover, it is not straightforward to leverage abundant data of other tasks to benefit a particular task. To handle such data heterogeneity challenges, we exploit the specialty of molecular tasks that there are physical laws connecting them, and design consistency training approaches that allow different tasks to exchange information directly so as to improve one another. Particularly, we demonstrate that the more accurate energy data can improve the accuracy of structure prediction. We also find that consistency training can directly leverage force and off-equilibrium structure data to improve structure prediction, demonstrating a broad capability for integrating heterogeneous data.
Yingce Xia, Xu Tan, Fei Tian, Fei Gao, Weicong Chen, Yang Fan, Linyuan Gong, Yichong Leng, Renqian Luo, Yiren Wang, Lijun Wu, Jinhua Zhu, Tao Qin, Tie-Yan Liu
We Microsoft Research Asia made submissions to 11 language directions in the WMT19 news translation tasks. We won the first place for 8 of the 11 directions and the second place for the other three. Our basic systems are built on Transformer, back translation and knowledge distillation. We integrate several of our rececent techniques to enhance the baseline systems: multi-agent dual learning (MADL), masked sequence-to-sequence pre-training (MASS), neural architecture optimization (NAO), and soft contextual data augmentation (SCA).
Renqian Luo, Xu Tan, Rui Wang, Tao Qin, Enhong Chen, Tie-Yan Liu
Neural architecture search (NAS) relies on a good controller to generate better architectures or predict the accuracy of given architectures. However, training the controller requires both abundant and high-quality pairs of architectures and their accuracy, while it is costly to evaluate an architecture and obtain its accuracy. In this paper, we propose SemiNAS, a semi-supervised NAS approach that leverages numerous unlabeled architectures (without evaluation and thus nearly no cost). Specifically, SemiNAS 1) trains an initial accuracy predictor with a small set of architecture-accuracy data pairs; 2) uses the trained accuracy predictor to predict the accuracy of large amount of architectures (without evaluation); and 3) adds the generated data pairs to the original data to further improve the predictor. The trained accuracy predictor can be applied to various NAS algorithms by predicting the accuracy of candidate architectures for them. SemiNAS has two advantages: 1) It reduces the computational cost under the same accuracy guarantee. On NASBench-101 benchmark dataset, it achieves comparable accuracy with gradient-based method while using only 1/7 architecture-accuracy pairs. 2) It achieves higher accuracy under the same computational cost. It achieves 94.02% test accuracy on NASBench-101, outperforming all the baselines when using the same number of architectures. On ImageNet, it achieves 23.5% top-1 error rate (under 600M FLOPS constraint) using 4 GPU-days for search. We further apply it to LJSpeech text to speech task and it achieves 97% intelligibility rate in the low-resource setting and 15% test error rate in the robustness setting, with 9%, 7% improvements over the baseline respectively.
Jinglin Liu, Yi Ren, Xu Tan, Chen Zhang, Tao Qin, Zhou Zhao, Tie-Yan Liu
Non-autoregressive translation (NAT) achieves faster inference speed but at the cost of worse accuracy compared with autoregressive translation (AT). Since AT and NAT can share model structure and AT is an easier task than NAT due to the explicit dependency on previous target-side tokens, a natural idea is to gradually shift the model training from the easier AT task to the harder NAT task. To smooth the shift from AT training to NAT training, in this paper, we introduce semi-autoregressive translation (SAT) as intermediate tasks. SAT contains a hyperparameter k, and each k value defines a SAT task with different degrees of parallelism. Specially, SAT covers AT and NAT as its special cases: it reduces to AT when k = 1 and to NAT when k = N (N is the length of target sentence). We design curriculum schedules to gradually shift k from 1 to N, with different pacing functions and number of tasks trained at the same time. We called our method as task-level curriculum learning for NAT (TCL-NAT). Experiments on IWSLT14 De-En, IWSLT16 En-De, WMT14 En-De and De-En datasets show that TCL-NAT achieves significant accuracy improvements over previous NAT baselines and reduces the performance gap between NAT and AT models to 1-2 BLEU points, demonstrating the effectiveness of our proposed method.
Jianxin Lin, Yingce Xia, Tao Qin, Zhibo Chen, Tie-Yan Liu
Image-to-image translation tasks have been widely investigated with Generative Adversarial Networks (GANs) and dual learning. However, existing models lack the ability to control the translated results in the target domain and their results usually lack of diversity in the sense that a fixed image usually leads to (almost) deterministic translation result. In this paper, we study a new problem, conditional image-to-image translation, which is to translate an image from the source domain to the target domain conditioned on a given image in the target domain. It requires that the generated image should inherit some domain-specific features of the conditional image from the target domain. Therefore, changing the conditional image in the target domain will lead to diverse translation results for a fixed input image from the source domain, and therefore the conditional input image helps to control the translation results. We tackle this problem with unpaired data based on GANs and dual learning. We twist two conditional translation models (one translation from A domain to B domain, and the other one from B domain to A domain) together for inputs combination and reconstruction while preserving domain independent features. We carry out experiments on men's faces from-to women's faces translation and edges to shoes&bags translations. The results demonstrate the effectiveness of our proposed method.
Yang Fan, Fei Tian, Tao Qin, Xiang-Yang Li, Tie-Yan Liu
Teaching plays a very important role in our society, by spreading human knowledge and educating our next generations. A good teacher will select appropriate teaching materials, impact suitable methodologies, and set up targeted examinations, according to the learning behaviors of the students. In the field of artificial intelligence, however, one has not fully explored the role of teaching, and pays most attention to machine \emph{learning}. In this paper, we argue that equal attention, if not more, should be paid to teaching, and furthermore, an optimization framework (instead of heuristics) should be used to obtain good teaching strategies. We call this approach `learning to teach'. In the approach, two intelligent agents interact with each other: a student model (which corresponds to the learner in traditional machine learning algorithms), and a teacher model (which determines the appropriate data, loss function, and hypothesis space to facilitate the training of the student model). The teacher model leverages the feedback from the student model to optimize its own teaching strategies by means of reinforcement learning, so as to achieve teacher-student co-evolution. To demonstrate the practical value of our proposed approach, we take the training of deep neural networks (DNN) as an example, and show that by using the learning to teach techniques, we are able to use much less training data and fewer iterations to achieve almost the same accuracy for different kinds of DNN models (e.g., multi-layer perceptron, convolutional neural networks and recurrent neural networks) under various machine learning tasks (e.g., image classification and text understanding).
Jaromir Panas, Michael Pasek, Arya Dhar, Tao Qin, Andreas Geißler, Mohsen Hafez-Torbati, Max E. Sorantin, Irakli Titvinidze, Walter Hofstetter
In this work we investigate the effect of local dissipation on the presence of density-wave ordering in spinful fermions with both local and nearest-neighbor interactions as described by the extended Hubbard model. We find density-wave order to be robust against decoherence effects up to a critical point where the system becomes homogeneous with no spatial ordering. Our results will be relevant for future cold-atom experiments using fermions with non-local interactions arising from the dressing by highly-excited Rydberg states, which have finite lifetimes due to spontaneous emission processes.
Chang Liu, Xinwei Sun, Jindong Wang, Haoyue Tang, Tao Li, Tao Qin, Wei Chen, Tie-Yan Liu
Conventional supervised learning methods, especially deep ones, are found to be sensitive to out-of-distribution (OOD) examples, largely because the learned representation mixes the semantic factor with the variation factor due to their domain-specific correlation, while only the semantic factor causes the output. To address the problem, we propose a Causal Semantic Generative model (CSG) based on a causal reasoning so that the two factors are modeled separately, and develop methods for OOD prediction from a single training domain, which is common and challenging. The methods are based on the causal invariance principle, with a novel design in variational Bayes for both efficient learning and easy prediction. Theoretically, we prove that under certain conditions, CSG can identify the semantic factor by fitting training data, and this semantic-identification guarantees the boundedness of OOD generalization error and the success of adaptation. Empirical study shows improved OOD performance over prevailing baselines.
Jinhua Zhu, Yingce Xia, Tao Qin, Wengang Zhou, Houqiang Li, Tie-Yan Liu
Jun 17, 2021·q-bio.QM·PDF Inspired by its success in natural language processing and computer vision, pre-training has attracted substantial attention in cheminformatics and bioinformatics, especially for molecule based tasks. A molecule can be represented by either a graph (where atoms are connected by bonds) or a SMILES sequence (where depth-first-search is applied to the molecular graph with specific rules). Existing works on molecule pre-training use either graph representations only or SMILES representations only. In this work, we propose to leverage both the representations and design a new pre-training algorithm, dual-view molecule pre-training (briefly, DMP), that can effectively combine the strengths of both types of molecule representations. The model of DMP consists of two branches: a Transformer branch that takes the SMILES sequence of a molecule as input, and a GNN branch that takes a molecular graph as input. The training of DMP contains three tasks: (1) predicting masked tokens in a SMILES sequence by the Transformer branch, (2) predicting masked atoms in a molecular graph by the GNN branch, and (3) maximizing the consistency between the two high-level representations output by the Transformer and GNN branches separately. After pre-training, we can use either the Transformer branch (this one is recommended according to empirical results), the GNN branch, or both for downstream tasks. DMP is tested on nine molecular property prediction tasks and achieves state-of-the-art performances on seven of them. Furthermore, we test DMP on three retrosynthesis tasks and achieve state-of-the-art results on them.
Chen Zhang, Jiaxing Yu, LuChin Chang, Xu Tan, Jiawei Chen, Tao Qin, Kejun Zhang
Automatic lyrics transcription (ALT), which can be regarded as automatic speech recognition (ASR) on singing voice, is an interesting and practical topic in academia and industry. ALT has not been well developed mainly due to the dearth of paired singing voice and lyrics datasets for model training. Considering that there is a large amount of ASR training data, a straightforward method is to leverage ASR data to enhance ALT training. However, the improvement is marginal when training the ALT system directly with ASR data, because of the gap between the singing voice and standard speech data which is rooted in music-specific acoustic characteristics in singing voice. In this paper, we propose PDAugment, a data augmentation method that adjusts pitch and duration of speech at syllable level under the guidance of music scores to help ALT training. Specifically, we adjust the pitch and duration of each syllable in natural speech to those of the corresponding note extracted from music scores, so as to narrow the gap between natural speech and singing voice. Experiments on DSing30 and Dali corpus show that the ALT system equipped with our PDAugment outperforms previous state-of-the-art systems by 5.9% and 18.1% WERs respectively, demonstrating the effectiveness of PDAugment for ALT.
Linghui Meng, Jin Xu, Xu Tan, Jindong Wang, Tao Qin, Bo Xu
In this paper, we propose MixSpeech, a simple yet effective data augmentation method based on mixup for automatic speech recognition (ASR). MixSpeech trains an ASR model by taking a weighted combination of two different speech features (e.g., mel-spectrograms or MFCC) as the input, and recognizing both text sequences, where the two recognition losses use the same combination weight. We apply MixSpeech on two popular end-to-end speech recognition models including LAS (Listen, Attend and Spell) and Transformer, and conduct experiments on several low-resource datasets including TIMIT, WSJ, and HKUST. Experimental results show that MixSpeech achieves better accuracy than the baseline models without data augmentation, and outperforms a strong data augmentation method SpecAugment on these recognition tasks. Specifically, MixSpeech outperforms SpecAugment with a relative PER improvement of 10.6$\%$ on TIMIT dataset, and achieves a strong WER of 4.7$\%$ on WSJ dataset.