Proceedings of the First Workshop on Weakly Supervised Learning (WeaSuL)

/ Authors

Michael A. Hedderich, Benjamin Roth, Katharina Kann, Barbara Plank, Alexander J. Ratner, D. Klakow

/ Abstract

The paradigm of pre-training followed by finetuning has become a standard procedure for NLP tasks, with a known problem of domain shift between the pre-training and downstream corpus. Previous works have tried to mitigate this problem with additional pre-training, either on the downstream corpus itself when it is large enough, or on a manually curated unlabeled corpus of a similar domain. In this paper, we address the problem for the case when the downstream corpus is too small for additional pre-training. We propose TADPOLE, a task adapted pre-training framework based on data selection techniques adapted from Domain Adaptation. We formulate the data selection as an anomaly detection problem that unlike existing methods works well when the downstream corpus is limited in size. It results in a scalable and efficient unsupervised technique that eliminates the need for any manual data curation. We evaluate our framework on eight tasks across four different domains: Biomedical, Computer Science, News, and Movie reviews, and compare its performance against competitive baseline techniques from the area of Domain Adaptation. Our framework outperforms all the baseline methods. On large datasets we get an average gain of 0.3% in performance but on small datasets with less than 5K training examples, we get a much higher gain of 1.8%. This shows the efficacy of domain adapted finetuning when the task dataset is small.

Journal: ArXiv