Self-Compositional Data Augmentation for Scientific Keyphrase Generation

/ Authors

Maël Houbre, Florian Boudin, B. Daille, Akiko Aizawa

/ Abstract

Performances of state-of-the-art keyphrase generation models improve with the size of the training dataset. But obtaining large amounts of keyphrase-labeled documents can be challenging and costly. Data augmentation methods allow to increase the training set size without additional cost. However, those techniques rely most of the time on external data or additional resources than can be as difficult to obtain as new annotated data. To tackle this issue, we present a self-compositional data augmentation method which creates additional training samples that keep domain coherence, without relying on any external data or resources. More specifically, we measure the relatedness of training documents based on their shared keyphrases, and combine similar documents to generate synthetic samples. Our results on multiple datasets spanning three different domains, demonstrate that our method consistently improves keyphrase generation. A qualitative analysis of the generated keyphrases for the Computer Science domain, confirms this improvement towards their representativity property.

Journal: Proceedings of the 24th ACM/IEEE Joint Conference on Digital Libraries

DOI: 10.1145/3677389.3702504