2412.15210 — arXiv2

Tokenisation is NP-Complete

cs.DS

/ Published

Dec 19, 2024

/ Categories

cs.DS cs.CL cs.FL

/ Links

/ Authors

Philip Whittington, Gregor Bachmann, Tiago Pimentel

/ Abstract

In this work, we prove the NP-completeness of two variants of tokenisation, defined as the problem of compressing a dataset to at most $δ$ symbols by either finding a vocabulary directly (direct tokenisation), or selecting a sequence of merge operations (bottom-up tokenisation).