Jonas Ellert, Johannes Fischer, Nodari Sitchinava
When lexicographically sorting strings, it is not always necessary to inspect all symbols. For example, the lexicographical rank of "europar" amongst the strings "eureka", "eurasia", and "excells" only depends on its so called relevant prefix "euro". The distinguishing prefix size $D$ of a set of strings is the number of symbols that actually need to be inspected to establish the lexicographical ordering of all strings. Efficient string sorters should be $D$-aware, i.e. their complexity should depend on $D$ rather than on the total number $N$ of all symbols in all strings. While there are many $D$-aware sorters in the sequential setting, there appear to be no such results in the PRAM model. We propose a framework yielding a $D$-aware modification of any existing PRAM string sorter. The derived algorithms are work-optimal with respect to their original counterpart: If the original algorithm requires $O(w(N))$ work, the derived one requires $O(w(D))$ work. The execution time increases only by a small factor that is logarithmic in the length of the longest relevant prefix. Our framework universally works for deterministic and randomized algorithms in all variations of the PRAM model, such that future improvements in ($D$-unaware) parallel string sorting will directly result in improvements in $D$-aware parallel string sorting.
Jonas Ellert, Paweł Gawrychowski, Garance Gourdel
Squares (fragments of the form $xx$, for some string $x$) are arguably the most natural type of repetition in strings. The basic algorithmic question concerning squares is to check if a given string of length $n$ is square-free, that is, does not contain a fragment of such form. Main and Lorentz [J. Algorithms 1984] designed an $\mathcal{O}(n\log n)$ time algorithm for this problem, and proved a matching lower bound assuming the so-called general alphabet, meaning that the algorithm is only allowed to check if two characters are equal. However, their lower bound also assumes that there are $Ω(n)$ distinct symbols in the string. As an open question, they asked if there is a faster algorithm if one restricts the size of the alphabet. Crochemore [Theor. Comput. Sci. 1986] designed a linear-time algorithm for constant-size alphabets, and combined with more recent results his approach in fact implies such an algorithm for linearly-sortable alphabets. Very recently, Ellert and Fischer [ICALP 2021] significantly relaxed this assumption by designing a linear-time algorithm for general ordered alphabets, that is, assuming a linear order on the characters that permits constant time order comparisons. However, the open question of Main and Lorentz from 1984 remained unresolved for general (unordered) alphabets. In this paper, we show that testing square-freeness of a length-$n$ string over general alphabet of size $σ$ can be done with $\mathcal{O}(n\log σ)$ comparisons, and cannot be done with $o(n\log σ)$ comparisons. We complement this result with an $\mathcal{O}(n\log σ)$ time algorithm in the Word RAM model. Finally, we extend the algorithm to reporting all the runs (maximal repetitions) in the same complexity.
Patrick Dinklage, Jonas Ellert, Johannes Fischer, Dominik Köppl, Manuel Penschuck
Bidirectional compression algorithms work by substituting repeated substrings by references that, unlike in the famous LZ77-scheme, can point to either direction. We present such an algorithm that is particularly suited for an external memory implementation. We evaluate it experimentally on large data sets of size up to 128 GiB (using only 16 GiB of RAM) and show that it is significantly faster than all known LZ77 compressors, while producing a roughly similar number of factors. We also introduce an external memory decompressor for texts compressed with any uni- or bidirectional compression scheme.
Jonas Ellert, Paweł Gawrychowski, Adam Górkiewicz, Tatiana Starikovskaya
The classical pattern matching asks for locating all occurrences of one string, called the pattern, in another, called the text, where a string is simply a sequence of characters. Due to the potential practical applications, it is desirable to seek approximate occurrences, for example by bounding the number of mismatches. This problem has been extensively studied, and by now we have a good understanding of the best possible time complexity as a function of $n$ (length of the text), $m$ (length of the pattern), and $k$ (number of mismatches). In particular, we know that for $k=\mathcal{O}(\sqrt{m})$, we can achieve quasi-linear time complexity [Gawrychowski and Uznański, ICALP 2018]. We consider a natural generalisation of the approximate pattern matching problem to two-dimensional strings, which are simply square arrays of characters. The exact version of this problem has been extensively studied in the early 90s. While periodicity, which is the basic tool for one-dimensional pattern matching, admits a natural extension to two dimensions, it turns out to become significantly more challenging to work with, and it took some time until an alphabet-independent linear-time algorithm has been obtained by Galil and Park [SICOMP 1996]. In the approximate two-dimensional pattern matching, we are given a pattern of size $m\times m$ and a text of size $n\times n$, and ask for all locations in the text where the pattern matches with at most $k$ mismatches. The asymptotically fastest algorithm for this algorithm works in $\mathcal{O}(kn^{2})$ time [Amir and Landau, TCS 1991]. We provide a new insight into two-dimensional periodicity to improve on these 30-years old bounds. Our algorithm works in $\tilde{\mathcal{O}}((m^{2}+mk^{5/4})n^{2}/m^{2})$ time, which is $\tilde{\mathcal{O}}(n^{2})$ for $k=\mathcal{O}(m^{4/5})$.
Jonas Ellert, Johannes Fischer
A run in a string is a maximal periodic substring. For example, the string $\texttt{bananatree}$ contains the runs $\texttt{anana} = (\texttt{an})^{3/2}$ and $\texttt{ee} = \texttt{e}^2$. There are less than $n$ runs in any length-$n$ string, and computing all runs for a string over a linearly-sortable alphabet takes $\mathcal{O}(n)$ time (Bannai et al., SODA 2015). Kosolobov conjectured that there also exists a linear time runs algorithm for general ordered alphabets (Inf. Process. Lett. 2016). The conjecture was almost proven by Crochemore et al., who presented an $\mathcal{O}(nα(n))$ time algorithm (where $α(n)$ is the extremely slowly growing inverse Ackermann function). We show how to achieve $\mathcal{O}(n)$ time by exploiting combinatorial properties of the Lyndon array, thus proving Kosolobov's conjecture.
Gabriel Bathie, Jonas Ellert, Tatiana Starikovskaya
Palindromes are non-empty strings that read the same forward and backward. The problem of recognizing strings that can be represented as the concatenation of even-length palindromes, the concatenation of palindromes of length at least two, and the concatenation of exactly $k$ palindromes was introduced in the seminal paper of Knuth, Morris, and Pratt [SIAM J. Comput., 1977]. In this work, we study the problem of recognizing so-called $k$-palindromic strings, which can be represented as the concatenation of exactly $k$ palindromes. We show the following results: 1. First, we show a structural characterization of the set of all $k$-palindromic prefixes of a string by representing it as a union of a small number of highly structured string sets, called affine prefix sets. Representing the lengths of the $k$-palindromic prefixes in this way requires $O(6^{k^2} \cdot \log^k n)$ space. By constructing a lower bound, we show that the space complexity is optimal up to polylogarithmic factors for reasonably small values of $k$. 2. Secondly, we derive a read-only algorithm that, given a string $T$ of length $n$ and an integer $k$, computes a compact representation of $i$-palindromic prefixes of $T$, for all $1 \le i \le k$. The algorithm uses $O(n \cdot 6^{k^2} \cdot \log^k n)$ time and $O(6^{k^2} \cdot \log^k n)$ space. 3. Finally, we also give a read-only algorithm for computing the palindromic length of $T$, which is the smallest $\ell$ such that $T$ is $\ell$-palindromic. Here, we achieve $O(n \cdot 6^{\ell^2} \cdot \log^{\lceil{\ell/2 \rceil}} n)$ time and $O(6^{\ell^2} \cdot \log^{\lceil{\ell/2\rceil}} n)$ space. For some values of $\ell$, this is the first algorithm for palindromic length that uses $o(n)$ additional working space on top of the input.
Jonas Ellert, Tomasz Kociumaka
A key principle in string processing is local consistency: using short contexts to handle matching fragments of a string consistently. String synchronizing sets [Kempa, Kociumaka; STOC 2019] are an influential instantiation of this principle. A $τ$-synchronizing set of a length-$n$ string is a set of $O(n/τ)$ positions, chosen via their length-$2τ$ contexts, such that (outside highly periodic regions) at least one position in every length-$τ$ window is selected. Among their applications are faster algorithms for data compression, text indexing, and string similarity in the word RAM model. We show how to preprocess any string $T \in [0..σ)^n$ in $O(n\logσ/\log n)$ time so that, for any $τ\in[1..n]$, a $τ$-synchronizing set of $T$ can be constructed in $O((n\logτ)/(τ\log n))$ time. Both bounds are optimal in the word RAM model with word size $w=Θ(\log n)$. Previously, the construction time was $O(n/τ)$, either after an $O(n)$-time preprocessing [Kociumaka, Radoszewski, Rytter, Waleń; SICOMP 2024], or without preprocessing if $τ<0.2\log_σn$ [Kempa, Kociumaka; STOC 2019]. A simple version of our method outputs the set as a sorted list in $O(n/τ)$ time, or as a bitmask in $O(n/\log n)$ time. Our optimal construction produces a compact fully indexable dictionary, supporting select queries in $O(1)$ time and rank queries in $O(\log(\tfrac{\logτ}{\log\log n}))$ time, matching unconditional cell-probe lower bounds for $τ\le n^{1-Ω(1)}$. We achieve this via a new framework for processing sparse integer sequences in a custom variable-length encoding. For rank and select queries, we augment the optimal variant of van Emde Boas trees [Pătraşcu, Thorup; STOC 2006] with a deterministic linear-time construction. The above query-time guarantees hold after preprocessing time proportional to the encoding size (in words).
Philip Bille, Jonas Ellert, Johannes Fischer, Inge Li Gørtz, Florian Kurpicz, Ian Munro, Eva Rotenberg
We present the first linear time algorithm to construct the $2n$-bit version of the Lyndon array for a string of length $n$ using only $o(n)$ bits of working space. A simpler variant of this algorithm computes the plain ($n\lg n$-bit) version of the Lyndon array using only $\mathcal{O}(1)$ words of additional working space. All previous algorithms are either not linear, or use at least $n\lg n$ bits of additional working space. Also in practice, our new algorithms outperform the previous best ones by an order of magnitude, both in terms of time and space.
Gabriel Bathie, Itai Boneh, Panagiotis Charalampopoulos, Jonas Ellert, Tatiana Starikovskaya
We study the Longest Common Extension (LCE) problem in a string containing wildcards. Wildcards (also called "don't cares" or "holes") are special characters that match any other character in the alphabet, similar to the character "?" in Unix commands or "." in regular expression engines. We consider the problem parametrized by $G$, the number of maximal contiguous groups of wildcards in the input string. Our main contribution is a simple data structure for this problem that can be built in $O(n (G/t) \log n)$ time, occupies $O(nG/t)$ space, and answers queries in $O(t)$ time, for any $t \in [1, G]$. Up to the $O(\log n)$ factor, this interpolates smoothly between the data structure of Crochemore et al. [JDA 2015], which has $O(nG)$ preprocessing time and space, and $O(1)$ query time, and a simple solution based on the "kangaroo jumping" technique [Landau and Vishkin, STOC 1986], which has $O(n)$ preprocessing time and space, and $O(G)$ query time. By establishing a connection between this problem and Boolean matrix multiplication, we show that our solution is optimal, up to subpolynomial factors, among combinatorial data structures when $G = Ω(n^ε)$ under a widely believed hypothesis. In addition, we develop a simple deterministic combinatorial algorithm for sparse Boolean matrix multiplication. We further establish a conditional lower bound for non-combinatorial data structures, stating that $O(nG/t^4)$ preprocessing time (resp. space) is optimal, up to subpolynomial factors, for any data structure with query time $t$ for a wide range of $t$ and $G$, assuming the well-established $\textsf{3SUM}$ (resp. $\textsf{Set-Disjointness}$) conjecture. Finally, we show that our data structure can be used to obtain efficient algorithms for approximate pattern matching and structural analysis of strings with wildcards.
Panagiotis Charalampopoulos, Jonas Ellert, Manal Mohamed
The Cartesian tree of a sequence captures the relative order of the sequence's elements. In recent years, Cartesian tree matching has attracted considerable attention, particularly due to its applications in time series analysis. Consider a text $T$ of length $n$ and a pattern $P$ of length $m$. In the exact Cartesian tree matching problem, the task is to find all length-$m$ fragments of $T$ whose Cartesian tree coincides with the Cartesian tree $CT(P)$ of the pattern. Although the exact version of the problem can be solved in linear time [Park et al., TCS 2020], it remains rather restrictive; for example, it is not robust to outliers in the pattern. To overcome this limitation, we consider the approximate setting, where the goal is to identify all fragments of $T$ that are close to some string whose Cartesian tree matches $CT(P)$. In this work, we quantify closeness via the widely used Hamming distance metric. For a given integer parameter $k>0$, we present an algorithm that computes all fragments of $T$ that are at Hamming distance at most $k$ from a string whose Cartesian tree matches $CT(P)$. Our algorithm runs in time $\mathcal O(n \sqrt{m} \cdot k^{2.5})$ for $k \leq m^{1/5}$ and in time $\mathcal O(nk^5)$ for $k \geq m^{1/5}$, thereby improving upon the state-of-the-art $\mathcal O(nmk)$-time algorithm of Kim and Han [TCS 2025] in the regime $k = o(m^{1/4})$. On the way to our solution, we develop a toolbox of independent interest. First, we introduce a new notion of periodicity in Cartesian trees. Then, we lift multiple well-known combinatorial and algorithmic results for string matching and periodicity in strings to Cartesian tree matching and periodicity in Cartesian trees.
Panagiotis Charalampopoulos, Taha El Ghazi, Jonas Ellert, Paweł Gawrychowski, Tatiana Starikovskaya
Many string processing problems can be phrased in the streaming setting, where the input arrives symbol by symbol and we have sublinear working space. The area of streaming algorithms for string processing has flourished since the seminal work of Porat and Porat [FOCS 2009]. Unfortunately, problems with efficient solutions in the classical setting often do not admit efficient solutions in the streaming setting. As a bridge between these two settings, Saks and Seshadhri [SODA 2013] introduced the asymmetric streaming model. Here, one is given read-only access to a (typically short) reference string $R$ of length $m$, while a text $T$ arrives as a stream. We provide a generic technique to reduce fundamental string problems in the asymmetric streaming model to the online read-only model, lifting several existing algorithms and generally improving upon the state of the art. Most notably, we obtain asymmetric streaming algorithms for exact and approximate pattern matching (under both the Hamming and edit distances), and for relative Lempel-Ziv compression. At the heart of our approach lies a novel tool that facilitates efficient computation in the asymmetric streaming model: the suffix random access data structure. In its simplest variant, it maintains constant-time random access to the longest suffix of (the seen prefix of) $T$ that occurs in $R$. We show a bidirectional reduction between suffix random access and function inversion, a central problem in cryptography. On the way to our upper bound, we propose a variant of the string synchronizing sets ([Kempa and Kociumaka; STOC 2019]) with a local sparsity condition that, as we show, admits an efficient streaming construction algorithm. We believe that our framework and techniques will find broad applications in the development of small-space string algorithms.