Rare
 0/6

Suffix Array

Authors: Benjamin Qi, Kevin Sheng

Quickly sorting suffixes of a string and its applications

Edit This Page

Prerequisites

The suffix array of a string is the sorted array of all possible suffixes of the string. Each suffix is usually represented by its starting index.

For example, if our string were abcbcba, the suffix array would be as follows:

SuffixString
6a
0abcbcba
5ba
3bcba
1bcbcba
4cba
2cbcba

Construction

Focus Problem – try your best to solve this problem before continuing!

The general philosophy behind the construction of a suffix array is to sort incrementally. We first start by comparing the first character of each suffix, and then double the amount we compare until we're comparing the entire length of the string. If this sounds really abstract, don't worry! All the implementation details will be hammered out below.

Initial Transformation

For convenience, let's start by appending a "null character" of sorts to the end of the string. This acts as a tiebreaker and will ensure we never hit the end of a suffix while comparing two suffixes. $ or the space character are possible options, as the ASCII code of either are less than a.

To avoid out of bounds issues, we'll also append the string to itself. Notice that since any comparisons would have stopped at the null character, this has no effect on the order of the suffix array.

After implementing these two transformations, the strings we compare would look like this (still using abcbcba as an example):

SuffixString
7$abcbcba
6a$abcbcba
0abcbcba$abcbcba
5ba$abcbcba
3bcba$abcbcba
1bcbcba$abcbcba
4cba$abcbcba
2cbcba$abcbcba

The last suffix that starts with $ will always come first; it doesn't particularly matter.

Sorting

As stated earlier, let's first just consider the first character in each substring and sort it:

SuffixString
7$abcbcba
0abcbcba$abcbcba
6a$abcbcba
1bcbcba$abcbcba
3bcba$abcbcba
5ba$abcbcba
2cbcba$abcbcba
4cba$abcbcba

From here we start doubling the length of string we compare. To get to the point where we efficiently compare suffixes, let's first do some analysis.

Say we just sorted all the suffixes by the first 22 characters, and we're now comparing the first four characters of suffixes 22 and 44:

SuffixString
2cbcb
4cba$

If the first 22 characters of suffix 22 were already less than or greater than those of suffix 44, nothing changes. However, in this case they aren't; we're going to have to compare the right half.

How might we do this? Well, the cool thing is that since we just sorted all suffixes by the first 22, we actually have information about the right halves already! The right half of the first four characters of suffix 22 is just the first two characters of suffix 44. Similarly, the right half of suffix 44 is the first two characters of suffix 66.

To compare these half-suffixes, after each iteration of sorting we can assign each half-baked suffix a number representing its position in the array. If two half-baked suffixes are equal, then they would get the same number. These numbers are usually called equivalence classes.

Applying this augmentation to our array sorted by the first character, we would now have:

Eq. ClassSuffixString
07$abcbcba
10abcbcba$abcbcba
16a$abcbcba
21bcbcba$abcbcba
23bcba$abcbcba
25ba$abcbcba
32cbcba$abcbcba
34cba$abcbcba

Optimization

Since we're doing O(NlogN)\mathcal{O}(N \log N) sorts logN\log N times, our current algorithm as it stands is O(Nlog2N)\mathcal{O}(N \log^2 N).

Although this is adequate, it can sometimes be a bit too slow especially when creating a suffix array is often just the first step in many problems. Fortunately, we can fix this! Since everything we're comparing is in the range [0,N)[0, N), we can use radix sort to remove a log factor from our complexity.

Implementation

Time Complexity: O(NlogN)\mathcal{O}(N \log N)

C++

#include <algorithm>
#include <iostream>
#include <string>
#include <vector>
using std::cout;
using std::endl;
using std::pair;
using std::vector;

A more compact implementation is also below.

Resources
Benq

O(N log N)

Java

Warning!

The below implementation will TLE on some test cases due to Java's constant factor.

import java.io.*;
import java.util.*;
import java.util.function.Function;
public class SuffixArray {
static class Suffix implements Comparable<Suffix> {
public int start;
public int leftEq;
public int rightEq;

It's recommended that you also test your implementation against a brute force solution for many small strings.

Here's a small script that outputs a test case and its answer:

C++

#include <algorithm>
#include <iostream>
#include <vector>
using std::cout;
using std::endl;
using std::vector;
int main() {
int n = 50; // adjust n as you please

Java

import java.io.*;
import java.util.*;
public class Generator {
public static void main(String[] args) throws IOException {
Random rng = new Random();
int n = 50; // adjust n as you please
StringBuilder sb = new StringBuilder();
for (int i = 0; i < n; i++) { sb.append((char)('a' + rng.nextInt(26))); }

Python

from random import choice
from string import ascii_lowercase
n = 50 # adjust n as you please
str_ = [choice(ascii_lowercase) for _ in range(n)]
print("".join(str_))
suffs = list(range(n))
suffs.sort(key=lambda s: str_[s:])
print(" ".join(str(i) for i in suffs))

LCP Array

The LCP array is a common auxiliary array based on the suffix array that stores the longest common prefix (LCP) between adjacent elements of the suffix array.

Taking our example string from above, such an array would look like this:

LCPSuffixString
NA7$abcbcba
00abcbcba$abcbcba
16a$abcbcba
01bcbcba$abcbcba
33bcba$abcbcba
15ba$abcbcba
02cbcba$abcbcba
24cba$abcbcba

Each row contains the LCP between the row's string and that of the previous row.

Construction

An algorithm often used to construct this LCP array starts by calculating the LCP for the suffix starting at 00 and its previous suffix. It then calculates the LCP for the suffix starting at 11 and its previous suffix, and so on and so forth. The reason for this calculation order is so that we can apply a clever optimization that is best described through our example.

Say we just finished calculating the LCP for the row of suffix 33. Our next step, then, would be to calculate the LCP for suffix 44.

One might think that we would have to start comparing the prefixes all over again for this suffix, but we actually don't have to! Remember that we just calculated the LCP of suffixes 33 and 11 to be 33 characters, which means that the suffixes 44 and 22, and all the suffixes between them in the array (despite there not being any in our example) must have at least 22 characters in common.

What this means is that we can just start comparing from the third character to compute the LCP for the row of suffix 44 instead of starting all over.

In general, we keep a variable start_at\texttt{start\_at} which keeps the number of common characters we know the current suffix has with its predecessor in the suffix array. This variable tells us how many characters we can skip, as we know they're equal. After each computation of an LCP, we set this variable equal to its value minus one, except when the LCP is already 00.

You'll see an implementation of this algorithm in the next section.

Application

Now that we've constructed this LCP array, let's see an application of it.

Focus Problem – try your best to solve this problem before continuing!

A crucial observation to be made for this problem is that every substring can be represented as a prefix of some suffix. We know that suffix ii has NiN-i (recall that NN is the length of the string) possible prefixes.

Notice that the number of prefixes of suffix ii that are duplicates is just the longest common prefix between ii and its predecessor. With this, we can iterate through the LCP array and calculate our answer in one fell swoop.

Implementation

Time Complexity: O(NlogN)\mathcal{O}(N \log N)

C++

#include <algorithm>
#include <iostream>
#include <string>
#include <vector>
using std::cout;
using std::endl;
using std::pair;
using std::string;
using std::vector;

Java

Warning!

This solution, like the previous one, will TLE for the same reason stated above.

import java.io.*;
import java.util.*;
import java.util.function.Function;
public class SuffixArray {
Code Snippet: Suffix Class (Click to expand)
public static void main(String[] args) throws IOException {
BufferedReader read = new BufferedReader(new InputStreamReader(System.in));
String str = read.readLine() + '$';

Burrows-Wheeler Transform

Resources
GFG

could be simpler?

Focus Problem – try your best to solve this problem before continuing!

View Internal Solution

Run Enumerate

Resources
cp-algo

could be simpler?

Focus Problem – try your best to solve this problem before continuing!

This section is not complete.

Any help would be appreciated! Just submit a Pull Request on GitHub.

you can do this easily with a suffix array

Problems

StatusSourceProblem NameDifficultyTags
PlatinumHard
Show TagsSuffix Array
CFHard
Show TagsSuffix Array

Module Progress:

Join the USACO Forum!

Stuck on a problem, or don't understand a module? Join the USACO Forum and get help from other competitive programmers!