PROBLEM LINKS

DIFFICULTY

MEDIUM

PREREQUISITES

Suffix Array, Longest Common Prefix Array

PROBLEM

Given two strings a and b, let A be the set of all substrings of a, and B be the set of all substrings of b.

Find the number of unique strings in A plus the number of unique strings in B, that are not common to both A and B.

QUICK EXPLANATION

Efficient solutions to a large set of problems on strings are approachable by use of SuffixArrays (or Trees). If you have not yet added SuffixSorting (for construction of Suffix Arrays) to your skill-set, then this is the perfect problem to do so.

Suffix Sorting can be done using

Manber Myers algorithm in O(L log L) time
Karkkainan Sander's algorithm in O(L) time

Longest Common Prefixes of adjacent items in the list of sorted suffixes is a classical problem in strings. Most texts discuss construction of Suffix Arrays immediately followed by calculation of LCP arrays.

The LCP Array can be found by augmenting the Suffix Sorting algorithms above. The time complexity for LCP Array calculation will be exactly equal to the time complexity of the Suffix Sorting. You can find it as an entirely different step as well, following the Suffix Sorting.

Both O(L) and O(L log L) approaches will fit within the time limit for this problem. Hence, choose as you please :)

Let us assume that we can find the number of unique sub-strings of a string by use of the LCP array for the suffixes of that string. (We will see the algorithm to do so in the EXPLANATION section)

U(A) = number of unique strings in A

We are given two strings a and b and their set of substrings A and B respectively. We wish to find

U(A) + U(B) - U(A intersection B)

Which is equal to

U(A) + U(B) - (U(A union B) - U(A) - U(B))

2*U(A) + 2*U(B) - U(A union B)

EXPLANATION

Let us see how to find the number of unique substrings of a string by using the LCP array.

If we consider all the prefixes of all the suffixes, we would be considering the entire set of sub-strings.
The LCA array helps us determine how many prefixes to ignore for each suffix
We of course do not ignore any prefix of the first suffix. These are all valid and unique substrings.

Let s be given string
    indexes in s are 1-based
Let S be the suffix array
    S stores the list of 1-based indexes
    that represent the start position of the suffix of s
Let L be the LCP array
    L(i) is the longest common prefix between
    the suffixes starting from S(i) and S(i-1)
    Thus, L is only defined from 2 to |s|

uniq_sub_strings = |s| - S[1] + 1
// thus we count all prefixes of the first suffix

for i = 2 to N
    uniq_sub_strings += |s| - S[i] + 1 - L[i]

Let us try this with an example to see why this is correct.

s = abaabba
S = [
    7,    // a
    3,    // aabba
    1,    // abaabba
    4,    // abba
    6,    // ba
    2,    // baabba
    5     // bba
]
L = [
    0,    // not defined for L[1]
    1,    // a is the common prefix
    1,    // a
    2,    // ab
    0,    // nothing
    2,    // ba
    1     // b
]

uniq_sub_strings =
    1 + 4 + 6 + 2 + 2 + 4 + 2 = 21

Thus, there are 21 substrings (out of 28) that are unique. You can work this out by deducing that the sub-strings that are not unique (and hence weren't counted are)

{
    a,  // prefix of aabba
        // since a was counted as prefix of "a"
    a,  // prefix of abaabba
    a,  // prefix of abba
    ab, // prefix of abba
        // since ab was counted as prefix of "abaabba"
    b,  // prefix of baabba
        // since b was counted as prefix of "ba"
    ba, // prefix of baabba
        // since ba was counted as prefix of "ba"
    b   // prefix of bba
}

Now, you can find U(A) and U(B) by using the above algorithm. The only part that remains is to calculate U(A union B).

This can be done by considering a string

c = a + "$" + b

We can find all the unique substrings of c and reduce from the result, all the strings that contain the character '$'. We know that all the different strings that contain '$' are unique anyway, since '$' is not part of the alphabet.

There are (|a|+1) * (|b|+1) substrings of c that contain '$'.

Thus, we can find U(A union B), the number of unique substrings that exists in either A, or B.

We can find the number of unique sub-strings of either a, or b, but not both, in time O(F(|a|) + F(|b|) + F(|a+b|), where F(n) is the complexity of calculating the Suffix Array or the LCA Array (which ever is more). In the best implementation, F(n) = O(n). But the problem may be solved even with F(n) = O(n log n).

SETTER'S SOLUTION

Can be found here.

TESTER'S SOLUTION

Can be found here.

TASTR - Editorial

PROBLEM LINKS

DIFFICULTY

PREREQUISITES

PROBLEM

QUICK EXPLANATION

EXPLANATION

SETTER'S SOLUTION

TESTER'S SOLUTION

Trending Articles

Scuffham Amps - S-GEAR 2.6.0 VST, AAX, STANDALONE x86 x64 (R2R NO iLok2, +NO...

Practice Sheet of Right form of verbs for HSC Students

VHSE First (1st) Allotment 2025 - vhscap.kerala.gov.in

UNIVERSE LEAGUE – UNIVERSE LEAGUE – WAR (We Are Ready) – EP [iTunes Plus M4A]

City Hunter Teledrama – Episode 18 – 07th May 2016

Comment on Proposed Criteria for Identifying Predatory Conferences by Luke...

Bureau of Internal Revenue: Regional Offices (Directory)

Kendrick Lamar – Not Like Us (2024) [24Bit-88.2kHz] [PMEDIA] ⭐️

Inception 2010 Hindi Dual Audio 650MB BRRip 720p ESubs HEVC

East Hull MD admits sexual assaults after another victim comes forward

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

R. v. Sargeant, 2023 ONSC 6406 (CanLII)

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Who’s been sentenced at Northampton Magistrates’ Court

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Family cries out as traditional ruler allegedly abducts brother, extorts N2.5m

Long-Running Conflict In Springfield (MA) Gangland Sphere Has Manzi Family &...

Wondershare Filmora X v10.1.20.16 x64

Man arrested after fracas in flat

Man charged in ongoing Sexual Assault Investigation Derek Nyilas, 46, Faces...