Making as many unique strings as possible by removing two characters

Posted on

Problem

I’m attempting a programming challenge type task in C#, where the goal is to determine how many unique strings can be obtained by removing two characters. The prompt for the task implied that I should create the set of all possible strings with 2 chars removed, then return the number of items in the set. I was initially suspicious of this, as I’ve often found that the only way to complete these kinds of tasks is to abstract away from actually storing results or enumerating possibilities wherever possible – but due to the requirement that only unique strings should be counted, I don’t know how to avoid storing information about every result so far. Supposedly I have to be able to handle strings of up to a million characters in length – and I can’t think of any way right now to avoid the hideous iteration count and massive result set that a million character string would require.

Here is my code so far. It works, but its way too slow, and I think large inputs might be generating incorrect results, but I’m not actually sure:

 private static void Main(String[] args)
{
    var input = Console.ReadLine();
    Console.WriteLine(FindBeautifulStrings(input).Count);
}

// B will always be larger than A because of the way we're iterating so we have to remove it first.
private static string RemoveTwo(string input, int indexA, int indexB)
{
    return input.Remove(indexB, 1).Remove(indexA, 1);
}

private static HashSet<int> FindBeautifulStrings(string input)
{
    // Iterate over every character in the string, then for each character, iterate over every
    // other character, removing the two selected characters; return set of all possible results.
    int inputLength = input.Length;
    HashSet<int> results = new HashSet<int>();
    for (int i = 0; i < inputLength; ++i)
    {
        for (int j = i + 1; j < inputLength; ++j)
        {
            results.Add(RemoveTwo(input, i, j).GetHashCode());
        }
    }
    return results;
}

Storing hashes of strings instead of strings themselves is the only idea I’ve come up with in terms of more efficiently detecting a string identical to one I’ve already seen. Since the non-duplicate requirement means that the resulting combinations themselves are significant, I haven’t been able to avoid working with strings entirely and solve it mathematically instead (if duplicates were permissable, I feel like this could be solved using the equation n!/(n - (n - 2)! * (n - 2)!. Is there any way to determine the non-duplicate possibilities mathematically without iterating over or storing the strings themselves? If not, is there any way to optimise what I have so far?

Edit:

I thought I should clarify – although the question intuitively feels like it’s about permutations, a mistake made both by myself initially and by a few others so far, the only operation performed on the original string is removal of characters.

It works, I think, like this:

input: apple

i=0, j=1: ple
i=0, j=2: ple **Doesn't count, duplicate**
i=0, j=3: ppe
i=0, j=4: ppl
i=1, j=2: ale
i=1, j=3: ape
i=1, j=4: apl
i=2, j=3: ape **Doesn't count, duplicate**
i=2, j=4: apl **Doesn't count, duplicate**
i=3, j=4: app

Unique strings: 7

Solution

I think your intuition is correct that we should be able to compute this mathematically, rather than by enumerating the actual results.

First of all, convert the given string in such a way that no two adjacent characters are the same, i.e., remove all consecutive repeating characters and just keep 1 of that ‘series’ and store its frequency along with that.

e.g. aabaa-> a,2 ; b,1 ; a,2

This can be done (using C++ to demonstrate) as given below:

  vector<char> v;
  vector<int long long>f;
  cin>>str;
  n=str.length();
  for(int i=0;i<n;i++)
  {
    int long long j=i+1,cnt=1;
    while(j<n)
    {
      if(str[j]==str[i])
      {
        cnt++;
        j++;
      }
      else
      break;
    }
    v.push_back(str[i]);
    f.push_back(cnt);
    i=j-1;
  }

Now as the string is encoded, we move to the next part. Here v contains all the characters and f contains the corresponding frequency. Now we need to remove 2 characters. There are 3 following cases for the same:

  1. Removing 2 characters which don’t belong to the same series
    e.g. Considering 0 based indexing for str=”aabaa” now a at 0th position and a at 3rd or 4th position are considered to be from different series.
    Hence there are total kC2=(k*(k-1))/2ways for the same where k is the size of freq vector or the vector v.
  2. Removing 2 characters which belong to the same series
    e.g. a at 0th position and a at 1st position will belong to same series.
    So there are total x ways for this where x is the count of series with more than 1 elements.
    The above two cases are added to the answer. But this will result in over counting. Why?
  3. The case where selecting a series with frequency 1 will result in combining two same character series.
    e.g. aabaa when b is removed then both the series of a will merge. Thus we need to decrement answer by 1 for every such case as removing b, a at 1st position and b, a at 3rd position has been counted twice via kC2. Hence the decrement
    Thus the final answer will be computed. You can check the below code for the same:
  int long long k=f.size(),two=0;
  int long long ans=k*(k-1)/2+two;
  for(int i=0;i<f.size();i++)
  if(f[i]>1)
  two++;
  for(int i=1;i<f.size()-1;i++)
  {
    if(f[i]!=1)//Only applicable for a single frequency character
    continue;
    if(v[i-1]==v[i+1])
    {
      ans--;
    }
  }
  cout<<ans<<endl;

For the same problem but with a single deletion, it’s easy: for every run of k consecutive characters, you have k identical strings, so you discount k - 1 of them. There are N possible deletions, so the number of distinct strings is N - sum_k k-1.

With two deletions, there are various cases to consider.

  • The first easy case is that for every run of k > 2 identical characters, deleting any two of those characters gives the same result, so of the k(k-1)/2 pairs you can discount k(k-1)/2 - 1 of them.
  • The second easy case is that for every two separate runs of j and k identical characters respectively, there are jk identical strings, so you discount jk - 1 of them.
  • The complicated case is where deleting a character merges two runs. In exponential (run-length encoded) notation we have a substring x^j y x^k and deleting the y gives j + k possibilities for the second deletion which will give the same string, so you discount j + k - 1 of them.
    • Consider substrings of the form xyxy, which have 6 pairs of indices and 4 distinct results, because there are three ways of getting xy. We’ve discounted one for the xyx and one for the yxy, so we don’t need to do any extra processing.

If you first transform the string into a run-length encoded representation then these cases are all pretty simple to check.

  • Instead of using string.Remove (creating 2 strings for each index combination), you could create an array of chars / ints and work on that array.
  • You don’t have to iterate from i == 0 to i == str.Length and j == 0 to j == str.Length because that results in duplicated indexes (e.g. 1,2 and 2,1).
  • The idea with the hashes is good, but as mentioned in a comment, you have to check for all “uncertain duplicates” whether they are actually duplicates.

The following code shows a simple implementation that considers the points above:

public class Variation
{
    public Variation(int hash, int index1, int index2)
    {
        Hash = hash;
        Index1 = index1;
        Index2 = index2;
    }
    public int Index1 { get; }
    public int Index2 { get; }
    public int Hash { get; }
}

public int CountVariations(string input)
{
    int[] inputArray = input.ToCharArray().Select(c => (int)c).ToArray();
    var variations = new List<Variation>();
    for (int i = 0; i < input.Length; i++)
        for (int j = i + 1; j < input.Length; j++)
            variations.Add(GetVariation(inputArray, i, j));
    
    var groups = variations.GroupBy(v => v.Hash).ToArray();
    
    var uncertainDuplicates = groups.Where(g => g.Skip(1).Any()).ToArray();
    
    var duplicatesRealCount = GetRealCount(inputArray, uncertain Dublicates);
    
    return groups.Length - uncertainDuplicates.Length + duplicatesRealCount;
}

private int GetRealCount(int[] inputArray, IEnumerable<IGrouping<int, Variation>> duplicates)
{
    // todo: check if the duplicates are actually identical
    return duplicates.Count();
}

private static Variation GetVariation(int[] inputArray, int index1, int index2)
{
    var hashValue = Enumerable
        .Range(0, inputArray.Length)
        .Where(i => i != index1 && i != index2)
        .Select(i => inputArray[i])
        .Aggregate((hash, val) => hash ^ val);
        
    return new Variation(hashValue, index1, index2);
}

I’m not seeing the need to keep the string characters in order in the requirements you list (though you refer to this in a comment?). If the string is “toenail”, but “toena” is as legitimate as “aneot”, than what you’re after is indeed a permutation – especially if you don’t care what the strings are, only the counts.

The number of unique permutations for a given set of characters is:

n!(nr)!n1!n2!nx!n!(nr)!n1!n2!nx!

… where nn is the number of characters in the string, rr is the number of characters you want to use. To eliminate duplicates, you need to also divide by the factorials of counts of any duplicates.

So for the word bookkeeper, you have 10 characters, but there are two O’s, two K’s, and three E’s. for unique permutations of 8 characters, the formula is:

10!(108)!2!2!3!=75,60010!(108)!2!2!3!=75,600

It’s kind of annoying to have to track the counts of each letter, but I don’t see any way around that. You could get fancy with linq to query the letters and their counts, or many other options. But that’s the heart of it.

The closest link I can find to document this is here. The yellow-highlighted text with the big arrow pointing to it is the key.

Leave a Reply

Your email address will not be published. Required fields are marked *