Counting the frequency of letters in a string slices

Posted on

Problem

While preparing a solution for the CodeChef – magic tricks, an easy question about counting the frequency of given character from the string slice, I came up with the following solution:

input_str = input("")
t = int(input(""))
for i in range(t):
    c, l1, l2 = input("").split()
    l, r = int(l1),int(l2)
    print(input_str[l-1:r].count(c)) 

But It turns out to be inefficient because it’s not meeting the time constraint. How can I better implement it?

(here, the indexing starts from 1)

Solution

Rewriting your code with Caridorc’s suggestions gets me something like

input_str = input()

iterations = int(input())
for _ in range(iterations):
    try:
        character, start, end = input().split()
    except ValueError:
        raise ValueError("Need exactly three items (character, start, end)")

    start, end = int(start), int(end)
    print(input_str[start-1:end].count(character))

Note that you should probably put this in a main method.

The slowest part is going to involve the inner iteration; namely

    try:
        character, start, end = input().split()
    except ValueError:
        raise ValueError("Need exactly three items (character, start, end)")

    start, end = int(start), int(end)
    print(input_str[start-1:end].count(character))

Getting input can’t get much faster, I’d assume, so this line to blame is

    print(input_str[start-1:end].count(character))

If start-1 and end are wildly different (they can be up to 10611061 different) then this will take a long time. For a maximum of 106106 iterations, this could take a while.

What you should do instead is pre-process the string so you can quickly do

    print(number_between(start, end, letter))

How? One way is to build cumulative lists, such as

string:  a  b  c  b  a  c  b  a  b  a
     a:  1  1  1  1  2  2  2  3  3  4
     b:  0  1  1  2  2  2  3  3  4  4
     c:  0  0  1  1  1  2  2  2  2  2

That way you can quickly it calculate with

def number_between(start, end, letter):
    return counts[letter][end] - count[letter][start-1]

This initial generation might take a long time, though, as you need to generate a list for every possible letter – and there are a lot of those. If you only generate a list for which there is a letter in the string, you’ve improved your worst case but you’re still generating a large square. This could be expensive.

One other option is to store indexes:

string:  a  b  c  b  a  c  b  a  b  a
     a:  0           4        7     9  =  [0, 4, 7, 9]
     b:     1     3        6     8     =  [1, 3, 6, 8]
     c:        2        5              =  [2, 5]

Then you can do a binary search on the list to find the number of instances before an index, and do

def number_between(start, end, letter):
    number_before_end = ... # binary search
    number_before_start = ... # binary search

    return number_before_end - number_before_start

Depending on how much startup costs affect you, this could either be better or worse than the original strategy.

I’ll stick with the second of these, since it’s less problematic in the face of Unicode. The test suite seems to ignore anything but lowercase ASCII, but I’m not willing to make such a horrible assumption.

So, first we want a bunch of lists indexed by character. This will be easiest as a defaultdict(list), since they all start out empty:

indices = defaultdict(list)

Then we want to go over the characters in the string and add the respective indices to the particular list:

for i, character in enumerate(input_string):
    indices[character].append(i)

We want to do a binary search to find the counts. This is just bisect_right and bisect_left, depending on whether you want to be inclusive:

def number_between(start, end, counts):
    number_before_start = bisect_left(counts, start)
    number_before_end = bisect_right(counts, end)

    return number_before_end - number_before_start

Now all that’s needed is input, giving:

from bisect import bisect_left, bisect_right
from collections import defaultdict

def preprocess(input_string):
    indices = defaultdict(list)

    for i, character in enumerate(input_string):
        indices[character].append(i)

    return indices

def number_between(start, end, counts):
    return bisect_right(counts, end) - bisect_left(counts, start)

def main():
    indices = preprocess(input())

    for _ in range(int(input())):
        try:
            character, start, end = input().split()
        except ValueError:
            raise ValueError("Need exactly three items (character, start, end)")

        # Change to 0-based indexing
        start, end = int(start) - 1, int(end) - 1

        print(number_between(start, end, indices[character]))

main()

Avoid one/two letter variable names

Your code is full of one and two character letters that make it very hard to read

Avoid cryptic failure on bad inputs

  • If the input has not exactly three values and you try to split your code is going to fail badly. It is better to add

:

if input_var.count(',') != 3:
    raise ValueError("Exactly three commas expected, found more or less.")

Miscellanous

  • It is common practice to use _ as a loop variable if you do not care about its value

  • Usually it is better to insert a prompt like var = input("Question?"), if you do not want that, you should anyway remove the "" because it is unnecessary

Performance

I don’t think Python can get faster than this, maybe you should try the low level C.

it turns out that this problem can be better solved by dynamic programming,
here is the better solution

input_str = input("")
d = {x:[0 for y in input_str] for x in 'abcdefghijklmnopqrstuvwxyz'}
test_cases = int(input())
d[input_str[0]][0]=1
index=0
for letter in input_str[1:]:
    index += 1
    for alpha in 'abcdefghijklmnopqrstuvwxyz':
        if alpha == letter:
            d[alpha][index] = d[alpha][index-1]+1
        else:
            d[alpha][index] = d[alpha][index-1]
for cases in range(test_cases):
    letter,start,end = input("").split()
    start=int(start)-1
    end  =int(end)-1

    print(d[letter][end])

Leave a Reply

Your email address will not be published. Required fields are marked *