find_index for first occurrence of pattern in text

Posted on


This is an implementation of find_index for a pattern in text. I got asked this coding question in a coding interview.

This Python function returns the starting index of the first occurrence of pattern in text and returns None if no index is found. For example, find_index('ababc', 'abc') returns 2 and find_index('abc', 'z') returns None.

I have the following implementation that handles most of the cases and I think the worst case running time is O(mn)O(mn).

def find_index(text, pattern):
    assert isinstance(text, str), 'text is not a string: {}'.format(text)
    assert isinstance(pattern, str), 'pattern is not a string: {}'.format(text)
    #if it's empty string, return 0
    if pattern == '':
        return 0
    text_index = 0
    while text_index != len(text):
        for i in range(len(pattern)):
            if text_index + i < len(text):
                if text[text_index + i] != pattern[i]:
                if i == len(pattern) - 1:
                    return text_index
        text_index += 1

I am not sure if I exhausted all of the edge cases, but I ran the following unit test, and it passed all tests.

import unittest

class StringsTest(unittest.TestCase):
   def test_find_index_with_matching_patterns(self):
        assert find_index('abc', '') == 0  # all strings contain empty string
        assert find_index('abc', 'a') == 0  # single letters are easy
        assert find_index('abc', 'b') == 1
        assert find_index('abc', 'c') == 2
        assert find_index('abc', 'ab') == 0  # multiple letters are harder
        assert find_index('abc', 'bc') == 1
        assert find_index('abc', 'abc') == 0  # all strings contain themselves
        assert find_index('aaa', 'a') == 0  # multiple occurrences
        assert find_index('aaa', 'aa') == 0  # overlapping pattern
        assert find_index('thisisabsolutelycrazy', 't') == 0

    def test_find_index_with_non_matching_patterns(self):
        # Negative test cases (counterexamples) with non-matching patterns
        assert find_index('abc', 'z') is None  # remember to test other letters
        assert find_index('abc', 'ac') is None  # important to test close cases
        assert find_index('abc', 'az') is None  # first letter, but not last
        assert find_index('abc', 'abz') is None  # first 2 letters, but not last

if __name__ == '__main__':


Remarks on the Code

[] The use of assert demonstrates the understanding data validation is important. It is a clear strength of the code.

[] The line if pattern == '': is a bit weaker. In part because if pattern: is equivalent. In part because if the reason for using pattern == '' is to express intent, then pattern == '' or pattern == "" more clearly expresses that intent.

[] The code phrase:

if pattern == '':
  return 0
text_index = 0

is probably better as

text_index = 0
if pattern == '':
  return text_index

[] Because the specification of the problem does not define what should happen in when pattern == '', providing an explicit behavior may be a problem. Comments might help people notice that there is an explicit behavior or that the explicit behavior is different from what they expect.

[] A docstring would be a reasonable (and probably better) place to describe the behavior when pattern == '' because docstrings are accessible without examining the source code and ordinarily used by automated documentation tools.

[] Although it is possible to iterate through Python strings as if they were arrays, it is generally better to use the methods of <class str> when possible. Iterating through a string as if it were an array makes Python code look like **old C **code (old because strstr has been part of the C Standard Library since ANSI C or C89).

[] Having unit tests for code that you have written is a strong point of the code. It reflects positively on your approach to programming.

Alternative Implemenation

The problem can be solved with the following code utilizing Python’s Standard Library. There is no guarantee that the code demonstrates what the question was designed to measure.

1:  def find_index(text, pattern):
2:     output = None
3:     try:
4:         output = text.index(pattern)
5:     except ValueError:
6:         pass
7:     return output

2: The use of output provides a clear name for the role of the variable within the function. Assigning it to None makes the default behavior explicit.

3: A try:except block is “the trick” to using the Python Standard Library and returning None if the value is not found. Using try:except blocks is not guaranteed to be what the question was designed to ask (for example if when the question’s purpose is to determine ability to iterate on a string).

5: The only way to know that index raises ValueError when the substring is not found is from the documentation/experiment. Note that this handles no other exception (example: TypeError raised by 'abc'.index(1)). Additional errors can be handled similarly.

7: The function only returns in one location. That location is at the end, which is at least as good as another location in the code.

Semi-Relevant Rambling

Remarks on Interview Coding Questions

Interview coding questions are typically selected for purposes unique to interviews:

  • Can the candidate write a very simple program? These are FizzBuzz questions. Evaluating the code for anything other than whether it works or not is probably a bad idea: both experienced and inexperienced programmers may attack the problem with brute force and both experienced and inexperienced programmers may write clever solutions that address efficiency, generalization, and so on. Fizzbuzz type questions are useful for producing a Boolean value.

  • Is the interviewer superior to the candidate? These are trick questions based on a corner case that the interviewer knows about. When trick questions produce a Boolean. Unlike FizzBuzz they do not reliably provide good technical evaluation. They are more about power relations and office politics and hence bad technical questions.

  • Does the candidate have deep technical knowledge? These can also address corner cases, but can provide information about the process the candidate uses to solve hard problems. They make sense in late-stage interviews after basic technical competence has been established. The same question can be a “trick question” in one context and a “deep technical question” in another. The purpose of the question matters.

  • How much relevant technical experience does the candidate have? These questions produce values in a continuous range and therefore allow for the ranking of candidates. A potential problem with this class of question is that the pressure of a job interview (job versus no-job) on a candidate is different from normal workplace pressure (except when normal workplace pressure means ordinary mistakes are job versus no-job). This means that these questions can rank candidates according to interviewing skill instead of technical ability. Providing this class of question as a take home problem may avoid measuring interviewing skill.

Unfortunately, it is up to the candidate to guess the purpose for which a question is asked (and that involves interviewing skill more than technical skill). But in general, a take home problem will tend to be designed to determine relevant technical experience. It’s also worth noting that companies that are worried about candidates “cheating” do not trust their interview process to produce reliable technical evaluations.

find_index as an interview question

Because the specification of find_index allows for a range of implementations it is probably not a FizzBuzz. It is most likely a How much relevant technical experience does the candidate have? type question.

One of the features of find_index is that it will reflect the candidate’s knowledge of the Python built-in functions much as a similar question in C might reflect a candidate’s knowledge of C’s standard library.

Some candidates will start solving the problem by writing code. Some candidates will start solving the problem by looking at Python’s string module. Depending on what the interviewer is looking for, either approach might be preferred. For example, find_index might be a FizzBuzz type question on iteration in a whiteboard context and a “relevant technical experience” question for Python’s Standard Library in a take-home context.

Python is a duck typed language, so it’s not considered normal to assert instance types. Instead simply assume that the user passed a string, and let the program fail if the input can’t be processed sensibly.

You calculate the length of text and pattern potentially many times during a run. Unless you plan on modifying them during the run you can save the results to a variable and reuse that.

You can get rid of if text_index + i < len(text) by changing your while statement to take into account the length of the pattern.

Apropos runtime: I’m purposefully not commenting on more advanced algorithms for text searching, mainly because I don’t memorize things I have literally never had to use.

You call len(pattern) multiple times in your loops, so you could set it to a variable to avoid recalculating it every time and also keep it easier to maintain.

I think you could clean up your nested while and for loops by replacing your while loop with a for loop and only checking up to where the pattern could still fit into the text.

For example,

len_pattern = len(pattern)
for text_index in range(len(text)-len_pattern+1):
    for i in range(len_pattern):
        if text[text_index + i] != pattern[i]:
        if i == len_pattern - 1:
            return text_index

But rather than get into nested for loops which can get kind of messy, you could just pass over your text once and hash potential substrings, then just check if your pattern is in the hashed list. This can be implemented with a dictionary to keep track of the string index:

def find_index2(text,pattern):
    len_pattern = len(pattern)    

    substrs = {}       
    for ind in range(len(text)-len_pattern+1):
        substr = text[ind:ind+len_pattern]
        if substrs.get(substr,False) is False:
            substrs[substr] = ind
    return substrs.get(pattern,None)

which would be cleaner and simpler.

This also handles 2 of your edge cases by default, when pattern is ” and when pattern is longer than text. The empty string case is handled by default but you could also hard code it to save N time or leave it as is for cleanliness.

As for the 2nd case, your current for loops would still search over the entire text even if pattern is longer than text and cannot possibly included. In both methods I have included, it would terminate at the first evaluation.

Actually you could clean up the for loop with the same logic even more…

len_pattern = len(pattern)
for text_index in range(len(text)-len_pattern+1):
    if text[text_index:text_index+len_pattern] == pattern:
        return text_index

This avoids having to maintain a separate hash.

Leave a Reply

Your email address will not be published. Required fields are marked *