Getting the smallest snippet from content containing all keywords

Posted on

Problem

This returns the smallest snippet from the content containing all of the given keywords (in any order). This provides the correct solution but I would like to know if it can be made more efficient.

import itertools, re

def solve(content, keywords):
    content = content.strip()
    contentWords = content.split()
    termPosition = {}
    for term in keywords:
        index = 0
        positions = []
        while index is not None:
            try:
                index = contentWords.index(term)
                positions.append(index)
                contentWords[index] = None
            except:
                index = None
        termPosition[term] = positions
    singlePositions = []
    multiplePositions = []
    for term in termPosition.keys():
        if len(termPosition.get(term)) == 1:
            singlePositions.append(termPosition.get(term)[0])
        else:
            multiplePositions.append(termPosition.get(term))
    snippet = content
    contentWords = content.split(' ')
    for element in itertools.product(*multiplePositions):
        tempList = list(element) + singlePositions
        minPos = min(tempList)
        maxPos = max(tempList) + 1
        if len(' '.join(contentWords[minPos:maxPos])) < len(snippet):
            snippet = ' '.join(contentWords[minPos:maxPos])
    return snippet

Solution

First, going over your code

import itertools, re

re is not used


contentWords = content.split()

I would be more explicit on the split, calling split(‘ ‘) so your intentions are crystal clear


index = contentWords.index(term)

Each call to this is an O(n) operation on the list, m times where m is the number of times the element occurs, x times where x is the number of keywords. So in big O (and worst case) we have O(nmx). We can be smarter about this, and do it in O(n) and in one iteration. We will use extra memory to store the keywords in a dict

So replacing this:

termPosition = {}
for term in keywords:
    index = 0
    positions = []
    while index is not None:
        try:
            index = contentWords.index(term)
            positions.append(index)
            contentWords[index] = None
        except:
            index = None
    termPosition[term] = positions

With:

# Create an empty list for each keyword
# We could achieve the same result with a defaultdict
termPosition = {keyword: [] for keyword in keywords}
# enumerate gives us the indexes of the word
for index, word in enumerate(contentWords):
    # if the word is a keyword, record the index
    if word in termPosition:
        termPosition[word].append(index)

for term in termPosition.keys():
    if len(termPosition.get(term)) == 1:
        singlePositions.append(termPosition.get(term)[0])
    else:
        multiplePositions.append(termPosition.get(term))

You don’t need to work out the values from the keys. Also .append(x[0]) for a list of size one is the same as extending by that list. (You can concat lists with either += or .extend)

for positions in termPosition.values():
    if len(positions) == 1:
        singlePositions += positions
    else:
        multiplePositions.append(positions)  # Keeping it as a list of lists

    contentWords = content.split(' ')

No longer needed since we don’t modify contentWords


for element in itertools.product(*multiplePositions):
    tempList = list(element) + singlePositions
    minPos = min(tempList)
    maxPos = max(tempList) + 1
    if len(' '.join(contentWords[minPos:maxPos])) < len(snippet):
        snippet = ' '.join(contentWords[minPos:maxPos])

This is going to be the main slowdown of your code as product means factorial growth. I can’t think of something that does this better right now, but I’ll come back if I do find one. The issue at hand is that it isn’t the shortest range of indexes which contain all the keywords, it is the shortest snippet. The words might vary in length, giving each index a weight. In addition, spaces mean it isn’t the smallest sum of weights, it is the min sum of weights plus number of additions used.
You could create a list of the length of each word in contentWords, and find the min sum(contentWordsLens[minPos:maxPos] + (maxPos – minPos)). The should be more performant as you are adding numbers instead of strings
As for the code, there are a couple of tricks to squeeze down the time

# The min and max from singlePositions never change
# We convert the single value to a tuple to avoid doing repeatedly in the loop

minSinglePositions = tuple(min(singlePositions))
maxSinglePositions = tuple(max(singlePositions))
minSnippetLen = len(snippet)
for element in itertools.product(*multiplePositions):
    minPos = min(element + minSinglePositions)
    maxPos = max(element + maxSinglePositions) + 1
    candidate = ' '.join(contentWords[minPos:maxPos])
    if len(candidate) < minSnippetLen:
        snippet = candidate
        minSnippetLen = len(snippet)

The main improvements come from moving any work that is constant to outside the loop. We also avoid repeating work like working out the length of the snippet. For this kind of tricking about, profiling should be done to test the function.


A final note, there are two parts to the function, getting the indexes, and checking which combination is the shortest. As such, we can move them to two functions.

A cleaner version of the submitted code:

import itertools

def get_positions(content, keywords):
    termPosition = {keyword: [] for keyword in keywords}
    # enumerate gives us the indexes of the word
    for index, word in enumerate(content):
        # if the word is a keyword, record the index
        if word in termPosition:
            termPosition[word].append(index)

    singlePositions = []
    multiplePositions = []
    # Note, if any of the positions are of len 0, the keyword isn't in the snippet
    # This should be raised. As it is currently, it returns the input snippet
    for positions in termPosition.values():
        if len(positions) == 1:
            singlePositions += positions
        else:
            multiplePositions.append(positions)  # Keeping it as a list of lists

    return singlePositions, multiplePositions


def get_shortest_snippet(content, contentWords, singlePositions, multiplePositions):
    # The min and max from singlePositions never change
    # We convert the single value to a tuple to avoid doing repeatedly in the loop
    if singlePositions:
        minSinglePositions = tuple(min(singlePositions))
        maxSinglePositions = tuple(max(singlePositions))
    else:
        minSinglePositions = tuple()        
        maxSinglePositions = tuple()

    snippet = content        
    minSnippetLen = len(snippet)

    for element in itertools.product(*multiplePositions):
        minPos = min(element + minSinglePositions)
        maxPos = max(element + maxSinglePositions) + 1

        candidate = ' '.join(contentWords[minPos:maxPos])
        if len(candidate) < minSnippetLen:
            snippet = candidate
            minSnippetLen = len(snippet)

    return snippet

def solve(content, keywords):
    # Input
    content = content.strip()
    contentWords = content.split(' ')

    singlePositions, multiplePositions = get_positions(contentWords, keywords)
    shortestSnippet = get_shortest_snippet(content, contentWords, singlePositions, multiplePositions)
    return shortestSnippet

Whenever you write some_list = [] or some_dict = {} followed by a loop to populate it, pause. It is very likely that you can express yourself better using a list or dict comprehension.

You imported re, but didn’t use it. Instead, you went with str.split(). If the text is long, then I would expect str.split() to be very costly. Furthermore, splitting on whitespace would not be good enough if the text contains punctuation, which may be attached to some words. Therefore, I recommend using regular expression searches instead.

If you go with the split(), though, you need to do so correctly. You called contentWords = content.split() the first time, then contentWords = content.split(' ') the second time, which isn’t the same, and therefore would give wrong results. To split on whitespace, str.split() is better: it also splits on newlines, handles multiple consecutive whitespace characters, and ignores leading and trailing whitespace (eliminating the need for .strip().

I don’t see the point in classifying singlePositions vs. multiplePositions. Taking the product of *termPositions will work just fine, and would be no less efficient.

I don’t think that initializing snippet = content is correct. The only situation in which it makes a difference is when one of the keywords is not present in the text at all, in which case an empty or None result would be more appropriate.

Here is a solution that I wrote as part of another question. Without comments, it is about the same length as your code, but I consider it to be more expressive Python.

Note that PEP 8, the official Python style guide, recommends lower_case_with_underscores for variable names.

Leave a Reply

Your email address will not be published. Required fields are marked *