Problem
This returns the smallest snippet from the content containing all of the given keywords (in any order). This provides the correct solution but I would like to know if it can be made more efficient.
import itertools, re
def solve(content, keywords):
content = content.strip()
contentWords = content.split()
termPosition = {}
for term in keywords:
index = 0
positions = []
while index is not None:
try:
index = contentWords.index(term)
positions.append(index)
contentWords[index] = None
except:
index = None
termPosition[term] = positions
singlePositions = []
multiplePositions = []
for term in termPosition.keys():
if len(termPosition.get(term)) == 1:
singlePositions.append(termPosition.get(term)[0])
else:
multiplePositions.append(termPosition.get(term))
snippet = content
contentWords = content.split(' ')
for element in itertools.product(*multiplePositions):
tempList = list(element) + singlePositions
minPos = min(tempList)
maxPos = max(tempList) + 1
if len(' '.join(contentWords[minPos:maxPos])) < len(snippet):
snippet = ' '.join(contentWords[minPos:maxPos])
return snippet
Solution
First, going over your code
import itertools, re
re is not used
contentWords = content.split()
I would be more explicit on the split, calling split(‘ ‘) so your intentions are crystal clear
index = contentWords.index(term)
Each call to this is an O(n) operation on the list, m times where m is the number of times the element occurs, x times where x is the number of keywords. So in big O (and worst case) we have O(nmx). We can be smarter about this, and do it in O(n) and in one iteration. We will use extra memory to store the keywords in a dict
So replacing this:
termPosition = {}
for term in keywords:
index = 0
positions = []
while index is not None:
try:
index = contentWords.index(term)
positions.append(index)
contentWords[index] = None
except:
index = None
termPosition[term] = positions
With:
# Create an empty list for each keyword
# We could achieve the same result with a defaultdict
termPosition = {keyword: [] for keyword in keywords}
# enumerate gives us the indexes of the word
for index, word in enumerate(contentWords):
# if the word is a keyword, record the index
if word in termPosition:
termPosition[word].append(index)
for term in termPosition.keys():
if len(termPosition.get(term)) == 1:
singlePositions.append(termPosition.get(term)[0])
else:
multiplePositions.append(termPosition.get(term))
You don’t need to work out the values from the keys. Also .append(x[0]) for a list of size one is the same as extending by that list. (You can concat lists with either += or .extend)
for positions in termPosition.values():
if len(positions) == 1:
singlePositions += positions
else:
multiplePositions.append(positions) # Keeping it as a list of lists
contentWords = content.split(' ')
No longer needed since we don’t modify contentWords
for element in itertools.product(*multiplePositions):
tempList = list(element) + singlePositions
minPos = min(tempList)
maxPos = max(tempList) + 1
if len(' '.join(contentWords[minPos:maxPos])) < len(snippet):
snippet = ' '.join(contentWords[minPos:maxPos])
This is going to be the main slowdown of your code as product means factorial growth. I can’t think of something that does this better right now, but I’ll come back if I do find one. The issue at hand is that it isn’t the shortest range of indexes which contain all the keywords, it is the shortest snippet. The words might vary in length, giving each index a weight. In addition, spaces mean it isn’t the smallest sum of weights, it is the min sum of weights plus number of additions used.
You could create a list of the length of each word in contentWords, and find the min sum(contentWordsLens[minPos:maxPos] + (maxPos – minPos)). The should be more performant as you are adding numbers instead of strings
As for the code, there are a couple of tricks to squeeze down the time
# The min and max from singlePositions never change
# We convert the single value to a tuple to avoid doing repeatedly in the loop
minSinglePositions = tuple(min(singlePositions))
maxSinglePositions = tuple(max(singlePositions))
minSnippetLen = len(snippet)
for element in itertools.product(*multiplePositions):
minPos = min(element + minSinglePositions)
maxPos = max(element + maxSinglePositions) + 1
candidate = ' '.join(contentWords[minPos:maxPos])
if len(candidate) < minSnippetLen:
snippet = candidate
minSnippetLen = len(snippet)
The main improvements come from moving any work that is constant to outside the loop. We also avoid repeating work like working out the length of the snippet. For this kind of tricking about, profiling should be done to test the function.
A final note, there are two parts to the function, getting the indexes, and checking which combination is the shortest. As such, we can move them to two functions.
A cleaner version of the submitted code:
import itertools
def get_positions(content, keywords):
termPosition = {keyword: [] for keyword in keywords}
# enumerate gives us the indexes of the word
for index, word in enumerate(content):
# if the word is a keyword, record the index
if word in termPosition:
termPosition[word].append(index)
singlePositions = []
multiplePositions = []
# Note, if any of the positions are of len 0, the keyword isn't in the snippet
# This should be raised. As it is currently, it returns the input snippet
for positions in termPosition.values():
if len(positions) == 1:
singlePositions += positions
else:
multiplePositions.append(positions) # Keeping it as a list of lists
return singlePositions, multiplePositions
def get_shortest_snippet(content, contentWords, singlePositions, multiplePositions):
# The min and max from singlePositions never change
# We convert the single value to a tuple to avoid doing repeatedly in the loop
if singlePositions:
minSinglePositions = tuple(min(singlePositions))
maxSinglePositions = tuple(max(singlePositions))
else:
minSinglePositions = tuple()
maxSinglePositions = tuple()
snippet = content
minSnippetLen = len(snippet)
for element in itertools.product(*multiplePositions):
minPos = min(element + minSinglePositions)
maxPos = max(element + maxSinglePositions) + 1
candidate = ' '.join(contentWords[minPos:maxPos])
if len(candidate) < minSnippetLen:
snippet = candidate
minSnippetLen = len(snippet)
return snippet
def solve(content, keywords):
# Input
content = content.strip()
contentWords = content.split(' ')
singlePositions, multiplePositions = get_positions(contentWords, keywords)
shortestSnippet = get_shortest_snippet(content, contentWords, singlePositions, multiplePositions)
return shortestSnippet
Whenever you write some_list = []
or some_dict = {}
followed by a loop to populate it, pause. It is very likely that you can express yourself better using a list or dict comprehension.
You imported re
, but didn’t use it. Instead, you went with str.split()
. If the text is long, then I would expect str.split()
to be very costly. Furthermore, splitting on whitespace would not be good enough if the text contains punctuation, which may be attached to some words. Therefore, I recommend using regular expression searches instead.
If you go with the split()
, though, you need to do so correctly. You called contentWords = content.split()
the first time, then contentWords = content.split(' ')
the second time, which isn’t the same, and therefore would give wrong results. To split on whitespace, str.split()
is better: it also splits on newlines, handles multiple consecutive whitespace characters, and ignores leading and trailing whitespace (eliminating the need for .strip()
.
I don’t see the point in classifying singlePositions
vs. multiplePositions
. Taking the product of *termPositions
will work just fine, and would be no less efficient.
I don’t think that initializing snippet = content
is correct. The only situation in which it makes a difference is when one of the keywords is not present in the text at all, in which case an empty or None
result would be more appropriate.
Here is a solution that I wrote as part of another question. Without comments, it is about the same length as your code, but I consider it to be more expressive Python.
Note that PEP 8, the official Python style guide, recommends lower_case_with_underscores
for variable names.