# Find all needles in the haystack and return their sum

Posted on

Problem

This is based on a programming challenge I recently solved. However, since the writer of the challenge asked me not to disclose the question and/or it’s answer, I modified it a little. If it doesn’t make sense: my mistake.

The target is to find all numbers in a file and sum those. So if the following is found:

``````Sed ut 11 perspiciatis, unde omnis 11 iste natus 20 error sit
``````

The result should be `42`.

Floats like `0.42` will be considered equal to `42` in this challenge.

To generate example input, I took the revised version of my ‘n x on the y’ generator and made it run from 999. That’s almost 5k lines of code, a nice amount to test against. The old code produces the same output, for those interested.

I was quite content with the performance. Now I’m wondering whether I could’ve written it better. Readability and maintainability are key, as usual.

The next iteration of this code will likely try to match keywords instead of arbitrary numbers. More on that in a later question.

### haystack.py

``````# Sum all numbers in a haystack
import re

def getNumbers(inp):
with open(inp, 'r') as f:

def calculateSum(results):
total = 0
for result in results:
total += int(result)

def main(input_file='haystack.txt'):
print calculateSum(getNumbers(input_file))

if __name__ == "__main__":
main()
``````

Solution

You should:

• Use `sum`.
• Make `calculateSum` do one thing. Also instead you could use `sum(int(i) for i in ...)`.
• You can use `re.finditer` to get better memory performance.
• You can use `mmap.mmap`, to reduce memory usage. 5K lines is a lot.
Note: It will error on empty files in Windows.

This can get you:

``````import re
import mmap

def get_matches(data):
return (g.group(0) for g in re.finditer('[0-9]+', data))

def main(input_file='haystack.txt'):
with open(input_file, 'r+') as f:
data = mmap.mmap(f.fileno(), 0)
return sum(int(i) for i in get_matches(data))

if __name__ == "__main__":
print main()
``````

In the next iteration you can change only `get_matches` to handle the keywords, rather than arbitrary numbers.
And you can iterate over data, one char at a time.

First, you should have a look at PEP8 again, specifically the part about function names. It recommends using `lower_case_with_underscores`. It also recommends leaving two blank lines between function definitions on the module level.

Next, `open` already opens the file read-only by default, there is no need for `'r'`. Also, `f.readlines()` returns a `list`, which you then convert into a string representation of that list. However, you can just use `f.read()` here, which directly returns the full string. Note that reading in the full file content at once might be quite memory-consuming, so don’t do it with files larger than your RAM.

Last but not least, your function `calculate_sum` could just be the built-in `sum`, if the elements of `results` were already numbers. So I would change `get_numbers` to return actual numbers (like the name implies), by using `map`:

``````import re

def get_numbers(file_name):
with open(file_name) as f:

def main(file_name='haystack.txt'):
print sum(get_numbers(file_name))

if __name__ == "__main__":
main()
``````

In Python 3.x, you could use the `yield from` keyword to make your `get_numbers` into a generator and avoid the memory issue, because the file is read one line at a time (at the cost of executing the regexp for every line):

``````def get_numbers(file_name):
with open(file_name) as f:
for line in f:
yield from map(int, re.findall('[0-9]+', line))
``````

In Python 2.x this could be implemented like this:

``````def get_numbers(file_name):
with open(file_name) as f:
for line in f:
for item in map(int, re.findall('[0-9]+', line)):
yield item
``````

You could significantly shorten your code while speeding it up using list comprehension and built in python functions.

``````data = 'Sed ut \$11 perspiciatis, unde omnis 11hours and 22mins iste natus 20 error sit' #11+11+22+20 = 64
``````

Method 1

``````n = []
for i in data.split(' '): #we split the list at each index of ' '
if i.isdigit(): #built in method to determine if an object can be an int
n.append(int(i)) #append to list
``````

Method 1.2

List comprehension

``````n = sum((int(i) for i in data.split(' ') if i.isdigit()))
``````

Output

``````print n #--> 20, because it dosen't find \$11, 11hours, 22mins
``````

Method 2
Using `re` we modify the pattern of `Method 1.2`

``````import re
n_2 = sum((int(i) for i in re.findall('[0-9]+', data)))
``````

Output

``````print n_2 #--> 64 finds all.
``````