Find all needles in the haystack and return their sum

Posted on

Problem

This is based on a programming challenge I recently solved. However, since the writer of the challenge asked me not to disclose the question and/or it’s answer, I modified it a little. If it doesn’t make sense: my mistake.

The target is to find all numbers in a file and sum those. So if the following is found:

Sed ut 11 perspiciatis, unde omnis 11 iste natus 20 error sit

The result should be 42.

Floats like 0.42 will be considered equal to 42 in this challenge.

To generate example input, I took the revised version of my ‘n x on the y’ generator and made it run from 999. That’s almost 5k lines of code, a nice amount to test against. The old code produces the same output, for those interested.

I was quite content with the performance. Now I’m wondering whether I could’ve written it better. Readability and maintainability are key, as usual.

The next iteration of this code will likely try to match keywords instead of arbitrary numbers. More on that in a later question.

haystack.py

# Sum all numbers in a haystack
import re


def getNumbers(inp):
    with open(inp, 'r') as f:
        return re.findall('[0-9]+', str(f.readlines()))

def calculateSum(results):
    total = 0
    for result in results:
        total += int(result)
    return total

def main(input_file='haystack.txt'):
    print calculateSum(getNumbers(input_file))

if __name__ == "__main__":
    main()

Solution

You should:

  • Use sum.
  • Make calculateSum do one thing. Also instead you could use sum(int(i) for i in ...).
  • You can use re.finditer to get better memory performance.
  • You can use mmap.mmap, to reduce memory usage. 5K lines is a lot.
    Note: It will error on empty files in Windows.

This can get you:

import re
import mmap

def get_matches(data):
    return (g.group(0) for g in re.finditer('[0-9]+', data))

def main(input_file='haystack.txt'):
    with open(input_file, 'r+') as f:
        data = mmap.mmap(f.fileno(), 0)
        return sum(int(i) for i in get_matches(data))

if __name__ == "__main__":
    print main()

In the next iteration you can change only get_matches to handle the keywords, rather than arbitrary numbers.
And you can iterate over data, one char at a time.

First, you should have a look at PEP8 again, specifically the part about function names. It recommends using lower_case_with_underscores. It also recommends leaving two blank lines between function definitions on the module level.

Next, open already opens the file read-only by default, there is no need for 'r'. Also, f.readlines() returns a list, which you then convert into a string representation of that list. However, you can just use f.read() here, which directly returns the full string. Note that reading in the full file content at once might be quite memory-consuming, so don’t do it with files larger than your RAM.

Last but not least, your function calculate_sum could just be the built-in sum, if the elements of results were already numbers. So I would change get_numbers to return actual numbers (like the name implies), by using map:

import re

def get_numbers(file_name):
    with open(file_name) as f:
        return map(int, re.findall('[0-9]+', f.read()))


def main(file_name='haystack.txt'):
    print sum(get_numbers(file_name))

if __name__ == "__main__":
    main()

In Python 3.x, you could use the yield from keyword to make your get_numbers into a generator and avoid the memory issue, because the file is read one line at a time (at the cost of executing the regexp for every line):

def get_numbers(file_name):
    with open(file_name) as f:
        for line in f:
            yield from map(int, re.findall('[0-9]+', line))

In Python 2.x this could be implemented like this:

def get_numbers(file_name):
    with open(file_name) as f:
        for line in f:
            for item in map(int, re.findall('[0-9]+', line)):
                yield item

You could significantly shorten your code while speeding it up using list comprehension and built in python functions.

data = 'Sed ut $11 perspiciatis, unde omnis 11hours and 22mins iste natus 20 error sit' #11+11+22+20 = 64

Method 1

n = []
for i in data.split(' '): #we split the list at each index of ' '
    if i.isdigit(): #built in method to determine if an object can be an int
        n.append(int(i)) #append to list

Method 1.2

List comprehension

n = sum((int(i) for i in data.split(' ') if i.isdigit()))

Output

print n #--> 20, because it dosen't find $11, 11hours, 22mins

Method 2
Using re we modify the pattern of Method 1.2

import re
n_2 = sum((int(i) for i in re.findall('[0-9]+', data)))

Output

print n_2 #--> 64 finds all.

Leave a Reply

Your email address will not be published. Required fields are marked *