This is based on a programming challenge I recently solved. However, since the writer of the challenge asked me not to disclose the question and/or it’s answer, I modified it a little. If it doesn’t make sense: my mistake.
The target is to find all numbers in a file and sum those. So if the following is found:
Sed ut 11 perspiciatis, unde omnis 11 iste natus 20 error sit
The result should be
0.42 will be considered equal to
42 in this challenge.
To generate example input, I took the revised version of my ‘n x on the y’ generator and made it run from 999. That’s almost 5k lines of code, a nice amount to test against. The old code produces the same output, for those interested.
I was quite content with the performance. Now I’m wondering whether I could’ve written it better. Readability and maintainability are key, as usual.
The next iteration of this code will likely try to match keywords instead of arbitrary numbers. More on that in a later question.
# Sum all numbers in a haystack import re def getNumbers(inp): with open(inp, 'r') as f: return re.findall('[0-9]+', str(f.readlines())) def calculateSum(results): total = 0 for result in results: total += int(result) return total def main(input_file='haystack.txt'): print calculateSum(getNumbers(input_file)) if __name__ == "__main__": main()
calculateSumdo one thing. Also instead you could use
sum(int(i) for i in ...).
- You can use
re.finditerto get better memory performance.
- You can use
mmap.mmap, to reduce memory usage. 5K lines is a lot.
Note: It will error on empty files in Windows.
This can get you:
import re import mmap def get_matches(data): return (g.group(0) for g in re.finditer('[0-9]+', data)) def main(input_file='haystack.txt'): with open(input_file, 'r+') as f: data = mmap.mmap(f.fileno(), 0) return sum(int(i) for i in get_matches(data)) if __name__ == "__main__": print main()
In the next iteration you can change only
get_matches to handle the keywords, rather than arbitrary numbers.
And you can iterate over data, one char at a time.
First, you should have a look at PEP8 again, specifically the part about function names. It recommends using
lower_case_with_underscores. It also recommends leaving two blank lines between function definitions on the module level.
open already opens the file read-only by default, there is no need for
f.readlines() returns a
list, which you then convert into a string representation of that list. However, you can just use
f.read() here, which directly returns the full string. Note that reading in the full file content at once might be quite memory-consuming, so don’t do it with files larger than your RAM.
Last but not least, your function
calculate_sum could just be the built-in
sum, if the elements of
results were already numbers. So I would change
get_numbers to return actual numbers (like the name implies), by using
import re def get_numbers(file_name): with open(file_name) as f: return map(int, re.findall('[0-9]+', f.read())) def main(file_name='haystack.txt'): print sum(get_numbers(file_name)) if __name__ == "__main__": main()
In Python 3.x, you could use the
yield from keyword to make your
get_numbers into a generator and avoid the memory issue, because the file is read one line at a time (at the cost of executing the regexp for every line):
def get_numbers(file_name): with open(file_name) as f: for line in f: yield from map(int, re.findall('[0-9]+', line))
In Python 2.x this could be implemented like this:
def get_numbers(file_name): with open(file_name) as f: for line in f: for item in map(int, re.findall('[0-9]+', line)): yield item
You could significantly shorten your code while speeding it up using list comprehension and built in python functions.
data = 'Sed ut $11 perspiciatis, unde omnis 11hours and 22mins iste natus 20 error sit' #11+11+22+20 = 64
n =  for i in data.split(' '): #we split the list at each index of ' ' if i.isdigit(): #built in method to determine if an object can be an int n.append(int(i)) #append to list
n = sum((int(i) for i in data.split(' ') if i.isdigit()))
print n #--> 20, because it dosen't find $11, 11hours, 22mins
re we modify the pattern of
import re n_2 = sum((int(i) for i in re.findall('[0-9]+', data)))
print n_2 #--> 64 finds all.