Problem
This is based on a programming challenge I recently solved. However, since the writer of the challenge asked me not to disclose the question and/or it’s answer, I modified it a little. If it doesn’t make sense: my mistake.
The target is to find all numbers in a file and sum those. So if the following is found:
Sed ut 11 perspiciatis, unde omnis 11 iste natus 20 error sit
The result should be 42
.
Floats like 0.42
will be considered equal to 42
in this challenge.
To generate example input, I took the revised version of my ‘n x on the y’ generator and made it run from 999. That’s almost 5k lines of code, a nice amount to test against. The old code produces the same output, for those interested.
I was quite content with the performance. Now I’m wondering whether I could’ve written it better. Readability and maintainability are key, as usual.
The next iteration of this code will likely try to match keywords instead of arbitrary numbers. More on that in a later question.
haystack.py
# Sum all numbers in a haystack
import re
def getNumbers(inp):
with open(inp, 'r') as f:
return re.findall('[0-9]+', str(f.readlines()))
def calculateSum(results):
total = 0
for result in results:
total += int(result)
return total
def main(input_file='haystack.txt'):
print calculateSum(getNumbers(input_file))
if __name__ == "__main__":
main()
Solution
You should:
- Use
sum
. - Make
calculateSum
do one thing. Also instead you could usesum(int(i) for i in ...)
. - You can use
re.finditer
to get better memory performance. - You can use
mmap.mmap
, to reduce memory usage. 5K lines is a lot.
Note: It will error on empty files in Windows.
This can get you:
import re
import mmap
def get_matches(data):
return (g.group(0) for g in re.finditer('[0-9]+', data))
def main(input_file='haystack.txt'):
with open(input_file, 'r+') as f:
data = mmap.mmap(f.fileno(), 0)
return sum(int(i) for i in get_matches(data))
if __name__ == "__main__":
print main()
In the next iteration you can change only get_matches
to handle the keywords, rather than arbitrary numbers.
And you can iterate over data, one char at a time.
First, you should have a look at PEP8 again, specifically the part about function names. It recommends using lower_case_with_underscores
. It also recommends leaving two blank lines between function definitions on the module level.
Next, open
already opens the file read-only by default, there is no need for 'r'
. Also, f.readlines()
returns a list
, which you then convert into a string representation of that list. However, you can just use f.read()
here, which directly returns the full string. Note that reading in the full file content at once might be quite memory-consuming, so don’t do it with files larger than your RAM.
Last but not least, your function calculate_sum
could just be the built-in sum
, if the elements of results
were already numbers. So I would change get_numbers
to return actual numbers (like the name implies), by using map
:
import re
def get_numbers(file_name):
with open(file_name) as f:
return map(int, re.findall('[0-9]+', f.read()))
def main(file_name='haystack.txt'):
print sum(get_numbers(file_name))
if __name__ == "__main__":
main()
In Python 3.x, you could use the yield from
keyword to make your get_numbers
into a generator and avoid the memory issue, because the file is read one line at a time (at the cost of executing the regexp for every line):
def get_numbers(file_name):
with open(file_name) as f:
for line in f:
yield from map(int, re.findall('[0-9]+', line))
In Python 2.x this could be implemented like this:
def get_numbers(file_name):
with open(file_name) as f:
for line in f:
for item in map(int, re.findall('[0-9]+', line)):
yield item
You could significantly shorten your code while speeding it up using list comprehension and built in python functions.
data = 'Sed ut $11 perspiciatis, unde omnis 11hours and 22mins iste natus 20 error sit' #11+11+22+20 = 64
Method 1
n = []
for i in data.split(' '): #we split the list at each index of ' '
if i.isdigit(): #built in method to determine if an object can be an int
n.append(int(i)) #append to list
Method 1.2
List comprehension
n = sum((int(i) for i in data.split(' ') if i.isdigit()))
Output
print n #--> 20, because it dosen't find $11, 11hours, 22mins
Method 2
Using re
we modify the pattern of Method 1.2
import re
n_2 = sum((int(i) for i in re.findall('[0-9]+', data)))
Output
print n_2 #--> 64 finds all.