Problem
This program takes the filename as the input and outputs to a file the number of requests that were more than 1000 bytes and the total bytes.
Log file example:
site.com - - [01/Jul/1995:00:00:01 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 183
Any suggestions/recommendations on the code or structure?
import re
import os
filename = input()
filename_regex = '^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(S+)s?(S+)?s?(S+)?" (d{3}|-) (d+|-)s'
nbr_of_requests = 0
sum_of_bytes = 0
def parse_file(filename):
with open(filename, 'r', encoding='utf-8') as fin:
print(f"Opening file: {filename}")
for line in fin.readlines():
if os.path.exists(filename):
line = re.findall(filename_regex, line)
yield line
def save_to_file(filename):
output_filename = "bytes_" + filename
with open(output_filename, 'w+') as fout:
fout.write(str(nbr_of_requests) + 'n' + str(sum_of_bytes))
if __name__ == '__main__':
for line in parse_file(filename):
if line and int(line[0][8]) >= 1000:
nbr_of_requests += 1
sum_of_bytes = sum_of_bytes + int(line[0][8])
save_to_file(filename)
Solution
If you don’t need fields other than size, and have no use for lines with no size, a simpler regex like s(d+)$
is going to be appreciably faster. You could even go with s(d{4,})$
and skip the >= 1000
test.
If you do keep the full regex, I’d simplify the date portion because date formats are notoriously unpredictable. [[^]]+]
does the job more robustly.
(S+)?
is better written as (S*)
.
filename_regex
is better named like log_entry_regex
, since that’s what it matches.
Overwriting the string line
with an array of captures is ugly. And re.findall
is not the best choice, because the regex should never match more than once per line (and your code ignores subsequent matches anyway). Try something like:
matched = re.search(log_entry_regex, line)
if (matched)
yield matched.groups()
This eliminates the array-of-arrays in the result and ensures results are always a set of matches. if line and int(line[0][8]) >= 1000
becomes just if int(line[8]) >= 1000
.
The global variables are better placed inside the functions where they’re used. save_to_file
would need to take arguments then, but it’s three lines and could just be inline in the main function.
When you have an array with many values, where each value means something different, a useful pattern is to create “fake enums” with constant field identifiers. It makes your uses of that array easier to write and easier to read. It doesn’t make much difference here, where the code is short and only one element is actually used; however it becomes invaluable as line count and complexity increase. Python doesn’t have constants, variables will work too:
HOST = 0
…
SIZE = 8
…
if int( line[SIZE] ) >= 1000: