Analyzing a web access log to obtaining totals for requests

Posted on


This program takes the filename as the input and outputs to a file the number of requests that were more than 1000 bytes and the total bytes.

Log file example: - - [01/Jul/1995:00:00:01 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 183

Any suggestions/recommendations on the code or structure?

import re
import os

filename = input()
filename_regex = '^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(S+)s?(S+)?s?(S+)?" (d{3}|-) (d+|-)s'
nbr_of_requests = 0
sum_of_bytes = 0

def parse_file(filename):
    with open(filename, 'r', encoding='utf-8') as fin:
        print(f"Opening file: {filename}")
        for line in fin.readlines():
            if os.path.exists(filename):
                line = re.findall(filename_regex, line)
                yield line

def save_to_file(filename):
    output_filename = "bytes_" + filename
    with open(output_filename, 'w+') as fout:
        fout.write(str(nbr_of_requests) + 'n' + str(sum_of_bytes))

if __name__ == '__main__':
    for line in parse_file(filename):
        if line and int(line[0][8]) >= 1000:
            nbr_of_requests += 1
            sum_of_bytes = sum_of_bytes + int(line[0][8])



If you don’t need fields other than size, and have no use for lines with no size, a simpler regex like s(d+)$ is going to be appreciably faster. You could even go with s(d{4,})$ and skip the >= 1000 test.

If you do keep the full regex, I’d simplify the date portion because date formats are notoriously unpredictable. [[^]]+] does the job more robustly.

(S+)? is better written as (S*).

filename_regex is better named like log_entry_regex, since that’s what it matches.

Overwriting the string line with an array of captures is ugly. And re.findall is not the best choice, because the regex should never match more than once per line (and your code ignores subsequent matches anyway). Try something like:

matched =, line)
if (matched) 
    yield matched.groups()

This eliminates the array-of-arrays in the result and ensures results are always a set of matches. if line and int(line[0][8]) >= 1000 becomes just if int(line[8]) >= 1000.

The global variables are better placed inside the functions where they’re used. save_to_file would need to take arguments then, but it’s three lines and could just be inline in the main function.

When you have an array with many values, where each value means something different, a useful pattern is to create “fake enums” with constant field identifiers. It makes your uses of that array easier to write and easier to read. It doesn’t make much difference here, where the code is short and only one element is actually used; however it becomes invaluable as line count and complexity increase. Python doesn’t have constants, variables will work too:

HOST = 0
SIZE = 8if int( line[SIZE] ) >= 1000:

Leave a Reply

Your email address will not be published. Required fields are marked *