Python script to find large folders

Posted on

Problem

I wrote this script a few days ago to find all files which were taking up over a certain threshold of memory on the hard drive. A hard drive on one of my computers had just 6 GB left, so I needed to clean some space up. I used this to identify bulky folders to find out where the problem was.

Is there anything I could have done better? I’m particularly displeased with the exception handling in the get_folder_size function, since I ended up handling the same exception twice, just at different levels. Couldn’t think of a way around it off hand, though. Also, efficiency was an issue with this script. I know I could’ve invented some framework to keep track of all the folders previously scanned, but that seemed like a bit too monumental of an effort to make for such a trivial script. The recursion was much easier to implement.

Any and all suggestions are welcome!

import os

BYTES_IN_GIGABYTE = 1073741824
GIGABYTE_DISPLAY_LIMIT = 2
ACCESS_ERROR_CODE = 13

def get_folder_size(filepath):
    size = os.path.getsize(filepath)
    try:
        for item in os.listdir(filepath):
            item_path = os.path.join(filepath, item)
            try:
                if os.path.isfile(item_path):
                    size += os.path.getsize(item_path)
                elif os.path.isdir(item_path):
                    size += get_folder_size(item_path)
            except OSError as err:
                if err.errno == ACCESS_ERROR_CODE:
                    print('Unable to access ' + item_path)
                    continue
                else:
                    raise
    except OSError as err:
        if err.errno != ACCESS_ERROR_CODE:
            raise
        else:
            print('Unable to access ' + filepath)
    return size

def get_all_folder_sizes(root_filepath):
    folders = []
    for item in os.walk(root_filepath):
        if os.path.isdir(os.path.join(root_filepath, item[0])):
            folders.append([item[0], get_folder_size(item[0])])
    return folders

def convert_bytes_to_gigabytes(bytes):
    return bytes / BYTES_IN_GIGABYTE

def main():
    folder_sizes = get_all_folder_sizes('C:\')
    folder_sizes.sort()
    for folder in folder_sizes:
        gigabytes = convert_bytes_to_gigabytes(folder[1])
        if gigabytes > GIGABYTE_DISPLAY_LIMIT:
            print(folder[0] + ' = ' + format(gigabytes, '.2f') + ' GB')

if __name__ == '__main__':
    main()

Solution

ACCESS_ERROR_CODE is errno.EACCES.

BYTES_IN_GIGABYTE is clearer as 2**30. That’s transparent enough that the constant and convert_bytes_to_gigabytes aren’t needed, so I’d just write folder[1] / 2**30.

GIGABYTE_DISPLAY_LIMIT sounds like an upper limit to display. How about GIGABYTE_DISPLAY_THRESHOLD instead?

To reduce the opportunity for unit confusion, I’d keep the size and the threshold in bytes, and only convert to GB when printing.

You may want folder_sizes.sort(reverse=True) to get the largest ones first.

get_all_folder_sizes recurses over directories twice: once in os.walk and once in get_folder_size. This takes time quadratic in the tree depth, so it may be slow. Can you do the recursion just once, by returning a generator of (size, path) pairs, and recursing on subdirectories?

You can avoid the two try blocks by getting directory size in the same place as file size. (Then you don’t even need isfile.)

The continue in the inner exception handler is redundant.

Leave a Reply

Your email address will not be published. Required fields are marked *