Problem
I wrote this script a few days ago to find all files which were taking up over a certain threshold of memory on the hard drive. A hard drive on one of my computers had just 6 GB left, so I needed to clean some space up. I used this to identify bulky folders to find out where the problem was.
Is there anything I could have done better? I’m particularly displeased with the exception handling in the get_folder_size
function, since I ended up handling the same exception twice, just at different levels. Couldn’t think of a way around it off hand, though. Also, efficiency was an issue with this script. I know I could’ve invented some framework to keep track of all the folders previously scanned, but that seemed like a bit too monumental of an effort to make for such a trivial script. The recursion was much easier to implement.
Any and all suggestions are welcome!
import os
BYTES_IN_GIGABYTE = 1073741824
GIGABYTE_DISPLAY_LIMIT = 2
ACCESS_ERROR_CODE = 13
def get_folder_size(filepath):
size = os.path.getsize(filepath)
try:
for item in os.listdir(filepath):
item_path = os.path.join(filepath, item)
try:
if os.path.isfile(item_path):
size += os.path.getsize(item_path)
elif os.path.isdir(item_path):
size += get_folder_size(item_path)
except OSError as err:
if err.errno == ACCESS_ERROR_CODE:
print('Unable to access ' + item_path)
continue
else:
raise
except OSError as err:
if err.errno != ACCESS_ERROR_CODE:
raise
else:
print('Unable to access ' + filepath)
return size
def get_all_folder_sizes(root_filepath):
folders = []
for item in os.walk(root_filepath):
if os.path.isdir(os.path.join(root_filepath, item[0])):
folders.append([item[0], get_folder_size(item[0])])
return folders
def convert_bytes_to_gigabytes(bytes):
return bytes / BYTES_IN_GIGABYTE
def main():
folder_sizes = get_all_folder_sizes('C:\')
folder_sizes.sort()
for folder in folder_sizes:
gigabytes = convert_bytes_to_gigabytes(folder[1])
if gigabytes > GIGABYTE_DISPLAY_LIMIT:
print(folder[0] + ' = ' + format(gigabytes, '.2f') + ' GB')
if __name__ == '__main__':
main()
Solution
ACCESS_ERROR_CODE
is errno.EACCES
.
BYTES_IN_GIGABYTE
is clearer as 2**30
. That’s transparent enough that the constant and convert_bytes_to_gigabytes
aren’t needed, so I’d just write folder[1] / 2**30
.
GIGABYTE_DISPLAY_LIMIT
sounds like an upper limit to display. How about GIGABYTE_DISPLAY_THRESHOLD
instead?
To reduce the opportunity for unit confusion, I’d keep the size and the threshold in bytes, and only convert to GB when printing.
You may want folder_sizes.sort(reverse=True)
to get the largest ones first.
get_all_folder_sizes
recurses over directories twice: once in os.walk
and once in get_folder_size
. This takes time quadratic in the tree depth, so it may be slow. Can you do the recursion just once, by returning a generator of (size, path)
pairs, and recursing on subdirectories?
You can avoid the two try
blocks by getting directory size in the same place as file size. (Then you don’t even need isfile
.)
The continue
in the inner exception handler is redundant.