I wrote this script a few days ago to find all files which were taking up over a certain threshold of memory on the hard drive. A hard drive on one of my computers had just 6 GB left, so I needed to clean some space up. I used this to identify bulky folders to find out where the problem was.
Is there anything I could have done better? I’m particularly displeased with the exception handling in the
get_folder_size function, since I ended up handling the same exception twice, just at different levels. Couldn’t think of a way around it off hand, though. Also, efficiency was an issue with this script. I know I could’ve invented some framework to keep track of all the folders previously scanned, but that seemed like a bit too monumental of an effort to make for such a trivial script. The recursion was much easier to implement.
Any and all suggestions are welcome!
import os BYTES_IN_GIGABYTE = 1073741824 GIGABYTE_DISPLAY_LIMIT = 2 ACCESS_ERROR_CODE = 13 def get_folder_size(filepath): size = os.path.getsize(filepath) try: for item in os.listdir(filepath): item_path = os.path.join(filepath, item) try: if os.path.isfile(item_path): size += os.path.getsize(item_path) elif os.path.isdir(item_path): size += get_folder_size(item_path) except OSError as err: if err.errno == ACCESS_ERROR_CODE: print('Unable to access ' + item_path) continue else: raise except OSError as err: if err.errno != ACCESS_ERROR_CODE: raise else: print('Unable to access ' + filepath) return size def get_all_folder_sizes(root_filepath): folders =  for item in os.walk(root_filepath): if os.path.isdir(os.path.join(root_filepath, item)): folders.append([item, get_folder_size(item)]) return folders def convert_bytes_to_gigabytes(bytes): return bytes / BYTES_IN_GIGABYTE def main(): folder_sizes = get_all_folder_sizes('C:\') folder_sizes.sort() for folder in folder_sizes: gigabytes = convert_bytes_to_gigabytes(folder) if gigabytes > GIGABYTE_DISPLAY_LIMIT: print(folder + ' = ' + format(gigabytes, '.2f') + ' GB') if __name__ == '__main__': main()
BYTES_IN_GIGABYTE is clearer as
2**30. That’s transparent enough that the constant and
convert_bytes_to_gigabytes aren’t needed, so I’d just write
folder / 2**30.
GIGABYTE_DISPLAY_LIMIT sounds like an upper limit to display. How about
To reduce the opportunity for unit confusion, I’d keep the size and the threshold in bytes, and only convert to GB when printing.
You may want
folder_sizes.sort(reverse=True) to get the largest ones first.
get_all_folder_sizes recurses over directories twice: once in
os.walk and once in
get_folder_size. This takes time quadratic in the tree depth, so it may be slow. Can you do the recursion just once, by returning a generator of
(size, path) pairs, and recursing on subdirectories?
You can avoid the two
try blocks by getting directory size in the same place as file size. (Then you don’t even need
continue in the inner exception handler is redundant.