Code downloads a text file from a website, saves it to local disk, and then loads it into a list for further processing

Posted on

Problem

My program opens a website and downloads a text file. The text file is a simple file with one word per line. I save the file to local disk and then create a list to hold each line of the text file for later processing. I would like to know if I am doing these first steps in a way that would be considered idiomatic Python and have I made any big mistakes that will hamper my efforts to expand on it later.

This is similar to an exercise in Think Python by Allen Downey. He suggests using a browser to download the text file but I wanted to do it with Python.

import requests

def get_webpage(uri):
    return requests.get(uri)


def save_webpagecontent(r, filename):
    """ This function saves the page retrieved by get_webpage. r is the 
    response from the call to requests.get and
    filename is where we want to save the file to in the filesystem."""

    chunk_size = 8388608                # number of bytes to write to disk in each chunk
    with open(filename, 'wb') as fd:
        for chunk in r.iter_content(chunk_size):
            fd.write(chunk)


def make_wordlist(filename):
    wordlist = []
    with open(filename) as fd:
        wordlist = fd.readlines()
    return wordlist


def get_mylist(wordlist, num_lines=10):
    if len(wordlist) <= num_lines:
        return wordlist
    return wordlist[:num_lines]


def print_mylist(mylist):
    for word in mylist:
        print(word.strip())
    return None

"""List of words collected and contributed to the public domain by
Grady Ward as part of the Moby lexicon project. See https://en.wikipedia.org/wiki/Moby_Project
"""
uri = 'https://ia802308.us.archive.org/7/items/mobywordlists03201gut/CROSSWD.TXT'
filename = 'wordlist.txt'

r = get_webpage(uri)
save_webpagecontent(r, filename)
wordlist = make_wordlist(filename)
mylist = get_mylist(wordlist)
print_mylist(mylist)

My program works as I expect it to. I have basically found how to do each individual piece by reading this forum and but I would like to know if I’m putting all the pieces together correctly. By correctly I mean something that not only functions as expected but also will be easy to build larger programs and modules from.

I hope it isn’t wrong of me to post this much code. I wasn’t sure how I could trim it down and still show what I am doing. Please let me know if I need to change the format of my question.

Solution

Your code is nice and concise however there are some changes you can make:

  1. You can just return f.readlines() in make_wordlist.
  2. If you’ve done this to show that the result is a list then it’d be better to use the typing module.

    from typing import List
    
    
    def make_wordlist(filename: str) -> List[str]:
        ...
    
  3. get_mylist can be replaced with wordlist[:numlines]. This is because if len(wordlist) is smaller or equal to numlines, then it will return the entire thing anyway.
  4. Performance wise it’s best to use print('n'.join(list)) rather than for item in list: print(item).
  5. I would prefer to be able to change chunk_size in save_webpagecontent and so you can make it a default argument.
  6. IIRC multi-line docstrings shouldn’t start on the same line as the """, nor should they end on the same line either.
import requests
from typing import List

Response = requests.Response


def get_webpage(uri) -> Response:
    return requests.get(uri)


def save_webpagecontent(r: Response, filename: str,
                        chunk_size: int=8388608) -> None:
    """
    This function saves the page retrieved by get_webpage. r is the 
    response from the call to requests.get and
    filename is where we want to save the file to in the filesystem.
    """
    with open(filename, 'wb') as fd:
        for chunk in r.iter_content(chunk_size):
            fd.write(chunk)


def read_wordlist(filename: str) -> List[str]:
    with open(filename) as fd:
        return fd.readlines()


def print_mylist(word_list: List[str]) -> None:
    print('n'.join(word.strip() for word in word_list))


"""
List of words collected and contributed to the public domain by
Grady Ward as part of the Moby lexicon project. See https://en.wikipedia.org/wiki/Moby_Project
"""
uri = 'https://ia802308.us.archive.org/7/items/mobywordlists03201gut/CROSSWD.TXT'
filename = 'wordlist.txt'

r = get_webpage(uri)
save_webpagecontent(r, filename)
print_mylist(read_wordlist(filename)[:10])

Your program is concise and well-readable. One thing that is probably less pythonic is storing the received data in a file. If you have no further use for the file, you could just process the data into a wordlist while receiving it. This saves one intermediate step, and your program would not leave a residual wordlist.txt file around.

Code downloads a text file from a website, saves it to local disk, and then loads it into a list for further processing – Version 2.0

In my new version of this code I have separated my code into 3 modules (my start at a 12 factor app):

download.py for handling downloading the text file from the website and saving it as a file to local storage;

config.py for specifying the URI of the website and the filename for local storage;

moby.py is the actual code that reads the words in the text file, 1 per line, into a list. For now all it does is prints out the words from the file, one per line.

The review my code received provided valuable suggestions on how it could be made more Pythonic, more modular, and more efficient.

Motivated by Hans-Martin Mosner to separate the file download code here is that module. Also made the chunk_size a parameter to the save_webpagecontent() function based on as suggested by Peilonrayz

download.py

import requests
from typing import List

Response = requests.Response

def get_webpage(uri) -> Response:
    return requests.get(uri)


def save_webpagecontent(r: Response, filename: str, chunk_size=8388608) -> None:
    """
    This function saves the page retrieved by get_webpage.
    r is the response from the call to requests.get.
    filename is where we want to save the file to in the filesystem.
    chunk_size is the number of bytes to write to disk in each chunk
    """

    with open(filename, 'wb') as fd:
        for chunk in r.iter_content(chunk_size):
            fd.write(chunk)

config.py

uri = 'https://ia802308.us.archive.org/7/items/mobywordlists03201gut/CROSSWD.TXT'
filename = 'wordlist.txt'

I feel I made the most gains in my Python profiency as a result of implementing the changes suggested by Peilonrayz where I did away with intermediate function calls and variables and by working on the suggestion by BruceWayne to add an event for failing to open the file. The file opening code turned out to be the most challenging. I wasn’t able to get `opened_w_error() working exactly based on the example from PEP343. Figuring it out was very rewarding.

moby.py

import download_file as df
import config as cfg
from contextlib import contextmanager
from typing import List

filename = cfg.filename
uri = cfg.uri

@contextmanager
def opened_w_error(filename, mode="r"):
    try:
        f = open(filename, mode)
    except OSError as err:
        yield None, err
    else:
        try:
            yield f, None
        finally:
            f.close()


def read_wordlist(filename: str) -> List[str]:
    with opened_w_error(filename, 'r') as (fd, err):
        if type(err) == FileNotFoundError:
            df.save_webpagecontent(df.get_webpage(uri), filename) #since it failed the first time we need to actually download it
            with opened_w_error(filename, 'r') as (fd, err): # if it fails again abort
                if err:
                    print("OSError:", err)
                else:
                    return fd.readlines()
        else:
            return fd.readlines()


def print_mylist(wordlist: List[str]) -> None:
    print('n'.join(word.strip() for word in wordlist))


print_mylist(read_wordlist(filename)[:50])

Thank you to everyone, especially Roland Illig, Hans-Martin Mosner, and Mast for all your help and encouragement and a safe place to learn!

Leave a Reply

Your email address will not be published. Required fields are marked *