Problem
My program opens a website and downloads a text file. The text file is a simple file with one word per line. I save the file to local disk and then create a list to hold each line of the text file for later processing. I would like to know if I am doing these first steps in a way that would be considered idiomatic Python and have I made any big mistakes that will hamper my efforts to expand on it later.
This is similar to an exercise in Think Python by Allen Downey. He suggests using a browser to download the text file but I wanted to do it with Python.
import requests
def get_webpage(uri):
return requests.get(uri)
def save_webpagecontent(r, filename):
""" This function saves the page retrieved by get_webpage. r is the
response from the call to requests.get and
filename is where we want to save the file to in the filesystem."""
chunk_size = 8388608 # number of bytes to write to disk in each chunk
with open(filename, 'wb') as fd:
for chunk in r.iter_content(chunk_size):
fd.write(chunk)
def make_wordlist(filename):
wordlist = []
with open(filename) as fd:
wordlist = fd.readlines()
return wordlist
def get_mylist(wordlist, num_lines=10):
if len(wordlist) <= num_lines:
return wordlist
return wordlist[:num_lines]
def print_mylist(mylist):
for word in mylist:
print(word.strip())
return None
"""List of words collected and contributed to the public domain by
Grady Ward as part of the Moby lexicon project. See https://en.wikipedia.org/wiki/Moby_Project
"""
uri = 'https://ia802308.us.archive.org/7/items/mobywordlists03201gut/CROSSWD.TXT'
filename = 'wordlist.txt'
r = get_webpage(uri)
save_webpagecontent(r, filename)
wordlist = make_wordlist(filename)
mylist = get_mylist(wordlist)
print_mylist(mylist)
My program works as I expect it to. I have basically found how to do each individual piece by reading this forum and but I would like to know if I’m putting all the pieces together correctly. By correctly I mean something that not only functions as expected but also will be easy to build larger programs and modules from.
I hope it isn’t wrong of me to post this much code. I wasn’t sure how I could trim it down and still show what I am doing. Please let me know if I need to change the format of my question.
Solution
Your code is nice and concise however there are some changes you can make:
- You can just return
f.readlines()
inmake_wordlist
. -
If you’ve done this to show that the result is a list then it’d be better to use the
typing
module.from typing import List def make_wordlist(filename: str) -> List[str]: ...
get_mylist
can be replaced withwordlist[:numlines]
. This is because iflen(wordlist)
is smaller or equal tonumlines
, then it will return the entire thing anyway.- Performance wise it’s best to use
print('n'.join(list))
rather thanfor item in list: print(item)
. - I would prefer to be able to change
chunk_size
insave_webpagecontent
and so you can make it a default argument. - IIRC multi-line docstrings shouldn’t start on the same line as the
"""
, nor should they end on the same line either.
import requests
from typing import List
Response = requests.Response
def get_webpage(uri) -> Response:
return requests.get(uri)
def save_webpagecontent(r: Response, filename: str,
chunk_size: int=8388608) -> None:
"""
This function saves the page retrieved by get_webpage. r is the
response from the call to requests.get and
filename is where we want to save the file to in the filesystem.
"""
with open(filename, 'wb') as fd:
for chunk in r.iter_content(chunk_size):
fd.write(chunk)
def read_wordlist(filename: str) -> List[str]:
with open(filename) as fd:
return fd.readlines()
def print_mylist(word_list: List[str]) -> None:
print('n'.join(word.strip() for word in word_list))
"""
List of words collected and contributed to the public domain by
Grady Ward as part of the Moby lexicon project. See https://en.wikipedia.org/wiki/Moby_Project
"""
uri = 'https://ia802308.us.archive.org/7/items/mobywordlists03201gut/CROSSWD.TXT'
filename = 'wordlist.txt'
r = get_webpage(uri)
save_webpagecontent(r, filename)
print_mylist(read_wordlist(filename)[:10])
Your program is concise and well-readable. One thing that is probably less pythonic is storing the received data in a file. If you have no further use for the file, you could just process the data into a wordlist while receiving it. This saves one intermediate step, and your program would not leave a residual wordlist.txt
file around.
Code downloads a text file from a website, saves it to local disk, and then loads it into a list for further processing – Version 2.0
In my new version of this code I have separated my code into 3 modules (my start at a 12 factor app):
download.py
for handling downloading the text file from the website and saving it as a file to local storage;
config.py
for specifying the URI of the website and the filename for local storage;
moby.py
is the actual code that reads the words in the text file, 1 per line, into a list. For now all it does is prints out the words from the file, one per line.
The review my code received provided valuable suggestions on how it could be made more Pythonic, more modular, and more efficient.
Motivated by Hans-Martin Mosner to separate the file download code here is that module. Also made the chunk_size a parameter to the save_webpagecontent() function based on as suggested by Peilonrayz
download.py
import requests
from typing import List
Response = requests.Response
def get_webpage(uri) -> Response:
return requests.get(uri)
def save_webpagecontent(r: Response, filename: str, chunk_size=8388608) -> None:
"""
This function saves the page retrieved by get_webpage.
r is the response from the call to requests.get.
filename is where we want to save the file to in the filesystem.
chunk_size is the number of bytes to write to disk in each chunk
"""
with open(filename, 'wb') as fd:
for chunk in r.iter_content(chunk_size):
fd.write(chunk)
config.py
uri = 'https://ia802308.us.archive.org/7/items/mobywordlists03201gut/CROSSWD.TXT'
filename = 'wordlist.txt'
I feel I made the most gains in my Python profiency as a result of implementing the changes suggested by Peilonrayz where I did away with intermediate function calls and variables and by working on the suggestion by BruceWayne to add an event for failing to open the file. The file opening code turned out to be the most challenging. I wasn’t able to get `opened_w_error() working exactly based on the example from PEP343. Figuring it out was very rewarding.
moby.py
import download_file as df
import config as cfg
from contextlib import contextmanager
from typing import List
filename = cfg.filename
uri = cfg.uri
@contextmanager
def opened_w_error(filename, mode="r"):
try:
f = open(filename, mode)
except OSError as err:
yield None, err
else:
try:
yield f, None
finally:
f.close()
def read_wordlist(filename: str) -> List[str]:
with opened_w_error(filename, 'r') as (fd, err):
if type(err) == FileNotFoundError:
df.save_webpagecontent(df.get_webpage(uri), filename) #since it failed the first time we need to actually download it
with opened_w_error(filename, 'r') as (fd, err): # if it fails again abort
if err:
print("OSError:", err)
else:
return fd.readlines()
else:
return fd.readlines()
def print_mylist(wordlist: List[str]) -> None:
print('n'.join(word.strip() for word in wordlist))
print_mylist(read_wordlist(filename)[:50])
Thank you to everyone, especially Roland Illig, Hans-Martin Mosner, and Mast for all your help and encouragement and a safe place to learn!