Problem
This script is supposed to download query xkcd json interface, find the total number of comics till date, pick a random number in that range, use the xkcd json interface again to find the comic image url, download it into /tmp
filesystem, and use feh image viewer to display the image.
I’d appreciate it if you could tell me if this program is neatly written
#!/usr/bin/env python
import urllib.request
import json
import random
import subprocess
class Xkcd :
def __init__(self) :
self.url = "https://xkcd.com/info.0.json"
self.num_comics = None
self.random_comic_num = None
self.comic_json_url = None
self.image_url = None
def get_num_comics(self) :
html_data = urllib.request.urlopen(self.url)
json_data = json.loads(html_data.read())
self.num_comics = json_data['num']
def download_image(self) :
urllib.request.urlretrieve(self.image_url, "/tmp/image")
def get_random_comic(self) :
self.get_num_comics()
self.random_comic_num = random.randint(1, self.num_comics)
self.comic_json_url = "https://xkcd.com/%s/info.0.json" %(self.random_comic_num)
html_data = urllib.request.urlopen(self.comic_json_url)
json_data = json.loads(html_data.read())
self.image_url = json_data['img']
self.download_image()
def show_image(self) :
subprocess.Popen(["feh", "/tmp/image"])
exit()
if __name__ == "__main__" :
ob = Xkcd()
ob.get_random_comic()
ob.show_image()
Solution
Xkcd
as a class isn’t particularly well-named: you’re actually representing a “site”, “session” or “downloader”, distinct from an individual comic.
Its members are problematic. url
never changes so should be moved to a static, but also url
in its current form is only ever used once so instead you should store the base URL as a static and append paths to it as necessary.
Xkcd
, upon construction, is broken and not ready to use. Exterior code that assumes that num_comics
is valid will fail due to the None. An attempt to call get_random_comic()
requires that it calls get_num_comics
, and this will occur every single time that a comic is fetched. This should be cached instead.
random_comic_num
, comic_json_url
and image_url
have no business existing as persistent members on the class; they should be passed around on a case-by-case basis as parameters and return values.
Avoid using urllib
when requests
does a better job.
Hard-coding /tmp/image
is problematic: you’re guaranteed to steamroll anything else at that path, including files from both your program and other programs that use that path. Also, some operating systems (a) cannot use that directory for temporary files, and (b) cannot infer from the extension-less path what mimetype the image is. The solution is to use the built-in temporary file support in Python and pass it a sane suffix.
It’s relatively easy to capture the known properties of a returned JSON comic payload in a well-typed, well-defined class based off of a NamedTuple
or dataclass, so do this.
It’s also relatively easy to make get_random_comic
more generally useful by making the “random” behaviour optional and accepting an optional index number. If that index number is None
then the method can fill in a random index. Among other advantages, if the comic count logic is made lazy (which you should do anyway), then it will never be called so long as callers only want pre-defined indices.
Don’t hard-code feh
. If it isn’t the default image viewer on your OS, that’s a user problem and not a program problem. Instead make an attempt at using the default OS image viewer, though Python doesn’t make this entirely easy to be portable: use startfile
on Windows, xdg-open
on Linux, open
on Mac, etc. The latter two should be using subprocess.check_call
with shell=False
.
Your downloader/site class should have a requests.Session
and it should be closeable via context management.
An algorithmic note: your random algorithm is reasonable but is not what the site itself uses. The site offers a link https://c.xkcd.com/random/comic/
that takes your normal GET
request and 302-redirects you to a random comic. Given that it requires the same number of network operations and gives you HTML rather than JSON, I think your count-and-choose method is still preferable.
Suggested
#!/usr/bin/env python3
from functools import cache
from mimetypes import guess_type
from os import startfile
from os.path import basename
from random import randint
from shutil import copyfileobj
from subprocess import check_call
from sys import platform
from tempfile import NamedTemporaryFile
from typing import BinaryIO, NamedTuple
from requests import Session
class XkcdComic(NamedTuple):
year: str
month: str
day: str
num: int
title: str
safe_title: str
link: str
img: str
news: str
transcript: str
alt: str
@property
def image_name(self) -> str:
return basename(self.img)
@property
def image_mimetype(self) -> str:
# typically image/png or image/jpeg
mimetype, encoding = guess_type(self.img)
return mimetype
class XkcdSite:
BASE_URL = 'https://xkcd.com/'
def __init__(self) -> None:
self.session = Session()
def get_count() -> int:
return self.get_current_comic().num
self.get_num_comics = cache(get_count)
def __enter__(self) -> 'XkcdSite':
return self
def __exit__(self, exc_type, exc_val, exc_tb) -> None:
self.session.close()
def get_image(self, comic: XkcdComic, file: BinaryIO) -> None:
with self.session.get(
url=comic.img,
headers={'Accept': comic.image_mimetype},
stream=True,
) as response:
response.raise_for_status()
copyfileobj(response.raw, file)
def get_comic(self, index: int) -> XkcdComic:
return self._fetch_comic(prefix=f'{index}/')
def get_current_comic(self) -> XkcdComic:
return self._fetch_comic(prefix='')
def get_random_comic(self) -> XkcdComic:
return self.get_comic(index=randint(1, self.get_num_comics()))
def _fetch_comic(self, prefix: str) -> XkcdComic:
with self.session.get(
url=f'{self.BASE_URL}{prefix}info.0.json',
headers={'Accept': 'application/json'},
) as response:
response.raise_for_status()
return XkcdComic(**response.json())
def show_image(self, comic: XkcdComic) -> None:
with NamedTemporaryFile(
# Needed for Windows since startfile() cannot block
delete=False,
# Needed for Windows since it cannot guess mimetypes based on content alone
suffix='_' + comic.image_name,
) as file:
self.get_image(comic, file)
run_file(file.name)
def run_file(path: str) -> None:
if platform == 'win32': # Windows
startfile(path)
else: # assume Linux-like
check_call(('/usr/bin/xdg-open', path), shell=False)
# This does not cover darwin/mac ('open'), etc.
def main() -> None:
with XkcdSite() as site:
comic = site.get_random_comic()
site.show_image(comic)
if __name__ == '__main__':
main()
The use of a class here is basically strictly a bad idea. This is such a simple script that if I were actually writing it, I would probably just put all the code under if __name__ == "__main__" :
with no functions or classes at all. Of course, since you’re posting here you’re probably at least partly writing this as an intellectual exercise, but even with with that parameter, using a class here really does nothing but make the code worse, for reasons I’ll explain in a second. First I’ll post how I would prefer to structure it. In the spirit of this being an exercise, I will actually split it up into functions in what I think is a reasonable way:
def random_comic_url():
url = "https://xkcd.com/info.0.json"
html_data = urllib.request.urlopen(url)
json_data = json.loads(html_data.read())
num_comics = json_data['num']
random_comic_num = random.randint(1, num_comics)
comic_json_url = "https://xkcd.com/%s/info.0.json" % (self.random_comic_num)
html_data = urllib.request.urlopen(comic_json_url)
json_data = json.loads(html_data.read())
image_url = json_data['img']
return image_url
def show_image(image_url):
urllib.request.urlretrieve(image_url, "/tmp/image")
subprocess.Popen(["feh", "/tmp/image"])
exit()
if __name__ == "__main__":
url = random_comic_url()
show_image(url)
I figure I’ll turn this into a little case study in when and when not to use a class. Here are bad reasons to use a class:
- Because I just thought of a noun that seems intuitively related to the problem, so it should be a class.
- Because I’m writing a function, and every function should be in a class, because that’s cleaner, right?
Here is the only reason you actually should use a class:
- You have a bunch of functions whose behavior needs to depend on which order the other functions got called in previously (and with what arguments), and it makes sense to call those functions in all kinds of different orders.
This is just a very abstract way of saying that you have a bunch of functions whose behavior depends on some piece of state, and such that only those functions are able to modify that state. The problem here is that the italicized part does not apply to our situation. Yes, the behavior of your show_image()
method will depend on which of the other methods you called previously, but only in the sense that it doesn’t make sense to call it at all if you haven’t called all of the other methods.
This gets to the heart of the biggest problem with unnecessary use of classes: you’re adding opportunities for your code to be put into an illegal state. In your case, it doesn’t make sense to call get_num_comics()
, get_random_comic()
, download_image()
and show_image()
in anything other than that order. But how would a user of your class (including you in a week, when you’ve forgotten the details) know that? They’re likely to forget to call one, and end up causing a crash. They’ll need to discover how the code is to be used through trial and error, or you’ll have to write documentation.
If it only ever makes sense to call f1
, f2
and f3
in that order, you may as well concatenate them together into one big function. If for reasons of readability you want to keep them as three separate functions, the thing to do is have f1
return a value which gets passed to f2
, and so on, not put them in a class. This way, it is actually impossible for someone using your code to call f2
without calling f1
first. This is why I split the code into two functions in the way that I did. If for example I had a function download_random_comic()
and a function show_image()
which accepts no argument, this would have been bad because calling show_image()
without calling download_random_comic()
first would be an error.
If the functions could be called in an arbitrary order but don’t depend on one another, just make them functions. If the functions could be called in an arbitrary order, and their behavior depends on the order the other functions were called in previously, then it should be a class. You might wonder: but why would we ever not know in what order some functions are going to get called? It’s our code, we know everything about it. There are two reasons:
- Because the order depends on the result of I/O. I/O here means anything that you can’t know while writing the code: contents of files the program might read, what order the user clicks the buttons in, etc.
- Because we are pretending not to know the order so that when we use the functions, we know that we won’t accidentally cause bugs by calling them in the wrong order, and so that if the code later changes so they have to get called in a different order, we won’t have to re-write this part of the code.
A testament to what kinds of problems unnecessary state can cause is is that while I was writing the classless version above, it was actually quite difficult for me to cut-and-paste the code together in a way that would work, because of the way the state is kind of bouncing around all over the place in your class. I start writing the random_comic_url()
function and go look up how you generate the random comic number, and I see you’re depending on self.num_comics
(worse, you’re depending on it through and accessor function, meaning I have to read through yet another layer of indirection). So now I have to scan through the class looking for how that field is populated. I look at get_num_comics()
, and it depends on self.url
, so I have to go look up how that field gets its value. In this simple example this is annoying but ultimately it just means I have to scroll around more: I just look up the one place in the class where you populate that field. In a more complicated class, it would be a disaster, because for every one of those fields, I have to anticipate every possible way that field could have gotten a value, and basically mentally plot a kind of exponentially growing possibility tree of all the possible paths that could have been taken: num_comics
has three different ways to get a value, one of which depends on the value of url
, and url
can get a value in four different ways, etc… This is fine if the logic that the code is implementing really is just that complicated, and the code merely reflects that, but for this problem there is actually only one possible execution path by which url
and num_comics
can get their value. In a straight-line “functional” version of the code, that knowledge about the program is made explicit and visible at a glance. In the stateful, class-based version, that knowledge is destroyed, and the programmer has to reconstruct it by carefully reading the code.
Another disadvantage to putting things in classes when they don’t need to be is that you are coupling things together that don’t need to be coupled. If you look at my implementation above, we’ve decoupled the code that displays an image from the code that gets an XKCD comic url. After all, showing an image from a URL really has nothing to do with JSON or XKCD. If later you want to extend the code to also display other webcomics, it will be very obvious how to do so: since show_image()
accepts an image URL as argument, you just need to write code to get a URL to the image of the new comic, and pass it to show_image()
. With the class-based implementation, you would be forced to read the class, un-pick the image rendering code form the network and JSON code, and then split the two apart as I did when I wrote this post, which as I commented above, actually turned out to be a slightly annoying process.
An epilogue: now that we understand when classes are appropriate, we can actually take a second look at the problem we’re trying to solve here. There actually is one way in which this problem has some state which could be nicely modelled as a class. This potential statefulness actually comes not from code architecture concerns, but from code performance concerns. If you look at my code, it has to download the main JSON file and figure out how many XKCD comics there are every time you want to download a new random comic. This is a little bit inefficient. In cases like this where you have a function (my random_comic_url()
function) which does some initial relatively heavy work that strictly only needs to be done once, and then does some extra, lighter work, a pattern you can apply is to make that function the sole method of a very small class. The heavy work happens in the class constructor. If we apply this pattern here, we get something like this:
class RandomXKCDSource:
def __init__(self):
url = "https://xkcd.com/info.0.json"
html_data = urllib.request.urlopen(url)
json_data = json.loads(html_data.read())
self.num_comics = json_data['num']
def random_comic_url(self):
random_comic_num = random.randint(1, self.num_comics)
comic_json_url = "https://xkcd.com/%s/info.0.json" % (random_comic_num)
html_data = urllib.request.urlopen(comic_json_url)
json_data = json.loads(html_data.read())
image_url = json_data['img']
return image_url
In terms of our “bunch of functions depending on eachother which could be called in any order” model, the “in any order” component is subtle here: it comes down to the fact that we don’t know how many times random_comic_url()
will be called. If we knew it would only be called once, then there would be no performance loss in merging both these functions into one.
Of course, you could also just have a function that calculates the number of comics, and then pass that number as an argument to a function random_comic_url()
. Now you can call it as many times as you want without re-calculating the number of comics. In many cases, that would be a bette way to write the code, but the class-based way doesn’t have no advtanges. One advantage is that you can construct an instance of this class and pass it off to another piece of code, and that one object has everything that code needs to get a random URL: you don’t need to pass a second variable around. Another advantage is that you are allowing the random_comic_url()
to have an arbitrarily complicated “startup routine”. Right now the “startup routine” is calculating the number of comics. But maybe XKCD changes their API at some point, and while you still need to do some heavy computations which you only have to do once up-front, the result of those computations no longer takes the form of a number of comics. Then you can just change what happens in the constructor, and code using this class doesn’t have to know.
And of course, remember that at the end of the day, the actual correct way to write a script this simple, if we were really just trying to solve the actual problem and not as a code design exercise (and we were sure this script wasn’t going to grow into a bigger project), was just to put it all inline in one script. Good formatting and good variable names give you all the readability you need in a script this short.