Problem
For practice, I decided to write a run-length decoder. For example:
basic_run_length_decode("1A2B3C")
=> 'ABBCCC'
basic_run_length_decode("5A10Z10J")
=> 'AAAAAZZZZZZZZZZJJJJJJJJJJ'
It ended up getting long, but that’s mostly because I wanted try incorporating the partition-by
function from Clojure. It seemed like it would work nicely here, and it did. It required me to write that function though since Python doesn’t seem to have anything equivalent. I also “stole” grouper
from the “recipe” section of the itertools
docs. Clojure has both of these functions as part of its standard library, so that’s what I thought of while writing this.
Basically, it works by cutting the string into groups of digits and non-digits, pairing them into [digits, word]
pairs, then parsing the digits and doing some string multiplication.
What I’d like commented on mainly:
-
partition_by
. I tried to see if there was a sane way of using something liketakewhile
+drop_while
, but both of those functions consume an extra element after the predicate flips, which caused problems. I opted to just do manual iteration, but it’s quite verbose. -
Anything else that I may be doing the hard way.
Note, it actually allows for multiple characters per number.
basic_run_length_decode("10AB10CD")
=> 'ABABABABABABABABABABCDCDCDCDCDCDCDCDCDCD'
I thought about raising an error, but decided to just allow it. This is apparently a type of possible encoding, and it cost me nothing to allow it.
from itertools import zip_longest
from typing import Callable, Any, Iterable, TypeVar, Generator, List
T = TypeVar("T")
# Emulating https://clojuredocs.org/clojure.core/partition-by
def _partition_by(f: Callable[[T], Any], iterable: Iterable[T]) -> Generator[List[T], None, None]:
"""Splits off a new chunk everytime f returns a new value.
list(partition_by(lambda n: n < 5, [1, 2, 3, 6, 3, 2, 1])) => [[1, 2, 3], [6], [3, 2, 1]]
"""
last_value = object() # Dummy object that will never be equal to anything else
chunk = []
for x in iterable:
returned = f(x)
if returned == last_value:
chunk.append(x)
else:
if chunk:
yield chunk
chunk = [x]
last_value = returned
if chunk:
yield chunk
# Recipe from https://docs.python.org/3.8/library/itertools.html#itertools.takewhile
def _grouper(iterable, n, fillvalue=None):
"""Collect data into fixed-length chunks or blocks"""
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
def _invalid_encoded_error(message: str) -> ValueError:
return ValueError(f"Invalid encoded string. {message}")
def basic_run_length_decode(encoded_string: str) -> str:
"""Decodes a string in the form of <number><chars><number><chars>. . .
basic_run_length_decode("1A2B3C") => 'ABBCCC'"""
if encoded_string and not encoded_string[0].isdigit():
raise _invalid_encoded_error("First character must be a number.")
chunks = _partition_by(lambda c: c.isdigit(), encoded_string)
acc = []
for raw_n, raw_word in _grouper(chunks, 2):
if not raw_word:
raise _invalid_encoded_error("Trailing numbers without character at end of encoded string.")
n = int("".join(raw_n)) # Should never throw as long as the very first char is a digit?
word = "".join(raw_word)
acc.append(n * word)
return "".join(acc)
Solution
The code looks good to me, it is well documented and I do not have much to say.
Suggestion for _partition_by
In:
if returned == last_value:
chunk.append(x)
else:
if chunk:
yield chunk
chunk = [x]
I’d tried to write chunk = [x]
as chunk = [] ; chunk.append(x)
. This allows some rewriting as you can extract the common line out of the if / else
.
if returned == last_value:
pass
else:
if chunk:
yield chunk
chunk = []
chunk.append(x)
Then revert condition
if returned != last_value:
if chunk:
yield chunk
chunk = []
chunk.append(x)
Then group condition as there is nothing to do when chunk
is already empty.
if chunk and returned != last_value:
yield chunk
chunk = []
chunk.append(x)
There is nothing wrong with the way things were done, this is a pure personal preference.
Details about docstring
The docstrings are slightly inconsistent as we have two different forms for the verb: with and without the final ‘s’. If that can be of any help, there are Docstring conventions for Python in PEP 257.
It suggests:
The docstring is a phrase ending in a period. It prescribes the
function or method’s effect as a command (“Do this”, “Return that”),
not as a description; e.g. don’t write “Returns the pathname …”.
Short and flexible substitution for initial _partition_by
function by using itertools.groupby
magic (to generate/split a consecutive groups by predicate):
from typing import Callable, Any, Iterable, TypeVar, Generator, List
from itertools import groupby
T = TypeVar("T")
def _partition_by(f: Callable[[T], Any], iterable: Iterable[T]) -> Generator[List[T], None, None]:
"""Splits to consecutive chunks by predicate.
list(partition_by(lambda n: n < 5, [1, 2, 3, 6, 3, 2, 1])) => [[1, 2, 3], [6], [3, 2, 1]]
"""
for k, group in groupby(iterable, key=f):
yield list(group)
Use case:
lst = [1, 2, 3, 6, 3, 2, 1]
print(list(_partition_by(lambda n: n < 5, lst)))
The output:
[[1, 2, 3], [6], [3, 2, 1]]
Type Hint
The return type -> Generator[List[T], None, None]
is correct, but it is a little verbose. The documentation suggests a simpler way of documenting generators that just yield values, using Iterator[]
def _partition_by(...) -> Iterator[List[T]]:
The method _grouper()
doesn’t have a return type hint.
The Hard Way
- Anything else that I may be doing the hard way.
Well, you are doing the whole thing The Hard Way™.
You are looking for occurrences of <number><chars>
in a string. This sounds like a job for the regular expression module
. The <number>
pattern is simply (d+)
, and the <chars>
pattern would be characters which are not numbers, which can be expressed concisely as (D+)
… using a capital letter indicates it is matching not-digit characters.
Since you wish to find successive occurrences of these, you would want to use re.findall
, or perhaps re.finditer
:
for m in re.finditer(r"(d+)(D+)", encoded_str)
Now, m.group(1)
is the string containing <number>
characters and m.group(2)
is the string containing the <chars>
character. At this point, we can turn the digits into an integer, multiply the character string by that value, and join successive values together into one large string:
def basic_run_length_decode(encoded_string: str) -> str:
"""
Decodes a string in the form of <number><chars><number><chars>. . .
>>> basic_run_length_decode("1A2B3C")
'ABBCCC'
>>> basic_run_length_decode("10AB10CD")
'ABABABABABABABABABABCDCDCDCDCDCDCDCDCDCD'
"""
return "".join(m.group(2) * int(m.group(1))
for m in re.finditer(r"(d+)(D+)", encoded_string))
Yup. Your function can be simplified to one line of code, with no helper functions. Re-adding error checking left as exercise. 😉
Doctest
I’ve changed the format of your """docstring"""
slightly.
Instead of writing:
basic_run_length_decode("1A2B3C") => 'ABBCCC'
which implies to the reader that when you execute basic_run_length_decode("1A2B3C")
, it will return the value 'ABBCCC'
, I’ve replaced this with:
>>> basic_run_length_decode("1A2B3C")
'ABBCCC'
which implies if you are at Python’s REPL, and you type basic_run_length_decode("1A2B3C")
, it will return the value 'ABBCCC'
, the representation of (ie, with quotes) which will be printed by the REPL.
Basically, exactly the same thing.
Except …
The doctest
module knows how to read docstrings for functions, and extract things that look like Python REPL commands (ie, appear after a >>>
prompt), and will execute those commands and compare the actual result with the indicated result, and raise an error if they don’t match. Built-in unit testing!
Let’s test our code!
aneufeld$ python3 -m doctest run_length_decode.py
aneufeld$
Well, that was anticlimactic; nothing got printed. That’s a good thing. It means the tests passed. We can add a -v
to make the output more verbose:
aneufeld$ python3 -m doctest -v run_length_decode.py
Trying:
basic_run_length_decode("1A2B3C")
Expecting:
'ABBCCC'
ok
Trying:
basic_run_length_decode("10AB10CD")
Expecting:
'ABABABABABABABABABABCDCDCDCDCDCDCDCDCDCD'
ok
1 items had no tests:
run_length_decode
1 items passed all tests:
2 tests in run_length_decode.basic_run_length_decode
2 tests in 2 items.
2 passed and 0 failed.
Test passed.
aneufeld$