Simple run-length encoding decoder

Posted on

Problem

For practice, I decided to write a run-length decoder. For example:

basic_run_length_decode("1A2B3C")
=> 'ABBCCC'

basic_run_length_decode("5A10Z10J")
=> 'AAAAAZZZZZZZZZZJJJJJJJJJJ'

It ended up getting long, but that’s mostly because I wanted try incorporating the partition-by function from Clojure. It seemed like it would work nicely here, and it did. It required me to write that function though since Python doesn’t seem to have anything equivalent. I also “stole” grouper from the “recipe” section of the itertools docs. Clojure has both of these functions as part of its standard library, so that’s what I thought of while writing this.

Basically, it works by cutting the string into groups of digits and non-digits, pairing them into [digits, word] pairs, then parsing the digits and doing some string multiplication.

What I’d like commented on mainly:

  • partition_by. I tried to see if there was a sane way of using something like takewhile + drop_while, but both of those functions consume an extra element after the predicate flips, which caused problems. I opted to just do manual iteration, but it’s quite verbose.

  • Anything else that I may be doing the hard way.

Note, it actually allows for multiple characters per number.

basic_run_length_decode("10AB10CD")
=> 'ABABABABABABABABABABCDCDCDCDCDCDCDCDCDCD'

I thought about raising an error, but decided to just allow it. This is apparently a type of possible encoding, and it cost me nothing to allow it.

from itertools import zip_longest
from typing import Callable, Any, Iterable, TypeVar, Generator, List

T = TypeVar("T")
# Emulating https://clojuredocs.org/clojure.core/partition-by
def _partition_by(f: Callable[[T], Any], iterable: Iterable[T]) -> Generator[List[T], None, None]:
    """Splits off a new chunk everytime f returns a new value.
    list(partition_by(lambda n: n < 5, [1, 2, 3, 6, 3, 2, 1])) => [[1, 2, 3], [6], [3, 2, 1]]
    """
    last_value = object()  # Dummy object that will never be equal to anything else
    chunk = []

    for x in iterable:
        returned = f(x)

        if returned == last_value:
            chunk.append(x)

        else:
            if chunk:
                yield chunk

            chunk = [x]

        last_value = returned

    if chunk:
        yield chunk


# Recipe from https://docs.python.org/3.8/library/itertools.html#itertools.takewhile
def _grouper(iterable, n, fillvalue=None):
    """Collect data into fixed-length chunks or blocks"""
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)


def _invalid_encoded_error(message: str) -> ValueError:
    return ValueError(f"Invalid encoded string. {message}")


def basic_run_length_decode(encoded_string: str) -> str:
    """Decodes a string in the form of <number><chars><number><chars>. . .
    basic_run_length_decode("1A2B3C") => 'ABBCCC'"""
    if encoded_string and not encoded_string[0].isdigit():
        raise _invalid_encoded_error("First character must be a number.")

    chunks = _partition_by(lambda c: c.isdigit(), encoded_string)

    acc = []
    for raw_n, raw_word in _grouper(chunks, 2):
        if not raw_word:
            raise _invalid_encoded_error("Trailing numbers without character at end of encoded string.")

        n = int("".join(raw_n))  # Should never throw as long as the very first char is a digit?
        word = "".join(raw_word)

        acc.append(n * word)

    return "".join(acc)

Solution

The code looks good to me, it is well documented and I do not have much to say.

Suggestion for _partition_by

In:

    if returned == last_value:
        chunk.append(x)
    else:
        if chunk:
            yield chunk
        chunk = [x]

I’d tried to write chunk = [x] as chunk = [] ; chunk.append(x). This allows some rewriting as you can extract the common line out of the if / else.

    if returned == last_value:
        pass
    else:
        if chunk:
            yield chunk
        chunk = []
    chunk.append(x)

Then revert condition

    if returned != last_value:
        if chunk:
            yield chunk
        chunk = []
    chunk.append(x)

Then group condition as there is nothing to do when chunk is already empty.

    if chunk and returned != last_value:
        yield chunk
        chunk = []
    chunk.append(x)

There is nothing wrong with the way things were done, this is a pure personal preference.

Details about docstring

The docstrings are slightly inconsistent as we have two different forms for the verb: with and without the final ‘s’. If that can be of any help, there are Docstring conventions for Python in PEP 257.

It suggests:

The docstring is a phrase ending in a period. It prescribes the
function or method’s effect as a command (“Do this”, “Return that”),
not as a description; e.g. don’t write “Returns the pathname …”.

Short and flexible substitution for initial _partition_by function by using itertools.groupby magic (to generate/split a consecutive groups by predicate):

from typing import Callable, Any, Iterable, TypeVar, Generator, List
from itertools import groupby

T = TypeVar("T")

def _partition_by(f: Callable[[T], Any], iterable: Iterable[T]) -> Generator[List[T], None, None]:
    """Splits to consecutive chunks by predicate.
    list(partition_by(lambda n: n < 5, [1, 2, 3, 6, 3, 2, 1])) => [[1, 2, 3], [6], [3, 2, 1]]
    """
    for k, group in groupby(iterable, key=f):
        yield list(group)

Use case:

lst = [1, 2, 3, 6, 3, 2, 1]
print(list(_partition_by(lambda n: n < 5, lst)))

The output:

[[1, 2, 3], [6], [3, 2, 1]]

Type Hint

The return type -> Generator[List[T], None, None] is correct, but it is a little verbose. The documentation suggests a simpler way of documenting generators that just yield values, using Iterator[]

def _partition_by(...) -> Iterator[List[T]]:

The method _grouper() doesn’t have a return type hint.

The Hard Way

  • Anything else that I may be doing the hard way.

Well, you are doing the whole thing The Hard Way™.

You are looking for occurrences of <number><chars> in a string. This sounds like a job for the regular expression module. The <number> pattern is simply (d+), and the <chars> pattern would be characters which are not numbers, which can be expressed concisely as (D+) … using a capital letter indicates it is matching not-digit characters.

Since you wish to find successive occurrences of these, you would want to use re.findall, or perhaps re.finditer:

for m in re.finditer(r"(d+)(D+)", encoded_str)

Now, m.group(1) is the string containing <number> characters and m.group(2) is the string containing the <chars> character. At this point, we can turn the digits into an integer, multiply the character string by that value, and join successive values together into one large string:

def basic_run_length_decode(encoded_string: str) -> str:
    """
    Decodes a string in the form of <number><chars><number><chars>. . .

    >>> basic_run_length_decode("1A2B3C")
    'ABBCCC'
    >>> basic_run_length_decode("10AB10CD")
    'ABABABABABABABABABABCDCDCDCDCDCDCDCDCDCD'
    """

    return "".join(m.group(2) * int(m.group(1))
                   for m in re.finditer(r"(d+)(D+)", encoded_string))

Yup. Your function can be simplified to one line of code, with no helper functions. Re-adding error checking left as exercise. 😉

Doctest

I’ve changed the format of your """docstring""" slightly.

Instead of writing:

    basic_run_length_decode("1A2B3C") => 'ABBCCC'

which implies to the reader that when you execute basic_run_length_decode("1A2B3C"), it will return the value 'ABBCCC', I’ve replaced this with:

    >>> basic_run_length_decode("1A2B3C")
    'ABBCCC'

which implies if you are at Python’s REPL, and you type basic_run_length_decode("1A2B3C"), it will return the value 'ABBCCC', the representation of (ie, with quotes) which will be printed by the REPL.

Basically, exactly the same thing.

Except …

The doctest module knows how to read docstrings for functions, and extract things that look like Python REPL commands (ie, appear after a >>> prompt), and will execute those commands and compare the actual result with the indicated result, and raise an error if they don’t match. Built-in unit testing!

Let’s test our code!

aneufeld$ python3 -m doctest run_length_decode.py
aneufeld$

Well, that was anticlimactic; nothing got printed. That’s a good thing. It means the tests passed. We can add a -v to make the output more verbose:

aneufeld$ python3 -m doctest -v run_length_decode.py 
Trying:
    basic_run_length_decode("1A2B3C")
Expecting:
    'ABBCCC'
ok
Trying:
    basic_run_length_decode("10AB10CD")
Expecting:
    'ABABABABABABABABABABCDCDCDCDCDCDCDCDCDCD'
ok
1 items had no tests:
    run_length_decode
1 items passed all tests:
   2 tests in run_length_decode.basic_run_length_decode
2 tests in 2 items.
2 passed and 0 failed.
Test passed.
aneufeld$ 

Leave a Reply

Your email address will not be published. Required fields are marked *