Genotype as input, give the phenotype

Posted on

Problem

I want to write a Python script where the input is a genotype (e.g. “CCDd”) based upon the following Punnett square. The output should be the phenotype (e.g. “seal”).

enter image description here

My code works, but doesn’t look that ‘pythonic’ at all.

def colour(input):

    if len(set(input)) == 2 and input[0] != 'C' and input[2] != 'D':
        return('lilac')
    elif len(set(input[2::])) == 1 and input[2] != 'D':
        return('blue')
    elif len(set(input[0:2])) == 1 and input[0] != 'C':
        return('chocolate')
    else:
        return('seal')

Solution

You correctly identified that you need to count the number of lowercase c and d. But using a set to do so is not the right tool. Instead, I suggest using collections.Counter:

from collections import Counter


def color(genotype):
    assert len(genotype) == 4

    count = Counter(genotype)
    blue = count['d'] == 2
    chocolate = count['c'] == 2

    if blue and chocolate:
        return 'lilac'
    if blue:
        return 'blue'
    if chocolate:
        return 'chocolate'
    return 'seal'

I also added a debug check to ensure that the length of the input sequence is always four.


Given the mini debate that occured in the comments, pertaining to variable names, you may consider this slight variation:

from collections import Counter


def color(genotype):
    assert len(genotype) == 4

    count = Counter(genotype)
    recessive_traits = [count[gene] == 2 for gene in 'cd']

    if all(recessive_traits):
        return 'lilac'
    if recessive_traits[0]:
        return 'blue'
    if recessive_traits[1]:
        return 'chocolate'
    return 'seal'

It may be more expressive as to what you’re testing, and why.

The code you have posted produces nonsense results for incorrectly formatted input:

>>> colour('DDCC')
'lilac'
>>> colour('abcdef')
'seal'
>>> colour('aayx')
'chocolate'

It would be better if the function rejected such inputs by raising an appropriate exception. I advise against using assertions for this purpose, for the reasons discussed in the two links.

I suggest to separate the two tasks of parsing the genotype from the input string and determining the phenotype based on the presence of the dominant allele for each of the two genes.

(The same suggestion is essentially also contained in Mathias Ettinger’s answer, however I interpret the “blueness” and “chocolateness” inversely, so that blue and chocolate gives seal, and neither blue nor chocolate gives lilac.)

As an alternative style of coding the phenotype logic, I have chosen a lookup table using a dictionary.

Validating and parsing the input can be combined in a single step using a regular expression.

Also note that I have added a docstring to the function, which includes a few doctests that can automatically be checked.

import re

def colour(genotype):
    """Determine the phenotype given the genotype encoded as a four-letter
    string consisting of two C's ("blueness" gene) followed by two D's
    ("chocolateness" gene) where the capitalization of the letters indicates
    whether the dominant (upper-case) or recessive (lower-case) alleles are
    present.

    >>> colour('CCDD')
    'seal'
    >>> colour('CCdd')
    'blue'
    >>> colour('ccDd')
    'chocolate'
    >>> colour('ccdd')
    'lilac'
    >>> colour('abcde')
    Traceback (most recent call last):
    ValueError: Incorrectly formatted input: abcde
    """

    def has_dominant(alleles):
        # Aa, aA, AA -> True, aa -> False
        return not alleles.islower()

    try:
        blue, chocolate = map(has_dominant,
            re.match(r'^([Cc]{2})([Dd]{2})$', genotype).groups())
    except AttributeError:
        raise ValueError('Incorrectly formatted input: {}'.format(genotype))

    return {
        (True,  True):  'seal',
        (True,  False): 'blue',
        (False, True):  'chocolate',
        (False, False): 'lilac'}[
        (blue,  chocolate)]

Leave a Reply

Your email address will not be published. Required fields are marked *