Problem
I want to write a Python script where the input is a genotype (e.g. “CCDd”) based upon the following Punnett square. The output should be the phenotype (e.g. “seal”).
My code works, but doesn’t look that ‘pythonic’ at all.
def colour(input):
if len(set(input)) == 2 and input[0] != 'C' and input[2] != 'D':
return('lilac')
elif len(set(input[2::])) == 1 and input[2] != 'D':
return('blue')
elif len(set(input[0:2])) == 1 and input[0] != 'C':
return('chocolate')
else:
return('seal')
Solution
You correctly identified that you need to count the number of lowercase c
and d
. But using a set
to do so is not the right tool. Instead, I suggest using collections.Counter
:
from collections import Counter
def color(genotype):
assert len(genotype) == 4
count = Counter(genotype)
blue = count['d'] == 2
chocolate = count['c'] == 2
if blue and chocolate:
return 'lilac'
if blue:
return 'blue'
if chocolate:
return 'chocolate'
return 'seal'
I also added a debug check to ensure that the length of the input sequence is always four.
Given the mini debate that occured in the comments, pertaining to variable names, you may consider this slight variation:
from collections import Counter
def color(genotype):
assert len(genotype) == 4
count = Counter(genotype)
recessive_traits = [count[gene] == 2 for gene in 'cd']
if all(recessive_traits):
return 'lilac'
if recessive_traits[0]:
return 'blue'
if recessive_traits[1]:
return 'chocolate'
return 'seal'
It may be more expressive as to what you’re testing, and why.
The code you have posted produces nonsense results for incorrectly formatted input:
>>> colour('DDCC')
'lilac'
>>> colour('abcdef')
'seal'
>>> colour('aayx')
'chocolate'
It would be better if the function rejected such inputs by raising an appropriate exception. I advise against using assertions for this purpose, for the reasons discussed in the two links.
I suggest to separate the two tasks of parsing the genotype from the input string and determining the phenotype based on the presence of the dominant allele for each of the two genes.
(The same suggestion is essentially also contained in Mathias Ettinger’s answer, however I interpret the “blueness” and “chocolateness” inversely, so that blue and chocolate gives seal, and neither blue nor chocolate gives lilac.)
As an alternative style of coding the phenotype logic, I have chosen a lookup table using a dictionary.
Validating and parsing the input can be combined in a single step using a regular expression.
Also note that I have added a docstring to the function, which includes a few doctests that can automatically be checked.
import re
def colour(genotype):
"""Determine the phenotype given the genotype encoded as a four-letter
string consisting of two C's ("blueness" gene) followed by two D's
("chocolateness" gene) where the capitalization of the letters indicates
whether the dominant (upper-case) or recessive (lower-case) alleles are
present.
>>> colour('CCDD')
'seal'
>>> colour('CCdd')
'blue'
>>> colour('ccDd')
'chocolate'
>>> colour('ccdd')
'lilac'
>>> colour('abcde')
Traceback (most recent call last):
ValueError: Incorrectly formatted input: abcde
"""
def has_dominant(alleles):
# Aa, aA, AA -> True, aa -> False
return not alleles.islower()
try:
blue, chocolate = map(has_dominant,
re.match(r'^([Cc]{2})([Dd]{2})$', genotype).groups())
except AttributeError:
raise ValueError('Incorrectly formatted input: {}'.format(genotype))
return {
(True, True): 'seal',
(True, False): 'blue',
(False, True): 'chocolate',
(False, False): 'lilac'}[
(blue, chocolate)]