Sentence editor

Posted on

Problem

I wrote a simple sentence editor which capitalises first word, deletes redundant spaces and adds a dot in the end unless another sign of punctuation is already present.

Please help me to improve my code, maybe by using some libraries.

def correct_sentence(text: str) -> str:
    """
      returns a corrected sentence which starts with a capital letter
      and ends with a dot if needed. Also deletes redundant spaces.
    """
    marks = ['!', '?']
    text = text.strip()
    # checking if the last symbol is any of marks
    for i in range(len(marks)):
        if text[-1] != '.':
            if text[-1] == marks[i]:
                break
            else:
                text += '.'
    text_l = text.split()
    # deleting redundant spaces
    for word in text_l:
        if word == ',':
            indw = text_l.index(word)
            text_l[indw - 1] += word
            text_l.remove(word)
    text_l[0] = text_l[0].title()
    text = ' '.join(text_l)
    return text

As an example of how the code should work, the sentence:

"Hello   ,  my name is test"

Would be corrected to:

"Hello, my name is test."

Solution

Your code looks good, has a docstring explaining precisely what is intended and typing annotations.

This is great and it also makes the review much more pleasant.

Tests

Before reviewing anything, I quite like adding a few test cases to see how the code behave and have a quick feedback loop if I try to change things in the code.

Here, based purely on the docstring, I wrote:

assert correct_sentence("a") == "A."
assert correct_sentence("abc") == "Abc."
assert correct_sentence("abc...") == "Abc..."
assert correct_sentence("abc .  .   .") == "Abc . . ."
assert correct_sentence(" abc  de f  ") == "Abc de f."
assert correct_sentence("aBcDe FgHiJk") == "Abcde FgHiJk." # Expected ?
correct_sentence("") # What should we do ?

Everything looks good to me except for the last 2 cases which you may want to look at.

More tests

The list above is not comprehensive and should be enhanced as we go through the review/add features. While going through the code, I realised that ‘?’ should be considered as a proper sentence end. I wrote the following test cases which do not really convince me that things are working fine.

assert correct_sentence("a?") == "A?."
assert correct_sentence("abc.?.") == "Abc.?."
assert correct_sentence("abc..?") == "Abc..?."

Disclaimer: the following part of the review is based on the assumption that the current behavior is the one you want. It may not be the case but I hope the lessons provided will be interesting anyway.

Look like a native

I highly recommend Ned Batchelder’s talk called “Loop Like A Native”. The main point is that whenever we are using “range(len(iterable))” there is usually a better (more pythonic, more concise, more efficient, clearer).

Here:

# checking if the last symbol is any of marks
for i in range(len(marks)):
    if text[-1] != '.':
        if text[-1] == marks[i]:
            break
        else:
            text += '.'

can be rewritten:

# checking if the last symbol is any of marks
for mark in marks:
    if text[-1] != '.':
        if text[-1] == mark:
            break
        else:
            text += '.'

Stopping early

When text[-1] == '.', the loop above will have no effect. Thus, we could break as soon as we do text += '.'.

We get

        if text[-1] == mark:
            break
        else:
            text += '.'
            break

which can be rewritten:

        if text[-1] != mark:
            text += '.'
        break

Reorganise the logic

Now it is clearer that text[-1] will stay unchanged during the whole loop as we exit the loop as soon as we do something that would change the value.

We can use a variable for this, we can move the check if text[-1] != '.' before the loop.

We get something like:

last = text[-1]
if last != '.':
    for mark in marks:
        if last != mark:
            text += '.'
        break

Now, it is obvious that only the first elements from marks will ever be considered (we break at the end of the first iteration) and this explains the surprising behavior described above.

Loop with the ,

The second loop is commented # deleting redundant spaces but we could literally remove it and still have the redundant spaces removed. The comment needs to be adjusted to the relevant part.

It looks like what you are trying to achieve is removing space before comma.

I’ve added the following test cases:

assert correct_sentence("a,b,c,d") == "A,B,C,D."
assert correct_sentence("a, b ,c , d") == "A, b ,c, d."
assert correct_sentence("a, b ,c , d , , , ,e") == "A, b ,c, d,, , ,e."

Question: should we do this only when word == "," or should when word starts wich "," ?

Let’s see what we can try to improve.

  1. You are trying to update the list as you iterate through it. This makes everything quite fragile.
  2. You are calling word to know the position of the word you are considering. This could be provided by enumerate.

There are many things that are still to be improved but I think there are many defects you should try to fix first before going any further.

At this stage, the code looks like:

def correct_sentence(text: str) -> str:
    """
      returns a corrected sentence which starts with a capital letter
      and ends with a dot if needed. Also deletes redundant spaces.
    """
    text = text.strip()
    if not text:
        return ''

    # checking if the last symbol is any of marks
    marks = ['!', '?']
    last = text[-1]
    if last != '.':
        for mark in marks:
            if last != mark:
                text += '.'
            break

    text_l = text.split()
    for word in text_l:
        if word == ',':
            indw = text_l.index(word)
            text_l[indw - 1] += word
            text_l.remove(word)
    text_l[0] = text_l[0].title()
    # deleting redundant spaces with split & join
    return ' '.join(text_l)


assert correct_sentence("a") == "A."
assert correct_sentence("a?") == "A?."
assert correct_sentence("abc") == "Abc."
assert correct_sentence("abc...") == "Abc..."
assert correct_sentence("abc.?.") == "Abc.?."
assert correct_sentence("abc..?") == "Abc..?."
assert correct_sentence("abc .  .   .") == "Abc . . ."
assert correct_sentence(" abc  de f  ") == "Abc de f."
assert correct_sentence("aBcDe FgHiJk") == "Abcde FgHiJk." # Expected ?
assert correct_sentence("") == '' # What should we do ?
assert correct_sentence("a,b,c,d") == "A,B,C,D."
assert correct_sentence("a, b ,c , d") == "A, b ,c, d."
assert correct_sentence("a, b ,c , d , , , ,e") == "A, b ,c, d,, , ,e."

Organisation

If you want to make you life easier, you may want to split your function into smaller functions with a single responsability. This will also make testing easier.

In my oppinion, your code looks quite good. You have a consistent style, appropriate documentation, and a well-structured function with clear code.

Nevertheless, I would like to point out a few points that can make your code more Pythonic.

The way you check for the final mark can be improved and made more concise. Instead of checking all the elements of marks by hand, you can use str.endswith(...) with multiple suffixes like so:

marks = ('!', '?', '.')
if not text.endswith(marks):
    text += '.'

Note: marks is now a tuple and not a list anymore.

Handling those ,‘s kept me thinking longer than I would like to admit. I thought I would be easy to also handle cases like , , , or ,,, but as always, nothing is quite as easy as it seems. It’s always a bit tricky to index and/or iterate lists that gets modified in the meantime. One approach is to use another list to store the results, another very common approach is to index the list from the back. I chose the later one and came up with the following solution:

# avoid whitespace in front of , as well as multiple ,
for i in reversed(range(len(text_l))):
    word = text_l[i]
    if re.fullmatch(r',+', word):
        if not text_l[i-1].endswith(',') and i > 0:
            text_l[i-1] += ','
        del text_l[i]

I chose to work with Python’s regex module to take care of repeated , characters. text_l[i-1].endswith(',') was added to make sure that a word does not get more than one , and i > 0 was added to handle the edge case of a leading ,, which would otherwise wrap around.

The capitalization might also need a slight retouch, since str.title() has some quirks which are listed in the docs. This can be seen in a simple example: print("They're".title()) prints They'Re, which is probably not what you are looking for. A quick-and-dirty fix a came up with was to use text_l[0] = text_l[0][0].upper()+text_l[0][1:] to capitalize only the first letter. This even works for single letter words like i:

test = "i"
print(test.upper()+test[1:])
>>> I

With all these changes the code now looks as follows:

def correct_sentence(text: str) -> str:
    """
    returns a corrected sentence which starts with a capital letter
    and ends with a dot if needed. Also deletes redundant spaces.
    """
    text = text.strip()

    # checking if the last symbol is any of marks
    marks = ('!', '?', '.')
    if not text.endswith(marks):
        text += '.'

    text_l = text.split()
    # avoid whitespace in front of , as well as multiple ,
    for i in reversed(range(len(text_l))):
        word = text_l[i]
        if re.fullmatch(r',+', word):
            if not text_l[i-1].endswith(',') and i > 0:
                text_l[i-1] += ','
            del text_l[i]

    # capitalize first letter only, since .title() has some quirks.
    text_l[0] = text_l[0][0].upper()+text_l[0][1:]

    return ' '.join(text_l)

I also took the liberty to change the comment to describe what you actually want to do.

Leave a Reply

Your email address will not be published. Required fields are marked *