Re-arranging an obfuscated address

Posted on

Problem

I’m getting address (physical address, not digital) input that’s obfuscated, and looks like the following:

The plaintext version:

'39 Jerrabomberra Ave. Narrabundah Canberra 2604 Australia'

The obfuscated version:

['39 Jerrabomberra Ave., Narrabundah', 'Canberra', ' ', '2604', ', ', 'Australia', '39 Jerrabomberra Ave., Narrabundah', 'Canberra 2604, ', 'Australia']

Usually the obfuscation is simple duplication and rearranging, which my script catches, however there are a few edge cases that are missed, which I’m working on catching.

However, my solution feels like it could be simplified.

The logic follows this order:

  1. Join the array into one long string with a space as the ‘glue’ character.
  2. Use re.sub to find all commas and remove them.
  3. Split by space
  4. Add each non-empty component to the components array if it is not already in there.
  5. Join the components together.
import re

...

address = fooGetAddress(foo[bar]) #returns an array
address_components = []
for component in re.sub(",", "", " ".join(address)).split(" "):
    if component not in address_components and component is not "":
        address_components.append(component)
address = " ".join(address_components)

Solution

Not bad. However, we can do away with if component not in address_components and component is not "".

A better way to check if component not in address_components would be to use collections.OrderedDict:

An OrderedDict is a dict that remembers the order that keys were first inserted. If a new entry overwrites an existing entry, the original insertion position is left unchanged.

That’s exactly what we want. (Well, almost exactly. What we really want is an ordered set, but we can just use the keys of an OrderedDict and ignore the values.)

We can eliminate the need for component is not "" by using str.split() instead of str.split(" "):

If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].

To get rid of the commas, you don’t need a regular expression. str.replace() will do.

It’s bad practice to use the same variable (address) for two different purposes, especially when the type changes (from a list of strings to a string).

With those changes, we can write the solution as just a single expression.

from collections import OrderedDict

obfuscated_address = …
address = ' '.join(
    OrderedDict(
        (component, None) for component in
        ' '.join(obfuscated_address).replace(',', '').split()
    ).keys()
)

is is buggy

is should be used only for singleton classes, that in practice means 99% of its usage is comparing to None. 1000 is 1000 does not output True for sure.

Leave a Reply

Your email address will not be published. Required fields are marked *