Problem
I’m getting address (physical address, not digital) input that’s obfuscated, and looks like the following:
The plaintext version:
'39 Jerrabomberra Ave. Narrabundah Canberra 2604 Australia'
The obfuscated version:
['39 Jerrabomberra Ave., Narrabundah', 'Canberra', ' ', '2604', ', ', 'Australia', '39 Jerrabomberra Ave., Narrabundah', 'Canberra 2604, ', 'Australia']
Usually the obfuscation is simple duplication and rearranging, which my script catches, however there are a few edge cases that are missed, which I’m working on catching.
However, my solution feels like it could be simplified.
The logic follows this order:
- Join the array into one long string with a space as the ‘glue’ character.
- Use
re.sub
to find all commas and remove them. - Split by space
- Add each non-empty component to the components array if it is not already in there.
- Join the components together.
import re
...
address = fooGetAddress(foo[bar]) #returns an array
address_components = []
for component in re.sub(",", "", " ".join(address)).split(" "):
if component not in address_components and component is not "":
address_components.append(component)
address = " ".join(address_components)
Solution
Not bad. However, we can do away with if component not in address_components and component is not ""
.
A better way to check if component not in address_components
would be to use collections.OrderedDict
:
An
OrderedDict
is a dict that remembers the order that keys were first inserted. If a new entry overwrites an existing entry, the original insertion position is left unchanged.
That’s exactly what we want. (Well, almost exactly. What we really want is an ordered set, but we can just use the keys of an OrderedDict
and ignore the values.)
We can eliminate the need for component is not ""
by using str.split()
instead of str.split(" ")
:
If
sep
is not specified or isNone
, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with aNone
separator returns[]
.
To get rid of the commas, you don’t need a regular expression. str.replace()
will do.
It’s bad practice to use the same variable (address
) for two different purposes, especially when the type changes (from a list of strings to a string).
With those changes, we can write the solution as just a single expression.
from collections import OrderedDict
obfuscated_address = …
address = ' '.join(
OrderedDict(
(component, None) for component in
' '.join(obfuscated_address).replace(',', '').split()
).keys()
)
is
is buggy
is
should be used only for singleton classes, that in practice means 99% of its usage is comparing to None
. 1000 is 1000
does not output True
for sure.