I’m getting address (physical address, not digital) input that’s obfuscated, and looks like the following:
The plaintext version:
'39 Jerrabomberra Ave. Narrabundah Canberra 2604 Australia'
The obfuscated version:
['39 Jerrabomberra Ave., Narrabundah', 'Canberra', ' ', '2604', ', ', 'Australia', '39 Jerrabomberra Ave., Narrabundah', 'Canberra 2604, ', 'Australia']
Usually the obfuscation is simple duplication and rearranging, which my script catches, however there are a few edge cases that are missed, which I’m working on catching.
However, my solution feels like it could be simplified.
The logic follows this order:
- Join the array into one long string with a space as the ‘glue’ character.
re.subto find all commas and remove them.
- Split by space
- Add each non-empty component to the components array if it is not already in there.
- Join the components together.
import re ... address = fooGetAddress(foo[bar]) #returns an array address_components =  for component in re.sub(",", "", " ".join(address)).split(" "): if component not in address_components and component is not "": address_components.append(component) address = " ".join(address_components)
Not bad. However, we can do away with
if component not in address_components and component is not "".
A better way to check
if component not in address_components would be to use
OrderedDictis a dict that remembers the order that keys were first inserted. If a new entry overwrites an existing entry, the original insertion position is left unchanged.
That’s exactly what we want. (Well, almost exactly. What we really want is an ordered set, but we can just use the keys of an
OrderedDict and ignore the values.)
We can eliminate the need for
component is not "" by using
str.split() instead of
sepis not specified or is
None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a
To get rid of the commas, you don’t need a regular expression.
str.replace() will do.
It’s bad practice to use the same variable (
address) for two different purposes, especially when the type changes (from a list of strings to a string).
With those changes, we can write the solution as just a single expression.
from collections import OrderedDict obfuscated_address = … address = ' '.join( OrderedDict( (component, None) for component in ' '.join(obfuscated_address).replace(',', '').split() ).keys() )
is is buggy
is should be used only for singleton classes, that in practice means 99% of its usage is comparing to
1000 is 1000 does not output
True for sure.