Problem
This snippet processes every regular (w+
) word in a text and reinserts the processed version:
import re
import time
def process(word: str) -> str:
time.sleep(0.05) # Processing takes a while...
return word.title()
text = """This is just a text!
Newlines should work.
Multiple ones as well, as well as arbitrary spaces.
Super-hyphenated long-lasting and overly-complex words should also work.
Arbitrary punctuation; has to work... Because? Why not!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
"""
processed_items = []
items = re.split(r"(w+)", text)
for item in items:
# Have to check for word *again* in order to skip unnecessary processing, even
# though we just matched/found all words.
is_word = re.match(r"w+", item) is not None
if is_word:
item = process(item)
processed_items.append(item)
processed_text = "".join(processed_items)
print(processed_text)
Processing is expensive, so we would like to skip non-word elements, of which there can be many of arbitrary types.
In this current version, this requires matching using a word regex twice. There should be a way to only process/split the input once, cutting the regex effort in half. That would require some more structure. A possible solution data structure I had in mind could be a (eventually named) tuple like:
items = [
("Hello", True),
("World", True),
("!", False),
]
where the second element indicates whether the element is a word. This would spare us from having to re.match(r"w+", item)
a second time. However, as before, splitting "Hello World!"
into the above three elements requires word-splitting in the first place.
Solution
Your biggest problem is use of split()
. It indiscriminately mixes in matches and non-matches. Instead, just finditer
and explicitly define two groups: words and non-words.
import re
import time
from typing import Iterator
def process(word: str) -> str:
time.sleep(0.05) # Processing takes a while...
return word.title()
WORD_PAT = re.compile(
r'''
(?P<notword>W*) # named capturing group: non-word characters
(?P<word>w*) # named capturing group: word characters
''',
re.VERBOSE,
)
def split_and_process(text: str) -> Iterator[str]:
for match in WORD_PAT.finditer(text):
yield match.group('notword')
yield process(match.group('word'))
def test() -> None:
text = """This is just a text!
Newlines should work.
Multiple ones as well, as well as arbitrary spaces.
Super-hyphenated long-lasting and overly-complex words should also work.
Arbitrary punctuation; has to work... Because? Why not!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
"""
processed_text = "".join(split_and_process(text))
print(processed_text)
if __name__ == '__main__':
test()