Extract regular words from string but retain all other elements and record their type

Posted on

Problem

This snippet processes every regular (w+) word in a text and reinserts the processed version:

import re
import time


def process(word: str) -> str:
    time.sleep(0.05)  # Processing takes a while...
    return word.title()


text = """This is just a text!
Newlines should work.


Multiple ones as well,           as well as arbitrary spaces.
Super-hyphenated long-lasting and overly-complex words should also work.

Arbitrary punctuation; has to work... Because? Why not!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
"""

processed_items = []

items = re.split(r"(w+)", text)

for item in items:
    # Have to check for word *again* in order to skip unnecessary processing, even
    # though we just matched/found all words.
    is_word = re.match(r"w+", item) is not None
    if is_word:
        item = process(item)
    processed_items.append(item)

processed_text = "".join(processed_items)

print(processed_text)

Processing is expensive, so we would like to skip non-word elements, of which there can be many of arbitrary types.

In this current version, this requires matching using a word regex twice. There should be a way to only process/split the input once, cutting the regex effort in half. That would require some more structure. A possible solution data structure I had in mind could be a (eventually named) tuple like:

items = [
    ("Hello", True),
    ("World", True),
    ("!", False),
]

where the second element indicates whether the element is a word. This would spare us from having to re.match(r"w+", item) a second time. However, as before, splitting "Hello World!" into the above three elements requires word-splitting in the first place.

Solution

Your biggest problem is use of split(). It indiscriminately mixes in matches and non-matches. Instead, just finditer and explicitly define two groups: words and non-words.

import re
import time
from typing import Iterator


def process(word: str) -> str:
    time.sleep(0.05)  # Processing takes a while...
    return word.title()


WORD_PAT = re.compile(
    r'''
        (?P<notword>W*)  # named capturing group: non-word characters
        (?P<word>w*)     # named capturing group: word characters
    ''',
    re.VERBOSE,
)


def split_and_process(text: str) -> Iterator[str]:
    for match in WORD_PAT.finditer(text):
        yield match.group('notword')
        yield process(match.group('word'))


def test() -> None:
    text = """This is just a text!
Newlines should work.


Multiple ones as well,           as well as arbitrary spaces.
Super-hyphenated long-lasting and overly-complex words should also work.

Arbitrary punctuation; has to work... Because? Why not!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
"""

    processed_text = "".join(split_and_process(text))
    print(processed_text)


if __name__ == '__main__':
    test()

Leave a Reply

Your email address will not be published. Required fields are marked *