Filtering string by given keywords

Posted on

Problem

Im currently working on a filtering keywords. I have two lists which I called positive and negative.

Scenarios:

  1. if POSITIVE matches and NOT NEGATIVEmatches = True
  2. if POSITIVE matches and NEGATIVEmatches = False
  3. if NOT POSITIVE(Then no need to check NEGATIVE) = False.

I have done something like this:

import re
from typing import List

# In the future it will be a longer list of keywords
POSITIVE: List[str] = [
    "hello+world",
    "python",
    "no"
]

# In the future it will be a longer list of keywords
NEGATIVE: List[str] = [
    "java",
    "c#",
    "ruby+js"
]


def filter_keywords(string) -> bool:
    def check_all(sentence, words):
        return all(re.search(r'b{}b'.format(word), sentence) for word in words)

    if not any(check_all(string.lower(), word.split("+")) for word in POSITIVE) or 
            any(check_all(string.lower(), word.split("+")) for word in NEGATIVE):
        filtered = False
    else:
        filtered = True

    print(f"Your text: {string} is filtered as: {filtered}")
    return filtered


filter_keywords("Hello %#U&¤AU(1! World ¤!%!java")
filter_keywords("Hello %#U&¤AU(1! World ¤!%!")
filter_keywords("I love Python more than ruby and js")
filter_keywords("I love Python more than myself!")

Regarding the + in the list, that means that if hello and world is in the string then its a positive match

It does work of course, but I do believe I am doing too many “calculations” and I believe it could be shorted, I wonder if there is a possibility of it?

Solution

That’s a lot of regexes to compute. I’m going to suggest that you only run one regex, to split the string up into words; and from there represent this as a series of set operations.

If I understand it correctly, you basically have two different cases in your filter words:

  • A single word, which produces a match
  • Multiple words, which – regardless of the order found – all need to be present to produce a match

In the second case, I don’t think it’s a good idea to include a magic plus-sign to indicate string separation in static application data. Just write the separation in yourself, as represented by a multi-element set.

The single-word match case is going to execute much more quickly than what you have now, because it takes only one set disjoint check for each of the positive and negative terms. If such a match is found, the multi-word case will be short-circuited away and will not be computed. The multi-word case is slower because every sub-set needs to be checked, but I still expect it to be faster than the iterative-regex approach.

Also note that you should remove your boolean variable and print statement in your function, and simplify your boolean expression by applying De Morgan’s Law to switch up your True and False and produce a single return expression.

Suggested

import re
from typing import Tuple, Iterable, Set


# In the future it will be a longer list of keywords
SINGLE_POSITIVE: Set[str] = {
    "python",
    "no",
}
MULTI_POSITIVE: Tuple[Set[str], ...] = (
     {"hello", "world"},
)

# In the future it will be a longer list of keywords
SINGLE_NEGATIVE: Set[str] = {
    "java",
    "c#",
}
MULTI_NEGATIVE: Tuple[Set[str], ...] = (
    {"ruby", "js"},
)


find_words = re.compile(r'w+').findall


def filter_keywords(string: str) -> bool:
    words = {word.lower() for word in find_words(string)}

    def matches(single: Set[str], multi: Iterable[Set[str]]) -> bool:
        return (
            (not words.isdisjoint(single))
            or any(multi_word <= words for multi_word in multi)
        )

    return (
        matches(SINGLE_POSITIVE, MULTI_POSITIVE) and not
        matches(SINGLE_NEGATIVE, MULTI_NEGATIVE)
    )


def test() -> None:
    for string in (
        "Hello %#U&¤AU(1! World ¤!%!java",
        "Hello %#U&¤AU(1! World ¤!%!",
        "I love Python more than ruby and js",
        "I love Python more than myself!",
        "I love Python more than js",
    ):
        filtered = filter_keywords(string)
        print(f"Your text: {string} is filtered as: {filtered}")


if __name__ == '__main__':
    test()

Output

Your text: Hello %#U&¤AU(1! World ¤!%!java is filtered as: False
Your text: Hello %#U&¤AU(1! World ¤!%! is filtered as: True
Your text: I love Python more than ruby and js is filtered as: False
Your text: I love Python more than myself! is filtered as: True
Your text: I love Python more than js is filtered as: True

Leave a Reply

Your email address will not be published. Required fields are marked *