Problem
Im currently working on a filtering keywords. I have two lists which I called positive and negative.
Scenarios:
- if
POSITIVE
matches and NOTNEGATIVE
matches = True - if
POSITIVE
matches andNEGATIVE
matches = False - if NOT
POSITIVE
(Then no need to checkNEGATIVE
) = False.
I have done something like this:
import re
from typing import List
# In the future it will be a longer list of keywords
POSITIVE: List[str] = [
"hello+world",
"python",
"no"
]
# In the future it will be a longer list of keywords
NEGATIVE: List[str] = [
"java",
"c#",
"ruby+js"
]
def filter_keywords(string) -> bool:
def check_all(sentence, words):
return all(re.search(r'b{}b'.format(word), sentence) for word in words)
if not any(check_all(string.lower(), word.split("+")) for word in POSITIVE) or
any(check_all(string.lower(), word.split("+")) for word in NEGATIVE):
filtered = False
else:
filtered = True
print(f"Your text: {string} is filtered as: {filtered}")
return filtered
filter_keywords("Hello %#U&¤AU(1! World ¤!%!java")
filter_keywords("Hello %#U&¤AU(1! World ¤!%!")
filter_keywords("I love Python more than ruby and js")
filter_keywords("I love Python more than myself!")
Regarding the +
in the list, that means that if hello
and world
is in the string then its a positive match
It does work of course, but I do believe I am doing too many “calculations” and I believe it could be shorted, I wonder if there is a possibility of it?
Solution
That’s a lot of regexes to compute. I’m going to suggest that you only run one regex, to split the string up into words; and from there represent this as a series of set operations.
If I understand it correctly, you basically have two different cases in your filter words:
- A single word, which produces a match
- Multiple words, which – regardless of the order found – all need to be present to produce a match
In the second case, I don’t think it’s a good idea to include a magic plus-sign to indicate string separation in static application data. Just write the separation in yourself, as represented by a multi-element set.
The single-word match case is going to execute much more quickly than what you have now, because it takes only one set disjoint check for each of the positive and negative terms. If such a match is found, the multi-word case will be short-circuited away and will not be computed. The multi-word case is slower because every sub-set needs to be checked, but I still expect it to be faster than the iterative-regex approach.
Also note that you should remove your boolean variable and print statement in your function, and simplify your boolean expression by applying De Morgan’s Law to switch up your True
and False
and produce a single return expression.
Suggested
import re
from typing import Tuple, Iterable, Set
# In the future it will be a longer list of keywords
SINGLE_POSITIVE: Set[str] = {
"python",
"no",
}
MULTI_POSITIVE: Tuple[Set[str], ...] = (
{"hello", "world"},
)
# In the future it will be a longer list of keywords
SINGLE_NEGATIVE: Set[str] = {
"java",
"c#",
}
MULTI_NEGATIVE: Tuple[Set[str], ...] = (
{"ruby", "js"},
)
find_words = re.compile(r'w+').findall
def filter_keywords(string: str) -> bool:
words = {word.lower() for word in find_words(string)}
def matches(single: Set[str], multi: Iterable[Set[str]]) -> bool:
return (
(not words.isdisjoint(single))
or any(multi_word <= words for multi_word in multi)
)
return (
matches(SINGLE_POSITIVE, MULTI_POSITIVE) and not
matches(SINGLE_NEGATIVE, MULTI_NEGATIVE)
)
def test() -> None:
for string in (
"Hello %#U&¤AU(1! World ¤!%!java",
"Hello %#U&¤AU(1! World ¤!%!",
"I love Python more than ruby and js",
"I love Python more than myself!",
"I love Python more than js",
):
filtered = filter_keywords(string)
print(f"Your text: {string} is filtered as: {filtered}")
if __name__ == '__main__':
test()
Output
Your text: Hello %#U&¤AU(1! World ¤!%!java is filtered as: False
Your text: Hello %#U&¤AU(1! World ¤!%! is filtered as: True
Your text: I love Python more than ruby and js is filtered as: False
Your text: I love Python more than myself! is filtered as: True
Your text: I love Python more than js is filtered as: True