# Find if a dataset of words contains any words containing all vowels by using exactly 5 calls to str_detect

Posted on

Problem

Take the `words` dataset from `stringr`. An R4DS exercise challenges the reader to check if there are any words in that dataset that contain every vowel. My presumably correct solution, using five calls to `str_detect`, was as follows:

``````library(stringr)
words[apply(sapply(c("a", "e", "i", "o", "u"), function(x) str_detect(words, x)), 1, all)]
``````

Under the restriction that we must call `str_detect` exactly five times, once for each vowel, is there a better way to do this? The `sapply` seems like the right idea – it produces a 980*5 logical matrix with each row being a logical vector showing if the corresponding vowel was in the corresponding word. Collapsing it with `all` also seems like the right idea, but any time that I have to call `apply` rather than any other member of the apply family or something like `Filter`, I become concerned about my approach. R’s higher-order functions typically work best by column rather than by row and `apply` is a sign that I’ve failed to work like that.

Suggestions from either base R or the Tidyverse will be accepted. My primary reason for asking this question is that I have a strong feeling that the `apply` could have been avoided, perhaps by construction the logical matrix differently.

Solution

Sample data

``````vowels <- c("a", "e", "i", "o", "u")
input <- c("aeiou", "uoiea", "aeiouc", "aeioo", "coeiauc1", "eu")
``````

Solution 1. Use `reduce` and `&`

These create 5 vectors, but only have 2 loaded at a time.

``````# 1A: with str_detect
input[Reduce(function(x, y) x & str_detect(input, y), vowels, init = TRUE)]

# 1B: with grepl
input[Reduce(function(x, y) x & grepl(y, input), vowels, init = TRUE)]
``````

You can find a complete explanation here. We are essentially checking for the presence of all the vowels.

``````# 2A: with str_detect
input[str_detect(input, "(?=.*?a)(?=.*?e)(?=.*?i)(?=.*?o)(?=.*?u)")]

# 2B: with grep
grep("(?=.*?a)(?=.*?e)(?=.*?i)(?=.*?o)(?=.*?u)", input, perl = TRUE, value = TRUE)
``````

Solution 3. Use `strsplit` and `match`

We can split all the words into individual characters, and then compare `vowels` to the letters of each word. All letters that aren’t matched become `0`. Since `0` converts to `FALSE` as a logical, we can use `all` to see if all vowels are present.

``````input[sapply(strsplit(input, ""), function(x) all(match(vowels, x, nomatch = 0L)))]
``````

Solution 4. `lapply` and `grep`

We can also choose to generate a vector for each vowel, but just a short one, as `grep` will only return the indices for which the conditions apply. We then use `Reduce` and `intersect` to get our answer. Alternatively, we can use skip `lapply` (solution 4B), but this is slower.

``````# 4A: with lapply
Reduce(intersect, lapply(vowels, grep, input, value = TRUE))

# 4B: without lapply
Reduce(function(x, y) intersect(x, grep(y, input, value = TRUE)), vowels, init = input)
``````

Benchmark of all solutions

``````Unit: microseconds
expr    min      lq      mean median      uq     max neval    cld
original(words) 1766.7 1936.75 2238.6696 2082.7 2237.00 60182.5 10000      f
solution_1a(words)  706.7  746.10  839.5126  802.6  868.70  4362.7 10000   c
solution_1b(words)  492.5  515.40  586.4063  555.5  605.15 30031.6 10000 a
solution_2a(words)  779.4  817.20  906.0936  875.8  940.55 29357.0 10000    d
solution_2b(words)  707.4  737.70  819.2046  790.7  854.10  3072.7 10000   c
solution_3(words) 1226.4 1376.55 1667.8316 1470.9 1600.95 76907.0 10000     e
solution_4a(words)  592.2  641.00  736.0374  689.4  755.20 29589.7 10000  b
solution_4b(words)  649.5  704.20  805.9942  757.8  828.80 31104.2 10000   c
``````

Memory allocation with `bench::mark`:

``````                Code    Memory allocation (KB)
1    original(words)                    100.15
2 solution_1a(words)                     42.63
3 solution_1b(words)                     42.63
4 solution_2a(words)                      7.76
5 solution_2b(words)                      3.88
6  solution_3(words)                     38.96
7 solution_4a(words)                     75.67
8 solution_4b(words)                    106.70
``````