Find if a dataset of words contains any words containing all vowels by using exactly 5 calls to str_detect

Posted on

Problem

Take the words dataset from stringr. An R4DS exercise challenges the reader to check if there are any words in that dataset that contain every vowel. My presumably correct solution, using five calls to str_detect, was as follows:

library(stringr)
words[apply(sapply(c("a", "e", "i", "o", "u"), function(x) str_detect(words, x)), 1, all)]

Under the restriction that we must call str_detect exactly five times, once for each vowel, is there a better way to do this? The sapply seems like the right idea – it produces a 980*5 logical matrix with each row being a logical vector showing if the corresponding vowel was in the corresponding word. Collapsing it with all also seems like the right idea, but any time that I have to call apply rather than any other member of the apply family or something like Filter, I become concerned about my approach. R’s higher-order functions typically work best by column rather than by row and apply is a sign that I’ve failed to work like that.

Suggestions from either base R or the Tidyverse will be accepted. My primary reason for asking this question is that I have a strong feeling that the apply could have been avoided, perhaps by construction the logical matrix differently.

Solution

Sample data

vowels <- c("a", "e", "i", "o", "u")
input <- c("aeiou", "uoiea", "aeiouc", "aeioo", "coeiauc1", "eu")

Solution 1. Use reduce and &

These create 5 vectors, but only have 2 loaded at a time.

# 1A: with str_detect
input[Reduce(function(x, y) x & str_detect(input, y), vowels, init = TRUE)]

# 1B: with grepl
input[Reduce(function(x, y) x & grepl(y, input), vowels, init = TRUE)]

Solution 2. Use positive lookaheads

You can find a complete explanation here. We are essentially checking for the presence of all the vowels.

# 2A: with str_detect
input[str_detect(input, "(?=.*?a)(?=.*?e)(?=.*?i)(?=.*?o)(?=.*?u)")]

# 2B: with grep
grep("(?=.*?a)(?=.*?e)(?=.*?i)(?=.*?o)(?=.*?u)", input, perl = TRUE, value = TRUE)

Solution 3. Use strsplit and match

We can split all the words into individual characters, and then compare vowels to the letters of each word. All letters that aren’t matched become 0. Since 0 converts to FALSE as a logical, we can use all to see if all vowels are present.

input[sapply(strsplit(input, ""), function(x) all(match(vowels, x, nomatch = 0L)))]

Solution 4. lapply and grep

We can also choose to generate a vector for each vowel, but just a short one, as grep will only return the indices for which the conditions apply. We then use Reduce and intersect to get our answer. Alternatively, we can use skip lapply (solution 4B), but this is slower.

# 4A: with lapply
Reduce(intersect, lapply(vowels, grep, input, value = TRUE))

# 4B: without lapply
Reduce(function(x, y) intersect(x, grep(y, input, value = TRUE)), vowels, init = input)

Benchmark of all solutions

Unit: microseconds
               expr    min      lq      mean median      uq     max neval    cld
    original(words) 1766.7 1936.75 2238.6696 2082.7 2237.00 60182.5 10000      f
 solution_1a(words)  706.7  746.10  839.5126  802.6  868.70  4362.7 10000   c   
 solution_1b(words)  492.5  515.40  586.4063  555.5  605.15 30031.6 10000 a     
 solution_2a(words)  779.4  817.20  906.0936  875.8  940.55 29357.0 10000    d  
 solution_2b(words)  707.4  737.70  819.2046  790.7  854.10  3072.7 10000   c   
  solution_3(words) 1226.4 1376.55 1667.8316 1470.9 1600.95 76907.0 10000     e 
 solution_4a(words)  592.2  641.00  736.0374  689.4  755.20 29589.7 10000  b    
 solution_4b(words)  649.5  704.20  805.9942  757.8  828.80 31104.2 10000   c      

Memory allocation with bench::mark:

                Code    Memory allocation (KB)
1    original(words)                    100.15
2 solution_1a(words)                     42.63
3 solution_1b(words)                     42.63
4 solution_2a(words)                      7.76
5 solution_2b(words)                      3.88
6  solution_3(words)                     38.96
7 solution_4a(words)                     75.67
8 solution_4b(words)                    106.70

Leave a Reply

Your email address will not be published. Required fields are marked *