Problem
Take the words
dataset from stringr
. An R4DS exercise challenges the reader to check if there are any words in that dataset that contain every vowel. My presumably correct solution, using five calls to str_detect
, was as follows:
library(stringr)
words[apply(sapply(c("a", "e", "i", "o", "u"), function(x) str_detect(words, x)), 1, all)]
Under the restriction that we must call str_detect
exactly five times, once for each vowel, is there a better way to do this? The sapply
seems like the right idea – it produces a 980*5 logical matrix with each row being a logical vector showing if the corresponding vowel was in the corresponding word. Collapsing it with all
also seems like the right idea, but any time that I have to call apply
rather than any other member of the apply family or something like Filter
, I become concerned about my approach. R’s higher-order functions typically work best by column rather than by row and apply
is a sign that I’ve failed to work like that.
Suggestions from either base R or the Tidyverse will be accepted. My primary reason for asking this question is that I have a strong feeling that the apply
could have been avoided, perhaps by construction the logical matrix differently.
Solution
Sample data
vowels <- c("a", "e", "i", "o", "u")
input <- c("aeiou", "uoiea", "aeiouc", "aeioo", "coeiauc1", "eu")
Solution 1. Use reduce
and &
These create 5 vectors, but only have 2 loaded at a time.
# 1A: with str_detect
input[Reduce(function(x, y) x & str_detect(input, y), vowels, init = TRUE)]
# 1B: with grepl
input[Reduce(function(x, y) x & grepl(y, input), vowels, init = TRUE)]
Solution 2. Use positive lookaheads
You can find a complete explanation here. We are essentially checking for the presence of all the vowels.
# 2A: with str_detect
input[str_detect(input, "(?=.*?a)(?=.*?e)(?=.*?i)(?=.*?o)(?=.*?u)")]
# 2B: with grep
grep("(?=.*?a)(?=.*?e)(?=.*?i)(?=.*?o)(?=.*?u)", input, perl = TRUE, value = TRUE)
Solution 3. Use strsplit
and match
We can split all the words into individual characters, and then compare vowels
to the letters of each word. All letters that aren’t matched become 0
. Since 0
converts to FALSE
as a logical, we can use all
to see if all vowels are present.
input[sapply(strsplit(input, ""), function(x) all(match(vowels, x, nomatch = 0L)))]
Solution 4. lapply
and grep
We can also choose to generate a vector for each vowel, but just a short one, as grep
will only return the indices for which the conditions apply. We then use Reduce
and intersect
to get our answer. Alternatively, we can use skip lapply
(solution 4B), but this is slower.
# 4A: with lapply
Reduce(intersect, lapply(vowels, grep, input, value = TRUE))
# 4B: without lapply
Reduce(function(x, y) intersect(x, grep(y, input, value = TRUE)), vowels, init = input)
Benchmark of all solutions
Unit: microseconds
expr min lq mean median uq max neval cld
original(words) 1766.7 1936.75 2238.6696 2082.7 2237.00 60182.5 10000 f
solution_1a(words) 706.7 746.10 839.5126 802.6 868.70 4362.7 10000 c
solution_1b(words) 492.5 515.40 586.4063 555.5 605.15 30031.6 10000 a
solution_2a(words) 779.4 817.20 906.0936 875.8 940.55 29357.0 10000 d
solution_2b(words) 707.4 737.70 819.2046 790.7 854.10 3072.7 10000 c
solution_3(words) 1226.4 1376.55 1667.8316 1470.9 1600.95 76907.0 10000 e
solution_4a(words) 592.2 641.00 736.0374 689.4 755.20 29589.7 10000 b
solution_4b(words) 649.5 704.20 805.9942 757.8 828.80 31104.2 10000 c
Memory allocation with bench::mark
:
Code Memory allocation (KB)
1 original(words) 100.15
2 solution_1a(words) 42.63
3 solution_1b(words) 42.63
4 solution_2a(words) 7.76
5 solution_2b(words) 3.88
6 solution_3(words) 38.96
7 solution_4a(words) 75.67
8 solution_4b(words) 106.70