Problem
I’m working on a port of NLTK to Rust.
I am fairly new to Rust, so I wanted to post a small file to check if I was writing idiomatic Rust. I’ve included the original Python. The Python has docstrings that describes usage, so I didn’t include comments for brevity in the Rust.
This file passes all of the tests we’ve written, so we’re certain that it mostly works.
util.py
from re import finditer
def string_span_tokenize(s, sep):
r"""
Return the offsets of the tokens in *s*, as a sequence of ``(start, end)``
tuples, by splitting the string at each occurrence of *sep*.
>>> from nltk.tokenize.util import string_span_tokenize
>>> s = '''Good muffins cost $3.88nin New York. Please buy me
... two of them.nnThanks.'''
>>> list(string_span_tokenize(s, " "))
[(0, 4), (5, 12), (13, 17), (18, 26), (27, 30), (31, 36), (37, 37),
(38, 44), (45, 48), (49, 55), (56, 58), (59, 73)]
:param s: the string to be tokenized
:type s: str
:param sep: the token separator
:type sep: str
:rtype: iter(tuple(int, int))
"""
if len(sep) == 0:
raise ValueError("Token delimiter must not be empty")
left = 0
while True:
try:
right = s.index(sep, left)
if right != 0:
yield left, right
except ValueError:
if left != len(s):
yield left, len(s)
break
left = right + len(sep)
def regexp_span_tokenize(s, regexp):
r"""
Return the offsets of the tokens in *s*, as a sequence of ``(start, end)``
tuples, by splitting the string at each successive match of *regexp*.
>>> from nltk.tokenize import WhitespaceTokenizer
>>> s = '''Good muffins cost $3.88nin New York. Please buy me
... two of them.nnThanks.'''
>>> list(WhitespaceTokenizer().span_tokenize(s))
[(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36),
(38, 44), (45, 48), (49, 51), (52, 55), (56, 58), (59, 64), (66, 73)]
:param s: the string to be tokenized
:type s: str
:param regexp: regular expression that matches token separators
:type regexp: str
:rtype: iter(tuple(int, int))
"""
left = 0
for m in finditer(regexp, s):
right, next = m.span()
if right != 0:
yield left, right
left = next
yield left, len(s)
def spans_to_relative(spans):
r"""
Return a sequence of relative spans, given a sequence of spans.
>>> from nltk.tokenize import WhitespaceTokenizer
>>> from nltk.tokenize.util import spans_to_relative
>>> s = '''Good muffins cost $3.88nin New York. Please buy me
... two of them.nnThanks.'''
>>> list(spans_to_relative(WhitespaceTokenizer().span_tokenize(s)))
[(0, 4), (1, 7), (1, 4), (1, 5), (1, 2), (1, 3), (1, 5), (2, 6),
(1, 3), (1, 2), (1, 3), (1, 2), (1, 5), (2, 7)]
:param spans: a sequence of (start, end) offsets of the tokens
:type spans: iter(tuple(int, int))
:rtype: iter(tuple(int, int))
"""
prev = 0
for left, right in spans:
yield left - prev, right - left
prev = right
util.rs
extern crate regex;
use regex::Regex;
pub fn string_span_tokenize(s: &str, sep: &str) -> Result<Vec<(usize, usize)>, String> {
if sep.len() == 0 {
Err(String::from("Error! Separator has a length of 0!"))
} else {
// TODO: we'll likely want to do some error checking
// to ensure s.len() and str.len() don't exceed usize::MAX
let strlen = s.len();
let seplen = sep.len();
let mut result: Vec<(usize, usize)> = Vec::new();
let mut left = 0;
let mut r_idx;
loop {
let right = s[left..].find(sep); // TODO: Will this work on unicode?
match right {
Some(right_idx) => {
if right_idx != 0 {
result.push((left, right_idx));
}
r_idx = right_idx;
},
None => {
if left != strlen {
result.push((left, strlen));
}
break;
}
}
left = r_idx + seplen;
}
return Ok(result);
}
}
pub fn regexp_span_tokenize(s: &str, regexp: ®ex::Regex) -> Vec<(usize, usize)> {
let mut result: Vec<(usize, usize)> = Vec::new();
let strlen = s.len();
let mut left = 0;
for (right, next) in regexp.find_iter(s) {
if right != 0 {
result.push((left, right));
}
left = next
}
result.push((left, strlen));
return result;
}
pub fn spans_to_relative(spans: Vec<(usize, usize)>) -> Vec<(usize, usize)> {
let mut prev = 0;
let mut result: Vec<(usize, usize)> = Vec::new();
for (left, right) in spans {
result.push((left - prev, right - left));
prev = right;
}
return result;
}
We decided to return Vec
s instead of using generators because generators are verbose in Rust.
Solution
-
You didn’t include any of your Rust tests, so I can’t make any guarantees that I preserved the semantics of your code. Additionally, these tests would have greatly helped me understand what your code was trying to do.
-
passes all of the tests we’ve written
You will want to double-check your tests, then. Running
string_span_tokenize("hello world", "l")
leads to an infinite loop. Because of this (and the lack of documentation and tests), I can’t tell what behavior you actually want for this method, so I just guessed. -
You have inconsistent handling of zero-width spans.
string_span_tokenize
prevents zero-width spans at the beginning and end, butregexp_span_tokenize
only prevents them at the beginning. Neither method prevents zero-width spans in the middle of the string. For my rewrite, I allow zero-width spans everywhere. I’d expect you could add afilter
call to remove all zero-width spans beforecollect
ing if that is needed. -
You have inconsistent handling of zero-width separators. The string method complains if you pass
""
, but the regex version can accept an empty regex. -
You use
(usize, usize)
to mean both absolute and relative spans. This is an invitation to confuse the two and pass one where the other is expected. I did not make this change, but you should consider creating newtypes likeAbsoluteSpan(usize, usize)
andRelativeSpan(usize, usize)
. You can have adapter methods to-and-from a tuple and it would be a place to hang the conversion methods. -
ensure
s.len()
andstr.len()
don’t exceedusize::MAX
Beyond the fact that you cannot exceed
usize::MAX
(not sure how you would exceed the maximum), remember thatusize
represents the native machine width (32-bit or 64-bit). If you could somehow exceed this, that means you’d have a string that was over 4GiB or 16EiB. It is likely you will run into other problems before you reach this point. -
Seeing many
push
calls should usually give you pause. There are often higher-level functions likemap
that are easier to understand and can actually be faster! For example, theVec
implementation ofcollect
usessize_hint
to pre-allocate the array. -
Avoid slicing strings and slices when possible and still understandable. These have a tiny amount of overhead as they have to do out-of-bounds checks. Instead, try to use iterators as the bounds checks can be avoided, giving you a little bit of a speedup.
-
Don’t top-load all your
let
s. Instead, define them as close as possible to where they are used. This helps keep their scope smaller which reduces the chance of programmer error and also tends to make refactoring easier in the long run. In Rust specifically, it often means you can avoid making things mutable. -
Avoid functions that are complete if-else statements, especially if one branch is much longer than the other. Use an early return called a guard clause for the short branch and dedent the happy path.
-
It’s very rare to need a
Vec<T>
as an argument. Instead, accept a&[T]
. It’s more general as anything that looks like a slice can be used. For example, you can pass a reference to an array. -
Don’t use explicit
return
statements at the end of methods. -
Don’t use explicit types on variables unless the compiler tells you to add them. Specifically, your
Vec::new
calls don’t need them. -
Use
is_empty
instead oflen == 0
. It’s easier to read and shorter.
extern crate regex;
use regex::Regex;
pub fn string_span_tokenize(s: &str, sep: &str) -> Result<Vec<(usize, usize)>, String> {
if sep.is_empty() {
return Err(String::from("Error! Separator has a length of 0!"));
}
let seplen = sep.len();
let mut left = 0;
let spans = s.split(sep).map(|piece| {
let right = left + piece.len();
let span = (left, right);
left = right + seplen;
span
}).collect();
Ok(spans)
}
pub fn regexp_span_tokenize(s: &str, regexp: ®ex::Regex) -> Vec<(usize, usize)> {
let mut left = 0;
let mut spans: Vec<_> = regexp.find_iter(s).map(|(right, next)| {
let span = (left, right);
left = next;
span
}).collect();
spans.push((left, s.len()));
spans
}
pub fn spans_to_relative(spans: &[(usize, usize)]) -> Vec<(usize, usize)> {
let mut prev = 0;
spans.iter().map(|&(left, right)| {
let span = (left - prev, right - left);
prev = right;
span
}).collect()
}
fn main() {
println!("{:?}", string_span_tokenize("hello world", "l"));
// Ok([(0, 2), (3, 3), (4, 9), (10, 11)])
println!("{:?}", regexp_span_tokenize("hello world", ®ex::Regex::new("l").unwrap()));
// [(0, 2), (3, 3), (4, 9), (10, 11)]
println!("{:?}", spans_to_relative(&[(1,2), (3,4), (100, 200)]));
// [(1, 1), (1, 1), (96, 100)]
}
string_span_tokenize
if sep.len() == 0 {
Err(String::from("Error! Separator has a length of 0!"))
} else {
// ...
return Ok(result);
}
You use an implicit and an explicit return, but both could be expressed as implicit returns. Alternatively, change the Err
path to use return
, then you don’t need an else
block and you can reduce indentation:
if sep.len() == 0 {
return Err(String::from("Error! Separator has a length of 0!"));
}
// ...
Ok(result)
// TODO: we'll likely want to do some error checking
// to ensure s.len() and str.len() don't exceed usize::MAX
You don’t need to do that, since len()
returns a usize
, and usize::MAX
is the maximum value. A condition such as s.len() > usize::MAX
would always be false.
let mut result: Vec<(usize, usize)> = Vec::new();
The type annotation here is not strictly necessary.
let right = s[left..].find(sep); // TODO: Will this work on unicode?
Strings in Rust are encoded in UTF-8. Standard library functions operate on Unicode code points, so both the string and the separator can contain fancy characters.
match right {
Some(right_idx) => {
if right_idx != 0 {
result.push((left, right_idx));
}
r_idx = right_idx;
},
None => {
if left != strlen {
result.push((left, strlen));
}
break;
}
}
Three things:
-
I would call the
right_idx
variableright
, shadowing the outerright
variable. Inside theSome
branch, there’s no reason to access the outerright
variable, so why not reuse the name? -
Instead of a
match
expression, you could use anif let
expression. For example:if let Some(right) = right { if right != 0 { result.push((left, right)); } r_idx = right; } else { if left != strlen { result.push((left, strlen)); } break; }
-
You don’t need to define the
r_idx
variable as mutable if you move it inside the loop and take advantage of the fact thatmatch
andif let
are expressions.let r_idx = if let Some(right) = right { if right != 0 { result.push((left, right)); } right } else { if left != strlen { result.push((left, strlen)); } break; };
Here, the
if
branch evaluates toright
and theelse
branch diverges. When theelse
branch is taken, thebreak
statement breaks out of the loop, so the expression never produces a value. Rust handles this gracefully without us having to put a dummy value on theelse
branch just to get the types to match.
regexp_span_tokenize
Nothing to say about this function.
spans_to_relative
pub fn spans_to_relative(spans: Vec<(usize, usize)>) -> Vec<(usize, usize)> {
// ...
}
The function takes a Vec
by value, but doesn’t take advantage of its storage — it’s forcing a move for no reason. To make the function more general, it should take a slice instead:
pub fn spans_to_relative(spans: &[(usize, usize)]) -> Vec<(usize, usize)> {
// ...
}
When you call the function, you’ll need to write spans_to_relative(&x)
instead of spans_to_relative(x)
.
let mut result: Vec<(usize, usize)> = Vec::new();
Here, you know exactly how many elements the Vec
will have before you return it, so you could allocate it with Vec::with_capacity(spans.len())
to avoid reallocations while filling it.