Problem
In my current BASH script project I am processing a string, which cannot have spaces.
At one point I get to where it is possible to have a series of digits. The series can be of any length, including zero, up to the abilities of the shell to handle command line arguments, and need not be a valid decimal integer. Leading zeros are acceptable, decimal points are not, comma separators could be handled for human compatibility if needed.
At the end I should have two strings: the collected digits (stripped of commas if allowed) and the remainder of the original string following the last digit in the series. The original string has been partially consumed by other sections, and the remainder string will be consumed by other sections, which may include a return to this section if another series of digits becomes valid. The collection of digits will be handed off to another section to deal with, including any validation which might be needed.
The snippet I’ve settled on is this:
original_group="$1"
source_group="$original_group"
number_group=''
while case "${source_group::1}" in
[[:digit:]] )
number_group+="${source_group::1}"
source_group="${source_group:1}"
;;
# ',' )
# source_group="${source_group:1}"
# ;;
* )
false
esac; do :; done
# Output not normally in the script.
results_top="Removing numbers from the head of %s gives:n"
results_list="tDigits: %sntBalance: %sn"
printf "$results_top" "'$original_group'"
printf "$results_list" "'$number_group'" "'$source_group'"
The possible contents of the source, once it reaches this point, are in one of four types:
- An empty string, (the last character could have sent it here and the empty has not been checked for)
- Results should be a pair of empty strings
- A non-empty string which begins with a non-digit character
- Results should be an empty string for the number list and the untouched source for the remainder
- A string with one, or more, digits, and nothing more
- Results should be the digits for the number list and an empty string for the remainder
- A string with one, or more, digits followed by one, or more, additional non-digit characters
- Results should be the digits for the number list and the sub-string beginning with the first non-digit character for the remainder
In all four cases, there cannot be an error triggered by the process. (There are no invalid characters, and the series of digits may not be a ‘number’ in a mathematical sense; it could be an account number, for example.) Validation of the number belongs elsewhere, and the contents of the string are consumed, whether or not the results are valid for some other purpose.
Performance is not a major issue, it is a shell script. In addition, this section is unlikely to be executed more than 3 times in any execution, commonly once if at all. Likewise, Bashisms are not a concern here, as the remainder of the script is replete with such.
Part of my reasoning for using this snippet is the convenience of adding/removing support for commas in the number series, and the ease of changing from [[:digit:]]
to [[:xdigit:]]
should I need to implement the same general logic for hexadecimal digits later.
Is there something inherently wrong with this method? Or, is there a different method which is simply better, cleaner, or easier?
I avoided RegEx as either extractor, or loop conditional, under the belief that when not needed, RegEx introduces needless overhead. (Yes, performance is not a major issue here, but bad habits are worth avoiding as well.)
Solution
bash
is slow to append and index into strings: both operations are O(n)
. Your approach does as many indexes and appends as there are digits in the input, making it O(n2)
. And processing a string by chomping the first character repeatedly will be O(n2)
in almost every language.
Have a look at these timings for large inputs, using your code:
$ for x in 10000 20000 40000 80000; do echo $x; time bash dig.sh $( tr -dc 0-9 < /dev/urandom | head -c $x ) >/dev/null ; done
input size (1000's of digits) runtime (seconds)
10 1.1
20 4.4
40 17.7
80 71.6
Almost perfectly quadratic! It’s possible to do better, and in fewer lines.
Your code is implementing a state machine, just like a regular expression would, but much less efficiently. So the regex isn’t needless overhead; it’s very necessary (and profitable) “underhead”:
dig() {
[[ $1 =~ ^([0-9,]*)(.*) ]] # this always matches
digits="${BASH_REMATCH[1]//,}"
remain="${BASH_REMATCH[2]}"
printf "inputt'%s'tdigitst'%s'tremaint'%s'n" "$1" "$digits" "$remain"
}
+ dig ''
input '' digits '' remain ''
+ dig 1234
input '1234' digits '1234' remain ''
+ dig 1,234
input '1,234' digits '1234' remain ''
+ dig abc12
input 'abc12' digits '' remain 'abc12'
+ dig 12z34
input '12z34' digits '12' remain 'z34'
$ for x in 10000 20000 40000 80000; do echo $x; time dig $( tr -dc 0-9 < /dev/urandom | head -c $x ) >/dev/null ; done
input size (1000's of digits) runtime (seconds)
10 0.006
20 0.010
40 0.018
80 0.032
Behold: a two-thousand-fold speedup of the slowest case!