Problem
I have a text file with the following three line pattern with blank lines in between. My script sorts alphabetically by each person’s last name and preserves formatting. I would love to see other options to improve this in Bash. For example, the group command that redirects into final.txt repeats a lot. Also, it would be nice to have the content of output.txt in a variable instead of creating a file.
Sally Smith UniqueStringSmith_1 UniqueStringSmith_2 Wally Wilson UniqueStringWilson_1 UniqueStringWilson_2 Tod Taylor UniqueStringTaylor_1 UniqueStringTaylor_2 Judy Johnson UniqueStringJohnson_1 UniqueStringJohnson_2
The result looks like the following, sorted alphabetically by last name:
Judy Johnson UniqueStringJohnson_1 UniqueStringJohnson_2 Sally Smith UniqueStringSmith_1 UniqueStringSmith_2 Tod Taylor UniqueStringTaylor_1 UniqueStringTaylor_2 Wally Wilson UniqueStringWilson_1 UniqueStringWilson_2
Here is my script:
#!/bin/bash
# Get the number of lines in the document.
lines=$(cat my-file.txt | wc -l)
# This is the starting range and end range. Each section is three lines.
x=1
y=3
until [ "$x" -gt "$lines" ]; do
# Store the three lines to one line.
block=$(awk 'NR=="'"$x"'",NR=="'"$y"'"' my-file.txt)
# Echo each instance into my file.
# The $block variable is not double quotes so new lines are not honored.
echo $block >> output.txt
# Increment so it goes on to the next block.
x=$((x+4))
y=$((y+4))
done
# Sort the output file in place by the second column.
sort -k2 output.txt -o output.txt
# Put it back into original formatting.
while read i; do
(echo "$i" | awk '{ print $1 " " $2 }'; echo "$i" | awk '{ print $3 }'; echo "$i" | awk '{ print $4 }'; echo "") >> final.txt
done < output.txt
# Remove the unnecessary file.
rm output.txt
Solution
Usability
Hardcoded input and output filenames are not easy to use.
This script only works with one specific input file name,
and it may inadvertently overwrite a file.
It would be better to take the input file as a command line argument,
and write the output to stdout
,
letting the user to redirect to any file.
Error handling
If the input file doesn’t exist, the script prints a bunch of error messages:
cat: my-file.txt: No such file or directory
sort: open failed: output.txt: No such file or directory
script.sh: line 29: output.txt: No such file or directory
rm: output.txt: No such file or directory
It would be better to check first that the file exists and fail early.
Keep in mind that after an error in one of the commands,
the script continues to run and execute the rest of the commands anyway.
I’ve seen cases when this cause real damage,
for example with rm -fr
commands that assumed to be in a different directory, which was not the case due to earlier errors.
So it’s important to look out for possible errors, check the exit code of commands and halt execution early.
You could do something like this:
input=$1
if ! test -f "$input"; then
echo fatal: input file argument missing or not a file: $input
echo usage: $0 input
exit 1
fi
Bash arithmetic
The -gt
operator in [ ... ]
is obsolete, a better way is to use the modern ((...))
. Instead of:
until [ "$x" -gt "$lines" ]; do
You can write like this:
until (( x > lines )); do
Simpler quoting
You can simplify the quoting here:
block=$(awk 'NR=="'"$x"'",NR=="'"$y"'"' "$input")
Like this:
block=$(awk "NR==$x,NR==$y" "$input")
Initializing output.txt
In the until
loop, you append to output.txt
.
What if the file already existed before running the script?
You will get funny results.
To make sure the file is empty, you can do this:
> output.txt
But this is still not great. A file with that name may exist, and now its content will be destroyed.
Instead of using a temporary file in the current folder,
it would be better to use one in $TMP/output.txt
.
And to avoid clashing with other scripts that might do the same,
you can add the process ID to the filename, for example $TMP/output-$$.txt
.
But the best solution is to use the mktemp
command:
tmpfile=$(mktemp)
Deleting temporary files at the end
One problem with deleting temporary files at the end of the script like you did with rm output.txt
is that you might forget to do it.
Another problem is the end of the script might not be reached,
if the command gets interrupted due to an error or signals or the user pressing Control-C.
You can protect against these by using the trap
builtin:
tmpfile=$(mktemp)
trap "rm -f '$tmpfile'; exit 1" 1 2 3 15
I copied again the line creating the temporary file,
because it’s best to put the trap
command right after that line,
so it won’t be forgotten.
The first parameter of trap
is a command to run,
typically more than one commands,
and it’s important that the last one is exit
.
The other parameters are signals that will be trapped.
1, 2, 3, 15 are typical signals to trap, for example 2 is SIGINT
,
it is sent when the user presses Control-C while the script is running.
More Bash arithmetic
Instead of this:
x=$((x+4))
y=$((y+4))
You can simplify to:
((x+=4))
((y+=4))
Fewer variables
y
is not really necessary. Instead of incrementing it by 4 in parallel with x
, you can just increment x
, and use x + 2
in awk
:
Fewer redirections
Instead of redirecting output in every iteration of the until
loop,
you could redirect the entire loop, just once:
until (( x > lines )); do
block=$(awk "NR==$x,NR==$x+2" "$input")
echo $block
((x+=4))
done > "$tmpfile"
Fewer processes
Instead of running an awk
process for every block in the file in an until
loop,
you could move the same logic inside awk
itself,
and achieve the same using a single process:
awk '{printf "%s ", $0} NR % 4 == 0 {print ""}' "$input" > "$tmpfile"
In the while
loop too, there is some waste.
Multiple commands are on one line separated by ;
,
and enclosed within (...)
.
It’s equivalent to this:
while read i; do
echo "$i" | awk '{ print $1 " " $2 }'
echo "$i" | awk '{ print $3 }'
echo "$i" | awk '{ print $4 }'
echo
done < "$tmpfile"
Note that i
is a poor name for a variable that contains a line.
But the bigger issue is that a single line with awk
could replace the 4 lines of echo
:
echo $line | awk '{ print $1 " " $2; print $3; print $4; print ""; }'
Even better, a single awk
process could replace the entire loop:
awk '{ print $1 " " $2; print $3; print $4; print ""; }' "$tmpfile"
Putting it together
At this point, we have:
- An
until
loop that creates$tmpfile
- A
sort
that sorts$tmpfile
- An
awk
command that processes$tmpfile
We can chain them all into a pipeline, and get rid of $tmpfile
altogether.
With the above changes and unnecessary elements removed,
the script becomes:
#!/bin/bash
input=$1
if ! test -f "$input"; then
echo fatal: input file argument missing or not a file: $input
echo usage: $0 input
exit 1
fi
awk '{printf "%s ", $0} NR % 4 == 0 {print ""}' "$input" |
sort -k2 |
awk '{print $1 " " $2; print $3; print $4; print ""}'
Since you have called awk
four times in your script, I think you could write a cleaner solution if you use just Awk to achieve your goal, since Awk is a more capable text-processing language than Bash.
(This code was adapted from this answer at SO, in response to “Sorting lines in a file alphabetically using awk and/or sed“)
#!/usr/bin/awk -f
BEGIN {
RS=""; FS="n";
}
{
tokens=split($1, name, " ")
key[NR]=name[tokens] "t" NR
block[NR]=$0
}
END {
asort(key)
for (i=1; i<=NR; i++) {
split(key[i],name,"t")
print block[name[2]]
printf "n"
}
}
The BEGIN
block sets the Record Separator (RS
) to the empty line, and the Field Separator (FS
) to newlines. This is the awk
idiom to deal with multi-line records separated by blank lines.
In the processing block, the first field (the line with the person’s name) is split on whitespace. The last token from the split (name[tokens]
, where tokens
is the number of strings resulting from the split) is used as a sort key (with an appended record number; this makes the entire sort stable). The entire matched record is stored in the blocks
array.
After all the records are read and indexed, the END
block sorts the key
array, then removes the record number tag from the values in key
to use as the lookup key in the block
array, to print the entire block containing the correctly sorted last name.
Assuming you save the script to sortblock.awk
, and chmod +x sortblock.awk
, you can just invoke it with
sortblock.awk data.txt