Sorting a file with three-line blocks by the second word of the first line in each block

Posted on

Problem

I have a text file with the following three line pattern with blank lines in between. My script sorts alphabetically by each person’s last name and preserves formatting. I would love to see other options to improve this in Bash. For example, the group command that redirects into final.txt repeats a lot. Also, it would be nice to have the content of output.txt in a variable instead of creating a file.

Sally Smith
UniqueStringSmith_1
UniqueStringSmith_2

Wally Wilson
UniqueStringWilson_1
UniqueStringWilson_2

Tod Taylor
UniqueStringTaylor_1
UniqueStringTaylor_2

Judy Johnson
UniqueStringJohnson_1
UniqueStringJohnson_2

The result looks like the following, sorted alphabetically by last name:

Judy Johnson
UniqueStringJohnson_1
UniqueStringJohnson_2

Sally Smith
UniqueStringSmith_1
UniqueStringSmith_2

Tod Taylor
UniqueStringTaylor_1
UniqueStringTaylor_2

Wally Wilson
UniqueStringWilson_1
UniqueStringWilson_2

Here is my script:

#!/bin/bash

# Get the number of lines in the document.
lines=$(cat my-file.txt | wc -l)

# This is the starting range and end range. Each section is three lines.
x=1
y=3

until [ "$x" -gt "$lines" ]; do
    # Store the three lines to one line. 
    block=$(awk 'NR=="'"$x"'",NR=="'"$y"'"' my-file.txt)
    # Echo each instance into my file. 
    # The $block variable is not double quotes so new lines are not honored. 
    echo $block >> output.txt
    # Increment so it goes on to the next block.
    x=$((x+4)) 
    y=$((y+4)) 
done 

# Sort the output file in place by the second column.
sort -k2 output.txt -o output.txt

# Put it back into original formatting.
while read i; do 
    (echo "$i" | awk '{ print $1 " " $2 }'; echo "$i" | awk '{ print $3 }'; echo "$i" | awk '{ print $4 }'; echo "") >> final.txt
done < output.txt

# Remove the unnecessary file. 
rm output.txt

Solution

Usability

Hardcoded input and output filenames are not easy to use.
This script only works with one specific input file name,
and it may inadvertently overwrite a file.
It would be better to take the input file as a command line argument,
and write the output to stdout,
letting the user to redirect to any file.

Error handling

If the input file doesn’t exist, the script prints a bunch of error messages:

cat: my-file.txt: No such file or directory
sort: open failed: output.txt: No such file or directory
script.sh: line 29: output.txt: No such file or directory
rm: output.txt: No such file or directory

It would be better to check first that the file exists and fail early.

Keep in mind that after an error in one of the commands,
the script continues to run and execute the rest of the commands anyway.
I’ve seen cases when this cause real damage,
for example with rm -fr commands that assumed to be in a different directory, which was not the case due to earlier errors.
So it’s important to look out for possible errors, check the exit code of commands and halt execution early.

You could do something like this:

input=$1

if ! test -f "$input"; then
    echo fatal: input file argument missing or not a file: $input
    echo usage: $0 input
    exit 1
fi

Bash arithmetic

The -gt operator in [ ... ] is obsolete, a better way is to use the modern ((...)). Instead of:

until [ "$x" -gt "$lines" ]; do

You can write like this:

until (( x > lines )); do

Simpler quoting

You can simplify the quoting here:

    block=$(awk 'NR=="'"$x"'",NR=="'"$y"'"' "$input")

Like this:

block=$(awk "NR==$x,NR==$y" "$input")

Initializing output.txt

In the until loop, you append to output.txt.
What if the file already existed before running the script?
You will get funny results.

To make sure the file is empty, you can do this:

> output.txt

But this is still not great. A file with that name may exist, and now its content will be destroyed.

Instead of using a temporary file in the current folder,
it would be better to use one in $TMP/output.txt.
And to avoid clashing with other scripts that might do the same,
you can add the process ID to the filename, for example $TMP/output-$$.txt.
But the best solution is to use the mktemp command:

tmpfile=$(mktemp)

Deleting temporary files at the end

One problem with deleting temporary files at the end of the script like you did with rm output.txt is that you might forget to do it.
Another problem is the end of the script might not be reached,
if the command gets interrupted due to an error or signals or the user pressing Control-C.
You can protect against these by using the trap builtin:

tmpfile=$(mktemp)
trap "rm -f '$tmpfile'; exit 1" 1 2 3 15

I copied again the line creating the temporary file,
because it’s best to put the trap command right after that line,
so it won’t be forgotten.

The first parameter of trap is a command to run,
typically more than one commands,
and it’s important that the last one is exit.
The other parameters are signals that will be trapped.
1, 2, 3, 15 are typical signals to trap, for example 2 is SIGINT,
it is sent when the user presses Control-C while the script is running.

More Bash arithmetic

Instead of this:

    x=$((x+4)) 
    y=$((y+4)) 

You can simplify to:

((x+=4))
((y+=4))

Fewer variables

y is not really necessary. Instead of incrementing it by 4 in parallel with x, you can just increment x, and use x + 2 in awk:

Fewer redirections

Instead of redirecting output in every iteration of the until loop,
you could redirect the entire loop, just once:

until (( x > lines )); do
    block=$(awk "NR==$x,NR==$x+2" "$input")
    echo $block
    ((x+=4))
done > "$tmpfile"

Fewer processes

Instead of running an awk process for every block in the file in an until loop,
you could move the same logic inside awk itself,
and achieve the same using a single process:

awk '{printf "%s ", $0} NR % 4 == 0 {print ""}' "$input" > "$tmpfile"

In the while loop too, there is some waste.
Multiple commands are on one line separated by ;,
and enclosed within (...).
It’s equivalent to this:

while read i; do 
    echo "$i" | awk '{ print $1 " " $2 }'
    echo "$i" | awk '{ print $3 }'
    echo "$i" | awk '{ print $4 }'
    echo
done < "$tmpfile"

Note that i is a poor name for a variable that contains a line.

But the bigger issue is that a single line with awk could replace the 4 lines of echo:

echo $line | awk '{ print $1 " " $2; print $3; print $4; print ""; }'

Even better, a single awk process could replace the entire loop:

awk '{ print $1 " " $2; print $3; print $4; print ""; }' "$tmpfile"

Putting it together

At this point, we have:

  • An until loop that creates $tmpfile
  • A sort that sorts $tmpfile
  • An awk command that processes $tmpfile

We can chain them all into a pipeline, and get rid of $tmpfile altogether.

With the above changes and unnecessary elements removed,
the script becomes:

#!/bin/bash

input=$1

if ! test -f "$input"; then
    echo fatal: input file argument missing or not a file: $input
    echo usage: $0 input
    exit 1
fi

awk '{printf "%s ", $0} NR % 4 == 0 {print ""}' "$input" | 
sort -k2 | 
awk '{print $1 " " $2; print $3; print $4; print ""}'

Since you have called awk four times in your script, I think you could write a cleaner solution if you use just Awk to achieve your goal, since Awk is a more capable text-processing language than Bash.

(This code was adapted from this answer at SO, in response to “Sorting lines in a file alphabetically using awk and/or sed“)

#!/usr/bin/awk -f

BEGIN {
    RS=""; FS="n";
}
{
    tokens=split($1, name, " ")
    key[NR]=name[tokens] "t" NR
    block[NR]=$0
}

END {
    asort(key)
    for (i=1; i<=NR; i++) {
        split(key[i],name,"t")
        print block[name[2]]
        printf "n"
    }
}

The BEGIN block sets the Record Separator (RS) to the empty line, and the Field Separator (FS) to newlines. This is the awk idiom to deal with multi-line records separated by blank lines.

In the processing block, the first field (the line with the person’s name) is split on whitespace. The last token from the split (name[tokens], where tokens is the number of strings resulting from the split) is used as a sort key (with an appended record number; this makes the entire sort stable). The entire matched record is stored in the blocks array.

After all the records are read and indexed, the END block sorts the key array, then removes the record number tag from the values in key to use as the lookup key in the block array, to print the entire block containing the correctly sorted last name.

Assuming you save the script to sortblock.awk, and chmod +x sortblock.awk, you can just invoke it with

sortblock.awk data.txt

Leave a Reply

Your email address will not be published. Required fields are marked *