Stack Code Review

Read group of keywords from file, modify value, store new group to variable

Problem

I would like to use the following routine in my submission script for GAMESS calculations. I am not entirely sure if this is the optimum way to go.

This function would need one of the messaging routines, message, to function properly, but I decided to put echo in front of it, since this is basically what it does any way. It is taken from an earlier review of me. I am not posting the whole script due to its length, and I am not yet finished. But in order to avoid rewriting it I hope to get some insight here. I need to program similar routines for different keywords of the program, so I’d like to use this one as a template.

The function parses an input file for the GAMESS program and substitutes part of it if present with values supplied through the script in order to match the request, that is later given to the queueing system. I’ll include one input file in this post.

What I am completely unsure about is that I am calling tr in more than one occasion and if there is a batter way to extract the command group(s) in the first place. Note that there could be multiple occurrences throughout the whole input.


#!/bin/sh

#Values obtained from main script
file=$1
mem=$2

#
# Parsing the specified inputfile
#

modifySystemGroup ()
{
    # $1 is the used inputfile
    # $2 is the requested memory in mwords
    local readSystemGroup
    local truncSystemGroup
    # Extracting the SYSTEM command group, delete newlines
    readSystemGroup=$(sed -n "/[[:space:]]+$[sS][yY][sS][tT][eE][mM][ ]*/,/$[eE][nN][dD]/p" $1
                      | tr -d "rn" )
    if [ -z "$readSystemGroup" ]; then
        echo message "No $SYSTEM group detected. Default will be added."
        else
            echo read : '"$readSystemGroup"'
            #Delete any $SYSTEM an $END words
            truncSystemGroup=${readSystemGroup//$[sS][yY][sS][tT][eE][mM]/}
            truncSystemGroup=${truncSystemGroup//$[eE][nN][dD]/}
        #Delete any MWORDS statement since it will be replaced by script values
        if [[ $truncSystemGroup =~ [mM][wW][oO][rR][dD][sS]=[[:digit:]]*[[:space:]] ]]; then
            truncSystemGroup=${truncSystemGroup//${BASH_REMATCH[0]}/}
        fi
    fi

    writeSystemGroup=$(echo " $SYSTEM MWORDS=$2 $truncSystemGroup $END" | tr -s [:space:] )
    echo message "Applied 'MWORDS=$2' to the input file."
}

modifySystemGroup $file $mem
echo write: '"$writeSystemGroup"'

Sample input file (It is very important, that the keywords $... are indented by at least on space. GAMESS would ignore them otherwise.)

 $CONTRL
 EXETYP=CHECK
 $END
 $SYSTEM MWORDS=100
 $END
 $CONTRL
 SCFTYP=RHF
 RUNTYP=OPTIMIZE
 $END
 $SYSTEM Parall=.t.
 $END

 $GUESS  GUESS=HUCKEL $END
 $DATA
Water
C1
HYDROGEN   1.0   -0.754909     0.000000     0.563845
S   3
  1     13.0107010              0.19682158E-01
  2      1.9622572              0.13796524
  3      0.44453796             0.47831935
S   1
  1      0.12194962             1.0000000
P   1
  1      0.8000000              1.0000000

HYDROGEN   1.0    0.754909     0.000000     0.563845
S   3
  1     13.0107010              0.19682158E-01
  2      1.9622572              0.13796524
  3      0.44453796             0.47831935
S   1
  1      0.12194962             1.0000000
P   1
  1      0.8000000              1.0000000

OXYGEN     8.0    0.000000     0.000000    -0.087201
S   5
  1   2266.1767785             -0.53431809926E-02
  2    340.87010191            -0.39890039230E-01
  3     77.363135167           -0.17853911985
  4     21.479644940           -0.46427684959
  5      6.6589433124          -0.44309745172
S   1
  1      0.80975975668          1.0000000
S   1
  1      0.25530772234          1.0000000
P   3
  1     17.721504317            0.43394573193E-01
  2      3.8635505440           0.23094120765
  3      1.0480920883           0.51375311064
P   1
  1      0.27641544411          1.0000000
D   1
  1      1.2000000              1.0000000

 $END

Here are alternative headers (part until the empty line), that all need to result in the same line returned from the function:

 $CONTRL
 EXETYP=CHECK
 $END
 $SYSTEM
 MWORDS=100
 $END
 $CONTRL
 SCFTYP=RHF
 RUNTYP=OPTIMIZE
 $END

 $CONTRL
 EXETYP=CHECK
 $END
 $SYSTEM
 MWORDS=100 Parall=.t.
 $END
 $CONTRL
 SCFTYP=RHF
 RUNTYP=OPTIMIZE
 $END

 $CONTRL
 EXETYP=CHECK
 $END
 $SYSTEM
 MWORDS=100 Parall=.t. $END
 $CONTRL
 SCFTYP=RHF
 RUNTYP=OPTIMIZE
 $END

 $CONTRL
 EXETYP=CHECK
 $END
 $SYSTEM
MWORDS=100 Parall=.t. $END
 $CONTRL
 SCFTYP=RHF
 RUNTYP=OPTIMIZE
 $END

 $CONTRL
 EXETYP=CHECK
 $END
 $SYSTEM
  MWORDS=100
  Parall=.t.
 $END
 $CONTRL
 SCFTYP=RHF
 RUNTYP=OPTIMIZE
 $END

There could be many, many more cases, that split the command group in different statements, adding other keywords, spanning multiple lines, and more that I cannot think of.
For my purposes, even when the MWORDS=... statement is missing, it needs to be inserted, hence the following must also produce the same result as the other examples:

 $CONTRL
 EXETYP=CHECK
 $END
 $CONTRL
 SCFTYP=RHF
 RUNTYP=OPTIMIZE
 $END
 $SYSTEM Parall=.t.
 $END

Expected output

The output of the function itself must produce be the same. The examples differ obviously in what is read from the file. Hence echo read : '"$readSystemGroup"' produces different things, but this is just a control statement. That can be ignored – I will probably delete it in the final version anyway, so the log file is a bit cleaner.
The important information is in any case what is stored in $writeSystemGroup, which must be in any case the same.
Currently the above routine produces this, given $mem=500:

read : '  ...  '
message Applied 'MWORDS=500' to the input file.
write: ' $SYSTEM MWORDS=500 Parall=.t. $END'

And that is the behaviour I desire.

Solution

Consider replacing with Perl

It looks to me that essentially you are converting input of this format:

 $SYSTEM MWORDS=100
 ...
 $SYSTEM Parall=.t.
 ...

To output like this:

$SYSTEM MWORDS=NNNN Parall=.t. $END

Where NNNN is a parameter of the script.

The logic seems to be:

  • Keep the values that appear after $SYSTEM
  • … except if it’s MWORDS=...
  • Form the output line by joining the values found

The scrips achieves in a fairly complicated way, using a series of sed, awk, Bash pattern substitution, and conditional statements.

Here’s a Perl that compresses all that into a single line:

values=$(perl -ne '/ $system (.*)/i && do { print "$1 " if $1 !~ /^mwords/i; }' "$1")

From which you can create the final writeSystemGroup value like this:

writeSystemGroup=" $SYSTEM MWORDS=$2 $values$END"

How does this work and why is it better?

  • Simpler logic, focused on the $SYSTEM line that seems to carry all the information we need
  • Simpler regex, thanks to the /.../i flag of Perl
  • Simpler extraction of the values using the capture group (.*), and the conditional to exclude the MWORDS value
  • A single process instead of multiple processes
  • This Perl works in osx too, unlike the original sed

Update

After you added more variations of inputs, the Perl needs a bit more work.
Here’s the complete rewritten modifySystemGroup function:

modifySystemGroup() {
    local values
    values=$(
        perl -e '
        chomp(@_ = <>);
        $_ = join(" ", @_);
        s/ {2,}/ /g;
        @matches = / $system (.*?) $end/ig;
        for (@matches) {
            $value = "";
            for (split(/ /)) {
                $value .= "$_ " if !/^mwords=/i;
            }
            print $value if $value;
        }' "$1"
    )
    writeSystemGroup=" $SYSTEM MWORDS=$2 $values$END"
    echo message "Applied 'MWORDS=$2' to the input file."
}

Other issues and improvements

The script starts with #!/bin/sh, but actually it’s using several Bash features that not all implementations of /bin/sh will have. So it’s better to make that clear by making the first line #!/bin/bash.

Instead of this:

    echo read : '"$readSystemGroup"'

This is exactly the same, just simpler to write and read:

    echo read : "'$readSystemGroup'"

Remember to enclose path variables within double-quotes. You failed to do that for $1 here:

readSystemGroup=$(sed -n "/[[:space:]]+$[sS][yY][sS][tT][eE][mM][ ]*/,/$[eE][nN][dD]/p" $1 | tr -d "rn" )
Exit mobile version