Problem
I would like to use the following routine in my submission script for GAMESS calculations. I am not entirely sure if this is the optimum way to go.
This function would need one of the messaging routines, message
, to function properly, but I decided to put echo in front of it, since this is basically what it does any way. It is taken from an earlier review of me. I am not posting the whole script due to its length, and I am not yet finished. But in order to avoid rewriting it I hope to get some insight here. I need to program similar routines for different keywords of the program, so I’d like to use this one as a template.
The function parses an input file for the GAMESS program and substitutes part of it if present with values supplied through the script in order to match the request, that is later given to the queueing system. I’ll include one input file in this post.
What I am completely unsure about is that I am calling tr
in more than one occasion and if there is a batter way to extract the command group(s) in the first place. Note that there could be multiple occurrences throughout the whole input.
#!/bin/sh
#Values obtained from main script
file=$1
mem=$2
#
# Parsing the specified inputfile
#
modifySystemGroup ()
{
# $1 is the used inputfile
# $2 is the requested memory in mwords
local readSystemGroup
local truncSystemGroup
# Extracting the SYSTEM command group, delete newlines
readSystemGroup=$(sed -n "/[[:space:]]+$[sS][yY][sS][tT][eE][mM][ ]*/,/$[eE][nN][dD]/p" $1
| tr -d "rn" )
if [ -z "$readSystemGroup" ]; then
echo message "No $SYSTEM group detected. Default will be added."
else
echo read : '"$readSystemGroup"'
#Delete any $SYSTEM an $END words
truncSystemGroup=${readSystemGroup//$[sS][yY][sS][tT][eE][mM]/}
truncSystemGroup=${truncSystemGroup//$[eE][nN][dD]/}
#Delete any MWORDS statement since it will be replaced by script values
if [[ $truncSystemGroup =~ [mM][wW][oO][rR][dD][sS]=[[:digit:]]*[[:space:]] ]]; then
truncSystemGroup=${truncSystemGroup//${BASH_REMATCH[0]}/}
fi
fi
writeSystemGroup=$(echo " $SYSTEM MWORDS=$2 $truncSystemGroup $END" | tr -s [:space:] )
echo message "Applied 'MWORDS=$2' to the input file."
}
modifySystemGroup $file $mem
echo write: '"$writeSystemGroup"'
Sample input file (It is very important, that the keywords $...
are indented by at least on space. GAMESS would ignore them otherwise.)
$CONTRL
EXETYP=CHECK
$END
$SYSTEM MWORDS=100
$END
$CONTRL
SCFTYP=RHF
RUNTYP=OPTIMIZE
$END
$SYSTEM Parall=.t.
$END
$GUESS GUESS=HUCKEL $END
$DATA
Water
C1
HYDROGEN 1.0 -0.754909 0.000000 0.563845
S 3
1 13.0107010 0.19682158E-01
2 1.9622572 0.13796524
3 0.44453796 0.47831935
S 1
1 0.12194962 1.0000000
P 1
1 0.8000000 1.0000000
HYDROGEN 1.0 0.754909 0.000000 0.563845
S 3
1 13.0107010 0.19682158E-01
2 1.9622572 0.13796524
3 0.44453796 0.47831935
S 1
1 0.12194962 1.0000000
P 1
1 0.8000000 1.0000000
OXYGEN 8.0 0.000000 0.000000 -0.087201
S 5
1 2266.1767785 -0.53431809926E-02
2 340.87010191 -0.39890039230E-01
3 77.363135167 -0.17853911985
4 21.479644940 -0.46427684959
5 6.6589433124 -0.44309745172
S 1
1 0.80975975668 1.0000000
S 1
1 0.25530772234 1.0000000
P 3
1 17.721504317 0.43394573193E-01
2 3.8635505440 0.23094120765
3 1.0480920883 0.51375311064
P 1
1 0.27641544411 1.0000000
D 1
1 1.2000000 1.0000000
$END
Here are alternative headers (part until the empty line), that all need to result in the same line returned from the function:
$CONTRL
EXETYP=CHECK
$END
$SYSTEM
MWORDS=100
$END
$CONTRL
SCFTYP=RHF
RUNTYP=OPTIMIZE
$END
$CONTRL
EXETYP=CHECK
$END
$SYSTEM
MWORDS=100 Parall=.t.
$END
$CONTRL
SCFTYP=RHF
RUNTYP=OPTIMIZE
$END
$CONTRL
EXETYP=CHECK
$END
$SYSTEM
MWORDS=100 Parall=.t. $END
$CONTRL
SCFTYP=RHF
RUNTYP=OPTIMIZE
$END
$CONTRL
EXETYP=CHECK
$END
$SYSTEM
MWORDS=100 Parall=.t. $END
$CONTRL
SCFTYP=RHF
RUNTYP=OPTIMIZE
$END
$CONTRL
EXETYP=CHECK
$END
$SYSTEM
MWORDS=100
Parall=.t.
$END
$CONTRL
SCFTYP=RHF
RUNTYP=OPTIMIZE
$END
There could be many, many more cases, that split the command group in different statements, adding other keywords, spanning multiple lines, and more that I cannot think of.
For my purposes, even when the MWORDS=...
statement is missing, it needs to be inserted, hence the following must also produce the same result as the other examples:
$CONTRL
EXETYP=CHECK
$END
$CONTRL
SCFTYP=RHF
RUNTYP=OPTIMIZE
$END
$SYSTEM Parall=.t.
$END
Expected output
The output of the function itself must produce be the same. The examples differ obviously in what is read from the file. Hence echo read : '"$readSystemGroup"'
produces different things, but this is just a control statement. That can be ignored – I will probably delete it in the final version anyway, so the log file is a bit cleaner.
The important information is in any case what is stored in $writeSystemGroup
, which must be in any case the same.
Currently the above routine produces this, given $mem=500
:
read : ' ... '
message Applied 'MWORDS=500' to the input file.
write: ' $SYSTEM MWORDS=500 Parall=.t. $END'
And that is the behaviour I desire.
Solution
Consider replacing with Perl
It looks to me that essentially you are converting input of this format:
$SYSTEM MWORDS=100
...
$SYSTEM Parall=.t.
...
To output like this:
$SYSTEM MWORDS=NNNN Parall=.t. $END
Where NNNN
is a parameter of the script.
The logic seems to be:
- Keep the values that appear after
$SYSTEM
- … except if it’s
MWORDS=...
- Form the output line by joining the values found
The scrips achieves in a fairly complicated way, using a series of sed
, awk
, Bash pattern substitution, and conditional statements.
Here’s a Perl that compresses all that into a single line:
values=$(perl -ne '/ $system (.*)/i && do { print "$1 " if $1 !~ /^mwords/i; }' "$1")
From which you can create the final writeSystemGroup
value like this:
writeSystemGroup=" $SYSTEM MWORDS=$2 $values$END"
How does this work and why is it better?
- Simpler logic, focused on the
$SYSTEM
line that seems to carry all the information we need - Simpler regex, thanks to the
/.../i
flag of Perl - Simpler extraction of the values using the capture group
(.*)
, and the conditional to exclude theMWORDS
value - A single process instead of multiple processes
- This Perl works in osx too, unlike the original
sed
Update
After you added more variations of inputs, the Perl needs a bit more work.
Here’s the complete rewritten modifySystemGroup
function:
modifySystemGroup() {
local values
values=$(
perl -e '
chomp(@_ = <>);
$_ = join(" ", @_);
s/ {2,}/ /g;
@matches = / $system (.*?) $end/ig;
for (@matches) {
$value = "";
for (split(/ /)) {
$value .= "$_ " if !/^mwords=/i;
}
print $value if $value;
}' "$1"
)
writeSystemGroup=" $SYSTEM MWORDS=$2 $values$END"
echo message "Applied 'MWORDS=$2' to the input file."
}
Other issues and improvements
The script starts with #!/bin/sh
, but actually it’s using several Bash features that not all implementations of /bin/sh
will have. So it’s better to make that clear by making the first line #!/bin/bash
.
Instead of this:
echo read : '"$readSystemGroup"'
This is exactly the same, just simpler to write and read:
echo read : "'$readSystemGroup'"
Remember to enclose path variables within double-quotes. You failed to do that for $1
here:
readSystemGroup=$(sed -n "/[[:space:]]+$[sS][yY][sS][tT][eE][mM][ ]*/,/$[eE][nN][dD]/p" $1 | tr -d "rn" )