Optimize text search in files with Bash

Posted on

Problem

I would like to get some performance improvement suggestions to a simple project I made using Bash in Linux.

The target is to read all the *.desktop files, and extract Name, Exec, Icon and Comment entries. Then will be displayed in a GTK yad List.

I have built the code in two versions. Both versions works OK, but are very slow.

Version 1 :

Read all the files one by one & grep the fields.
This version need about 9 seconds to read & grep 300 desktop files.

TimeStarted=$(date +%s.%N)
files=/usr/share/applications/*.desktop
fileindex=1 
for i in $(ls $files); do
  readarray -t executable < <(grep -m 1 "^Exec=" $i |cut -f 2 -d '=')
  readarray -t comment < <(grep -m 1 "^Comment=" $i |cut -f 2 -d '=')
  readarray -t comment2 < <(grep -m 1 "^GenericName=" $i |cut -f 2 -d '=')
  readarray -t mname < <(grep -m 1 "^Name=" $i |cut -f 2 -d '=')
  readarray -t icon < <(grep -m 1 "^Icon=" $i |cut -f 2 -d '=')
  if [[ $comment = "" ]]; then
     comment=$comment2
  fi
  yadlist+=( "$fileindex" "${icon[0]}" "${mname[0]}" "$i" "${executable[0]}" "${comment[0]}" ) #this sets double quotes in each variable.
  fileindex=$(($fileindex+1))
done
TimeFinished=$(date +%s.%N); TimeDiff=$(echo "$TimeFinished - $TimeStarted" | bc -l)

Version 2:

grep all the files at once for required fields.
This version improves the performance of the script, and needs 2 seconds to grep 300 desktop files.

TimeStarted=$(date +%s.%N)
files=/usr/share/applications/*.desktop
i=$files 
fileindex=1
IFS=$'n'
readarray -t fi < <(printf '%sn' $i)
readarray -t executable < <(grep  -m 1 '^Exec=' $i)
readarray -t noexecutable < <(grep  -L '^Exec=' $i)
readarray -t comment < <(grep -m 1 "^Comment=" $i )
readarray -t nocomment < <(grep -L "^Comment=" $i ) 
readarray -t comment2 < <(grep -m 1 "^GenericName=" $i )
readarray -t nocomment2 < <(grep -L "^GenericName=" $i )
readarray -t mname < <(grep -m 1 "^Name=" $i )
readarray -t nomname < <(grep -L "^Name=" $i )  
readarray -t icon < <(grep -m 1 "^Icon=" $i )
readarray -t noicon < <(grep -L "^Icon=" $i )   

for items1 in ${noexecutable[@]}; do
    executable+=($(echo "$items1"":Exec=None"))
done

for items2 in ${nocomment[@]}; do
    comment+=($(echo "$items2"":Comment=None"))
done
for items3 in ${nocomment2[@]}; do
    comment2+=($(echo "$items3"":GenericName=None"))
done

for items4 in ${nomname[@]}; do
    mname+=($(echo "$items4"":Name=None"))
done

for items5 in ${noicon[@]}; do
    icon+=($(echo "$items5"":Icon=None"))
done

sortexecutable=($(sort <<<"${executable[*]}"))
sortcomment=($(sort <<<"${comment[*]}"))
sortcomment2=($(sort <<<"${comment2[*]}"))
sortmname=($(sort <<<"${mname[*]}"))
sorticon=($(sort <<<"${icon[*]}"))

trimexecutable=($(grep  -Po '(?<=Exec=)[ --0-9A-Za-z/]*' <<<"${sortexecutable[*]}"))
trimcomment=($(grep -Po '(?<=Comment=)[ --0-9A-Za-z/]*' <<<"${sortcomment[*]}"))
trimcomment2=($(grep -Po '(?<=GenericName=)[ --0-9A-Za-z/]*' <<<"${sortcomment2[*]}"))
trimmname=($(grep -Po '(?<=Name=)[ --0-9A-Za-z/]*' <<<"${sortmname[*]}"))
trimicon=($(grep -Po '(?<=Icon=)[ --0-9A-Za-z/]*' <<<"${sorticon[*]}"))

unset IFS

ae=0
for aeitem in ${fi[@]};do
    if [[ ${trimcomment[ae]} = "None" ]]; then
        trimcomment[ae]=${trimcomment2[ae]}
    fi
    yadlist+=( "$fileindex" "${trimicon[$ae]}" "${trimmname[$ae]}" "${fi[$ae]}" "${trimexecutable[$ae]}" "${trimcomment[$ae]}" ) #this sets double quotes in each variable.
    fileindex=$(($fileindex+1))
    ae=$(($ae+1))
done
TimeFinished=$(date +%s.%N); TimeDiff=$(echo "$TimeFinished - $TimeStarted" | bc -l)

Remarks:

a) Some .dekstop files do not include all the required fields.

b) Performance refers to 64-bit Intel Celeron N3050 – 4GB ram machine, running 64bit Debian 8 Sid with XFCE and GNU bash 4.4.0(1) and GNU grep 2.26. PS: Performance of 9 or 2 seconds is also verified by time ./script.sh.

c) The version 2 script performance can achieve below 0.5 seconds if I remove the “for” sections, but then yadlist becomes a chaos due to the missing fields in some .desktop files.

Result:

According to my opinion, even 2 seconds to grep 300 files it is still too much time for such a small number of files.

Is it possible to further optimize this scripts performance in Bash?

As a sample , you can have a look at this caja.desktop file, taken from my system. Notice that Comment entry is missing.

[Desktop Entry]
Name=Caja
Name[af]=Caja
<More Name entries for different locale>
GenericName=File Manager
GenericName[af]=Lêerbestuurder
<more GenericName entries for different locale>
Exec=caja
Icon=system-file-manager
Terminal=false
Type=Application
StartupNotify=true
NoDisplay=true
OnlyShowIn=MATE;

In other .desktop files, the comment entry (if present) looks like this:

Comment=View multi-page documents
<various Comment entries for different locale>

Solution

Review w.r.t the algorithm only (language independent):

  • 5 grep per file to extract what you need. Instead search for all five altogether : grep "A|B|C|D|E". If this doesn’t suit your requirement, you should write a simple file read program and extract all the 5 parameters in one file read instead of 5.
  • Calculate comment2 only if [[ $comment = "" ]];

After a lot of research and “observation” i found the problem….

The real problem of script limited performance was cpu scaling.
As soon as i pushed the processor to work in full power (1.6 GHz), version 2 achieved 0.5 seconds!

All i had to do was to check the script performance in another machine, and i was lucky enough this “other” machine not to have cpu scaling enabled.

As a programmer point of view there is no doubt that version 2 is MUCH faster than version 1. Also it seems that version 2 is the most we can get out of bash.

PS1: I adopted recommendation of “thepace” for calculation of comment2. That way script performance improved by some milliseconds.

PS2: To make my CPU to work in full power i had to disable intel_pstate and apply performance governor in cpufrequtils (cpufreq-set -c 0 -g performance – same for -c 1) or even better to stick the CPU at max power using cpufreq-set -c 0 -f 1600000.

PS3: performance governor is also available with intel_ptate enabled (default setting) but in reality intel pstate keeps manipulating – reducing the cpu speed even in performance governor as proved by cpufreq-info (in a better way though than default powersave governor).By disabling intel pstate and applying performance governor cpu sticks to 1,6GHz.

PS4: I had no idea that cpufrequtils is installed by default in Debian 8…

For those who want to give a try, full script can be found here:
https://github.com/gevasiliou/PythonTests/blob/master/appslist.sh

If you don’t have .desktop files in your system (usually found at /usr/share/applications/) you can download this folder with around 300 files for testing:
https://github.com/gevasiliou/PythonTests/tree/master/appsfiles

Leave a Reply

Your email address will not be published. Required fields are marked *