Problem
I just wrote an advanced CSV parser that translates CSV to SQL(SQLite Compatible) statements, which are then inserted into the database. It translates variable names in the CSV to values defined in the script.
Example CSV
0,1,An empty black tile,${ASSET_PATH}/BlackTile.png
1,1,Grassy Tile,${ASSET_PATH}/Grass.png
2,1,Lava Tile,${ASSET_PATH}/Lava.png
3,1,Stone Brick Texture,${ASSET_PATH}/StoneBrick.png
The code
# Variables that are used in the CSV files
ASSET_PATH=~/spacegame/assets
# $1 - File to read. Named same as table to insert to plus .csv
# $2 -
function parseCsvAdv2Db {
local oldIFS=$IFS
local table=$(echo $1 | cut -d'.' -f 1)
local ins="INSERT INTO $table$2 VALUES"
IFS='|'
while read line
do
# Preprocess the line
local data=$(eval echo "$line" |
awk 'BEGIN { FS=","; OFS="|"; } { $1=$1; print $0; }')
local tmpdata=(
for field in $data
do
tmpdata+="'$field'"
done
tmpdata+=')'
ins+="$tmpdata"
done < $1
ins=$(echo $ins | sed -e 's/)(/),(/g' -e "s/''/','/g")
ins+=';'
sqlite3 $dbfile "$ins"
# Restore state
IFS=$oldIFS
}
parseCsvAdv2Db test.csv '(id,type,descr,path)'
Solution
First of all, it should be noted that this script would be vulnerable to arbitrary command execution as well as SQL injection. It might be OK if you trust the CSV data not to contain malicious shell commands or characters with special significance in SQL.
Several features make this code hard to follow:
- Mixing Bash and AWK. It should either be pure Bash or mostly AWK (with a thin Bash wrapper to invoke AWK with the right parameters). Calling AWK like this, especially with one invocation per line, is both confusing and bad for performance.
- What does the
|
character have to do with anything? It seems to be a secret delimiter used for Bash-AWK communication. That’s bad: will a literal|
in the data break the script? - If you are going to override
IFS
temporarily, use the set-this-variable-for-one-command syntax. - Why is there post-processing done using sed?
$(echo $1 | cut -d'.' -f 1)
can be better expressed in Bash using${1%%.*}
.
A corner case is that the code generates a malformed INSERT
statement if the CSV file is empty.
Suggested solution
# $1 - Name of CSV file to read. Table name is inferred from this by
# dropping the filename extension.
# $2 - Optional (column, names) for the INSERT statement
# $dbfile - SQLite filename
parseCsvAdv2Db() {
(
local rec_sep="INSERT INTO ${1%%.*}$2 VALUES"
while IFS=',' read -r -a fields ; do
local field_sep=
echo "$rec_sep ("
for field in "${fields[@]}" ; do
echo -n "$field_sep'$(eval echo "$field")'"
field_sep=', '
done
echo -n ")"
rec_sep=,
done < "$1"
echo ';'
) | sqlite3 "$dbfile"
}
Alternatively, define this function to do just one thing — generate the INSERT
statement — and let the caller pipe the result: parseCsvAdv2Db test.csv | sqlite3 "$dbfile"
.
parseCsvAdv2Db() {
local rec_sep="INSERT INTO ${1%%.*}$2 VALUES"
while IFS=',' read -r -a fields ; do
local field_sep=
echo "$rec_sep ("
for field in "${fields[@]}" ; do
echo -n "$field_sep'$(eval echo "$field")'"
field_sep=', '
done
echo -n ")"
rec_sep=,
done < "$1"
echo ';'
}
Your CSV parse is somewhat simplistic, it assumes the comma’s are “simple” (no quoted commas, etc. This is not a bad thing, just an observation, but it leads on to the next thing… the cumbersome splitting and rejoining of the field values using awk, etc. I have a confession to make here, I dislike using awk. To clarify, there are often simple ways to do things, and then there are times the simple routes are not enough. You don’t need awk for the simple stuff, and you need perl, or python for the complciated stuff. I like the concept of awk, I just don’t like having to know three tools when two is enough.
So, I immediately look at your code and think “simplify” it.
If you are using a simple CSV parse, you may as well do it all at once, and use a simple regex and expand your sed use….
I assume you are running on a reasonably recent linux version.
- the
basename
program has the suffix argument:basename -s .csv file/to/test.csv
will returntest
(the base name of the file with the suffix .csv removed). - you are tagged as bash so I will recommend here-strings instead of the
echo
commands you use. Instead of$(echo $1 | cut -d'.' -f 1)
do$(cut -d'.' -f 1 <<< "$1")
. It saves an echo. - you should have a shebang line
#!/bin/bash
as the first line.
So, to strip out all the awk, and more, I would have:
function parseCsvAdv2Db {
local table=$(basename -s .csv $1)
local ins="INSERT INTO $table$2 VALUES"
local data=""
while read line
do
# expand variable names inside the line.
line=$(eval echo $line)
# add parenthesis and quotes to the data.
# Internal quotes still missing.
data+="('$line')"
done < $1
# Add internal quotes, and add comma between value sets.
data=$(sed -e "s/,/','/g" -e 's/)(/),(/g' <<< "$data")
ins+="$data;"
sqlite3 $dbfile "$ins"
#echo $ins
}
By sacrificing the per-field parsing from awk you instead gain the simplicity of bulk modifications of the comma values using sed, to expand them to quote their respective values. Combined with adding the surrounding ('...')
structure for each line, it just works….