Advanced CSV-to-SQLite converter

Posted on

Problem

I just wrote an advanced CSV parser that translates CSV to SQL(SQLite Compatible) statements, which are then inserted into the database. It translates variable names in the CSV to values defined in the script.

Example CSV

0,1,An empty black tile,${ASSET_PATH}/BlackTile.png
1,1,Grassy Tile,${ASSET_PATH}/Grass.png
2,1,Lava Tile,${ASSET_PATH}/Lava.png
3,1,Stone Brick Texture,${ASSET_PATH}/StoneBrick.png

The code

# Variables that are used in the CSV files
ASSET_PATH=~/spacegame/assets

# $1 - File to read. Named same as table to insert to plus .csv
# $2 - 
function parseCsvAdv2Db {
    local oldIFS=$IFS
    local table=$(echo $1 | cut -d'.' -f 1)
    local ins="INSERT INTO $table$2 VALUES"
    IFS='|'
    while read line
    do
        # Preprocess the line
        local data=$(eval echo "$line" | 
            awk 'BEGIN { FS=","; OFS="|"; } { $1=$1; print $0; }')
        local tmpdata=(
        for field in $data
        do
            tmpdata+="'$field'"
        done
        tmpdata+=')'
        ins+="$tmpdata"
    done < $1

    ins=$(echo $ins | sed -e 's/)(/),(/g' -e "s/''/','/g")
    ins+=';'

    sqlite3 $dbfile "$ins"

    # Restore state
    IFS=$oldIFS
}

parseCsvAdv2Db test.csv '(id,type,descr,path)'

Solution

First of all, it should be noted that this script would be vulnerable to arbitrary command execution as well as SQL injection. It might be OK if you trust the CSV data not to contain malicious shell commands or characters with special significance in SQL.

Several features make this code hard to follow:

  • Mixing Bash and AWK. It should either be pure Bash or mostly AWK (with a thin Bash wrapper to invoke AWK with the right parameters). Calling AWK like this, especially with one invocation per line, is both confusing and bad for performance.
  • What does the | character have to do with anything? It seems to be a secret delimiter used for Bash-AWK communication. That’s bad: will a literal | in the data break the script?
  • If you are going to override IFS temporarily, use the set-this-variable-for-one-command syntax.
  • Why is there post-processing done using sed?
  • $(echo $1 | cut -d'.' -f 1) can be better expressed in Bash using ${1%%.*}.

A corner case is that the code generates a malformed INSERT statement if the CSV file is empty.

Suggested solution

# $1 - Name of CSV file to read. Table name is inferred from this by
#      dropping the filename extension.
# $2 - Optional (column, names) for the INSERT statement
# $dbfile - SQLite filename
parseCsvAdv2Db() {
    (
        local rec_sep="INSERT INTO ${1%%.*}$2 VALUES"
        while IFS=',' read -r -a fields ; do
            local field_sep=
            echo "$rec_sep ("
            for field in "${fields[@]}" ; do
                echo -n "$field_sep'$(eval echo "$field")'"
                field_sep=', '
            done
            echo -n ")"
            rec_sep=,
        done < "$1"
        echo ';'
    ) | sqlite3 "$dbfile"
}

Alternatively, define this function to do just one thing — generate the INSERT statement — and let the caller pipe the result: parseCsvAdv2Db test.csv | sqlite3 "$dbfile".

parseCsvAdv2Db() {
    local rec_sep="INSERT INTO ${1%%.*}$2 VALUES"
    while IFS=',' read -r -a fields ; do
        local field_sep=
        echo "$rec_sep ("
        for field in "${fields[@]}" ; do
            echo -n "$field_sep'$(eval echo "$field")'"
            field_sep=', '
        done
        echo -n ")"
        rec_sep=,
    done < "$1"
    echo ';'
}

Your CSV parse is somewhat simplistic, it assumes the comma’s are “simple” (no quoted commas, etc. This is not a bad thing, just an observation, but it leads on to the next thing… the cumbersome splitting and rejoining of the field values using awk, etc. I have a confession to make here, I dislike using awk. To clarify, there are often simple ways to do things, and then there are times the simple routes are not enough. You don’t need awk for the simple stuff, and you need perl, or python for the complciated stuff. I like the concept of awk, I just don’t like having to know three tools when two is enough.

So, I immediately look at your code and think “simplify” it.

If you are using a simple CSV parse, you may as well do it all at once, and use a simple regex and expand your sed use….

I assume you are running on a reasonably recent linux version.

  • the basename program has the suffix argument: basename -s .csv file/to/test.csv will return test (the base name of the file with the suffix .csv removed).
  • you are tagged as so I will recommend here-strings instead of the echo commands you use. Instead of $(echo $1 | cut -d'.' -f 1) do $(cut -d'.' -f 1 <<< "$1"). It saves an echo.
  • you should have a shebang line #!/bin/bash as the first line.

So, to strip out all the awk, and more, I would have:

function parseCsvAdv2Db {
    local table=$(basename -s .csv $1)
    local ins="INSERT INTO $table$2 VALUES"
    local data=""

    while read line
    do
        # expand variable names inside the line.
        line=$(eval echo $line)

        # add parenthesis and quotes to the data.
        # Internal quotes still missing.
        data+="('$line')"
    done < $1

    # Add internal quotes, and add comma between value sets.
    data=$(sed -e "s/,/','/g" -e 's/)(/),(/g' <<< "$data")

    ins+="$data;"

    sqlite3 $dbfile "$ins"
    #echo $ins

}

By sacrificing the per-field parsing from awk you instead gain the simplicity of bulk modifications of the comma values using sed, to expand them to quote their respective values. Combined with adding the surrounding ('...') structure for each line, it just works….

Leave a Reply

Your email address will not be published. Required fields are marked *