Problem
This is a project to pull domain & IP blacklists from various sources and compile them into one list. There are some whitelists included that are applied when a blacklist is built.
It draws blacklist and whitelist source entries from this JSON document. The “rule” that gets applied is intended to reduce list entries into what is strictly either a domain, IPv4 address, or IPv6 address. Such a format is designated by the “format” field. Each format is placed into its own list, except the “domain” format which is included in all three.
Notes on improvements from any related aspect are welcome and appreciated!
create_builds.sh
#!/usr/bin/env bash
set -euo pipefail
downloads=$(mktemp -d)
trap 'rm -rf "$downloads"' EXIT || exit 1
# params: list name, sort column, cache dir
sort_list() {
sort -o "$1" -k "$2" -u -S 90% --parallel=4 -T "$3" "$1"
}
for color in 'white' 'black'; do
cache_dir="${downloads}/${color}"
jq --arg color "$color" 'to_entries[] | select(.value.color == $color)' sources.json |
jq -r -s 'from_entries | keys[] as $k | "($k)#(.[$k] | .mirrors)"' |
while IFS=$'#' read -r key mirrors; do
echo "$mirrors" | tr -d '[]"' | tr -s ',' "t" | gawk -v key="$key" '{
if ($0 ~ /.tar.gz$/ || /.zip$/) {
printf "%sn out=%s.%sn",$0,key,gensub(/^(.*[/])?[^.]*[.]?/, "", 1, $0)
} else {
printf "%sn out=%s.txtn",$0,key
}
}'
done | aria2c --conf-path='./configs/aria2.conf' -d "$cache_dir"
for format in 'domain' 'ipv4' 'ipv6'; do
list_name="${color}_${format}.txt"
jq --arg color "$color" --arg format "$format" 'to_entries[] | select(.value.color == $color and .value.format == $format)' sources.json |
jq -r -s 'from_entries | keys[] as $k | "($k)#(.[$k] | .rule)"' |
while IFS=$'#' read -r key rule; do
fpath=$(find -P -O3 "$cache_dir" -type f -name "$key*")
case $fpath in
*.tar.gz)
# Both Shallalist and Ut-capitole adhere to this format
# If any archives are added that do not, this line needs to change
tar -xOzf "$fpath" --wildcards-match-slash --wildcards '*/domains'
;;
*.zip) zcat "$fpath" ;;
*) cat "$fpath" ;;
esac |
gawk --sandbox -O -- "$rule" | # apply the regex rule
gawk '!x[$0]++' | # filter duplicates out
gawk -v format="$format" -v color="$color" '{
switch (format) {
case "domain":
print $0 >> color "_domain.txt"
break
case "ipv4":
print "0.0.0.0 " $0 >> color "_ipv4.txt"
break
case "ipv6":
print ":: " $0 >> color "_ipv6.txt"
break
default:
break
}
}'
done
if test -f "$list_name"; then
if [[ "$format" == "domain" ]]; then
sort_list "$list_name" 1 "$cache_dir"
else
sort_list "$list_name" 2 "$cache_dir"
fi
if [[ "$color" == "black" ]]; then
if test -f "white_${format}.txt"; then
grep -Fxvf "white_${format}.txt" "black_${format}.txt" | sponge "black_${format}.txt"
fi
tar -czf "black_$format.tar.gz" "black_$format.txt"
fi
if [[ "$format" == "domain" ]]; then
gawk '{ print "0.0.0.0 " $0 }' "${color}_domain.txt" >"${color}_ipv4.txt"
gawk '{ print ":: " $0 }' "${color}_domain.txt" >"${color}_ipv6.txt"
fi
fi
done
done
Solution
Minor nitpicks
While assigning variables from unquoted command substitutions apparently is safe, I personally prefer double quoting my command substitutions regardless of context – I find rules easier to remember if I follow them consistently. So I would have double quoted the mktemp
near the start and the find
when handling formats, but you should be just fine if you choose not to
You may want to set a restrictive umask
before creating any files, to make sure access to those files is as restricted as possible. mktemp
usually does restrict the permissions of created files on its own as well, but getting in the habit of doing umask 077
or something similar at the start of scripts doesn’t hurt
The way aria2c
is used, it should definitely use stdin as its input file. This appears to be handled in config/aria2.conf
, since no input file is provided by the script itself. This means that unless that config file includes the line input-file=-
, the aria2c
call will fail due to getting no input – or at least that’s how it behaves on my system. It seems like it’d be neater to not rely on that line being present in the config file since we’d never want it to not take input from stdin here, so explicitly calling aria2c -i -
seems like it might be a good idea
The file’s name probably shouldn’t end in .sh
– the language an executable is written in is an implantation detail the caller shouldn’t have to care about, so removing the extension should be fine. And even if you do want an extension, this is not an sh script but a bash script
The second series of jq
s is a bit strange. To me there’s no obvious reason to split the processing between two jq
processes, or to use from_entries
for that matter – as the first jq
ends we have a stream of objects that look like
{
key: "we want this",
value: {
rule: "we want this too"
}
}
Making those objects into strings should just be a single extra step, resulting in something similar to jq -r --arg color "$color" --arg format "$format" 'to_entries[] | select(.value.color == $color and .value.format == $format) | "(.key)#(.value.rule)"' sources.json
The downloading part
I feel like if you’re going to use jq
to parse your data, you might as well take advantage of the fact that you already have parsed, structured data and a rather expressive language to manipulate it with. Serializing it to a less structured format may be useful if you need to do things jq
itself can’t do (such as interacting with the file system), but here we don’t do anything jq
can’t do on its own. We just lose access to the structured data, named fields and well-defined objects for no clear reason, when we could just not leave jq
in the first place. For example, in the first step, we want the mirrors as a tab-separated list, and we want an output line which depends on the file extension of one of the mirrors – as they are mirrors they all presumably provide the same file, so we don’t seem to care which mirror’s extension we look at if there are multiple, but your implementation looks at the last one and I don’t see much of a reason to change that so I’m doing the same here.
So, what do we do? Well, having filtered our entries, we can keep operating on them – we only want .key
and .value.mirrors
, so we can discard the rest. Then, since jq
has regexes, we can look at the mirror links to figure out the file extension and add that to the object. And then, with a list of mirrors, a file name and a file extension, outputting the two lines we want per key is simple with the help of ,
. All in all, we can fold all the commands leading into the aria2c
call into a single jq
filter:
to_entries[]
| select(.value.color == $color)
| {key, mirrors: .value.mirrors}
| .extension = (.mirrors[-1] | match(".(tar.gz|zip)").captures[0].string // "txt")
| (.mirrors | join("t")), " out=(.key).(.extension)"
Because of the simplified regex, this is subtly different than your implementation in cases like the alexa_anti_porn
list – your implementation produces alexa_anti_porn.csv.zip
, this one produces alexa_anti_porn.zip
. It’s not obvious which one you intended, but the rest of the program doesn’t seem to care one way or the other as far as I can tell.
All in all, 8 of the 9 commands in that whole pipeline can be neatly folded into a single jq
which we then pipe into aria2c
:
jq -r --arg color "$color" '
to_entries[]
| select(.value.color == $color)
| {key, mirrors: .value.mirrors}
| .extension = (.mirrors[0] | match(".(tar.gz|zip)").captures[0].string // "txt")
| (.mirrors | join("t")), " out=(.key).(.extension)"
' sources.json | aria2c -i - -d "$cache_dir" --conf-path=./configs/aria2.conf
AWK and writing the lists
Those last two gawk
calls seem a bit odd to me. The first one just acts as a filter on the second – as far as I can tell, there’s no reason those need to be separate. The second one also uses the color
and format
variables to rebuild the list_name
variable that’s already in scope when the gawk script runs. The use of switch
is also a bit awkward – all the branches fundamentally do the same thing, they just operate on slightly different data. Using a map of sorts for that feels more natural to me. I think we should be able to get the same results with
gawk -v format="$format" -v filename="$list_name" '
BEGIN {
prefixes["ipv4"] = "0.0.0.0 "
prefixes["ipv6"] = ":: "
prefixes["domain"] = ""
}
!seen[$0]++ {
print prefixes[format] $0 >> filename
}
'
or even
gawk -v format="$format" '
BEGIN {
prefixes["ipv4"] = "0.0.0.0 "
prefixes["ipv6"] = ":: "
prefixes["domain"] = ""
}
!seen[$0]++ {
print prefixes[format] $0
}
' > "$list_name"
Granted, I’ve been wrong before and I’m no less tired now than I was then, so it’s very possible that I’m missing something here.