Problem
The following command series is used to gather every bit of valuable information from Exodus trackers. The ultimate goal would be to get this all into one jq
statement, and if that’s not possible then to just simplify as much as possible.
I’m aware of the sub
and gsub
functions available in jq
, and tried to use them in the statement that parses network signatures. However, operating on the backslashes (even though they were properly escaped) didn’t work, hence the sed
cop-out.
Feedback from any angle is welcome!
<- JSON Input
Sample:
{
"trackers": [
{
"name": "ACRCloud",
"code_signature": "com.acrcloud",
"network_signature": "acrcloud.com|hb-minify-juc1ugur1qwqqqo4.stackpathdns.com",
"website": "https://acrcloud.com/"
},
{
"name": "ADLIB",
"code_signature": "com.mocoplex.adlib.",
"network_signature": "adlibr\.com",
"website": "https://adlibr.com"
},
{
"name": "ADOP",
"code_signature": "com.adop.sdk.",
"network_signature": "",
"website": "http://adop.cc/"
},
{
"name": "fullstory",
"code_signature": "com.fullstory.instrumentation.|com.fullstory.util.|com.fullstory.jni.|com.fullstory.FS|com.fullstory.rust.|com.fullstory.FSSessionData",
"network_signature": "",
"website": "https://www.fullstory.com/"
}
]
}
-> IP Output
exodus.bash
curl -s 'https://etip.exodus-privacy.eu.org/trackers/export' -o trackers.json
jq -r '.trackers[].code_signature | split("|") | reverse | .[] | split(".") | reverse | join(".") | ltrimstr(".")' trackers.json >tmp.txt
jq -r '.trackers[].network_signature | split("|") | .[]' trackers.json | sed -e 's/\././g' -e 's/\//g' -e 's/^.//' >>tmp.txt
jq -r '.trackers[].website' trackers.json | mawk -F/ '{print $3}' >>tmp.txt
gawk '/^([[:alnum:]_-]{1,63}.)+[[:alpha:]]+([[:space:]]|$)/{print tolower($1)}' tmp.txt >exodus_domains.txt
gawk '/^([0-9]{1,3}.){3}[0-9]{1,3}+(/|:|$)/{sub(/:.*/,"",$1); print $1}' tmp.txt >exodus_ips.txt
rm trackers.json
rm tmp.txt
Solution
Since you said the order of the outputs doesn’t really matter, the first reverse
in the .code_signature
path seems useless – as far as I can tell, all it does is change the order of certain outputs (the second reverse
on the other hand is useful).
In jq
, piping to .[]
is fine, but maybe a bit awkward. []
can be applied on top of most simple filters, and I don’t believe split("|")[]
is any less clear than split("|") | .[]
. But that’s definitely a matter of opinion, and the | .[]
approach is valid too.
The first of your sed
patterns seems a bit redundant. It replaces \.
with .
– so basically it removes some s. But then the very next pattern removes all
s anyway, so the first one does end up seeming a bit pointless.
All you need to replicate the last two sed
patterns in jq
should be replace("\"; "") | ltrimstr(".")
. But if you do want that first pattern as well, gsub("\\\."; ".")
seems to work fine. Yes, that’s a lot of s, but each
\
pair in a string literal represents just a single in the actual string.
The mawk
call appears to just split its input on /
and then take the 3rd item. We can do that as well without leaving jq
by simply doing split("/")[2]
If you move both the sed
and the mawk
into jq
, we then have 3 jq
commands with a common source (trackers.json
) and a common destination (tmp.txt
). At that point, it becomes possible to combine the three commands into one – ,
seems like a good tool for that.
Now, I’m fairly sure there are ways to do that without changing the order of the outputs, but since you said the order doesn’t particularly matter, the following approach feels the most natural to me:
.trackers[] |
(.code_signature | split("|")[] | split(".") | reverse | join("."))
, (.network_signature | split ("|")[] | gsub("\\"; ""))
, (.website | split("/")[2])
| ltrimstr(".")
Having done that, we no longer need to save trackers.json
– we can pipe curl
‘s output straight into jq
:
curl -s 'https://etip.exodus-privacy.eu.org/trackers/export' |
jq -r '.trackers[] |
(.code_signature | split("|")[] | split(".") | reverse | join("."))
, (.network_signature | split ("|")[] | gsub("\\"; ""))
, (.website | split("/")[2])
| ltrimstr(".")
' > tmp.txt
On a similar note, it’s also possible to combine the gawk
s into a single command. One possible approach might look like:
{ switch ($1) {
case /^([[:alnum:]_-]{1,63}.)+[[:alpha:]]+([[:space:]]|$)/:
print tolower($1) > "exodus_domains.txt"
break
case /^([0-9]{1,3}.){3}[0-9]{1,3}+(/|:|$)/:
sub(/:.*/, "", $1)
print $1 > "exodus_ips.txt"
break
} }
I’m sure there are cleaner ways to get similar results, but either way, if we end up making it a single command we can then go on to pipe our jq
straight into that. That way we can get rid of tmp.txt
as well:
curl -s 'https://etip.exodus-privacy.eu.org/trackers/export' |
jq -r '.trackers[] |
(.code_signature | split("|")[] | split(".") | reverse | join("."))
, (.network_signature | split ("|")[] | gsub("\\"; ""))
, (.website | split("/")[2])
| ltrimstr(".")
' |
gawk '{ switch ($1) {
case /^([[:alnum:]_-]{1,63}.)+[[:alpha:]]+([[:space:]]|$)/:
print tolower($1) > "exodus_domains.txt"
break
case /^([0-9]{1,3}.){3}[0-9]{1,3}+(/|:|$)/:
sub(/:.*/, "", $1)
print $1 > "exodus_ips.txt"
break
} }'
And now we don’t need to worry about creating and cleaning up temporary files, because all the action happens within a single pipeline.
Though it does get a bit unwieldy. It might be worth it at this point to break the jq
and awk
scripts out into separate files and do something closer to
curl -s 'https://etip.exodus-privacy.eu.org/trackers/export' |
jq -f parse-ETIP-hosts.jq -r |
awk -f partition-domains-and-IPs.awk
-v ip_file=exodus_ips.txt
-v domain_file=exodus_domains.txt