Extracting domains and IPs from Exodus trackers JSON report

Posted on

Problem

The following command series is used to gather every bit of valuable information from Exodus trackers. The ultimate goal would be to get this all into one jq statement, and if that’s not possible then to just simplify as much as possible.

I’m aware of the sub and gsub functions available in jq, and tried to use them in the statement that parses network signatures. However, operating on the backslashes (even though they were properly escaped) didn’t work, hence the sed cop-out.

Feedback from any angle is welcome!

<- JSON Input

Sample:

{
  "trackers": [
    {
      "name": "ACRCloud",
      "code_signature": "com.acrcloud",
      "network_signature": "acrcloud.com|hb-minify-juc1ugur1qwqqqo4.stackpathdns.com",
      "website": "https://acrcloud.com/"
    },
    {
      "name": "ADLIB",
      "code_signature": "com.mocoplex.adlib.",
      "network_signature": "adlibr\.com",
      "website": "https://adlibr.com"
    },
    {
      "name": "ADOP",
      "code_signature": "com.adop.sdk.",
      "network_signature": "",
      "website": "http://adop.cc/"
    },
    {
      "name": "fullstory",
      "code_signature": "com.fullstory.instrumentation.|com.fullstory.util.|com.fullstory.jni.|com.fullstory.FS|com.fullstory.rust.|com.fullstory.FSSessionData",
      "network_signature": "",
      "website": "https://www.fullstory.com/"
    }
  ]
}

-> Domain Output

-> IP Output

exodus.bash

curl -s 'https://etip.exodus-privacy.eu.org/trackers/export' -o trackers.json

jq -r '.trackers[].code_signature | split("|") | reverse | .[] | split(".") | reverse | join(".") | ltrimstr(".")' trackers.json >tmp.txt
jq -r '.trackers[].network_signature | split("|") | .[]' trackers.json | sed -e 's/\././g' -e 's/\//g' -e 's/^.//' >>tmp.txt
jq -r '.trackers[].website' trackers.json | mawk -F/ '{print $3}' >>tmp.txt

gawk '/^([[:alnum:]_-]{1,63}.)+[[:alpha:]]+([[:space:]]|$)/{print tolower($1)}' tmp.txt >exodus_domains.txt
gawk '/^([0-9]{1,3}.){3}[0-9]{1,3}+(/|:|$)/{sub(/:.*/,"",$1); print $1}' tmp.txt >exodus_ips.txt

rm trackers.json
rm tmp.txt

Solution

Since you said the order of the outputs doesn’t really matter, the first reverse in the .code_signature path seems useless – as far as I can tell, all it does is change the order of certain outputs (the second reverse on the other hand is useful).

In jq, piping to .[] is fine, but maybe a bit awkward. [] can be applied on top of most simple filters, and I don’t believe split("|")[] is any less clear than split("|") | .[]. But that’s definitely a matter of opinion, and the | .[] approach is valid too.

The first of your sed patterns seems a bit redundant. It replaces \. with . – so basically it removes some s. But then the very next pattern removes all s anyway, so the first one does end up seeming a bit pointless.

All you need to replicate the last two sed patterns in jq should be replace("\"; "") | ltrimstr("."). But if you do want that first pattern as well, gsub("\\\."; ".") seems to work fine. Yes, that’s a lot of s, but each \ pair in a string literal represents just a single in the actual string.

The mawk call appears to just split its input on / and then take the 3rd item. We can do that as well without leaving jq by simply doing split("/")[2]

If you move both the sed and the mawk into jq, we then have 3 jq commands with a common source (trackers.json) and a common destination (tmp.txt). At that point, it becomes possible to combine the three commands into one – , seems like a good tool for that.

Now, I’m fairly sure there are ways to do that without changing the order of the outputs, but since you said the order doesn’t particularly matter, the following approach feels the most natural to me:

.trackers[] |
        (.code_signature | split("|")[] | split(".") | reverse | join("."))
      , (.network_signature | split ("|")[] | gsub("\\"; ""))
      , (.website | split("/")[2])
    | ltrimstr(".")

Having done that, we no longer need to save trackers.json – we can pipe curl‘s output straight into jq:

curl -s 'https://etip.exodus-privacy.eu.org/trackers/export' |
    jq -r '.trackers[] |
          (.code_signature | split("|")[] | split(".") | reverse | join("."))
        , (.network_signature | split ("|")[] | gsub("\\"; ""))
        , (.website | split("/")[2])
      | ltrimstr(".")
    ' > tmp.txt

On a similar note, it’s also possible to combine the gawks into a single command. One possible approach might look like:

{ switch ($1) {
    case /^([[:alnum:]_-]{1,63}.)+[[:alpha:]]+([[:space:]]|$)/:
        print tolower($1) > "exodus_domains.txt"
        break
    case /^([0-9]{1,3}.){3}[0-9]{1,3}+(/|:|$)/:
        sub(/:.*/, "", $1)
        print $1 > "exodus_ips.txt"
        break
} }

I’m sure there are cleaner ways to get similar results, but either way, if we end up making it a single command we can then go on to pipe our jq straight into that. That way we can get rid of tmp.txt as well:

curl -s 'https://etip.exodus-privacy.eu.org/trackers/export' |
    jq -r '.trackers[] |
          (.code_signature | split("|")[] | split(".") | reverse | join("."))
        , (.network_signature | split ("|")[] | gsub("\\"; ""))
        , (.website | split("/")[2])
      | ltrimstr(".")
    ' |
    gawk '{ switch ($1) {
        case /^([[:alnum:]_-]{1,63}.)+[[:alpha:]]+([[:space:]]|$)/:
            print tolower($1) > "exodus_domains.txt"
            break
        case /^([0-9]{1,3}.){3}[0-9]{1,3}+(/|:|$)/:
            sub(/:.*/, "", $1)
            print $1 > "exodus_ips.txt"
            break
    } }'

And now we don’t need to worry about creating and cleaning up temporary files, because all the action happens within a single pipeline.

Though it does get a bit unwieldy. It might be worth it at this point to break the jq and awk scripts out into separate files and do something closer to

curl -s 'https://etip.exodus-privacy.eu.org/trackers/export' |
    jq -f parse-ETIP-hosts.jq -r |
    awk -f partition-domains-and-IPs.awk 
        -v ip_file=exodus_ips.txt 
        -v domain_file=exodus_domains.txt

Leave a Reply

Your email address will not be published. Required fields are marked *