Excel to JSON parser with http download

Posted on

Problem

I’ve been working on a project (link) to download a spreadsheet of known ransomware and properties and turn it into json so I can better consume the information within early detection projects.

I’m new to python – what can I be doing better? The destination json I’m converting can be found here.

update_json.py (entry point)

from excel_to_json import excel_to_json
from download_file import download_file

SOURCESHEET = 'https://docs.google.com/spreadsheets/d/1TWS238xacAto-fLKh1n5uTsdijWdCEsGIM0Y0Hvmc5g/pub?output=xlsx'
OUTPUTSHEET = '../RansomwareOverview.xlsx'
JSONFILE = '../ransomware_overview.json'

def write_json_file(json_data, filename):
    output = open(filename, 'w')
    output.writelines(json_data)

def generate_json(source_file, download_destination, json_file):
    download_file(source_file, download_destination)
    write_json_file(excel_to_json(download_destination), json_file)

generate_json(SOURCESHEET, OUTPUTSHEET, JSONFILE)

download_file.py

import urllib.request

def download_file(source, destination):
    try:
        urllib.request.urlretrieve(source, destination)
    except IOError:
        print('An error occured trying to write an updated spreadsheet. Do you already have it open?')
    except urllib.error.URLError:
        print('An error occured trying to download the file. Please check the source and try again.')

excel_to_json.py

import simplejson as json
import xlrd
from collections import OrderedDict

def excel_to_json(filename):
    wb = xlrd.open_workbook(filename)
    sh = wb.sheet_by_index(0)
    mw = wb.sheet_by_index(2)

    # List to hold dictionaries
    c_list = []

    # Iterate through each row in worksheet and fetch values into dict
    for rownum in range(1, sh.nrows):
        wares = OrderedDict()
        row_values = sh.row_values(rownum)

        if row_values[6]=="":
            name = row_values[0]
            gre=[name]
        elif "," in row_values[6]:
            e=row_values[6].split(",")
            ge = [row_values[0]]
            gre=e+ge
        else:
            gre=[row_values[0],row_values[6]]

        wares['name'] = gre
        wares['extensions'] = row_values[1]
        wares['extensionPattern'] = row_values[2]
        wares['ransomNoteFilenames'] = row_values[3]
        wares['comment'] = row_values[4]
        wares['encryptionAlgorithm'] = row_values[5]
        wares['decryptor'] = row_values[7]
        if row_values[8]=="":
            wares['resources'] = [row_values[9]]
        elif row_values[9]=="":
            wares['resources']=[row_values[8]]
        else:
            wares['resources'] = [row_values[8], row_values[9]]
        wares['screenshots'] = row_values[10]

        for r in range(1, mw.nrows):
            rowe = mw.row_values(r)

            if row_values[0] == rowe[0]:
                wares['microsoftDetectionName']=rowe[1]
                wares['microsoftInfo'] = rowe[2]
                wares['sandbox'] = rowe[3]
                wares['iocs'] = rowe[4]
                wares['snort'] = rowe[5]

        c_list.append(wares)

    # Serialize the list of dicts to JSON
    return json.dumps(c_list, indent=4)

Solution

Performance tips:

  • ujson can bring more speed
  • since both simplejson and xlrd are pure-python, you may get performance improvements “for free” by switching to PyPy
  • you may (or not) see speed and memory usage improvements if switching to openpyxl and especially in the “read-only” mode
  • in the excel_to_json function, you are accessing the same values from row_values by index multiple times. Defining intermediate variables (e.g. defining name = row_values[6] and using name variable later on) and avoiding accessing an element by index more than once might have a positive impact
  • I’m not sure I completely understand the inner for r in range(1, mw.nrows) loop. Can you break once you get the if row_values[0] == rowe[0] evaluated to True?
  • are you sure you need the OrderedDict and cannot get away with a regular dict? (there is a serious overhead for CPythons prior to 3.6)
  • instead of .dumps() and a separate function to dump a JSON string to a file – use .dump() method to dump to a file directly – make sure to use with context manager when opening a file

Code Style notes:

  • follow PEP8 guidelines in terms of whitespace usage in expressions and statements
  • properly organize imports
  • if row_values[6]=="": can be simplified to if not row_values[6]: (similar for some other if conditions later on)
  • the generate_json() call should be put into the if __name__ == '__main__': to avoid it being executed on import
  • the excel_to_json() function is not quite easy to grasp – see if you can add a helpful docstring and/or comments to improve on clarity and readability

Other notes:

  • improve variable naming. Variables like sh, mw, rowe are very close to being meaningless. I would also replace wb with a more explicit workbook
  • have you considered using pandas.read_excel() to read the contents into the dataframe and then dumping it via .to_json() (after applying the desired transformations)?
    row_values = sh.row_values(rownum)

    if row_values[6]=="":
        name = row_values[0]
        gre=[name]
    elif "," in row_values[6]:
        e=row_values[6].split(",")
        ge = [row_values[0]]
        gre=e+ge
    else:
        gre=[row_values[0],row_values[6]]

    wares['name'] = gre
    wares['extensions'] = row_values[1]
    wares['extensionPattern'] = row_values[2]
    wares['ransomNoteFilenames'] = row_values[3]
    wares['comment'] = row_values[4]
    wares['encryptionAlgorithm'] = row_values[5]
    wares['decryptor'] = row_values[7]
    if row_values[8]=="":
        wares['resources'] = [row_values[9]]
    elif row_values[9]=="":
        wares['resources']=[row_values[8]]
    else:
        wares['resources'] = [row_values[8], row_values[9]]
    wares['screenshots'] = row_values[10]

This can be improved in many ways:

  • Since you’re going to use all 10 values of a row (some several times even), you can unpack the whole row at once and avoid several lookups later on:

    name, extensions, ext_pattern, filenames, comment, algorithm, name_extra, decryptor, resources_part_1, resources_part_2, screenshots = sh.row_values(rownum)
    

    However, if each row contains more than 10 columns, you’d have to account for that and discard extra ones using *_:

    (name, extensions, ext_pattern, filenames, comment,
     algorithm, name_extra, decryptor, resources_part_1,
     resources_part_2, screenshots, *_) = sh.row_values(rownum)
    
  • split will always return a list with at least one element: the original string if the separator was not found.

  • You can simplify the tests for the empty string by using not as the empty string evaluates to False in a boolean context.

All in all, you can write:

(name, extensions, ext_pattern, filenames, comment,
 algorithm, name_extra, decryptor, resources_part_1,
 resources_part_2, screenshots, *_) = sh.row_values(rownum)

wares = {
    'name': [name],
    'extensions': extensions,
    'extensionPattern': ext_pattern,
    'ransomNoteFilenames': filenames,
    'comment': comment,
    'encryptionAlgorithm': algorithm,
    'decryptor': decryptor,
    'screenshots': screenshots,
}
if name_extra:
    wares['name'].extend(name_extra.split(','))
if not resources_part_1:
    wares['resources'] = [resources_part_2]
elif not resources_part_2:
    wares['resources'] = [resources_part_1]
else:
    wares['resources'] = [resources_part_1, resources_part_2]

But the resource part feels ugly so I think I’d rather write it as a list-comprehension:

(name, extensions, ext_pattern, filenames, comment,
 algorithm, name_extra, decryptor, resources_part_1,
 resources_part_2, screenshots, *_) = sh.row_values(rownum)

wares = {
    'name': [name],
    'extensions': extensions,
    'extensionPattern': ext_pattern,
    'ransomNoteFilenames': filenames,
    'comment': comment,
    'encryptionAlgorithm': algorithm,
    'decryptor': decryptor,
    'screenshots': screenshots,
    'resources': [r for r in (resources_part_1, resources_part_2) if r]
}
if name_extra:
    wares['name'].extend(name_extra.split(','))

Leave a Reply

Your email address will not be published. Required fields are marked *