Problem
I’ve been working on a project (link) to download a spreadsheet of known ransomware and properties and turn it into json so I can better consume the information within early detection projects.
I’m new to python – what can I be doing better? The destination json I’m converting can be found here.
update_json.py (entry point)
from excel_to_json import excel_to_json
from download_file import download_file
SOURCESHEET = 'https://docs.google.com/spreadsheets/d/1TWS238xacAto-fLKh1n5uTsdijWdCEsGIM0Y0Hvmc5g/pub?output=xlsx'
OUTPUTSHEET = '../RansomwareOverview.xlsx'
JSONFILE = '../ransomware_overview.json'
def write_json_file(json_data, filename):
output = open(filename, 'w')
output.writelines(json_data)
def generate_json(source_file, download_destination, json_file):
download_file(source_file, download_destination)
write_json_file(excel_to_json(download_destination), json_file)
generate_json(SOURCESHEET, OUTPUTSHEET, JSONFILE)
download_file.py
import urllib.request
def download_file(source, destination):
try:
urllib.request.urlretrieve(source, destination)
except IOError:
print('An error occured trying to write an updated spreadsheet. Do you already have it open?')
except urllib.error.URLError:
print('An error occured trying to download the file. Please check the source and try again.')
excel_to_json.py
import simplejson as json
import xlrd
from collections import OrderedDict
def excel_to_json(filename):
wb = xlrd.open_workbook(filename)
sh = wb.sheet_by_index(0)
mw = wb.sheet_by_index(2)
# List to hold dictionaries
c_list = []
# Iterate through each row in worksheet and fetch values into dict
for rownum in range(1, sh.nrows):
wares = OrderedDict()
row_values = sh.row_values(rownum)
if row_values[6]=="":
name = row_values[0]
gre=[name]
elif "," in row_values[6]:
e=row_values[6].split(",")
ge = [row_values[0]]
gre=e+ge
else:
gre=[row_values[0],row_values[6]]
wares['name'] = gre
wares['extensions'] = row_values[1]
wares['extensionPattern'] = row_values[2]
wares['ransomNoteFilenames'] = row_values[3]
wares['comment'] = row_values[4]
wares['encryptionAlgorithm'] = row_values[5]
wares['decryptor'] = row_values[7]
if row_values[8]=="":
wares['resources'] = [row_values[9]]
elif row_values[9]=="":
wares['resources']=[row_values[8]]
else:
wares['resources'] = [row_values[8], row_values[9]]
wares['screenshots'] = row_values[10]
for r in range(1, mw.nrows):
rowe = mw.row_values(r)
if row_values[0] == rowe[0]:
wares['microsoftDetectionName']=rowe[1]
wares['microsoftInfo'] = rowe[2]
wares['sandbox'] = rowe[3]
wares['iocs'] = rowe[4]
wares['snort'] = rowe[5]
c_list.append(wares)
# Serialize the list of dicts to JSON
return json.dumps(c_list, indent=4)
Solution
Performance tips:
ujson
can bring more speed- since both
simplejson
andxlrd
are pure-python, you may get performance improvements “for free” by switching toPyPy
- you may (or not) see speed and memory usage improvements if switching to
openpyxl
and especially in the “read-only” mode - in the
excel_to_json
function, you are accessing the same values fromrow_values
by index multiple times. Defining intermediate variables (e.g. definingname = row_values[6]
and usingname
variable later on) and avoiding accessing an element by index more than once might have a positive impact - I’m not sure I completely understand the inner
for r in range(1, mw.nrows)
loop. Can youbreak
once you get theif row_values[0] == rowe[0]
evaluated toTrue
? - are you sure you need the
OrderedDict
and cannot get away with a regulardict
? (there is a serious overhead for CPythons prior to 3.6) - instead of
.dumps()
and a separate function to dump a JSON string to a file – use.dump()
method to dump to a file directly – make sure to usewith
context manager when opening a file
Code Style notes:
- follow PEP8 guidelines in terms of whitespace usage in expressions and statements
- properly organize imports
if row_values[6]=="":
can be simplified toif not row_values[6]:
(similar for some other if conditions later on)- the
generate_json()
call should be put into theif __name__ == '__main__':
to avoid it being executed on import - the
excel_to_json()
function is not quite easy to grasp – see if you can add a helpful docstring and/or comments to improve on clarity and readability
Other notes:
- improve variable naming. Variables like
sh
,mw
,rowe
are very close to being meaningless. I would also replacewb
with a more explicitworkbook
- have you considered using
pandas.read_excel()
to read the contents into the dataframe and then dumping it via.to_json()
(after applying the desired transformations)?
row_values = sh.row_values(rownum)
if row_values[6]=="":
name = row_values[0]
gre=[name]
elif "," in row_values[6]:
e=row_values[6].split(",")
ge = [row_values[0]]
gre=e+ge
else:
gre=[row_values[0],row_values[6]]
wares['name'] = gre
wares['extensions'] = row_values[1]
wares['extensionPattern'] = row_values[2]
wares['ransomNoteFilenames'] = row_values[3]
wares['comment'] = row_values[4]
wares['encryptionAlgorithm'] = row_values[5]
wares['decryptor'] = row_values[7]
if row_values[8]=="":
wares['resources'] = [row_values[9]]
elif row_values[9]=="":
wares['resources']=[row_values[8]]
else:
wares['resources'] = [row_values[8], row_values[9]]
wares['screenshots'] = row_values[10]
This can be improved in many ways:
-
Since you’re going to use all 10 values of a row (some several times even), you can unpack the whole row at once and avoid several lookups later on:
name, extensions, ext_pattern, filenames, comment, algorithm, name_extra, decryptor, resources_part_1, resources_part_2, screenshots = sh.row_values(rownum)
However, if each row contains more than 10 columns, you’d have to account for that and discard extra ones using
*_
:(name, extensions, ext_pattern, filenames, comment, algorithm, name_extra, decryptor, resources_part_1, resources_part_2, screenshots, *_) = sh.row_values(rownum)
-
split
will always return a list with at least one element: the original string if the separator was not found. - You can simplify the tests for the empty string by using
not
as the empty string evaluates toFalse
in a boolean context.
All in all, you can write:
(name, extensions, ext_pattern, filenames, comment,
algorithm, name_extra, decryptor, resources_part_1,
resources_part_2, screenshots, *_) = sh.row_values(rownum)
wares = {
'name': [name],
'extensions': extensions,
'extensionPattern': ext_pattern,
'ransomNoteFilenames': filenames,
'comment': comment,
'encryptionAlgorithm': algorithm,
'decryptor': decryptor,
'screenshots': screenshots,
}
if name_extra:
wares['name'].extend(name_extra.split(','))
if not resources_part_1:
wares['resources'] = [resources_part_2]
elif not resources_part_2:
wares['resources'] = [resources_part_1]
else:
wares['resources'] = [resources_part_1, resources_part_2]
But the resource part feels ugly so I think I’d rather write it as a list-comprehension:
(name, extensions, ext_pattern, filenames, comment,
algorithm, name_extra, decryptor, resources_part_1,
resources_part_2, screenshots, *_) = sh.row_values(rownum)
wares = {
'name': [name],
'extensions': extensions,
'extensionPattern': ext_pattern,
'ransomNoteFilenames': filenames,
'comment': comment,
'encryptionAlgorithm': algorithm,
'decryptor': decryptor,
'screenshots': screenshots,
'resources': [r for r in (resources_part_1, resources_part_2) if r]
}
if name_extra:
wares['name'].extend(name_extra.split(','))