Problem
I am dealing with a large amount of XML files which I obtained from here https://clinicaltrials.gov/ct2/resources/download#DownloadAllData. The download yields around 300.000 XML files of similar structure which I eventually want to load into a single dataframe / csv. The code gives the result that I want: Each row is a single XML while columns are the categories/variable names coming from the XML X-Paths. The rows are filled with the text of each XML tag.
So far this is my code:
#Import packages.
import pandas as pd
from lxml import etree
import os
from os import listdir
from os.path import isfile, join
from tqdm import tqdm
from functools import partial
from pprint import pprint
#Set options for displaying results
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
def run(file, content):
data = etree.parse(file)
#get all paths from the XML
get_path = lambda x: data.getpath(x)
paths = list(map(get_path, data.getroot().getiterator()))
content = [
data.getroot().xpath(path)
for path in paths
]
get_text = lambda x: x.text
content = [list(map(get_text, i)) for i in content]
dictionary = dict(zip(paths, content))
df = pd.DataFrame([dictionary])
global df_final
df_final = df_final.append(df)
def write_csv(df_name, csv):
df_name.to_csv(csv, sep=";")
#RUN
mypath = '/Users/marcelwieting/Documents/AllPublicXML'
folder_all = os.listdir(mypath)
for folder in tqdm(folder_all):
df_final = pd.DataFrame()
mypath2 = mypath + "/" + folder
if os.path.isdir(mypath2):
file = [f for f in listdir(mypath2) if isfile(join(mypath2, f))]
output = "./Output/" + folder + ".csv"
for x in tqdm(file):
dir = mypath2 + "/" + x
df_name = x.split(".", 1)[0]
run(dir,df_name)
write_csv(df_final, output)
I am an absolute Python newbie and this code is the result of some painful mix and match from various forum entries and tutorials. It runs through but given the size of the data sources, it takes really long. Presumably, because I have many iterations in my code which are certainly avoidable. (is there some vector based solution maybe, rather than iterating?) It would be very nice if I could get some feedback on how to improve speed and maybe even some general remarks on how to structure this code better. I know it is not good, but it’s all I could pull off for now. 🙂
Cheers!
One sample XML file here:
<clinical_study>
<!--
This xml conforms to an XML Schema at:
https://clinicaltrials.gov/ct2/html/images/info/public.xsd
-->
<required_header>
<download_date>
ClinicalTrials.gov processed this data on March 20, 2020
</download_date>
<link_text>Link to the current ClinicalTrials.gov record.</link_text>
<url>https://clinicaltrials.gov/show/NCT03261284</url>
</required_header>
<id_info>
<org_study_id>2017-P-032</org_study_id>
<nct_id>NCT03261284</nct_id>
</id_info>
<brief_title>
D-dimer to Guide Anticoagulation Therapy in Patients With Atrial Fibrillation
</brief_title>
<acronym>DATA-AF</acronym>
<official_title>
D-dimer to Determine Intensity of Anticoagulation to Reduce Clinical Outcomes in Patients With Atrial Fibrillation
</official_title>
<sponsors>
<lead_sponsor>
<agency>Wuhan Asia Heart Hospital</agency>
<agency_class>Other</agency_class>
</lead_sponsor>
</sponsors>
<source>Wuhan Asia Heart Hospital</source>
<oversight_info>
<has_dmc>Yes</has_dmc>
<is_fda_regulated_drug>No</is_fda_regulated_drug>
<is_fda_regulated_device>No</is_fda_regulated_device>
</oversight_info>
<brief_summary>
<textblock>
This was a prospective, three arms, randomized controlled study.
</textblock>
</brief_summary>
<detailed_description>
<textblock>
D-dimer testing is performed in AF Patients receiving warfarin therapy (target INR:1.5-2.5) in Wuhan Asia Heart Hospital. Patients with elevated d-dimer levels (>0.5ug/ml FEU) were SCREENED AND RANDOMIZED to three groups at a ratio of 1:1:1. First, NOAC group,the anticoagulant was switched to Dabigatran (110mg,bid) when elevated d-dimer level was detected during warfarin therapy.Second,Higher-INR group, INR was adjusted to higher level (INR:2.0-3.0) when elevated d-dimer level was detected during warfarin therapy. Third, control group, patients with elevated d-dimer levels have no change in warfarin therapy. Warfarin is monitored once a month by INR ,and dabigatran dose not need monitor. All patients were followed up for 24 months until the occurrence of endpoints, including bleeding events, thrombotic events and all-cause deaths.
</textblock>
</detailed_description>
<overall_status>Enrolling by invitation</overall_status>
<start_date type="Anticipated">March 1, 2019</start_date>
<completion_date type="Anticipated">May 30, 2020</completion_date>
<primary_completion_date type="Anticipated">February 28, 2020</primary_completion_date>
<phase>N/A</phase>
<study_type>Interventional</study_type>
<has_expanded_access>No</has_expanded_access>
<study_design_info>
<allocation>Randomized</allocation>
<intervention_model>Parallel Assignment</intervention_model>
<primary_purpose>Treatment</primary_purpose>
<masking>None (Open Label)</masking>
</study_design_info>
<primary_outcome>
<measure>Thrombotic events</measure>
<time_frame>24 months</time_frame>
<description>
Stroke, DVT, PE, Peripheral arterial embolism, ACS etc.
</description>
</primary_outcome>
<primary_outcome>
<measure>hemorrhagic events</measure>
<time_frame>24 months</time_frame>
<description>cerebral hemorrhage,Gastrointestinal bleeding etc.</description>
</primary_outcome>
<secondary_outcome>
<measure>all-cause deaths</measure>
<time_frame>24 months</time_frame>
</secondary_outcome>
<number_of_arms>3</number_of_arms>
<enrollment type="Anticipated">600</enrollment>
<condition>Atrial Fibrillation</condition>
<condition>Thrombosis</condition>
<condition>Hemorrhage</condition>
<condition>Anticoagulant Adverse Reaction</condition>
<arm_group>
<arm_group_label>DOAC group</arm_group_label>
<arm_group_type>Experimental</arm_group_type>
<description>
Patients with elevated d-dimer levels was switched to DOAC (dabigatran 150mg, bid).
</description>
</arm_group>
<arm_group>
<arm_group_label>Higher-INR group</arm_group_label>
<arm_group_type>Experimental</arm_group_type>
<description>
Patients' target INR was adjusted from 1.5-2.5 to 2.0-3.0 by adding warfarin dose.
</description>
</arm_group>
<arm_group>
<arm_group_label>Control group</arm_group_label>
<arm_group_type>No Intervention</arm_group_type>
<description>
Patients continue previous strategy without change.
</description>
</arm_group>
<intervention>
<intervention_type>Drug</intervention_type>
<intervention_name>Dabigatran Etexilate 150 MG [Pradaxa]</intervention_name>
<description>Dabigatran Etexilate 150mg,bid</description>
<arm_group_label>DOAC group</arm_group_label>
<other_name>Pradaxa</other_name>
</intervention>
<intervention>
<intervention_type>Drug</intervention_type>
<intervention_name>Warfarin Pill</intervention_name>
<description>Add warfarin dose according to INR values.</description>
<arm_group_label>Higher-INR group</arm_group_label>
</intervention>
<eligibility>
<criteria>
<textblock>
Inclusion Criteria: - Patients with non-valvular atrial fibrillation - Receiving warfarin therapy Exclusion Criteria: - Patients who had suffered from recent (within 3 months) myocardial infarction, ischemic stroke, deep vein thrombosis, cerebral hemorrhages, or other serious diseases. - Those who had difficulty in compliance or were unavailable for follow-up.
</textblock>
</criteria>
<gender>All</gender>
<minimum_age>18 Years</minimum_age>
<maximum_age>75 Years</maximum_age>
<healthy_volunteers>No</healthy_volunteers>
</eligibility>
<overall_official>
<last_name>Zhenlu ZHANG, MD,PhD</last_name>
<role>Study Director</role>
<affiliation>Wuhan Asia Heart Hospital</affiliation>
</overall_official>
<location>
<facility>
<name>Zhang litao</name>
<address>
<city>Wuhan</city>
<state>Hubei</state>
<zip>430022</zip>
<country>China</country>
</address>
</facility>
</location>
<location_countries>
<country>China</country>
</location_countries>
<verification_date>March 2019</verification_date>
<study_first_submitted>August 22, 2017</study_first_submitted>
<study_first_submitted_qc>August 23, 2017</study_first_submitted_qc>
<study_first_posted type="Actual">August 24, 2017</study_first_posted>
<last_update_submitted>March 6, 2019</last_update_submitted>
<last_update_submitted_qc>March 6, 2019</last_update_submitted_qc>
<last_update_posted type="Actual">March 7, 2019</last_update_posted>
<responsible_party>
<responsible_party_type>Sponsor</responsible_party_type>
</responsible_party>
<keyword>D-dimer</keyword>
<keyword>Nonvalvular atrial fibrillation</keyword>
<keyword>Direct thrombin inhibitor</keyword>
<keyword>INR</keyword>
<condition_browse>
<!--
CAUTION: The following MeSH terms are assigned with an imperfect algorithm
-->
<mesh_term>Atrial Fibrillation</mesh_term>
<mesh_term>Thrombosis</mesh_term>
<mesh_term>Hemorrhage</mesh_term>
</condition_browse>
<intervention_browse>
<!--
CAUTION: The following MeSH terms are assigned with an imperfect algorithm
-->
<mesh_term>Warfarin</mesh_term>
<mesh_term>Dabigatran</mesh_term>
<mesh_term>Fibrin fragment D</mesh_term>
</intervention_browse>
<!--
Results have not yet been posted for this study
-->
</clinical_study>
Solution
TLDR: it can be done much more quickly and much more concisely, in 28 minutes end-to-end, and in 12 minutes when all XML are already on disk. The key trick: using the long rather than wide format for each data frame.
There are two parts to your code: style and performance. Let me post my approach and comment on why I coded it like this.
For imports, I separate Python standard libraries and user-installed libraries, and put them both in alphabetical order. This makes your code more readable.
import os
import zipfile
import click
from lxml import etree
import pandas as pd
import requests
You didn’t post it, so I had to write my own code to download and unzip all the CDC XML files:
def download_data(url, file):
response = requests.get(url, stream=True)
assert response.status_code == 200
with open(file, 'wb') as dst:
for chunk in response.iter_content(chunk_size=4096):
dst.write(chunk)
def unzip_data(file, path):
with zipfile.ZipFile(file) as src:
src.extractall(path)
url = 'https://clinicaltrials.gov/AllPublicXML.zip'
file = url.split('/')[-1]
path = file.split('.')[0]
download_data(url, file)
unzip_data(file, path)
Note that I didn’t write out my full path for my working directory, this is all implicitly handled relative to the working directory from where you run your code (yeah, explicit is better than implicit, but in this case I ignore the Zen of Python).
Next, there is the issue of parsing a single XML file. Your approach used two iterations over the parse tree, and some unnecessary lambdas stored in variables (why not do it in place?). It is possible to iterate just once over the parse tree, and extract both the xpath and the text. Furthermore, it will turn out that storing this in long format as a 3-column data frame with a variable number of rows will gain an order of magnitude when concatenating these data frames later on.
def parse_xml(file):
# The CDC XML files are named NCTyyyyxxxx, this extracts the 8 yyyyxxxx digits
id = int(os.path.splitext(os.path.basename(file))[0][-8:])
tree = etree.parse(file)
return pd.DataFrame(
data=[
(id, tree.getpath(elem), elem.text)
for elem in tree.iter()
],
columns=['id', 'key', 'value']
)
Note that there is no longer a need to have an intermediate dict. The code above just parses the tree into a list of tuples and initializes a data frame with it.
Next, the iteration over the directory tree is most conveniently done using the standard library, in particular os.walk
. Note that in your code, you not only had to manually check for directory and files, but you also manually combined directory names, file names and extensions rather than with the standard os.path.join
(which would have made it platform independent).
xml_files = [
os.path.join(dirpath, file)
for dirpath, _, filenames in os.walk(path)
for file in filenames
if file.endswith('.xml')
]
The code above stores the 335K XML file paths in a list (takes less than 1 second).
Finally, combining the above code is simply a loop over the files, parsing each file and concatenating the results. I use the progressbar to see how fast it will be, but in this case the time is a mere 12 minutes.
with click.progressbar(xml_files) as bar:
df = pd.concat((
parse_xml(f)
for f in bar
))
df.to_csv('output.csv')
Inspecting this data frame shows that it has almost 100M rows and 3 columns. In contrast, the wide format would have had 335K rows but 730K columns. This is the reason that this code ran so slow: concatenating / appending all these differently laid-out data frames requires an inordinate amount of data copying to align it in the final data frame. In contrast, the long format just appends 3 known columns.
Total file on disk is 11 Gb.
EDIT: in the end I didn’t manage to run your posted code because of memory overflow (335K rows x 730K columns will do that unless you have a Tb of RAM). Note that I didn’t use a global variable to which I append()
each parsed XML file. The Pandas docs state that pd.concat()
should be more performant. It was for my approach: the progress bar indicated that appending to an initially empty data frame would increase the total time to over 2 hours.
Seems like it might be a good use for parallizing with the multiprocessing
library. Something like this (untested):
from multiprocessing import Pool
# your other code goes here
mypath = '/Users/marcelwieting/Documents/AllPublicXML'
folder_all = os.listdir(mypath)
def process_folder(folder):
df_final = pd.DataFrame()
mypath2 = mypath + "/" + folder
if os.path.isdir(mypath2):
file = [f for f in listdir(mypath2) if isfile(join(mypath2, f))]
output = "./Output/" + folder + ".csv"
for x in tqdm(file):
dir = mypath2 + "/" + x
df_name = x.split(".", 1)[0]
run(dir,df_name)
write_csv(df_final, output)
if __name__ == "__main__":
with Pool() as p:
print(p.imap_unordered(process_folder, folder_all))
Avoid doing explicit appends in your for loop and use dictionary (or list) comprehension instead; this makes it run more than 3x faster on my machine.
That is, do something like
def run(file, content):
data = etree.parse(file)
get_path = lambda x: data.getpath(x)
paths = list(map(get_path, data.getroot().getiterator()))
content = [
data.getroot().xpath(path)
for path in paths
]
get_text = lambda x: x.text
content = [list(map(get_text, i)) for i in content]
bundle = dict(zip(paths, content))
df = pd.DataFrame([bundle])
global df_final
df_final = df_final.append(df)