Merging multiple JSON files using Python

Posted on

Problem

I have multiple (1000+) JSON files each of which contain a JSON array. I want to merge all these files into a single file.

I came up with the following, which reads each of those files and creates a new object with all the contents. I then write this new object into a new file.

Is this approach efficient? Is there a better way to do so?

head = []
with open("result.json", "w") as outfile:
    for f in file_list:
        with open(f, 'rb') as infile:
            file_data = json.load(infile)
            head += file_data
    json.dump(head, outfile)

Solution

  1. First off, if you want reusability, turn this into a function. The function should have it’s respective arguments.
  2. Secondly, instead of allocating a variable to store all of the JSON data to write, I’d recommend directly writing the contents of each of the files directly to the merged file. This will help prevent issues with memory.
  3. Finally, I just have a few nitpicky tips on your variable naming. Preferably, head should have a name more along the lines of merged_files, and you shouldn’t be using f as an iterator variable. Something like json_file would be better.

This is essentially alexwlchan’s comment spelled out:

Parsing and serializing JSON doesn’t come for free, so you may want to avoid it. I think you can just output "[", the first file, ",", the second file etc., "]" and call it a day. If all inputs are valid JSON, unless I’m terribly mistaken, this should also be valid JSON.

In code, version 1:

def cat_json(outfile, infiles):
    file(outfile, "w")
        .write("[%s]" % (",".join([mangle(file(f).read()) for f in infiles])))

def mangle(s):
    return s.strip()[1:-1]

Version 2:

def cat_json(output_filename, input_filenames):
    with file(output_filename, "w") as outfile:
        first = True
        for infile_name in input_filenames:
            with file(infile_name) as infile:
                if first:
                    outfile.write('[')
                    first = False
                else:
                    outfile.write(',')
                outfile.write(mangle(infile.read()))
        outfile.write(']')

The second version has a few advantages: its memory requirements should be something like the size of the longest input file, whereas the first requires twice the sum of all file sizes. The number of simultaneously open file handles is also smaller, so it should work for any number of files.

By using with, it also does deterministic (and immediate!) deallocation of file handles upon leaving each with block, even in python implementations with non-immediate garbage collection (such as pypy and jython etc.).

Leave a Reply

Your email address will not be published. Required fields are marked *