Text parser implemented as a generator

Posted on

Problem

I often need to parse tab-separated text (usually from a huge file) into records. I wrote a generator to do that for me; is there anything that could be improved in it, in terms of performance, extensibility or generality?

def table_parser(in_stream, types = None, sep = 't', endl = 'n', comment = None):
  header = next(in_stream).rstrip(endl).split(sep)
  for lineno, line in enumerate(in_stream):
    if line == endl:
      continue # ignore blank lines
    if line[0] == comment:
      continue # ignore comments
    fields = line.rstrip(endl).split(sep)
    try:
      # could have done this outside the loop instead:
      # if types is None: types = {c : (lambda x : x) for c in headers}
      # but it nearly doubles the run-time if types actually is None
      if types is None:
        record = {col : fields[no] for no, col in enumerate(header)}
      else:
        record = {col : types[col](fields[no]) for no, col in enumerate(header)}
    except IndexError:
      print('Insufficient columns in line #{}:n{}'.format(lineno, line))
      raise
    yield record

Solution

One thing you could try to reduce the amount of code in the loop is to make a function expression for these.

  if types is None:
    record = {col : fields[no] for no, col in enumerate(header)}
  else:
    record = {col : types[col](fields[no]) for no, col in enumerate(header)}

something like this: not tested but you should get the idea

def table_parser(in_stream, types = None, sep = 't', endl = 'n', comment = None):
  header = next(in_stream).rstrip(endl).split(sep)
  enumheader=enumerate(header)              ####  No need to do this every time
  if types is None:
     def recorder(col,fields): 
        return {col : fields[no] for no, col in enumheader}
  else:
     def recorder(col,fields): 
        return {col : types[col](fields[no]) for no, col in enumheader}

  for lineno, line in enumerate(in_stream):
    if line == endl:
      continue # ignore blank lines
    if line[0] == comment:
      continue # ignore comments
    fields = line.rstrip(endl).split(sep)
    try:
        record = recorder(col,fields)
    except IndexError:
      print('Insufficient columns in line #{}:n{}'.format(lineno, line))
      raise
    yield record

EDIT: from my first version (read comments)

Tiny thing:

    if types is None:

I suggest

    if not types:

You also may use csv module to iterate over your file. Your code would be faster because of C implementation and cleaner without line.rstrip(endl).split(sep)

Leave a Reply

Your email address will not be published. Required fields are marked *