Problem
I often need to parse tab-separated text (usually from a huge file) into records. I wrote a generator to do that for me; is there anything that could be improved in it, in terms of performance, extensibility or generality?
def table_parser(in_stream, types = None, sep = 't', endl = 'n', comment = None):
header = next(in_stream).rstrip(endl).split(sep)
for lineno, line in enumerate(in_stream):
if line == endl:
continue # ignore blank lines
if line[0] == comment:
continue # ignore comments
fields = line.rstrip(endl).split(sep)
try:
# could have done this outside the loop instead:
# if types is None: types = {c : (lambda x : x) for c in headers}
# but it nearly doubles the run-time if types actually is None
if types is None:
record = {col : fields[no] for no, col in enumerate(header)}
else:
record = {col : types[col](fields[no]) for no, col in enumerate(header)}
except IndexError:
print('Insufficient columns in line #{}:n{}'.format(lineno, line))
raise
yield record
Solution
One thing you could try to reduce the amount of code in the loop is to make a function expression for these.
if types is None:
record = {col : fields[no] for no, col in enumerate(header)}
else:
record = {col : types[col](fields[no]) for no, col in enumerate(header)}
something like this: not tested but you should get the idea
def table_parser(in_stream, types = None, sep = 't', endl = 'n', comment = None):
header = next(in_stream).rstrip(endl).split(sep)
enumheader=enumerate(header) #### No need to do this every time
if types is None:
def recorder(col,fields):
return {col : fields[no] for no, col in enumheader}
else:
def recorder(col,fields):
return {col : types[col](fields[no]) for no, col in enumheader}
for lineno, line in enumerate(in_stream):
if line == endl:
continue # ignore blank lines
if line[0] == comment:
continue # ignore comments
fields = line.rstrip(endl).split(sep)
try:
record = recorder(col,fields)
except IndexError:
print('Insufficient columns in line #{}:n{}'.format(lineno, line))
raise
yield record
EDIT: from my first version (read comments)
Tiny thing:
if types is None:
I suggest
if not types:
You also may use csv module to iterate over your file. Your code would be faster because of C implementation and cleaner without line.rstrip(endl).split(sep)