Class to parse a file sanitizing each line with a set of methods

Posted on


I am parsing a set of lines from a file, after reading the file content with Parser.read_file() I apply a set of methods to solve different issues each line may contain, like removing new lines, comments etc, and set a list of lines in the parser self.lines instance variable.

Filter methods are defined in Parser.filter as a list of names and executed after the file is read with self._sanitize().

import re
import os
import io
from operator import methodcaller

class Parser:

    # List of filters
    filters = ['_filter_comments',

    def __init__(self, **kwargs):
        self.lines = kwargs.get('lines', [])
        self.file_path = kwargs.get('file_path', "data/data_ts.utf8")
        self.lines_count = kwargs.get('entries_count', None)

    def read_file(self):
        """ Reads file data applying filters from self.filters """
        __location__ = os.path.realpath(
            os.path.join(os.getcwd(), os.path.dirname(__file__)))
        with, self.file_path), "r",
                     encoding='utf-8') as f:
            self.lines = f.readlines()

    def _sanitize(self):
        for fn in self.filters:
            f = methodcaller(fn)

    def _filter_comments(self):
        """ remove lines starting with # or #! """
        self.lines = [line for line in self.lines
                      if not line.startswith(("#", "#!"))]

    def _filter_new_lines(self):
        self.lines = [line.strip('n') for line in self.lines]

    def _filter_empty_entries(self):
        self.lines = [line for line in self.lines if line.strip()]


  • I think that the pythonic way to write those filters in one line as [line.strip('n') for line in self.lines if line.strip() and if not line.startswith(("#", "#!")] would make it harder to read and unit test them or it would be better that way?
  • Should the _sanitize method be called after reading the file or would it make more sense to put it in the setter property of self.lines like:

    def lines(self, lines):
          self.lines = lines
  • Is there any better option to arrange the above code or a design pattern like putting the sanitize method outside the Parser class and receiving a list of lines and returning the sanitized list instead of working always with self.list?


    def __init__(self, **kwargs):

I don’t understand why you didn’t write def __init__(self, lines=[], file_path='data/data_ts.utf8', entries_count=None)

        __location__ = os.path.realpath(
        os.path.join(os.getcwd(), os.path.dirname(__file__)))

Better to just name it location. Also, cwd can change between calls to your class, so this code seems “too tricky”. Better to have init() nail down a fully qualified pathname once, and store it.

filters = ['_filter_comments',

It’s not obvious why quoting these, and using methodcaller(), is a Good Thing. Why not simply have a list of three function references, rather than three quoted strings?

The contract on each filter is “assign some or all of lines back into lines”. Rather than evaluate each filter for side effects, you might have their API return some or all of lines, and make the _sanitize driving loop responsible for assignment. Or better, you might make them functions accepting a single line, that return a single modified line or None to suppress it. When you eventually write unit tests you’ll find such an API is simpler to test.

                  if not line.startswith(("#", "#!"))]

If it starts with ‘#’, then we already saw True and won’t consider ‘#!’.

    self.lines = [line.strip('n') for line in self.lines]

Idiomatic code would say rstrip.

You contemplated putting _sanitize() in the setter property of self.lines. That is feasible, but a bit odd, and would put a greater documentation burden on you, starting with renaming to class SanitizingParser. Perhaps there is some motivating use case where that makes life easier on the caller, but I’m not yet seeing it.

Consider writing generator(s) that yield a line at a time, rather than manipulating the entire list of lines. Then you can process giant files using a small memory footprint.

EDIT: Suppose the three filter functions have an API of accepting a str (not None) and returning that same input line or a trimmed line or None. Then a generator might look like this:

filters = [

def read_sanitized_file(file_path):
    with as fin:
        for line in fin:
            for filter in filters:
                if line:
                    line = filter(line)
            if line:
                yield line

Leave a Reply

Your email address will not be published. Required fields are marked *