I am parsing a set of lines from a file, after reading the file content with
Parser.read_file() I apply a set of methods to solve different issues each line may contain, like removing new lines, comments etc, and set a list of lines in the parser
self.lines instance variable.
Filter methods are defined in
Parser.filter as a list of names and executed after the file is read with
import re import os import io from operator import methodcaller class Parser: # List of filters filters = ['_filter_comments', '_filter_new_lines', '_filter_empty_entries'] def __init__(self, **kwargs): self.lines = kwargs.get('lines', ) self.file_path = kwargs.get('file_path', "data/data_ts.utf8") self.lines_count = kwargs.get('entries_count', None) def read_file(self): """ Reads file data applying filters from self.filters """ __location__ = os.path.realpath( os.path.join(os.getcwd(), os.path.dirname(__file__))) with io.open(os.path.join(__location__, self.file_path), "r", encoding='utf-8') as f: self.lines = f.readlines() self._sanitize() def _sanitize(self): for fn in self.filters: f = methodcaller(fn) f(self) def _filter_comments(self): """ remove lines starting with # or #! """ self.lines = [line for line in self.lines if not line.startswith(("#", "#!"))] def _filter_new_lines(self): self.lines = [line.strip('n') for line in self.lines] def _filter_empty_entries(self): self.lines = [line for line in self.lines if line.strip()]
- I think that the pythonic way to write those filters in one line as
[line.strip('n') for line in self.lines if line.strip() and if not line.startswith(("#", "#!")]would make it harder to read and unit test them or it would be better that way?
_sanitizemethod be called after reading the file or would it make more sense to put it in the setter property of
def lines(self, lines): self.lines = lines self._sanitize()
Is there any better option to arrange the above code or a design pattern like putting the sanitize method outside the Parser class and receiving a list of lines and returning the sanitized list instead of working always with
def __init__(self, **kwargs):
I don’t understand why you didn’t write
def __init__(self, lines=, file_path='data/data_ts.utf8', entries_count=None)
__location__ = os.path.realpath( os.path.join(os.getcwd(), os.path.dirname(__file__)))
Better to just name it
location. Also, cwd can change between calls to your class, so this code seems “too tricky”. Better to have init() nail down a fully qualified pathname once, and store it.
filters = ['_filter_comments', '_filter_new_lines', '_filter_empty_entries']
It’s not obvious why quoting these, and using
methodcaller(), is a Good Thing. Why not simply have a list of three function references, rather than three quoted strings?
The contract on each filter is “assign some or all of lines back into lines”. Rather than evaluate each filter for side effects, you might have their API return some or all of lines, and make the _sanitize driving loop responsible for assignment. Or better, you might make them functions accepting a single line, that return a single modified line or None to suppress it. When you eventually write unit tests you’ll find such an API is simpler to test.
if not line.startswith(("#", "#!"))]
If it starts with ‘#’, then we already saw True and won’t consider ‘#!’.
self.lines = [line.strip('n') for line in self.lines]
Idiomatic code would say
You contemplated putting _sanitize() in the setter property of self.lines. That is feasible, but a bit odd, and would put a greater documentation burden on you, starting with renaming to
class SanitizingParser. Perhaps there is some motivating use case where that makes life easier on the caller, but I’m not yet seeing it.
Consider writing generator(s) that
yield a line at a time, rather than manipulating the entire list of lines. Then you can process giant files using a small memory footprint.
EDIT: Suppose the three filter functions have an API of accepting a str (not None) and returning that same input line or a trimmed line or None. Then a generator might look like this:
filters = [ _filter_comments, _filter_new_lines, _filter_empty_entries, ] def read_sanitized_file(file_path): with io.open(file_path) as fin: for line in fin: for filter in filters: if line: line = filter(line) if line: yield line