Problem
I’ve created a Python script that tries to properly indent and dedent ASP code. Similar to http://www.aspindent.com/
Since I’ve began using it I’ve noticed some edge cases that I hadn’t anticipated. The reason these edge-cases appear is because of the fact that I’m not parsing straight ASP code. There are instances of HTML and Javascript in the file, thus, some of this is causing indentations to occur sporadically throughout the final output. I realize that part of the problem stems from my regex use. I’m here to see if there is a cleaner way to go about this.
import re, sys
class Indenter:
def __init__(self, string):
self.space = 0
self.count = 0
self.string = string
def print_ln(self, string):
sys.stdout.write(" " * self.space + str(string))
sys.stdout.flush()
def indent(self):
self.print_ln(self.string[self.count])
self.space += 4
def dedent(self):
self.space -= 4
self.print_ln(self.string[self.count])
def dedent_indent(self):
self.space -= 4
self.print_ln(self.string[self.count])
self.space += 4
def main(self):
while self.count < len(self.string):
if re.search("^s*if.*then", str(self.string[self.count]), re.IGNORECASE):
self.indent()
elif re.search("^s*for", str(self.string[self.count]), re.IGNORECASE):
self.indent()
elif re.search("^s*with", str(self.string[self.count]), re.IGNORECASE):
self.indent()
elif re.search("^s*do until", str(self.string[self.count]), re.IGNORECASE):
self.indent()
elif re.search("^s*do$", str(self.string[self.count]), re.IGNORECASE):
self.indent()
elif re.search("^s*Select Case", str(self.string[self.count]), re.IGNORECASE):
self.indent()
elif re.search("^s*End Select", str(self.string[self.count]), re.IGNORECASE):
self.dedent()
elif re.search("^s*loop", str(self.string[self.count]), re.IGNORECASE):
self.dedent()
elif re.search("^s*end with", str(self.string[self.count]), re.IGNORECASE):
self.dedent()
elif re.search("^s*end if", str(self.string[self.count]), re.IGNORECASE):
self.dedent()
elif re.search("^s*next", str(self.string[self.count]), re.IGNORECASE):
self.dedent()
elif re.search("^s*Case", str(self.string[self.count]), re.IGNORECASE):
self.dedent_indent()
elif re.search("^s*else", str(self.string[self.count]), re.IGNORECASE):
self.dedent_indent()
elif re.search("^s*elseif.*then", str(self.string[self.count]), re.IGNORECASE):
self.dedent_indent()
else:
self.print_ln(self.string[self.count])
self.count += 1
with open("scratch.html") as s:
ind = Indenter(s.readlines())
ind.main()
Solution
You should always write the main program protected within a if __name__ == '__main__':
block, so that you can import the file in another program without automatically executing code. Eg)
if __name__ == '__main__':
ind = Indenter(s.readlines())
ind.main()
Don’t write classes where you create the class object just to call one instance method of the object. See the video Stop Writing Classes.
Your code has:
ind = Indenter(s.readlines())
which reads every single line of the file into memory, and stores it in a list. Then, you call the Indenter.main()
function which executes …
while self.count < len(self.string):
# ... many lines of code omitted ...
self.count += 1
… sequentially loops through every element of the list (line of the file) for processing one at a time. Each member function, including this main()
function, uses self.string[self.count]
to reference the current line. This must have been maddening to write, having to type (or copy/paste) the same code over and over. It is frightening to read.
Let’s clean this up.
print_ln(self, string)
is structured correctly. It takes the current line string
as input. Let’s duplicate that in indent()
, dedent()
, and dedent_indent()
, by adding the string
argument to them, like:
def indent(self, string):
self.print_ln(string)
self.space += 4
In the main()
function, we now need to pass the line to each function, just like it was passed to self.print_ln(self.string[self.count])
. But that is a lot of duplicate self.string[self.count]
access, so let’s make a temporary variable for it. We can use that temporary variable in the re.search(....)
calls too.
while self.count < len(self.string):
string = self.string[self.count]
if re.search("^s*if.*then", string, re.IGNORECASE):
self.indent(string)
elif # ... many lines of code omitted ...
else:
self.print_ln(string)
self.count += 1
Note that each string
is a string, so the str(...)
calls can just be omitted.
Now that self.count
is only used in main()
, we can make it a local variable, instead of an instance member.
count = 0
while count < len(self.string):
string = self.string[count]
# ... many lines of code omitted ...
count += 1
A while
loop which counts from 0 up to some limit an be replaced by a for ... in range(...)
loop:
for count in range(len(self.string)):
string = self.string[count]
# ... many lines of code omitted ...
And since count
is only ever used to index into the list
that we are looping over the indices of, we can omit the indices altogether, and just loop over the list of strings itself.
for string in self.string:
# ... many lines of code omitted ...
Now that we are looping directly over self.string
, we don’t need to explicitly know that it is a list of strings. It can be any iterable object that returns, one-at-a-time, all the string we are interested in. Like a file object. And that file object can be passed to the main()
function.
class Indenter:
# ... omitted ...
def main(self, lines):
for line in lines:
if re.search("^s*if.*then", line), re.IGNORECASE):
self.indent(line)
elif # ... many lines of code omitted ...
else:
self.print_ln(line)
if __name__ == '__main__':
with open('scratch.html') as file:
ind = Indenter()
ind.main(file)
Now we are no longer reading the entire file into memory. Rather, we are opening the file, and reading it line-by-line, processing each line as it is read in, and then discarding it.
What does your code do if it encounters a file which is already indented? It indents it some more!
That is probably undesired behaviour. You probably want to strip off any existing indentation before adding the new indentation.
for line in lines:
line = line.strip()
# ... if/elif/else statements omitted ...
Note that this removes the newline from the end of the lines as well, so you’ll want to add that back in, perhaps by changing the sys.stdout.write(...)
sys.stdout.flush()
into a simple print(...)
statement.
With each line stripped of any previous indentation, you can remove the s*
from the start of all of the reg-ex patterns.
There are 6 reg-ex patterns where you want to indent, 5 where you want to dedent, and 3 where you want to do both. You always print. You dedent before printing. You indent after printing.
Instead of writing many lines of code which do the same thing with different patterns, it is much, much easier to write a loop, which loops through the patterns. Let’s put the patterns into lists:
indent_patterns = ['^if.*then', '^for', '^with', '^do until', '^do$', '^select case',
'^case', '^else', '^elseif.*then']
dedent_patterns = ['^end select', '^loop', '^end with', '^end if', '^next',
'^case', '^else', '^elseif.*then']
Now we can say if the current line matches any of the dedent_patterns
, reduce space
. Then print the line. Then, if the current line matches any of the indent_patterns
, increase space
.
if any(re.search(pattern, line, re.IGNORECASE) for pattern in dedent_patterns):
self.space -= 4
self.print_ln(line)
if any(re.search(pattern, line, re.IGNORECASE) for pattern in indent_patterns):
self.space += 4
… and the indent
, dedent
and dedent_indent
functions just became unnecessary.
Since print_ln()
is called in one spot, we can write than inline as well, which eliminates the need for self.space
to be a member variable.
With no member variables left, the entire class can be replaced with one function.
import re
def indent(lines, spacing=4):
indent_patterns = ['^if.*then', '^for', '^with', '^do until', '^do$', '^select case',
'^case', '^else', '^elseif.*then']
dedent_patterns = ['^end select', '^loop', '^end with', '^end if', '^next',
'^case', '^else', '^elseif.*then']
space = 0
for line in lines:
line = line.strip()
if any(re.search(pattern, line, re.IGNORECASE) for pattern in dedent_patterns):
space -= spacing
print(" "*space + line)
if any(re.search(pattern, line, re.IGNORECASE) for pattern in indent_patterns):
space += spacing
if __name__ == '__main__':
with open('scratch.html') as file:
indent(file)
And like what was said above, stop writing classes.
This does not address the sporadic indentation caused by HTML and Javascript in the file, but
- this is a Code Review, which only addresses the code you have actually written. This is not a forum where you come to get help writing new code, but to get a review of code you have already written.
- with a cleaner implementation of the indent code, you should have an easier time addressing that issue. When you have finished, feel free to post a follow-up question.