Counting Characters from an HTML File with Python

Posted on


I just completed level 2 of The Python Challenge on and I am in the process of learning python so please bear with me and any silly mistakes I may have made.

I am looking for some feedback about what I could have done better in my code. Two areas specifically:

  • How could I have more easily identified the comment section of the HTML file? I used a beat-around-the-bush method that kind of found the end of the comment (or the beginning technically but it is counting from the end) and gave me some extra characters that I was able to recognize and anticipated (the extra “–>” and “-“). What condition would have better found this comment so I could put it in a new string to be counted?

This is what I wrote:

from collections import Counter
import requests

page = requests.get('')

pagetext = ""
pagetext = (page.text)
#find out what number we are going back to

i = 1
x = 4
testchar = ""
testcharstring = ""

while x == 4:
    testcharstring = pagetext[-i:]
    testchar = testcharstring[0]
    if testchar == "-":
        testcharstring = pagetext[-(i+1)]
        testchar = testcharstring[0]
        if testchar == "-":
            testcharstring = pagetext[-(i+2)]
            testchar = testcharstring[0]
            if testchar == "!":
                testcharstring = pagetext[-(i+3)]
                testchar = testcharstring[0]
                if testchar == "<":
                    x = 3
                i += 1
                x = 4
            i += 1
            x = 4
        i += 1

newstring = pagetext[-i:]

charcount = Counter(newstring)


And this is the source HTML:

  <link rel="stylesheet" type="text/css" href="../style.css">
<center><img src="ocr.jpg">
<br><font color="#c03000">
recognize the characters. maybe they are in the book, <br>but MAYBE they 
are in the page source.</center>


<font size="-1" color="gold">
General tips:
<li>Use the hints. They are helpful, most of the times.</li>
<li>Investigate the data given to you.</li>
<li>Avoid looking for spoilers.</li>
Forums: <a href=""/>Python Challenge Forums</a>, 
read before you post.
IRC: #pythonchallenge
To see the solutions to the previous level, replace pc with pcc, i.e. go 


find rare characters in the mess below:


Followed by thousands of characters and the comment concludes with ‘–>’


I don’t have enough reputation to comment, so I must say this in an answer.
It looks clunky to use

    while x == 4:

and then do

    x = 3

whenever you want to break out of the loop.
It looks better to do

    while True:

and when you want to break out of the loop do



Redundant Code

pagetext = ""
pagetext = (page.text)

The first line assigns an empty string to pagetext. The second line ignores the contents already in pagetext and assigns a different value to the variable.

Why bother with the first statement? It simply makes the code longer, slower, and harder to understand.

Why bother with the (...) around page.text? They also are not serving any purpose.

Variable Names

Variables like i are a double-edged sword. You’re using it as a loop index, and then you’re using it to reference a found location after the loop terminates. But i by itself doesn’t have much meaning. posn might be clearer. last_comment_posn would be much clearer, though very verbose.

PEP-8 recommends using underscores to separate words in variable names: ie, use char_count not charcount etc.

Searching for a string of characters

Python strings have built-in functions for searching for a substring in a larger string. For instance, str.find could rapidly find the first occurrence of <!-- in the page text.

i = pagetext.find("<!--")

But you’re not looking for the first one; you’re looking for the last one. Python again has you covered, with the reverse find function: str.rfind.

i = pagetext.rfind("<!--")

But this still finds the index of the last occurrence. You want the characters after the comment marker, so we need to skip forward 4 additional characters:

if i >= 0:
    newstring = pagetext[i+4:]

Improved code

import requests
from collections import Counter

page = requests.get('')
page.raise_for_status()  # Crash if the request didn't succeed
page_text = page.text

posn = page_text.rfind("<!--")

if posn >= 0:
    comment_text = page_text[posn+4:]    # Fix!  This is to end of string, not end of comment!
    char_count = Counter(comment_text)

Leave a Reply

Your email address will not be published. Required fields are marked *