Problem
I’m looking for an effective way to parse html content in node.JS.
The objective is to extract data from inside the html, not handle objects.
It is a server-side thing.
I tried using jsdom and I found two major issues:
- Massive memory usage (probably some memory leak)
- Won’t parse properly if the code is a malformed html.
So I’m considering using regex to seek inside html streams.
In the code bellow I slim down the html stream removing extras spaces and new lines so the regex will cost less to match:
html = html.replace(/r?n|s{2,}/g,' ');
console.log(html.match(/<my regex>/));
I also thought of putting it on a function that would narrowing down even more by getting only the part of the html that matters like:
<html>
<!-- a lot of irrelevant code -->
<table id="fooTable"> </table>
<!-- a lot of irrelevant code -->
</html>
This would narrow down the code to cost even less to apply the regex match:
var i = html.indexOf('fooTable');
var chunck = html.substring(i);
Please have your say.
Would regex be an elegant/effective way to parse large html content? Is it cpu expensive to run a regex on a very large string?
Solution
First, you don’t parse HTML with RegEx. It’s a known fact. Don’t even try.
If you meant manipulating HTML as some arbitrary string (ignoring the structure, semantics, rules and all that jazz), that’s another thing. RegEx might help you, but not without problems.
Here’s potential problems that you’ll be facing:
-
The preciseness of your pattern with respect to the HTML spec. HTML is more forgiving than XML. That means there are quirks that still make the markup valid even when they don’t look valid. Your pattern might not pick up certain cases.
html-minifier is a good example of a library that knows about (and takes advantage of) quirks in HTML to minify HTML. It has a table that summarizes some of HTML’s quirks.
-
The input you’ll be receiving. I’ll assume it’s arbitrary and/or external (otherwise, you wouldn’t be manipulating it this way). A common problem is when the string isn’t what you expect it to be. An example is jQuery expecting JSON, but the server responded HTML of an HTTP 500 error. jQuery runs
JSON.parse
, then blows up.
Here’s some other problems:
html = html.replace(/r?n|s{2,}/g,' ');
This will blow away content that are sensitive to white-space, like the contents of <pre>
. It will also blow away any content that intentionally contains multiple white-space, like contents coming from a wysiwyg editor.
console.log(html.match(/<my regex>/));
As mentioned earlier, the accuracy of your pattern.