XSLT 2.0: Crawl HTML and add links

Posted on

Problem

Background: I have 4 GB of text data dispersed in 250,000 html files. I want to interlink the files with <a> for the reader to click on. I have a 12 MB file of regex patterns to identify the <a> sites.

Situation: I have developed a working proof of concept, three files:

  • an XML file of regex patterns of where we would want to place a touch-link <a>
  • A test HTML file
  • An xslt file to read the regex patterns, and apply them to the HTML file

Concern: I have slow performance when I apply the proof of concept to full production data.

The regex patterns (test-anchor-sites.xml):

<regexes>
    <!-- validated list of HREF and IDs where a reader would want to click to -->
    <regex match="Chapter 1" href="../chapter1.html"/>
    <regex match="Chapter 2" href="../chapter2.html"/>
    <regex match="Chapter 3" href="../chapter3.html"/>
    <regex match="Chapter 4" href="../chapter4.html"/>
    <regex match="laminectomy" href="../chapter1.html" id="#d2e1346"/>
</regexes>

The test HTML:

<!DOCTYPE HTML>
<html>
    <head>
        <title>Set Anchor IDs: Test File</title>
    </head>
    <body>
        <div class="cover">
            <div><b>S</b>pinal <b>S</b>urgery</div>
        </div>
        <div class="intro">
            <div>Degeneration of one or more disc(s) of the spine is called <i>degenerative disc disease</i> (DDD).</div>
            <div>Often, degenerative DDD can be successfully treated without surgery. Chapter 1 describes these <b>non</b>-surgical treatments.</div>
            <div>Chapter 2 describes a Laminectomy, which is a surgical procedure that removes a portion of the vertebral bone called the lamina.</div>
            <div>
                <p>A discectomy is the surgical removal of herniated disc material that presses on a nerve root or the spinal cord. It is covered in Chapter 3 and Chapter 4.</p>
                <p>Open disectomy is done through a large incision, and is described in Chapter 3.</p>
                <p>Microdisectomy is minimally invasive surgery, described in Chapter 4, and is often the most appropriate treatment after conservative treatments fail to provide relief.</p>
                <div>A percutaneous discectomy is a surgical procedure in which the central portion of an intervertebral disc is accessed and removed through a cannula.</div>
            </div>
        </div>
    </body>
</html>

The style sheet to load the regex patterns and apply them to the HTML:

<xsl:stylesheet
    version="2.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="xs">

    <xsl:output
        method="xhtml"
        html-version="5.0"
        omit-xml-declaration="yes"
        encoding="UTF-8"
        indent="yes" />

    <xsl:strip-space elements="*"/>

    <xsl:variable name="regexes" 
        select="document('test-anchor-sites.xml')"/>
    <xsl:variable name="regex-matches" 
        select="string-join($regexes//regex/@match, '|')"/>

    <xsl:key name="id-lookup" match="regex" use="@match"/>

    <!-- start -->

    <xsl:template match="/">
        <xsl:apply-templates/>
    </xsl:template>

    <xsl:template match="@* | node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="*/text()">
        <xsl:analyze-string select="." regex="{$regex-matches}">
            <xsl:matching-substring>
                <a>
                    <xsl:attribute name="href" select="key('id-lookup',.,$regexes)/@href"/>
                    <xsl:value-of select="."/>
                </a>
            </xsl:matching-substring>
            <xsl:non-matching-substring>
                <xsl:value-of select="."/>
            </xsl:non-matching-substring>
        </xsl:analyze-string>
    </xsl:template>

</xsl:stylesheet>

Result: Code runs. Does what is desired. But it takes a very long time. Based on a partial run, I estimate this would run for 24 hours to apply 12 MB of regex patterns that inter-link 4 GB of html.

Is there a more efficient way to do this?

Design notes:

  1. Yes, it has occurred to me: maybe 24 hours is OK. After all, applying thousands of regex patterns to 250K html files is a tall order.
  2. I will place some axis checks in the <xsl:template match="*/text()"> code to refine the text that is crawled, such as: [not(self::toc)] or [ancestor::chapter]. I expect this to trim run-time about 10%, not a big change, but it helps.
  3. Not shown here: an xslt that applies the xslt above to a document collection. I don’t think this is the problem. It’s code that has worked in another system for a very long time.

Solution

If it does run in 24 hours then that might well be the best way to do it. The only way I could think of speeding it up would be to build some kind of index (using xsl:key) of the words that appear in the links, and then pre-filtering each text node to see whether any of its words are present in the index before applying the regular expressions. This of course won’t give quite the same result because you aren’t currently taking word boundaries into account.

Leave a Reply

Your email address will not be published. Required fields are marked *