Manipulate HTML document loaded into WebBrowser control

Posted on

Problem

I have developed my custom solution for this. It happened that the first solution is using XPath queries and the second, a conceptually similar to the first one, is using CSS queries processed by sizzle.js

Here is the sample code for the second solution:

using System;
using System.Collections.Generic;
using System.Reflection;
using System.Windows.Forms;

namespace myTest.WinFormsApp
{
    public partial class MainFormForSizzleTesting : Form
    {
        public MainFormForSizzleTesting()
        {
            InitializeComponent();
        }

        private void MainForm_Load(object sender, EventArgs e)
        {
            webBrowser1.DocumentText = @"
                <html>
                <body>
                <img alt=""0764547763 Product Details"" 
                    src=""http://ecx.images-amazon.com/images/I/51AK1MRIi7L._AA160_.jpg"">
                <hr/>
                <h2>Product Details</h2>
                <ul>
                <li><b>Paperback:</b> 648 pages</li>
                <li class=""test""><b>Publisher:</b> Wiley; Unlimited Edition edition (October 15, 2001)</li>
                <li><b>Language:</b> English</li>
                <li class=""test""><b>ISBN-10:</b> 0764547763</li>
                </html>
            ";
        }

        private void cmdTest_Click(object sender, EventArgs e)
        {
            var processor = new WebBrowserControlCSSQueriesProcessor(webBrowser1);

            // change attributes of the first element of the list
            {
                var li = processor.GetHtmlElement("li");
                li.innerHTML = string.Format("<span style='text-transform: uppercase;font-family:verdana;color:green;'>{0}</span>", li.innerText);
            }

            // change attributes of the <li> elements with class = "test"
            var list = processor.GetHtmlElements("li.test");
            foreach (var li in list)
            {
                li.innerHTML = string.Format("<span style='text-transform: uppercase;font-family:verdana;color:blue;'>{0}</span>", li.innerText);
            }

        }

        /// <summary>
        /// Enables IE WebBrowser control to evaluate CSS queries
        /// by injecting sizzle.js (http://cdnjs.cloudflare.com/ajax/libs/sizzle/1.9.1/sizzle.min.js)
        /// and to return CSS queries results to the calling C# code as strongly typed
        /// mshtml.IHTMLElement and IEnumerable<mshtml.IHTMLElement>
        /// </summary>
        public class WebBrowserControlCSSQueriesProcessor
        {
            private System.Windows.Forms.WebBrowser _webBrowser;
            public WebBrowserControlCSSQueriesProcessor(System.Windows.Forms.WebBrowser webBrowser)
            {
                _webBrowser = webBrowser;
                injectScripts();
            }

            private void injectScripts()
            {
                // Thanks to: https://stackoverflow.com/questions/7998996/how-to-inject-javascript-in-webbrowser-control
                HtmlElement head = _webBrowser.Document.GetElementsByTagName("head")[0];
                HtmlElement scriptEl = _webBrowser.Document.CreateElement("script");
                mshtml.IHTMLScriptElement element = (mshtml.IHTMLScriptElement)scriptEl.DomElement;
                element.src = "http://cdnjs.cloudflare.com/ajax/libs/sizzle/1.9.1/sizzle.min.js";
                head.AppendChild(scriptEl);

                string javaScriptText = @"
                        function GetElementsByCSSQuery (cssQuery) {
                            var items = Sizzle(cssQuery);
                            var elements = new Object();
                            var elementIndex = 1;
                            for (i = 0; i < items.length; i++) {
                              elements[elementIndex++] = items[i];
                            }
                            elements.length = elementIndex -1;
                            return elements;
                            };
                       ";
                scriptEl = _webBrowser.Document.CreateElement("script");
                element = (mshtml.IHTMLScriptElement)scriptEl.DomElement;
                element.text = javaScriptText;
                head.AppendChild(scriptEl);            
            }

            /// <summary>
            /// Gets Html element's mshtml.IHTMLElement object instance using CSS query
            /// </summary>
            public mshtml.IHTMLElement GetHtmlElement(string cssQuery)
            {
                string code = string.Format("Sizzle('{0}')[0];", cssQuery);
                return _webBrowser.Document.InvokeScript("eval", new object[] { code }) as mshtml.IHTMLElement;
            }

            /// <summary>
            /// Gets Html elements' IEnumerable<mshtml.IHTMLElement> object instance using CSS query
            /// </summary>
            public IEnumerable<mshtml.IHTMLElement> GetHtmlElements(string cssQuery)
            {
                // Thanks to: https://stackoverflow.com/questions/5278275/accessing-properties-of-javascript-objects-using-type-dynamic-in-c-sharp-4
                var comObject = _webBrowser.Document.InvokeScript("eval", new object[] { string.Format("GetElementsByCSSQuery('{0}')", cssQuery) });
                Type type = comObject.GetType();
                int length = (int)type.InvokeMember("length", BindingFlags.GetProperty, null, comObject, null);

                for (int i = 1; i <= length; i++)
                {
                    yield return type.InvokeMember(i.ToString(), BindingFlags.GetProperty, null, comObject, null) as mshtml.IHTMLElement;
                }
            }
        }

    }
}

When I’m running the above sample code I’m getting the following expected test results:

Test results

Could you please review the code and post your notes and remarks on what could be improved in it?

Solution

private void cmdTest_Click(object sender, EventArgs e)

This looks like your button is called cmdTest, why? Hungarian notation is generally considered a bad thing and even then, why cmd for a button? I think a good name for that button would be TestButton.


WebBrowserControlCSSQueriesProcessor

That name is way too long, why not shorten it to something like WebBrowserCssProcessor?


li.innerHTML = string.Format("<span style='text-transform: uppercase;font-family:verdana;color:green;'>{0}</span>", li.innerText);
li.innerHTML = string.Format("<span style='text-transform: uppercase;font-family:verdana;color:blue;'>{0}</span>", li.innerText);

These two lines are almost the same, consider extracting them into a method:

private static void ChangeStyle(mshtml.IHTMLElement element, string color)
{
    element.innerHTML = string.Format(
        "<span style='text-transform: uppercase;font-family:verdana;color:{1};'>{0}</span>",
        element.innerText, color);
}

And use it like this:

ChangeStyle(li, "green");
ChangeStyle(li, "blue");

/// Enables IE WebBrowser control to evaluate CSS queries
/// by injecting sizzle.js (http://cdnjs.cloudflare.com/ajax/libs/sizzle/1.9.1/sizzle.min.js)

Why link the included version of the file here? If I need to know that, I can look at the source code. I would either give no link here (and assume people can google sizzle.js) or link to the main page.


System.Windows.Forms.WebBrowser

No need to spell out the whole namespace every time, when you have using System.Windows.Forms at the top of your file.

The same applies to the mshtml namespace: you should put that into a using.


HtmlElement scriptEl = _webBrowser.Document.CreateElement("script");
mshtml.IHTMLScriptElement element = (mshtml.IHTMLScriptElement)scriptEl.DomElement;
element.src = "http://cdnjs.cloudflare.com/ajax/libs/sizzle/1.9.1/sizzle.min.js";

…

scriptEl = _webBrowser.Document.CreateElement("script");
element = (mshtml.IHTMLScriptElement)scriptEl.DomElement;
element.text = javaScriptText;

I don’t have much experience with WebBrowser or mshtml, but why are you using mshtml here in the first place? Why not just use the HtmlElement directly:

scriptEl.SetAttribute("src", "http://cdnjs.cloudflare.com/ajax/libs/sizzle/1.9.1/sizzle.min.js");

scriptEl.InnerText = javaScriptText;

Also, reusing variables (scriptEl and element) like this is not great. You should use different variables here (e.g. sizzleScriptElement and functionScriptElement).


_webBrowser.Document.InvokeScript("eval", new object[] { code }) as mshtml.IHTMLElement
_webBrowser.Document.InvokeScript("eval", new object[] { string.Format("GetElementsByCSSQuery('{0}')", cssQuery) })

Repeated code again, so extract it into a method again:

public T Eval<T>(string code)
{
    return (T)_webBrowser.Document.InvokeScript("eval", new object[] { code });
}

Notice that I used a cast and not as. That’s because when an error happens, cast gives you immediately a clear InvalidCastException, while as gives you a confusing NullReferenceException later.


public IEnumerable<mshtml.IHTMLElement> GetHtmlElements(string cssQuery)
{
    // Thanks to: http://stackoverflow.com/questions/5278275/accessing-properties-of-javascript-objects-using-type-dynamic-in-c-sharp-4
    var comObject = _webBrowser.Document.InvokeScript("eval", new object[] { string.Format("GetElementsByCSSQuery('{0}')", cssQuery) });
    Type type = comObject.GetType();
    int length = (int)type.InvokeMember("length", BindingFlags.GetProperty, null, comObject, null);

    for (int i = 1; i <= length; i++)
    {
        yield return type.InvokeMember(i.ToString(), BindingFlags.GetProperty, null, comObject, null) as mshtml.IHTMLElement;
    }
}

From the linked question, it seems that accessing length using dynamic works, if you do it from JS first. I would do that, since it means you avoid writing all that reflection code.

Leave a Reply

Your email address will not be published. Required fields are marked *