Problem
I have had a problem with encoding in a project when trying to convert certain html code into image format. It happens when we receive html that has certain characters with accents and other non standard ones.
To combat this I have the following method which finds all of the text inside html tags and then converts all of it (that aren’t already in the encoded format i.e.
) to their html escaped representation i.e. +
to +
.
private string EncodeToHtml(string contents)
{
Regex textRegex = new Regex("(?<!<[^>]*)(?<Text>[^<>]*)", RegexOptions.Compiled);
Regex innerRegex = new Regex("(?<=^|;)([^&]|&(?=[^&;]*(?:&|$)))+", RegexOptions.Compiled);
return textRegex.Replace(contents, new MatchEvaluator(m =>
{
return innerRegex.Replace(m.Groups["Text"].Value, new MatchEvaluator(m2 =>
{
string result = string.Empty;
foreach (char c in m2.Value)
{
result += $"&#{(int)c};";
}
return result;
}));
}));
}
I’d appreciate any comments on the code especially anyway to make it more efficient.
Solution
You can make the loop more efficient by using the StringBuilder
and using the Aggregate
extension:
return m2.Value.Aggregate(
new StringBuilder(),
(current, next) => current.Append($"&#{next};")
).ToString())
You don’t need the new MatchEvaluator
, just the m
is OK:
textRegex.Replace(contents, m => ..
You can even compress it by removing all return
s:
return textRegex.Replace(contents, m =>
innerRegex.Replace(m.Groups["Text"].Value, m2 =>
m2.Value.Aggregate(
new StringBuilder(),
(current, next) => current.Append($"&#{next};")
).ToString())
);
Regex textRegex = .. Regex innerRegex = ..
I’m not sure about these two variables if they are not recompiled each time the method is called. It might be better to put them outside the method and make them static.
One final thought. You could split this method into three so that you can create finer tests and reuse any of them.
This method assembles both helpers to encode html.
public string EncodeToHtml(string contents)
{
return ReplaceHtmlText(contents, m => EncodeText(m.Groups["Text"].Value));
}
This method just allows you to replace the text:
public string ReplaceHtmlText(string text, MatchEvaluator m)
{
return textRegex.Replace(text, m);
}
This method allows you to encode the text:
public string EncodeText(string text)
{
return encodeRegex.Replace(text, m =>
m.Value.Aggregate(
new StringBuilder(),
(current, next) => current.Append($"&#{next};")
).ToString());
}