Given a blob of text (that is a complete html page) return all email(s) present [closed]

Posted on

Problem

Okay so this is NOT my code, I just inherited it. Seems it has an obvious bug that it will only return the first email despite being called “GetEmails”. My main concern is actually that it takes forever on some pages. The regex looks over complicated, but regex are something I have never really ‘got into’ hence posting it here:

public string GetEmails(string text)
{
    try
    {
        const string MatchEmailPattern =
       @"(([w-]+.)+[w-]+|([a-zA-Z]{1}|[w-]{2,}))@"
       + @"((([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9]).([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])."
         + @"([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9]).([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])){1}|"
       + @"([a-zA-Z]+[w-]+.)+[a-zA-Z]{2,4})";
        Regex rx = new Regex(MatchEmailPattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
        // Find matches.
        MatchCollection matches = rx.Matches(text);
        // Report the number of matches found.
        int noOfMatches = matches.Count;
        // Report on each match.

        foreach (Match match in matches)
        {
            return match.Value.ToString();
        }

    }
    catch (Exception ex)
    {
        CLog.LogMessage(string.Format("{0} {1}", ex.Message, ex.StackTrace));
    }
    return null;

}

Solution

Your are returning string
And you are returning only the first string

return match.Value.ToString();

I suspect you want to return an IEnumerable

public IEnumerable<string> GetEmails(string text)
{
    try
    {
        const string MatchEmailPattern =
              @"(([w-]+.)+[w-]+|([a-zA-Z]{1}|[w-]{2,}))@"
            + @"((([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9]).([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])."
            + @"([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9]).([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])){1}|"
            + @"([a-zA-Z]+[w-]+.)+[a-zA-Z]{2,4})";
        Regex rx = new Regex(MatchEmailPattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
        // Find matches.
        MatchCollection matches = rx.Matches(text);
        // Report on each match.
        foreach (Match match in matches)
        {
            yield return match.Value.ToString();
        }   
    }
    catch (Exception ex)
    {
        CLog.LogMessage(string.Format("{0} {1}", ex.Message, ex.StackTrace));
    }
}

Taking as a fact that your method returns null in case of nothing found, please read the 1st solution. Otherwise suggested 2nd solution.

1st solution

You should have a list declared before the loop as:

List<string> emails = new List<string>();

And the loop should fill in this list:

foreach (Match match in matches)
{
    emails.Add(match.Value.ToString());
}

And after the loop, just return the list:

return emails;

The rest should remain as they are.

public List<string> GetEmails(string text)
{
    try
    {
        const string MatchEmailPattern =
            @"(([w-]+.)+[w-]+|([a-zA-Z]{1}|[w-]{2,}))@"
            + @"((([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9]).([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])."
            + @"([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9]).([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])){1}|"
            + @"([a-zA-Z]+[w-]+.)+[a-zA-Z]{2,4})";
        Regex rx = new Regex(MatchEmailPattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
        // Find matches.
        MatchCollection matches = rx.Matches(text);
        // Report the number of matches found.
        int noOfMatches = matches.Count;
        // Report on each match.
        List<string> emails = new List<string>();
        foreach (Match match in matches)
        {
            emails.Add(match.Value.ToString());
        }
        return emails;
    }
    catch (Exception ex)
    {
        CLog.LogMessage(string.Format("{0} {1}", ex.Message, ex.StackTrace));
    }
    return null;
}

2nd solution

I would suggest that you give a solution returning an empty list in case of nothing found.
The differences in the code are:

  • you return only once (better maintainable structure), and
  • the returned value cannot be null so you will not have to check for nullability in the calling code.

The code should be like:

public List<string> GetEmails(string text)
{
    List<string> emails = new List<string>();
    try
    {
        const string MatchEmailPattern =
            @"(([w-]+.)+[w-]+|([a-zA-Z]{1}|[w-]{2,}))@"
            + @"((([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9]).([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])."
            + @"([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9]).([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])){1}|"
            + @"([a-zA-Z]+[w-]+.)+[a-zA-Z]{2,4})";
        Regex rx = new Regex(MatchEmailPattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
        // Find matches.
        MatchCollection matches = rx.Matches(text);
        // Report the number of matches found.
        int noOfMatches = matches.Count;
        // Report on each match.
        foreach (Match match in matches)
        {
            emails.Add(match.Value.ToString());
        }
    }
    catch (Exception ex)
    {
        CLog.LogMessage(string.Format("{0} {1}", ex.Message, ex.StackTrace));
    }
    return emails;
}

Leave a Reply

Your email address will not be published. Required fields are marked *