Using Regex to parse a chat transcript

Posted on

Problem

I need to classify each line as “announce, whisper or chat” once I have that sorted out I need to extract certain values to be processed.

Right now my regex is as follow:

var regex = new Regex(@"^[(d{2}:d{2}:d{2})]s*(?:([System Message])?s*<([^>]*)>|((.+) Whisper You :))s*(.*)$");
  • Group 0 is the entire message.
  • Group 1 is the hour time of when the message was sent.
  • Group 2 is wether it was an announce or chat.
  • Group 3 is who sent the announce.
  • Group 4 is if it was a whisper or not.
  • Group 5 is who sent the whisper.
  • Group 6 is the sent message by the user or system.

Classify each line:

if 4 matches
 means it is a whisper
   else if 2 matches
     means it is an announce
       else
         normal chat

Should I change anything to my regex to make it more precise/accurate on the matches ?

Sample data:

[02:33:03] John Whisper You :  Heya
[02:33:03] John Whisper You :  How is it going
[02:33:12] <John> [02:33:16] [System Message] bla bla
[02:33:39] <John> heya
[02:33:40] <John> hello :S
[02:33:57] <John> hi
[02:33:57] [System Message] <John> has left the room 
[02:33:57] [System Message] <John> has entered the room 

Solution

You can always break it down in multiple lines to make it more readable. You can also use named groups which take the “magic” out of the group numbers (4 == whisper, 3 == normal, etc).

        var regex = new Regex(@"^[(?<TimeStamp>d{2}:d{2}:d{2})]s*" +
            @"(?:" +
                @"(?<SysMessage>[System Message])?s*" +
                @"<(?<NormalWho>[^>]*)>|" +
                @"(?<Whisper>(?<WhisperWho>.+) Whisper You :))s*" +
            @"(?<Message>.*)$");

        string data = @"[02:33:03] John Whisper You :  Heya
[02:33:03] John Whisper You :  How is it going
[02:33:12] <John> [02:33:16] [System Message] bla bla
[02:33:39] <John> heya
[02:33:40] <John> hello :S
[02:33:57] <John> hi
[02:33:57] [System Message] <John> has left the room 
[02:33:57] [System Message] <John> has entered the room";

        foreach (var msg in data.Split(new char[] { 'r', 'n' }, StringSplitOptions.RemoveEmptyEntries))
        {
            Match match = regex.Match(msg);
            if (match.Success)
            {
                if (match.Groups["Whisper"].Success)
                {
                    Console.WriteLine("[whis from {0}]: {1}", match.Groups["WhisperWho"].Value, msg);
                }
                else if (match.Groups["SysMessage"].Success)
                {
                    Console.WriteLine("[sys msg]: {0}", msg);
                }
                else
                {
                    Console.WriteLine("[normal from {0}]: {1}", match.Groups["NormalWho"].Value, msg);
                }
            }
        }

Leave a Reply

Your email address will not be published. Required fields are marked *