Process a line in chunks via Regular Expressions with Ruby

Posted on

Problem

I’m writing a Ruby script to reverse engineering message log files. They come from an external system with the following characteristics:

  • Each line of the log file has at least one message.
  • Each line of the log file can have multiple messages.
  • Each message consists of a set of numbers separated by spaces (e.g. 30 0 -1 1 2 1).
  • Each message can have one of many different templates (e.g. some contain five numbers, others contains six).

The approach I’m using is to process each line, one at a time, via a method that takes a string to work on as an argument. It saves a copy of the initial input (for later comparison) then tries to match known patterns. When a pattern is matched, the string that made it up is removed. If there is nothing left, or if no more matches are found, the method exits. Otherwise, it calls itself with the remainder of the string to process. Here’s the code I camp up with along with an example.

#!/usr/bin/evn ruby

def parse_line remainder_of_line

  puts "Processing: #{remainder_of_line}"

  # Save a copy of the initial input for later comparison 
  initial_snapshot = remainder_of_line.dup

  # Look for known pattern matches, removing them if found
  if remainder_of_line.gsub!(/^(d+) 0 -1 1 (d+) d+s*/, '')
    puts " - Matched format 1 - found: #{$1} - #{$2}nn"
  elsif remainder_of_line.gsub!(/^d+ 0 -1 2 (d+) d+s*/, '')
    puts " - Matched format 2 - found: #{$1}nn"

  ### More patterns here. 
  end

  # If noting changed, then no matches were found.
  if initial_snapshot.eql? remainder_of_line
    puts " - Line still has data but no matches found. (Left with: #{remainder_of_line}nn"
  # Keep going if there is anything left.  
  elsif !remainder_of_line.empty?
    parse_line remainder_of_line
  end

end


line = "11 0 -1 2 13560 2 11 0 -1 2 13564 2 11 0 -1 1 36880 106 91 0 -1 1 36881 106 36881 106 91 1 13556 2 36880 106 36880 106 11 1 734 11 0 -1 1 36884 106 91 0 -1 1 36885 106 36885 106 91 1 13556 2 36884 106 36884 106 11 1 735 13556 2 31 18 799 13556 2 31 25 799 "

parse_line line

This works but I’m wondering if there is a better way.

Solution

  • Because you’re using the “bang” version of gsub, parse_line modifies the string you pass to it, which is generally a not a good idea. I wouldn’t expect a parsing method to “eat” my input.

  • Since there’s only one line and your regexes are anchored to the start of it, there’s little point in using gsub (i.e. global substitution), since you’ll only ever match 1 occurrence of the pattern.

  • Don’t bother with all the newline literals. puts will automatically add one, and if you want an extra one, you should be able to just say puts with no argument in a strategic location (i.e. after having tried all the patterns).

This seems like a good fit for Ruby’s case statement (aka switch) since you can match against regexes directly. And Ruby also sets other magic variables besides $1 and $2 whenever you match a regex. There’s no reason to make the method recursive, though. A simple loop would do nicely too.

For instance:

def parse_line(line)
  puts "Processing: #{line}"

  # Loop until the string's empty (or we hit the return below)
  until line.empty?
    # Try matching the line
    case line
    when /^(d+) 0 -1 1 (d+) d+s*/
      puts " - Matched format 1 - found: #{$1} - #{$2}"
    when /^d+ 0 -1 2 (d+) d+s*/
      puts " - Matched format 2 - found: #{$1}"

    # more patterns...

    else # no match
      puts " - Line still has data but no matches found. (Left with: #{line})"
      return # stop here
    end
    line = $' # set line to the *unmatched* part, i.e. the remainder
    puts "" # output an extra blank line
  end

  puts "Entire line matched, yay"
end

Leave a Reply

Your email address will not be published. Required fields are marked *