Small BM25+ implementation for probibalistic result ranking

Posted on

Problem

I came up with this small implementation of an Okapi BM25+ ranker using Ruby. I cooked it up in a very short time so it’s very simple but I’m trying to think whether there’s a better way to write it. In particular, I’m wondering if I should have something like a Ranker class — intuitively I feel as though I should but it seems to work fine this way too.

The two classes used are Document and Collection. A Collection can contain many Documents. Very simple.

document.rb

class Document
  attr_accessor :body, :rank

  def initialize(params={:body=>'', :rank=>nil})
    @body = params[:body]
    @rank = params[:rank]
  end

  def length
    @body.length
  end

  def include?(term)
    @body.include?(term)
  end

  def term_freq(term)
    # Will need something better here to clean out the document.
    # This is temporary ;)
   @body.gsub(/[^sp{Alnum}p{Han}p{Katakana}p{Hiragana}p{Hangul}]/,'')
    .downcase
    .split
    .count(term) 
  end
end

collection.rb

class Collection

  def initialize(params={:docs=>[], :query=>nil})
    @docs = params[:docs]
    @query = params[:query]
  end

  def containing_term(term)
    total = 0
    @docs.each do |doc|
      total += 1 if doc.include?(term) 
    end
    total
  end

  def avg_dl
    doc_lengths = 0
    @docs.each do |doc|
      doc_lengths += doc.length
    end
    doc_lengths / total_docs
  end

  def total_docs
    @docs.size
  end

  def idf(term)
    numerator = total_docs - containing_term(term) + 0.5
    denominator = containing_term(term) + 0.5
    Math.log(numerator / denominator)
  end

  def bm25(params={:k => 1.2, :b => 0.75, :delta => 1.0})
    @k = params[:k]
    @b = params[:b]
    @delta = params[:delta]

    @docs.each do |doc|
      score = 0
      dl = doc.length
      query_terms = @query.split

      query_terms.each do |term|
        dtf = doc.term_freq(term)
        numerator = dtf * (@k + 1)
        denominator = dtf + @k * (1 - @b + @b * (doc.length / avg_dl))
        score += idf(term) * (numerator/denominator) + @delta
      end
      doc.rank = score
    end
    @docs
  end

  def sort_by_rank
    @docs.sort! {|a, b| a.rank <=> b.rank}
  end
end

Solution

You should use built-ins to simplify the code:

def containing_term(term)
  total = 0
  @docs.each do |doc|
    total += 1 if doc.include?(term) 
   end
 total
end

Becomes:

def containing_term(term)
  @docs.count {|doc| doc.include?(term) }
end

And

def avg_dl
  doc_lengths = 0
  @docs.each do |doc|
    doc_lengths += doc.length
  end
  doc_lengths / total_docs
end

Becomes:

def avg_dl
  @docs.map(&:length).inject(:+) / total_docs
end

You are close to immutable, make the last step

In Collection you use sort!, disallowing method chaining, sort and returning a new object would be more expected.

Use a Hash for an easy but massive optimization

The method:

def term_freq(term)
   # Will need something better here to clean out the document.
   # This is temporary ;)
  @body.gsub(/[^sp{Alnum}p{Han}p{Katakana}p{Hiragana}p{Hangul}]/,'')
   .downcase
   .split
   .count(term) 
end

Should be a Hash lookup, where the Hash is created at init time, time complexity would become O(1) with an Hash instead of O(n).

Leave a Reply

Your email address will not be published. Required fields are marked *