Problem
I came up with this small implementation of an Okapi BM25+ ranker using Ruby. I cooked it up in a very short time so it’s very simple but I’m trying to think whether there’s a better way to write it. In particular, I’m wondering if I should have something like a Ranker
class — intuitively I feel as though I should but it seems to work fine this way too.
The two classes used are Document
and Collection
. A Collection
can contain many Document
s. Very simple.
document.rb
class Document
attr_accessor :body, :rank
def initialize(params={:body=>'', :rank=>nil})
@body = params[:body]
@rank = params[:rank]
end
def length
@body.length
end
def include?(term)
@body.include?(term)
end
def term_freq(term)
# Will need something better here to clean out the document.
# This is temporary ;)
@body.gsub(/[^sp{Alnum}p{Han}p{Katakana}p{Hiragana}p{Hangul}]/,'')
.downcase
.split
.count(term)
end
end
collection.rb
class Collection
def initialize(params={:docs=>[], :query=>nil})
@docs = params[:docs]
@query = params[:query]
end
def containing_term(term)
total = 0
@docs.each do |doc|
total += 1 if doc.include?(term)
end
total
end
def avg_dl
doc_lengths = 0
@docs.each do |doc|
doc_lengths += doc.length
end
doc_lengths / total_docs
end
def total_docs
@docs.size
end
def idf(term)
numerator = total_docs - containing_term(term) + 0.5
denominator = containing_term(term) + 0.5
Math.log(numerator / denominator)
end
def bm25(params={:k => 1.2, :b => 0.75, :delta => 1.0})
@k = params[:k]
@b = params[:b]
@delta = params[:delta]
@docs.each do |doc|
score = 0
dl = doc.length
query_terms = @query.split
query_terms.each do |term|
dtf = doc.term_freq(term)
numerator = dtf * (@k + 1)
denominator = dtf + @k * (1 - @b + @b * (doc.length / avg_dl))
score += idf(term) * (numerator/denominator) + @delta
end
doc.rank = score
end
@docs
end
def sort_by_rank
@docs.sort! {|a, b| a.rank <=> b.rank}
end
end
Solution
You should use built-ins to simplify the code:
def containing_term(term)
total = 0
@docs.each do |doc|
total += 1 if doc.include?(term)
end
total
end
Becomes:
def containing_term(term)
@docs.count {|doc| doc.include?(term) }
end
And
def avg_dl
doc_lengths = 0
@docs.each do |doc|
doc_lengths += doc.length
end
doc_lengths / total_docs
end
Becomes:
def avg_dl
@docs.map(&:length).inject(:+) / total_docs
end
You are close to immutable, make the last step
In Collection
you use sort!
, disallowing method chaining, sort
and returning a new object would be more expected.
Use a Hash for an easy but massive optimization
The method:
def term_freq(term)
# Will need something better here to clean out the document.
# This is temporary ;)
@body.gsub(/[^sp{Alnum}p{Han}p{Katakana}p{Hiragana}p{Hangul}]/,'')
.downcase
.split
.count(term)
end
Should be a Hash lookup, where the Hash is created at init time, time complexity would become O(1) with an Hash instead of O(n).