Optimize mass import of XML to SQLServer in Ruby

Posted on

Problem

I have the following Ruby code that’s designed to update item, price and stock data for items in a MSSQL database. It’s running on a Ruby 1.8.6/Rails 1.2.3 installation, in its own controller (for now)

What I’m looking for is ways to optimize performance.

Right now each item takes about 0,2 seconds (200ms) to process. Edit: There may be 10,000+ items in the XML file, and 10,000,000+ items in the “Items” and “Price” SQL tables.

I’ve been told that

  • composing my own SQL queries directly instead of using the models will be faster, since I’m doing quite a few lookups, how much performance would that lend me?

  • and selecting many rows at once, doing processing, and inserting at once should be faster, instead of doing it one by one. (How would this be done in practice?)


  # counters for statistics
  count = 0
  skips = 0

  # load + parse XML
  file = File.read(XML_PATH)
  doc = REXML::Document.new file

  # loop through products in XML file
  doc.elements.each('products') { |product|
    itemid = product.attributes['id'].to_i
    itemprice = product.elements['price'].text.to_i

    item = Item.find_by_id(itemid)
    if item.nil?
      skips += 1
      next
    end

    # update prices
    price = Price.find(:first, :conditions => { :item_id => itemid })
    # create new price if price not found
    if price.nil?
      price = Price.new
      price.item_id = itemid
    end
    price.price = product.elements['price'].text.to_i
    price.save

    # find + update item stock data
    product.elements.each('stocks/location') { |location|
      item_stock_location_id = location.attributes['location_code']
      item_stock_location = ItemStockLocationCount.find(:first,
                                                    :conditions => {
                                                        :item_stock_location_id => item_stock_location_id,
                                                        :item_id => itemid.to_s
                                                    })

      if item_stock_location.nil?
        item_stock_location = ItemStockLocationCount.new
        item_stock_location.item_stock_location_id = item_stock_location_id
        item_stock_location.item_id = itemid
      end
      item_stock_location.stock_qty = location.elements['stock'].text.to_i
      item_stock_location.save
    }

    # update onsite_stock + offsite_stock
    item_stock_count_on_site = 0
    item_stock_count_off_site = 0
    item_stock_loc_qtys = ItemStockLocationQuantity.find_all({ :item_id => itemid.to_s })
    item_stock_loc_qtys.each { |stock_loc_qty|
      item_stock_count_on_site += stock_loc_qty.stock_qty
      item_stock_count_off_site += stock_loc_qty.stock_qty unless stock_loc_qty.item_stock_location_id == 1
    }
    item.onsite_stock = item_stock_count_on_site
    item.offsite_stock = item_stock_count_off_site

    # update price lock
    item.price_lock = product.elements['price_locked']
    if item.price_lock == 1
      item.vat_price = product.elements['price']
      item.price = (itemprice - ((itemprice * 25)/ (100 + 25)))
    end

    item.save

    count += 1
    # progress bar
    if count%100 == 0
      printf '.'
    end
  }

  puts 'Item Count: ' + count.to_s
  puts 'Items skipped: ' + skips.to_s

Solution

As for performance, I definitely think you want to use SQL’s Bulk Insert from XML functionality. I don’t know how to use it from Ruby, so I’ll leave that to another reviewer.

I do want to point out something else though. I had to scroll down two pages to find the end of the loop that this opens up. doc.elements.each('products') { |product|. In fact, the entirety of the code is inside of this loop. Blocks of code should fit on the screen in their entirety. I call it the “Single Page Principle”.

So, the refactoring step that should be taken is to define a method that takes product as a parameter and put all of the logic necessary to handle a single doc.element inside of it.

def update_product!(product)
    #ALL THE CODEZ
end

# loop through products in XML file
doc.elements.each('products') { |product| update_product!(product) }

The next step would be to break down the steps inside of the new update_product method into smaller methods as well. A good start would be to take the code between the comments and place them inside of a method with a good name. This replaces your comment and breaks the code down into logical chunks.

For example:

Instead of this:

# update prices
price = Price.find(:first, :conditions => { :item_id => itemid })
# create new price if price not found
    if price.nil?
      price = Price.new
      price.item_id = itemid
    end
    price.price = product.elements['price'].text.to_i
    price.save

    # find + update item stock data
    product.elements.each('stocks/location') { |location|
      item_stock_location_id = location.attributes['location_code']
      item_stock_location = ItemStockLocationCount.find(:first,
                                                    :conditions => {
                                                        :item_stock_location_id => item_stock_location_id,
                                                        :item_id => itemid.to_s
                                                    })    

Define this:

def update_price
    # update prices
    price = Price.find(:first, :conditions => { :item_id => itemid })
    # create new price if price not found
    if price.nil?
      price = Price.new
      price.item_id = itemid
    end
    price.price = product.elements['price'].text.to_i
    price.save

    # find + update item stock data
    product.elements.each('stocks/location') { |location|
      item_stock_location_id = location.attributes['location_code']
      item_stock_location = ItemStockLocationCount.find(:first,
                                                    :conditions => {
                                                        :item_stock_location_id => item_stock_location_id,
                                                        :item_id => itemid.to_s
                                                    })
end

And call it from update_product!. Continuing doing this until each method has exactly one responsibility. You should end up with code that looks something like this psuedocode.

def update_product(product)
    update_price
    update_item_stock_data
    update_stock_site
    update_price_lock

    count += 1
    # progress bar
    if count%100 == 0
      printf '.'
    end
end

 # counters for statistics
  count = 0
  skips = 0

  # load + parse XML
  file = File.read(XML_PATH)
  doc = REXML::Document.new file

  # loop through products in XML file
  doc.elements.each('products') { |product| update_product!(product) }

  puts 'Item Count: ' + count.to_s
  puts 'Items skipped: ' + skips.to_s

Leave a Reply

Your email address will not be published. Required fields are marked *