Nokogiri crawler

Posted on

Problem

The following code works but is a mess. But being totally new to Ruby I have had big problems trying to refactor it into something resembling clean OOP code. Could you help with this and explain what you are doing?

require 'nokogiri'
require 'open-uri'
require_relative 'db.rb'        # Not used
require 'mechanize'
require 'digest/md5'
require 'colorize'
require 'win32console'

# Get the Base URL for Groups
# baseURL = '<some site>'
# The Initial Search Point for Groups

#Base URL
baseURL = '<some site>'
pageURL = '<some site>'
doc = Nokogiri::HTML(open(pageURL))

# Search for nodes by css
pages = doc.css('.pages')
thumbs = doc.css('.thumb')
startingPages = pages.css('a').first.text
totalPages = pages.css('a').last.text.to_i
currentPage = 1


## DO THE FIRST PAGE
thumbs.each do |thumb|
imgSrc = thumb.css('.t_img').first['src']
file = open(imgSrc).read.unpack('H*').first 
agent = Mechanize.new
image = agent.get(imgSrc)
md = Digest::MD5.hexdigest(*[file.to_s])
sha = Digest::SHA2.hexdigest(*[file.to_s])
title = thumb.css('.t_img').first['alt']
fileType = image.header["content-type"].to_s
lastModified = image.header["last-modified"].to_s
fileSize = image.header["content-length"].to_s
driver = "Prowler"
source = $pagesPath.to_s
binary = file
fileName = image.filename

# ||, -, *, /, <>, <, >, ,(comma), =, <=, >=, ~=, !=, ^=, (, ) Illegal characters for  sql queries. Check title for these before inserting into the database
title = title.gsub("'", "")


begin
# connect to the MySQL server
dbh = DBI.connect('dbi:ODBC:DamoclesProwler', 'user', 'pw') 
sqlText =   "INSERT INTO prowlerRunningResults(ProwlerDriver, SourceURL, Title, DestinationURL, FileSource, FileType, FileName, LastModified, FileSize, HashMD5, HashSHA256, ImageBinary) 
VALUES ('#{driver}', '#{source}', '#{title}', 'NA', '#{imgSrc}',  '#{fileType}',  '#{fileName}',  '#{lastModified}', #{fileSize.to_i},   '#{md}',   '#{sha}',   '#{binary}'   )"

  dbh.do( sqlText )
  puts "Record has been created for".green + "t" + sha.yellow
  dbh.commit
rescue DBI::DatabaseError => e
  if(e.errstr == "23000 (3604) [Microsoft][ODBC SQL Server Driver][SQL Server]Duplicate key was ignored.")
    puts "Ignoring Duplicate File".cyan
  else
  puts "An error occurred".red
  puts "Error code:    #{e.err}".red
  puts "Error message: #{e.errstr}".red
  # check update first
  dbh.rollback
  end
ensure
# disconnect from server
dbh.disconnect if dbh
end
end 

puts "--------------------------PAGE 1-----------------------------------"

(2..totalPages-1).each do |k|
$pagesPath = URI.join(baseURL, "/category/0/All/ctr/" + k.to_s + "/")
rdoc = Nokogiri::HTML(open($pagesPath))
tthumbs = rdoc.css('.thumb')
tthumbs.each do |mthumb|
title =  mthumb.css('.t_img').first['alt']
imgSrc = mthumb.css('.t_img').first['src']
file = "0x" + open(imgSrc).read.unpack('H*').first 
agent = Mechanize.new
fop = agent.get(imgSrc)
md = Digest::MD5.hexdigest(*[file.to_s])
sha = Digest::SHA2.hexdigest(*[file.to_s])
fileType = agent.head(imgSrc)["content-type"].to_s
lastModified = agent.head(imgSrc)["last-modified"].to_s
fileSize = agent.head(imgSrc)["content-length"].to_s
driver = "Prowler"
source = $pagesPath.to_s
binary = file
fileName = fop.filename
title = title.gsub("'", "")

  begin
  # connect to the MySQL server
  dbh = DBI.connect('dbi:ODBC:DamoclesProwler', 'user', 'pw')
  sqlText =   "INSERT INTO prowlerRunningResults(ProwlerDriver, SourceURL, Title, DestinationURL, FileSource, FileType, FileName, LastModified, FileSize, HashMD5, HashSHA256, ImageBinary) 
  VALUES ('#{driver}', '#{source}', '#{title}', 'NA', '#{imgSrc}',  '#{fileType}',  '#{fileName}',  '#{lastModified}', #{fileSize.to_i},   '#{md}',   '#{sha}',   '#{binary}'   )"

  dbh.do( sqlText )
  puts "Record has been created for".green + "t" + sha.yellow
  dbh.commit
rescue DBI::DatabaseError => e
 if(e.errstr == "23000 (3604) [Microsoft][ODBC SQL Server Driver][SQL Server]Duplicate key was ignored.")
    puts "Ignoring Duplicate File".cyan
  else
  puts "An error occurred".red
  puts "Error code:    #{e.err}".red
  puts "Error message: #{e.errstr}".red
  # check update first
  dbh.rollback
  end
ensure
# disconnect from server
  dbh.disconnect if dbh
end
end 
puts "-----------------------------PAGE #{k.to_s}------------------------------------"
sleep(2)
end

I would normally, in other languages, have the Database code in a separate file. How would I do that in Ruby, and what else in Ruby is there that I should be doing?

The code goes to a website – hardcoded
Loads the site into doc
which is then parsed using nokogiri.
The pagination for the site is loaded into pages so that we can work out how many pages we are going to process… this is the link at the top or bottom of the page with something like goto: page 1, 2, 3, 4, … 600.
next the page looks for any images on the page – thumbs (thumbnails) with the code put into thumbs variable.
Next the first page is processed…

  1. go through each thumbnail
  2. Get its location
  3. download the file
  4. Hash it
  5. Get info like size, name, type
  6. Save the info and file into the database
  7. Repeat the process for pages 2 .. lastpage

Solution

Object-oriented design may be a noble goal, but I think you have more urgent needs. The three most obvious problems I see are:

Indentation and spacing

Your indentation is pretty haphazard. Be sure to indent each block consistently. Without indentation, the code is incomprehensible and unmaintainable.

DBI misuse and SQL injection vulnerability

A comment says that you are connecting to a MySQL database, but the error handling is for Microsoft SQL Server. That’s weird, isn’t it?

You’re connecting to and disconnecting from the database once per page. That’s wasteful. The entire program should keep reusing the same database connection.

Most importantly, you should never construct SQL statements by string concatenation or interpolation, since that is what leads to SQL injection vulnerabilities. Instead, you should use parameterized queries:

dbh = DBI.connect('dbi:ODBC:DamoclesProwler', 'user', 'pw')
sth = dbh.prepare('INSERT INTO prowlerRunningResults(ProwlerDriver, SourceURL, Title, DestinationURL, FileSource, FileType, FileName, LastModified, FileSize, HashMD5, HashSHA256, ImageBinary)
                   VALUES (?, ?, ?, 'NA', ?, ?, ?, ?, ?, ?, ?, ?);')
sth.execute(driver, source, title, imgSrc, fileType, fileName, lastModified, fileSize.to_i, md, sha, binary)

The use of 'NA' is odd. Usually, NULL would be used to indicate the lack of a value.

Code duplication

The bulk of the program is repeated code. Such code should live in a function, not just to reduce duplication, but to give it a named purpose.

def scrape_page(dbh, agent, url)
  …
end

In fact, very little code should float freely outside functions, and such code should reveal at a glance what the outline of the entire program is.

require …

baseURL = '<some site>'
pageURL = '<some site>'

begin
  dbh = DBI.connect(…)
  agent = Mechanize.new

  scrape_page(dbi, agent, pageURL)
  2.upto(totalPages-1) do |k|
    scrape_page(dbi, agent, URI.join(baseURL, "/category/0/All/ctr/#{k}/"))
  end
ensure
  dbh.disconnect if dbh
end

I can’t speak to Ruby but I see an improvement that could be done to your MySQL code. Since this operation loops multiple times and passes a query to MySQL each time, you would considerably improve your database execution by using a stored PROCEDURE and passing parameters to it. The purpose of this (besides simplifying the code) is that MySQL would store the execution plan with the procedure after it runs for the first time.

So instead of this:

sqlText =   "INSERT INTO prowlerRunningResults(ProwlerDriver, SourceURL, Title, DestinationURL, FileSource, FileType, FileName, LastModified, FileSize, HashMD5, HashSHA256, ImageBinary) 
VALUES ('#{driver}', '#{source}', '#{title}', 'NA', '#{imgSrc}',  '#{fileType}',  '#{fileName}',  '#{lastModified}', #{fileSize.to_i},   '#{md}',   '#{sha}',   '#{binary}'   )"

You could do this (once) in MySQL (or pass it from Ruby, doesn’t matter):

DELIMITER |
DROP PROCEDURE IF EXISTS MyProcedure| -- Use a relevant name
CREATE PROCEDURE MyProcedure -- Use a relevant name
  (
  IN driver VARCHAR, -- Note you can specify character limit if you use VAR(50) or other number
  IN source VARCHAR,
  IN title VARCHAR,
  IN imgSrc VARCHAR,
  IN fileType VARCHAR,
  IN fileName VARCHAR,
  IN lastModified VARCHAR,
  IN [fileSize.to_i] VARCHAR, -- Using brackets due to period character
  IN md VARCHAR,
  IN sha VARCHAR,
  IN binary VARCHAR
  );
AS
BEGIN

INSERT INTO prowlerRunningResults(ProwlerDriver, SourceURL, Title, DestinationURL, FileSource, FileType, FileName, LastModified, FileSize, HashMD5, HashSHA256, ImageBinary)
VALUES (driver, source, title, 'NA', imgSrc,  fileType,  fileName,  lastModified, [fileSize.to_i], md, sha, binary);
END|
DELIMITER ;

Note you can use something else as DELIMITER it’s a wild card. I’ve seen // or $$ used frequently, or you can even use GO as long as it is not ;.

Then in your Ruby script you would just do this instead of what you have now:

sqlText =   "CALL MyProcedure('#{driver}', '#{source}', '#{title}', 'NA', '#{imgSrc}',  '#{fileType}',  '#{fileName}',  '#{lastModified}', #{fileSize.to_i},   '#{md}',   '#{sha}',   '#{binary}'   )"

Leave a Reply

Your email address will not be published. Required fields are marked *