Problem
The following code works but is a mess. But being totally new to Ruby I have had big problems trying to refactor it into something resembling clean OOP code. Could you help with this and explain what you are doing?
require 'nokogiri'
require 'open-uri'
require_relative 'db.rb' # Not used
require 'mechanize'
require 'digest/md5'
require 'colorize'
require 'win32console'
# Get the Base URL for Groups
# baseURL = '<some site>'
# The Initial Search Point for Groups
#Base URL
baseURL = '<some site>'
pageURL = '<some site>'
doc = Nokogiri::HTML(open(pageURL))
# Search for nodes by css
pages = doc.css('.pages')
thumbs = doc.css('.thumb')
startingPages = pages.css('a').first.text
totalPages = pages.css('a').last.text.to_i
currentPage = 1
## DO THE FIRST PAGE
thumbs.each do |thumb|
imgSrc = thumb.css('.t_img').first['src']
file = open(imgSrc).read.unpack('H*').first
agent = Mechanize.new
image = agent.get(imgSrc)
md = Digest::MD5.hexdigest(*[file.to_s])
sha = Digest::SHA2.hexdigest(*[file.to_s])
title = thumb.css('.t_img').first['alt']
fileType = image.header["content-type"].to_s
lastModified = image.header["last-modified"].to_s
fileSize = image.header["content-length"].to_s
driver = "Prowler"
source = $pagesPath.to_s
binary = file
fileName = image.filename
# ||, -, *, /, <>, <, >, ,(comma), =, <=, >=, ~=, !=, ^=, (, ) Illegal characters for sql queries. Check title for these before inserting into the database
title = title.gsub("'", "")
begin
# connect to the MySQL server
dbh = DBI.connect('dbi:ODBC:DamoclesProwler', 'user', 'pw')
sqlText = "INSERT INTO prowlerRunningResults(ProwlerDriver, SourceURL, Title, DestinationURL, FileSource, FileType, FileName, LastModified, FileSize, HashMD5, HashSHA256, ImageBinary)
VALUES ('#{driver}', '#{source}', '#{title}', 'NA', '#{imgSrc}', '#{fileType}', '#{fileName}', '#{lastModified}', #{fileSize.to_i}, '#{md}', '#{sha}', '#{binary}' )"
dbh.do( sqlText )
puts "Record has been created for".green + "t" + sha.yellow
dbh.commit
rescue DBI::DatabaseError => e
if(e.errstr == "23000 (3604) [Microsoft][ODBC SQL Server Driver][SQL Server]Duplicate key was ignored.")
puts "Ignoring Duplicate File".cyan
else
puts "An error occurred".red
puts "Error code: #{e.err}".red
puts "Error message: #{e.errstr}".red
# check update first
dbh.rollback
end
ensure
# disconnect from server
dbh.disconnect if dbh
end
end
puts "--------------------------PAGE 1-----------------------------------"
(2..totalPages-1).each do |k|
$pagesPath = URI.join(baseURL, "/category/0/All/ctr/" + k.to_s + "/")
rdoc = Nokogiri::HTML(open($pagesPath))
tthumbs = rdoc.css('.thumb')
tthumbs.each do |mthumb|
title = mthumb.css('.t_img').first['alt']
imgSrc = mthumb.css('.t_img').first['src']
file = "0x" + open(imgSrc).read.unpack('H*').first
agent = Mechanize.new
fop = agent.get(imgSrc)
md = Digest::MD5.hexdigest(*[file.to_s])
sha = Digest::SHA2.hexdigest(*[file.to_s])
fileType = agent.head(imgSrc)["content-type"].to_s
lastModified = agent.head(imgSrc)["last-modified"].to_s
fileSize = agent.head(imgSrc)["content-length"].to_s
driver = "Prowler"
source = $pagesPath.to_s
binary = file
fileName = fop.filename
title = title.gsub("'", "")
begin
# connect to the MySQL server
dbh = DBI.connect('dbi:ODBC:DamoclesProwler', 'user', 'pw')
sqlText = "INSERT INTO prowlerRunningResults(ProwlerDriver, SourceURL, Title, DestinationURL, FileSource, FileType, FileName, LastModified, FileSize, HashMD5, HashSHA256, ImageBinary)
VALUES ('#{driver}', '#{source}', '#{title}', 'NA', '#{imgSrc}', '#{fileType}', '#{fileName}', '#{lastModified}', #{fileSize.to_i}, '#{md}', '#{sha}', '#{binary}' )"
dbh.do( sqlText )
puts "Record has been created for".green + "t" + sha.yellow
dbh.commit
rescue DBI::DatabaseError => e
if(e.errstr == "23000 (3604) [Microsoft][ODBC SQL Server Driver][SQL Server]Duplicate key was ignored.")
puts "Ignoring Duplicate File".cyan
else
puts "An error occurred".red
puts "Error code: #{e.err}".red
puts "Error message: #{e.errstr}".red
# check update first
dbh.rollback
end
ensure
# disconnect from server
dbh.disconnect if dbh
end
end
puts "-----------------------------PAGE #{k.to_s}------------------------------------"
sleep(2)
end
I would normally, in other languages, have the Database code in a separate file. How would I do that in Ruby, and what else in Ruby is there that I should be doing?
The code goes to a website – hardcoded
Loads the site into doc
which is then parsed using nokogiri.
The pagination for the site is loaded into pages so that we can work out how many pages we are going to process… this is the link at the top or bottom of the page with something like goto: page 1, 2, 3, 4, … 600.
next the page looks for any images on the page – thumbs (thumbnails) with the code put into thumbs variable.
Next the first page is processed…
- go through each thumbnail
- Get its location
- download the file
- Hash it
- Get info like size, name, type
- Save the info and file into the database
- Repeat the process for pages 2 .. lastpage
Solution
Object-oriented design may be a noble goal, but I think you have more urgent needs. The three most obvious problems I see are:
Indentation and spacing
Your indentation is pretty haphazard. Be sure to indent each block consistently. Without indentation, the code is incomprehensible and unmaintainable.
DBI misuse and SQL injection vulnerability
A comment says that you are connecting to a MySQL database, but the error handling is for Microsoft SQL Server. That’s weird, isn’t it?
You’re connecting to and disconnecting from the database once per page. That’s wasteful. The entire program should keep reusing the same database connection.
Most importantly, you should never construct SQL statements by string concatenation or interpolation, since that is what leads to SQL injection vulnerabilities. Instead, you should use parameterized queries:
dbh = DBI.connect('dbi:ODBC:DamoclesProwler', 'user', 'pw')
sth = dbh.prepare('INSERT INTO prowlerRunningResults(ProwlerDriver, SourceURL, Title, DestinationURL, FileSource, FileType, FileName, LastModified, FileSize, HashMD5, HashSHA256, ImageBinary)
VALUES (?, ?, ?, 'NA', ?, ?, ?, ?, ?, ?, ?, ?);')
sth.execute(driver, source, title, imgSrc, fileType, fileName, lastModified, fileSize.to_i, md, sha, binary)
The use of 'NA'
is odd. Usually, NULL
would be used to indicate the lack of a value.
Code duplication
The bulk of the program is repeated code. Such code should live in a function, not just to reduce duplication, but to give it a named purpose.
def scrape_page(dbh, agent, url)
…
end
In fact, very little code should float freely outside functions, and such code should reveal at a glance what the outline of the entire program is.
require …
baseURL = '<some site>'
pageURL = '<some site>'
begin
dbh = DBI.connect(…)
agent = Mechanize.new
scrape_page(dbi, agent, pageURL)
2.upto(totalPages-1) do |k|
scrape_page(dbi, agent, URI.join(baseURL, "/category/0/All/ctr/#{k}/"))
end
ensure
dbh.disconnect if dbh
end
I can’t speak to Ruby but I see an improvement that could be done to your MySQL code. Since this operation loops multiple times and passes a query to MySQL each time, you would considerably improve your database execution by using a stored PROCEDURE
and passing parameters to it. The purpose of this (besides simplifying the code) is that MySQL would store the execution plan with the procedure after it runs for the first time.
So instead of this:
sqlText = "INSERT INTO prowlerRunningResults(ProwlerDriver, SourceURL, Title, DestinationURL, FileSource, FileType, FileName, LastModified, FileSize, HashMD5, HashSHA256, ImageBinary)
VALUES ('#{driver}', '#{source}', '#{title}', 'NA', '#{imgSrc}', '#{fileType}', '#{fileName}', '#{lastModified}', #{fileSize.to_i}, '#{md}', '#{sha}', '#{binary}' )"
You could do this (once) in MySQL (or pass it from Ruby, doesn’t matter):
DELIMITER |
DROP PROCEDURE IF EXISTS MyProcedure| -- Use a relevant name
CREATE PROCEDURE MyProcedure -- Use a relevant name
(
IN driver VARCHAR, -- Note you can specify character limit if you use VAR(50) or other number
IN source VARCHAR,
IN title VARCHAR,
IN imgSrc VARCHAR,
IN fileType VARCHAR,
IN fileName VARCHAR,
IN lastModified VARCHAR,
IN [fileSize.to_i] VARCHAR, -- Using brackets due to period character
IN md VARCHAR,
IN sha VARCHAR,
IN binary VARCHAR
);
AS
BEGIN
INSERT INTO prowlerRunningResults(ProwlerDriver, SourceURL, Title, DestinationURL, FileSource, FileType, FileName, LastModified, FileSize, HashMD5, HashSHA256, ImageBinary)
VALUES (driver, source, title, 'NA', imgSrc, fileType, fileName, lastModified, [fileSize.to_i], md, sha, binary);
END|
DELIMITER ;
Note you can use something else as DELIMITER
it’s a wild card. I’ve seen //
or $$
used frequently, or you can even use GO
as long as it is not ;
.
Then in your Ruby script you would just do this instead of what you have now:
sqlText = "CALL MyProcedure('#{driver}', '#{source}', '#{title}', 'NA', '#{imgSrc}', '#{fileType}', '#{fileName}', '#{lastModified}', #{fileSize.to_i}, '#{md}', '#{sha}', '#{binary}' )"