Problem
I wrote this code in Java using the Jaunt library. The program scrapes all words from Wiktionary from category “English_uncountable_nouns”. And after save
each world to text file.
I am not sure that exceptions are handled in best way, so I’m looking for tips on how to make this code easier to understand and more stable. I removed some of the redundancies but it is still strange.
package wiksurfer;
import com.jaunt.*;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.logging.Level;
import java.util.logging.Logger;
public class Wiksurfer {
public static void main(String[] args) {
surfPages();
}
public static void surfPages() {
int i = 0;//counter for number of processed pages
UserAgent userAgent = new UserAgent();
userAgent.settings.autoSaveAsHTML = true; //change settings to autosave last visited page.
Elements elements;
String wikiword;
try {
PrintWriter pw = new PrintWriter(new FileWriter("wkwords.txt"));
try {
String href = "http://en.wiktionary.org/wiki/Category:English_uncountable_nouns";
for (i = 0;; i++) {
userAgent.visit(href);
href = userAgent.doc.getHyperlink("next page").getHref().replaceAll("amp;", "");
//Jaunt has now a strange bug on processing strings, so the fragment replaceAll("amp;", "")
//is devoted to fix this issue
System.out.println("next page:" + href);
elements = userAgent.doc.findEvery("<div class=mw-category-group>").findEach("<li>").findEach("<a>");
//select all links which contan words we need
for (Element el : elements) { //iterate through Results
wikiword = el.getText().replaceAll("amp;", "");
if (isGoodWord(wikiword)) {
pw.println(wikiword);
}
}
}
} catch (JauntException e) {
System.err.println(e);
} finally {
pw.close();
}
} catch (IOException ex) {
Logger.getLogger(Wiksurfer.class.getName()).log(Level.SEVERE, null, ex);
}
}
public static boolean isGoodWord(String s) {
char c;
if (s.split(" +").length > 1) {
return false;
}
c = s.charAt(0);
return ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'));
}
}
Solution
First, you could wrap the PrintWriter pw
into a try-with-resources statement.
Without knowing the library, it is unclear how the loop for (i = 0;; i++)
can be left (only with an exception). Does some method throw a JauntException
?
If so, where? JauntException
also seems to be a quite general exception. Maybe a more specific one can be used. Also if the API provides such methods, it might be a good idea to test if more results are available (like hasNext()
in Iterator
) (instead of relying on exceptions for control flow).
Don’t use exceptions for flow control. I suppose a JauntException
will be thrown when there is no next page, interrupting the otherwise infinite loop. This is not a good way to exit a loop. Use conditionals instead.
I don’t see the loop variable used. If you don’t use it, delete it.
Don’t declare variables at the top. Declare them where you use them. Declaring at the top was an old requirement in C. In modern languages like Java, it’s recommended to declare variables in the smallest block where they are used, and as close as possible where they are used.
Splitting by spaces and checking the length of the resulting array is an awkward way of checking if a string contains spaces. Use s.indexOf(' ') > -1
instead.