Scraper for words from Wiktionary

Posted on

Problem

I wrote this code in Java using the Jaunt library. The program scrapes all words from Wiktionary from category “English_uncountable_nouns”. And after save
each world to text file.
I am not sure that exceptions are handled in best way, so I’m looking for tips on how to make this code easier to understand and more stable. I removed some of the redundancies but it is still strange.

package wiksurfer;
import com.jaunt.*;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.logging.Level;
import java.util.logging.Logger;

public class Wiksurfer {

    public static void main(String[] args) {
        surfPages();
    }

    public static void surfPages() {
        int i = 0;//counter for number of processed pages
        UserAgent userAgent = new UserAgent();
        userAgent.settings.autoSaveAsHTML = true;  //change settings to autosave last visited page.
        Elements elements;
        String wikiword;

        try {
            PrintWriter pw = new PrintWriter(new FileWriter("wkwords.txt"));
            try {
                String href = "http://en.wiktionary.org/wiki/Category:English_uncountable_nouns";

                for (i = 0;; i++) {
                    userAgent.visit(href);
                    href = userAgent.doc.getHyperlink("next page").getHref().replaceAll("amp;", "");
                    //Jaunt has now a strange bug on processing strings, so the fragment replaceAll("amp;", "")
                    //is devoted to fix this issue
                    System.out.println("next page:" + href);

                    elements = userAgent.doc.findEvery("<div class=mw-category-group>").findEach("<li>").findEach("<a>");
                    //select all links which contan words we need
                    for (Element el : elements) {           //iterate through Results
                        wikiword = el.getText().replaceAll("amp;", "");
                        if (isGoodWord(wikiword)) {
                            pw.println(wikiword);
                        }
                    }
                }

            } catch (JauntException e) {
                System.err.println(e);
            } finally {
                pw.close();
            }
        } catch (IOException ex) {
            Logger.getLogger(Wiksurfer.class.getName()).log(Level.SEVERE, null, ex);
        }
    }

    public static boolean isGoodWord(String s) {
        char c;
        if (s.split(" +").length > 1) {
            return false;
        }
        c = s.charAt(0);
        return ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'));
    }
}

Solution

First, you could wrap the PrintWriter pw into a try-with-resources statement.

Without knowing the library, it is unclear how the loop for (i = 0;; i++) can be left (only with an exception). Does some method throw a JauntException ?
If so, where? JauntException also seems to be a quite general exception. Maybe a more specific one can be used. Also if the API provides such methods, it might be a good idea to test if more results are available (like hasNext() in Iterator) (instead of relying on exceptions for control flow).

Don’t use exceptions for flow control. I suppose a JauntException will be thrown when there is no next page, interrupting the otherwise infinite loop. This is not a good way to exit a loop. Use conditionals instead.

I don’t see the loop variable used. If you don’t use it, delete it.

Don’t declare variables at the top. Declare them where you use them. Declaring at the top was an old requirement in C. In modern languages like Java, it’s recommended to declare variables in the smallest block where they are used, and as close as possible where they are used.

Splitting by spaces and checking the length of the resulting array is an awkward way of checking if a string contains spaces. Use s.indexOf(' ') > -1 instead.

Leave a Reply

Your email address will not be published. Required fields are marked *