Read file with URLs to connect to them, search for patterns and generate output in multi-threading env

Posted on

Problem

I’ve been implementing a little application that will read a text file that holds URL addresses (one address per line).

Example:

http://www.yahoo.com
http://www.cnn.com
http://news.google.com

Based on that file, I am supposed to read the parsed content of the main page for each URL and search for any text that includes a hashtag.

Once I found all the words that matches the hashtag pattern, I am supposed to write an output text file for each URL that I connected to with the results of the lookup.

External libraries being used: Apache Commons Validator (for URL stuff) and JSoup for HTML handling.

So far, this is what I’ve done:

An enum holding the supported patterns, I am using by now the hashtag:

public enum SearchPatternTypeEnum {

HASHTAG_PATTERN("#(\S+)"),
TWITTER_ACCOUNT_PATTERN("@(\S+)"),
PROPER_NAME_PATTERN("^[\p{L} .'-]+$");

private String regexPattern;

private SearchPatternTypeEnum(String regexPattern) {
    this.regexPattern = regexPattern;
}

public String getRegexPattern() {
    return regexPattern;
}

public void setRegexPattern(String regexPattern) {
    this.regexPattern = regexPattern;
}

}

A class (singleton) to handle read/write operations on text files

public class TextFileManager {

private static class Loader {
    static TextFileManager INSTANCE = new TextFileManager();
}

private TextFileManager() {

}

public static TextFileManager getInstance() {
    return Loader.INSTANCE;
}

public List<String> readLinesAsList(String fullFileName) throws IOException {
    List<String> dataLines = new ArrayList<>();

    try (Stream<String> stream = Files.lines(Paths.get(fullFileName))) {
        dataLines = stream.collect(Collectors.toList());
    } catch (IOException e) {
        throw e;
    }

    return dataLines;
}

public void writeContentToFile(String folderName, String fileName, String content) throws IOException {
    File targetFile = new File(folderName, fileName);
    try (BufferedWriter bufferedWriter = new BufferedWriter(new FileWriter(targetFile))) {
        bufferedWriter.write(content);
    } catch (IOException e) {
        throw e;
    }
}

}

Since the application is supposed to support more than one thread processing, I am creating a “task” that will connect to the URL, then get the HTML using JSoup, then search for the hashtag pattern (regex) and then save all the matches in a String variable that is a property in an object that holds other information of the process. I am saving as String because that content must be output to a file.

public class WebPageAnalyzerTask implements Callable<WebPageAnalysisTaskResult>{
private static final int CONNECTION_TIMEOUT_IN_MILLIS = 10000;
private String targetUrl;
private String searchPattern;
private int matchesFound = 0;

public WebPageAnalyzerTask(String targetUrl, String searchPattern) {
    this.targetUrl = targetUrl;
    this.searchPattern = searchPattern;
}

@Override
public WebPageAnalysisTaskResult call() throws Exception {
    WebPageAnalysisTaskResult resultObj = new WebPageAnalysisTaskResult();
    resultObj.setAnalyzedUrl(this.targetUrl);
    long startTime = System.nanoTime();
    String htmlContent = this.getHtmlContentFromUrl();
    String resultContent = this.getAnalysisResults(htmlContent);
    resultObj.setOutputContent(resultContent);
    long endTime = System.nanoTime();
    resultObj.setElapsedTime(startTime, endTime);
    resultObj.setSuccess(true);
    return resultObj;
}

private String getAnalysisResults(String htmlContent) {
    StringBuilder contentResult = new StringBuilder();
    Pattern pattern = Pattern.compile(this.searchPattern);
    Matcher matcher = pattern.matcher(htmlContent);

    while (matcher.find()) {
        matchesFound++;
        contentResult.append(matcher.group(1)).append(System.lineSeparator());
    }
    return contentResult.toString();
}

private String getHtmlContentFromUrl() throws IOException {
    Connection httpConnection = Jsoup.connect(this.targetUrl);
    httpConnection.timeout(CONNECTION_TIMEOUT_IN_MILLIS);
    Document htmlDocument = httpConnection.get();
    return htmlDocument.html();
}

}

And the class that will use the tasks is:

public class WebAnalyzer {
private static final int THREAD_POOL_SIZE = 3;
private static final String[] ALLOWED_URL_SCHEMES = new String[]{"http"};
private static final String RESULT_OUTPUT_FOLDER = "output";
private String inputPath;
private final TextFileManager txtFileMgr = TextFileManager.getInstance();

public WebAnalyzer(String inputPath) {
    this.inputPath = inputPath;
}

public void startAnalysis(SearchPatternTypeEnum patternType) {
    try {
        if (this.inputPath == null || this.inputPath.isEmpty()) {
            throw new InvalidFilePathException("The input file path provided is not valid.");
        }
        List<String> urlsToBeProcessed = txtFileMgr.readLinesAsList(this.inputPath);

        if (urlsToBeProcessed != null && urlsToBeProcessed.size() > 0) {
            List<Callable<WebPageAnalysisTaskResult>> pageAnalysisTasks = this.buildPageAnalysisTasksList(urlsToBeProcessed, patternType.getRegexPattern());
            ExecutorService executor = Executors.newFixedThreadPool(THREAD_POOL_SIZE);
            List<Future<WebPageAnalysisTaskResult>> results = executor.invokeAll(pageAnalysisTasks);
            executor.shutdown();
            writeOutputFiles(results);
        } else {
            throw new NoContentToProcessException("The input file provided does not contain relevant information for analysis.");
        }

    } catch (Exception e) {
        e.printStackTrace();
    }
}



private void writeOutputFiles(List<Future<WebPageAnalysisTaskResult>> results) throws InterruptedException, ExecutionException, IOException {
    String outputFolder = File.listRoots()[0].getPath() + File.separator + RESULT_OUTPUT_FOLDER + File.separator;
    new File(outputFolder).mkdir();

    for (Future<WebPageAnalysisTaskResult> result : results) {
        WebPageAnalysisTaskResult taskResult = result.get();
        if (taskResult != null) {
            String fileName = this.generateFileName();
            txtFileMgr.writeContentToFile(outputFolder, fileName, taskResult.getOutputContent());
        }
    }
}

private List<Callable<WebPageAnalysisTaskResult>> buildPageAnalysisTasksList(List<String> urlsToBeProcessed, String searchPattern) {
    List<Callable<WebPageAnalysisTaskResult>> tasks = new ArrayList<>();
    UrlValidator urlValidator = new UrlValidator(ALLOWED_URL_SCHEMES);

    urlsToBeProcessed.forEach(urlAddress -> {
        if (urlValidator.isValid(urlAddress)) {
            tasks.add(new WebPageAnalyzerTask(urlAddress, searchPattern));
        }
    });

    return tasks;
}

private String generateFileName() {
    return String.format("%s.txt", UUID.randomUUID().toString());

}

}

Main class that will perform the call:

   public static void main(String[] args) {
    WebAnalyzer webAnalyzer = new WebAnalyzer("data/siteList.txt");
    webAnalyzer.startAnalysis(SearchPatternTypeEnum.HASHTAG_PATTERN);
}

So, how I am focusing this solution is to create a task for each URL being analized. Once the whole analysis is done for the URLS, I start to write the content (results of analysis) on a file for each URL.

I wonder to know what improvements could be made for the performance stuff (considering the multi-threading requirement).

Or if I can get any kindly suggestion to improve the implementation of this application.

I’ll be grateful for any suggestion or feedback.

Solution

Some points in addition to ChrisWue’s good answer:

  • It makes no sense to expose a mutator for the RegexPattern of SearchPatternTypeEnum

  • It also makes no sense to postfix a typename with it’s typekind. You don’t use TextFileManagerClass or WebAnalyzerClass. Why SearchPatternTypeEnum then?

  • readLinesAsList is … useless. This method does nothing aside from materializing a nice lazy IO-Stream into memory, which is not a benefit. It doesn’t provide exception handling, nor does it hide any complexity. That method is utterly useless.

  • writeContentToFile should use the java.nio API instead of the java.io API like this:

    try (BufferedWriter bw = Files.newBufferedWriter(Paths.get(folderName, fileName)
                                                    , StandardCharsets.UTF_8
                                                    , StandardOpenOption.WRITE)) {
    

    This way you get the benefit of cleaner and more differentiated Exceptions as well as support for a multitude of different filesystems.

  • Summarizingly these changes IMO make TextFileManager completely superfluous

  • You’re conflating WebPageAnalyzerTask and WebPageAnalysisTaskResult‘s responsibilities. It’s significantly cleaner to create a Result only when you actually do have a result instead of preemptively declaring a resultObj the first thing in your method. Instead I’d expose a static factory method on WebPageAnalysisTaskResult that takes the relevant information as parameters and constructs an immutable instance of Result. That way nobody will mess with what’s inside a Result.

    This also allows for properly useful error handling in case of exceptions. Currently the Callable just silently terminates without generating a nice proper error when an Exception occurs.

  • Generally I don’t follow with how you deal (or rather actually do not deal) with Exceptions. Instead of just passing every exception you ever encounter to the caller, you should limit the scopes of blocks where you expect to encounter Exceptions, and only in extremely rare cases throw a checked Exception as result. This makes for a cleaner and less noisy API to expose.

  • Disallowing "https", but allowing "http" as URL_SCHEME is bad and you should feel bad.

I don’t do a lot of Java, so here just a few things which caught my attention:

  1. You re-compile the regex pattern over and over again. Probably doesn’t make a lot of difference compared to the time spent on downloading and parsing the web page but still not the most efficient use of CPU resources. You could simply compile the pattern once in SearchPatternTypeEnum and make getRegexPattern() return the compiled pattern.

  2. I don’t see any reason why TextFileManager would need to be a singleton. Singletons are one of the most abused “design patterns”. They have their uses occasionally but this isn’t one of them as far as I can see.

  3. I don’t quite see the point of simply re-throwing the caught IOExceptions in TextFileManager. Try-with-resources doesn’t require a catch block as far as I know and since you’re not doing anything useful in there you might as well omit it.

  4. WebAnalyzer gets given the inputPath but it’s only sanity checked when you call startAnalysis. That check should happen right there in the constructor. You should probably also check that the file exists and present a slightly nicer error to the user in that case.

Leave a Reply

Your email address will not be published. Required fields are marked *