Determining the similarity between two documents

Posted on

Problem

I’ve made some code that reads in text files (which hold quite large vectors of word frequencies), which in turn stores each index of a vector within an ArrayList for their specified team (i.e. Arsenal vector to the Arsenal ArrayList, Chelsea vector the Chelsea ArrayList, etc.). It then uses a cosine similarity function to determine similarity between the two documents and writes it to a file.

What I would like is to make the code that reads in the text files (and storing them in their corresponding ArrayList more efficient), rather than me change the parameters of the while loop each time i need to use it.

public static Double cosineSimilarity(ArrayList<Integer> vectorOne, ArrayList<Integer> vectorTwo) {

    Double dotProduct = 0.0;
    Double normVecA = 0.0;
    Double normVecB = 0.0;
    for(int i = 0; i < vectorOne.size(); i++) {
        dotProduct += vectorOne.get(i) * vectorTwo.get(i);
        normVecA += Math.pow(vectorOne.get(i), 2);
        normVecB += Math.pow(vectorTwo.get(i), 2);
    }

    return dotProduct / (Math.sqrt(normVecA) * Math.sqrt(normVecB));

}

public static void main(String[] args) throws IOException {

    ArrayList<Integer> arsenal = new ArrayList<Integer>();
    ArrayList<Integer> chelsea = new ArrayList<Integer>();
    ArrayList<Integer> liverpool = new ArrayList<Integer>();
    ArrayList<Integer> manchesterCity = new ArrayList<Integer>();
    ArrayList<Integer> manchesterUnited = new ArrayList<Integer>();
    ArrayList<Integer> tottenham = new ArrayList<Integer>();

    Scanner textFile = new Scanner(new File("Enter textfile  here"));

    while (textFile.hasNext()) {
        arsenal.add(textFile.nextInt());
   }

    Double output = cosineSimilarity(arsenal, chelsea);

    File fileName;
    FileWriter fw;

    // Create a new textfile for listOfWords
    fileName = new File("arsenalCosineSimilarities.txt");
    fw = new FileWriter(fileName, true);

    fw.write(String.valueOf("Chelsea: " + output + "n"));

    fw.close();
}

Solution

For flexibility, your cosineSimilarity() method should taken in List<Integer> arguments, instead of ArrayList<Integer> arguments. This method doesn’t care how the list is stored, only that it is a list which implements .size() and .get(i) methods.

For efficiency you should used double variables, not Double objects in the method:

public static double cosineSimilarity(List<Integer> vectorOne,
                                      List<Integer> vectorTwo)
    double dotProduct = 0.0;
    double normVecA = 0.0;
    double normVecB = 0.0;

When you open a Scanner, you should .close() it, to prevent resource leaks. The “try-with-resources” construct will automatically close resources that it opens. So instead of:

Scanner textFile = new Scanner(new File("Enter textfile  here"));

use

try(Scanner textFile = new Scanner(new File("Enter textfile  here")) {
    // use textFile inside this block.
}
// textFile is automatically closed when the block is exited.

Ditto with FileWriter. Use “try-with-resources” to automatically close the writer.

try(FileWriter fw = new FileWriter(fileName, true)) {
    fw.write( ... );
}
// fw has been automatically closed at this point.

If you are using .nextInt(), you should loop on .hasNextInt(), not .hasNext().

while(textFile.hasNextInt()) { 
     arsenal.add(textFile.nextInt());
} 

It sounds like you want a Map<String,List<Integer>> to store an ArrayList for each team.

List<String> team_names = List.of("Arsenal", "Chelsea", "Liverpool",
        "ManchesterCity", "ManchesterUnited", "Tottenham");

Map<String,List<Integer>> stats = new HashMap<>();

// Read in all team stats from (for example) "<TeamName>_stats.txt" files

for(String team_name : team_names) {
   List<Integer> team_stats = new ArrayList<>();
   try (Scanner textFile = new Scanner(new File(team+"_stats.txt"))) {
      while(textFile.hasNextInt()) {
          team_stats.add(textFile.nextInt());
      }
   }
   stats.put(team_name, team_stats);
}

Then you can use stats.get(team_name) to get each team’s stats for comparison / analysis

for(String team1_name : team_names) {
   List<Integer> team1_stats = stats.get(team1_name);

   for(String team2_name: team_names) {
      // Skip comparing a team against itself
      if (team1_name.equals(team2_name))
          continue;

      List<Integer> team2_stats = stats.get(team2_name);

      double output = cosineSimilarity(team1_stats, team2_stats);

      // ... display "output", or write to file, or ...
   }
}

Depending on how you want your output written (one file for all comparisons, or one file per team for a comparison with all other teams, you’ll want to open the FileWriter before the outer loop, or before the inner loop.

Leave a Reply

Your email address will not be published. Required fields are marked *