# Determining the similarity between two documents

Posted on

Problem

I’ve made some code that reads in text files (which hold quite large vectors of word frequencies), which in turn stores each index of a vector within an `ArrayList` for their specified team (i.e. `Arsenal` vector to the `Arsenal` `ArrayList`, `Chelsea` vector the `Chelsea` `ArrayList`, etc.). It then uses a cosine similarity function to determine similarity between the two documents and writes it to a file.

What I would like is to make the code that reads in the text files (and storing them in their corresponding `ArrayList` more efficient), rather than me change the parameters of the while loop each time i need to use it.

``````public static Double cosineSimilarity(ArrayList<Integer> vectorOne, ArrayList<Integer> vectorTwo) {

Double dotProduct = 0.0;
Double normVecA = 0.0;
Double normVecB = 0.0;
for(int i = 0; i < vectorOne.size(); i++) {
dotProduct += vectorOne.get(i) * vectorTwo.get(i);
normVecA += Math.pow(vectorOne.get(i), 2);
normVecB += Math.pow(vectorTwo.get(i), 2);
}

return dotProduct / (Math.sqrt(normVecA) * Math.sqrt(normVecB));

}

public static void main(String[] args) throws IOException {

ArrayList<Integer> arsenal = new ArrayList<Integer>();
ArrayList<Integer> chelsea = new ArrayList<Integer>();
ArrayList<Integer> liverpool = new ArrayList<Integer>();
ArrayList<Integer> manchesterCity = new ArrayList<Integer>();
ArrayList<Integer> manchesterUnited = new ArrayList<Integer>();
ArrayList<Integer> tottenham = new ArrayList<Integer>();

Scanner textFile = new Scanner(new File("Enter textfile  here"));

while (textFile.hasNext()) {
}

Double output = cosineSimilarity(arsenal, chelsea);

File fileName;
FileWriter fw;

// Create a new textfile for listOfWords
fileName = new File("arsenalCosineSimilarities.txt");
fw = new FileWriter(fileName, true);

fw.write(String.valueOf("Chelsea: " + output + "n"));

fw.close();
}
``````

Solution

For flexibility, your `cosineSimilarity()` method should taken in `List<Integer>` arguments, instead of `ArrayList<Integer>` arguments. This method doesn’t care how the list is stored, only that it is a list which implements `.size()` and `.get(i)` methods.

For efficiency you should used `double` variables, not `Double` objects in the method:

``````public static double cosineSimilarity(List<Integer> vectorOne,
List<Integer> vectorTwo)
double dotProduct = 0.0;
double normVecA = 0.0;
double normVecB = 0.0;
``````

When you open a `Scanner`, you should `.close()` it, to prevent resource leaks. The “try-with-resources” construct will automatically close resources that it opens. So instead of:

``````Scanner textFile = new Scanner(new File("Enter textfile  here"));
``````

use

``````try(Scanner textFile = new Scanner(new File("Enter textfile  here")) {
// use textFile inside this block.
}
// textFile is automatically closed when the block is exited.
``````

Ditto with `FileWriter`. Use “try-with-resources” to automatically close the writer.

``````try(FileWriter fw = new FileWriter(fileName, true)) {
fw.write( ... );
}
// fw has been automatically closed at this point.
``````

If you are using `.nextInt()`, you should loop on `.hasNextInt()`, not `.hasNext()`.

``````while(textFile.hasNextInt()) {
}
``````

It sounds like you want a `Map<String,List<Integer>>` to store an `ArrayList` for each team.

``````List<String> team_names = List.of("Arsenal", "Chelsea", "Liverpool",
"ManchesterCity", "ManchesterUnited", "Tottenham");

Map<String,List<Integer>> stats = new HashMap<>();

// Read in all team stats from (for example) "<TeamName>_stats.txt" files

for(String team_name : team_names) {
List<Integer> team_stats = new ArrayList<>();
try (Scanner textFile = new Scanner(new File(team+"_stats.txt"))) {
while(textFile.hasNextInt()) {
}
}
stats.put(team_name, team_stats);
}
``````

Then you can use `stats.get(team_name)` to get each team’s stats for comparison / analysis

``````for(String team1_name : team_names) {
List<Integer> team1_stats = stats.get(team1_name);

for(String team2_name: team_names) {
// Skip comparing a team against itself
if (team1_name.equals(team2_name))
continue;

List<Integer> team2_stats = stats.get(team2_name);

double output = cosineSimilarity(team1_stats, team2_stats);

// ... display "output", or write to file, or ...
}
}
``````

Depending on how you want your output written (one file for all comparisons, or one file per team for a comparison with all other teams, you’ll want to open the `FileWriter` before the outer loop, or before the inner loop.