Problem
I’ve made some code that reads in text files (which hold quite large vectors of word frequencies), which in turn stores each index of a vector within an ArrayList
for their specified team (i.e. Arsenal
vector to the Arsenal
ArrayList
, Chelsea
vector the Chelsea
ArrayList
, etc.). It then uses a cosine similarity function to determine similarity between the two documents and writes it to a file.
What I would like is to make the code that reads in the text files (and storing them in their corresponding ArrayList
more efficient), rather than me change the parameters of the while loop each time i need to use it.
public static Double cosineSimilarity(ArrayList<Integer> vectorOne, ArrayList<Integer> vectorTwo) {
Double dotProduct = 0.0;
Double normVecA = 0.0;
Double normVecB = 0.0;
for(int i = 0; i < vectorOne.size(); i++) {
dotProduct += vectorOne.get(i) * vectorTwo.get(i);
normVecA += Math.pow(vectorOne.get(i), 2);
normVecB += Math.pow(vectorTwo.get(i), 2);
}
return dotProduct / (Math.sqrt(normVecA) * Math.sqrt(normVecB));
}
public static void main(String[] args) throws IOException {
ArrayList<Integer> arsenal = new ArrayList<Integer>();
ArrayList<Integer> chelsea = new ArrayList<Integer>();
ArrayList<Integer> liverpool = new ArrayList<Integer>();
ArrayList<Integer> manchesterCity = new ArrayList<Integer>();
ArrayList<Integer> manchesterUnited = new ArrayList<Integer>();
ArrayList<Integer> tottenham = new ArrayList<Integer>();
Scanner textFile = new Scanner(new File("Enter textfile here"));
while (textFile.hasNext()) {
arsenal.add(textFile.nextInt());
}
Double output = cosineSimilarity(arsenal, chelsea);
File fileName;
FileWriter fw;
// Create a new textfile for listOfWords
fileName = new File("arsenalCosineSimilarities.txt");
fw = new FileWriter(fileName, true);
fw.write(String.valueOf("Chelsea: " + output + "n"));
fw.close();
}
Solution
For flexibility, your cosineSimilarity()
method should taken in List<Integer>
arguments, instead of ArrayList<Integer>
arguments. This method doesn’t care how the list is stored, only that it is a list which implements .size()
and .get(i)
methods.
For efficiency you should used double
variables, not Double
objects in the method:
public static double cosineSimilarity(List<Integer> vectorOne,
List<Integer> vectorTwo)
double dotProduct = 0.0;
double normVecA = 0.0;
double normVecB = 0.0;
When you open a Scanner
, you should .close()
it, to prevent resource leaks. The “try-with-resources” construct will automatically close resources that it opens. So instead of:
Scanner textFile = new Scanner(new File("Enter textfile here"));
use
try(Scanner textFile = new Scanner(new File("Enter textfile here")) {
// use textFile inside this block.
}
// textFile is automatically closed when the block is exited.
Ditto with FileWriter
. Use “try-with-resources” to automatically close the writer.
try(FileWriter fw = new FileWriter(fileName, true)) {
fw.write( ... );
}
// fw has been automatically closed at this point.
If you are using .nextInt()
, you should loop on .hasNextInt()
, not .hasNext()
.
while(textFile.hasNextInt()) {
arsenal.add(textFile.nextInt());
}
It sounds like you want a Map<String,List<Integer>>
to store an ArrayList
for each team.
List<String> team_names = List.of("Arsenal", "Chelsea", "Liverpool",
"ManchesterCity", "ManchesterUnited", "Tottenham");
Map<String,List<Integer>> stats = new HashMap<>();
// Read in all team stats from (for example) "<TeamName>_stats.txt" files
for(String team_name : team_names) {
List<Integer> team_stats = new ArrayList<>();
try (Scanner textFile = new Scanner(new File(team+"_stats.txt"))) {
while(textFile.hasNextInt()) {
team_stats.add(textFile.nextInt());
}
}
stats.put(team_name, team_stats);
}
Then you can use stats.get(team_name)
to get each team’s stats for comparison / analysis
for(String team1_name : team_names) {
List<Integer> team1_stats = stats.get(team1_name);
for(String team2_name: team_names) {
// Skip comparing a team against itself
if (team1_name.equals(team2_name))
continue;
List<Integer> team2_stats = stats.get(team2_name);
double output = cosineSimilarity(team1_stats, team2_stats);
// ... display "output", or write to file, or ...
}
}
Depending on how you want your output written (one file for all comparisons, or one file per team for a comparison with all other teams, you’ll want to open the FileWriter
before the outer loop, or before the inner loop.