Problem
I have a very large data file full of movie ratings that I am looking at for work. I wanted to do this in a clean and very effective manner. The ratings file contains on a per column by column basis:
userID, movieID, rating…
I have parsed the files, and I am now trying to compute cosine similarity of all 100,000 ratings for each movie. Thus, I’m using an ADT Hashmap to store the values of the ratings of each movie as follows HashMap. For each 1000 or so movie, I’m to compute the cosine Similarity. This is what i have done so far, what do you guys think?
import java.util.*;
import java.io.*;
public class MovieRatingParser {
static HashMap<String, Double> ratings = new HashMap<>();
public void parseMovieFile() throws FileNotFoundException, IOException {
//Create an ArrayList to store movies
ArrayList<Movie> movies = new ArrayList<Movie>();
try {
//Create a buffered file reader for FileReader to read in movies.dat
BufferedReader br = new BufferedReader(new FileReader("movies.dat"));
String readFile = br.readLine();
while (readFile != null) {
//Use String split delimiter to load each movie one by one
//File delimiter is “\|"
String[] tokenDelimiter = readFile.split("\|");
String movieID = tokenDelimiter[0];
String movieTitle = tokenDelimiter[1];
Movie movieToAdd = new Movie(movieID, movieTitle);
movies.add(movieToAdd);
readFile = br.readLine();
}
br.close();
} catch (FileNotFoundException e) {
System.out.println("file was not Found!");
}
System.out.println("==============================================");
}
public static void parseRatingFile() throws FileNotFoundException, IOException{
try {
BufferedReader br = new BufferedReader(new FileReader("ratings.dat"));
String readFile = br.readLine();
while (readFile != null) {
String[] tokenDelimiter = readFile.split("\|");
String userID = tokenDelimiter[0];
String movieID = tokenDelimiter[1];
double rating = Double.parseDouble(tokenDelimiter[2]);
ratings.put(movieID, rating);
readFile = br.readLine();
}
br.close();
} catch (FileNotFoundException e) {
System.out.println("File was not Found!");
}
}
public static double computeCosineSimilarity(HashMap<String, Double> movieA, HashMap<String, Double> movieB) {
double dotProduct = 0.0;
double normA = 0.0;
double normB = 0.0;
parseRatingFile();
for (int j = 0; j < ratings.size(); j++) {
movieA.put(ratings.get(3), ratings.values());
}
for (int i = 0; i < movieA.size(); i++) {
dotProduct += movieA[i] * movieB[i];
normA += Math.pow(movieA[i], 2);
normB += Math.pow(movieB[i], 2);
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
}
What can I do to improve the code? It looks very sloppy.
Solution
I’m not familiar with the algorithm you’ve implemented. So I cannot point to improvements there. But some things in the code can be enhanced.
Use informative error messages. For instance, instead of:
...
} catch (FileNotFoundException e) {
System.out.println("file was not Found!");
}
...
consider something like:
...
} catch (FileNotFoundException e) {
String detailedMessage =
format("File [%s] was not found. Reason was [%s]!", "movies.dat", e.getMessage());
// BTW "movies.dat" can be extracted into constant.
System.out.println(detailedMessage);
}
...
In the latter snippet you can see that error message includes detailed info about what really happened. And please note []
that surround variable data: such placeholders not only help to see corner cases in log (for example, when empty name of input file was specified by mistake) but do grep
(or any other text search) efficiently.
Consider try-with-resources. That will reduce amount of boilerplate code when dealing with readers.
Move parsing logic, e.g.:
...
String[] tokenDelimiter = readFile.split("\|");
String userID = tokenDelimiter[0];
String movieID = tokenDelimiter[1];
double rating = Double.parseDouble(tokenDelimiter[2]);
...
into separate helper method like it’s already done for computeCosineSimilarity()
.
After all “little” improvements are done you will see the code more clearly. Then you can concentrate on the algorithm (e.g. on pure logic), add checks for corner cases (like empty input file), use strict math for floating point numbers, handle encoding of input files gracefully, improve overall processing speed for large files, etc.