Problem
My application takes a .csv (comma separated values) file and serializes the data into a binary file. Then it converts the binary file back to a .csv file (identical to the original file).
For example, consider the following .csv file:
John, Appleseed, 435, 89.2 Debbie, Downer, 924, 70.3 Shirley, Temple, 0235, 93.7 Chatty, Cathy, 9237, 68.3 Goody, Two-Shoes, 8534, 99.6 Michael, Scott, 5432, 74.4 Rick, Deer, 4563, 57.2 Negative, Nancy, 3543, 78.3 East, West, 5343, 84.9 Old, Gregg, 9212, 49.2 Charles, Carmichael, 1495, 96.3 Bruce, Wayne, 9327, 97.7 Violet, Rose, 8326, 91.5
Firstly, I encapsulate student information in this simple Student
class.
Student.h
#ifndef Student_h
#define Student_h
#include <string>
#include <sstream>
class Student
{
private:
std::string firstName;
std::string lastName;
int id;
double gpa;
public:
Student(std::string studentInformation);
std::string getFirstName();
std::string getLastName();
std::string getName(); // returns first name and last name
int getId();
double getGPA();
double getSize();
};
#endif
Student.cpp
#include "Student.h"
using namespace std;
Student::Student(std::string studentInformation)
{
stringstream studentStream(studentInformation); // a stream of student information
studentStream >> firstName;
studentStream >> lastName;
studentStream >> id;
studentStream >> gpa;
}
std::string Student::getFirstName()
{
return firstName;
}
std::string Student::getLastName()
{
return lastName;
}
std::string Student::getName()
{
return firstName + " " + lastName;
}
int Student::getId()
{
return id;
}
double Student::getGPA()
{
return gpa;
}
double Student::getSize()
{
return sizeof(firstName) + sizeof(lastName) + sizeof(gpa) + sizeof(id);
}
I also create a very simple StudentList
class (contains a vector of students):
StudentList.h
#ifndef StudentList_h
#define StudentList_h
#include <string>
#include <vector>
#include <fstream> // file stream
#include <exception>
#include <iostream>
#include <algorithm>
#include "Student.h"
class StudentList
{
public:
StudentList(std::string filePath);
std::vector<Student> exposeStudentVector() const;
private:
std::vector<Student> students;
};
#endif
StudentList.cpp
#include "StudentList.h"
using namespace std;
StudentList::StudentList(string filePath)
{
ifstream studentFile(filePath);
string lineContents;
while (!studentFile.eof())
{
getline(studentFile, lineContents);
lineContents.erase(std::remove(lineContents.begin(), lineContents.end(), ','), lineContents.end());
Student s(lineContents);
students.push_back(s);
}
studentFile.close();
}
std::vector<Student> StudentList::exposeStudentVector() const
{
return students;
}
To actually do the data parsing, I call two methods on a StudentList
object. Firstly, I convert the text file into binary using saveBinary()
, and then I recover the text file using recoverText()
function. Notice that I hard coded the number of read entries in the recoverText()
. I tried using the .eof()
function, but it doesn’t seem to work with binary data file. Can someone suggest a better way? How can I increase the robustness of my code? Suppose the .csv file was massive for example. How can I make the processes more efficient?
Main:
#include <iostream>
#include "Student.h"
#include "StudentList.h"
using namespace std;
void saveBinary(vector<Student> sList); // saves student list to a binary file
void recoverText(); // recreates the original text file (by reading the binary file created by saveBinary())
int main(int argc, const char * argv[])
{
StudentList list("students.csv");
saveBinary(list.exposeStudentVector());
recoverText();
return 0;
}
void saveBinary(const vector<Student> sList)
{
try
{
ofstream datafile;
datafile.open("info.bin", ios::out | ios::binary | ios::trunc);
for (Student s : sList)
{
cout << "nSize of current student is: " << s.getSize() << "n";
datafile.write((char*)&s, sizeof(s));
}
datafile.close();
cout << "nSuccess! Data was saved to binary file.n";
}
catch (exception x)
{
cout << "nError! Could not save data to binary file.";
}
}
void recoverText()
{
try
{
Student* s = new Student("dummy student 22 44");
ifstream datafile;
datafile.open("info.bin", ios::in | ios::binary); // open the binary data
cout << "nSize of current student is: " << s->getSize() << "n";
ofstream recoveredText;
recoveredText.open("n_students.csv", ios::out | ios::app);
for (int counter = 0; counter < 13; counter++)
{
datafile.read((char*)s, sizeof(*s)); // read the binary file data
recoveredText << s->getFirstName() << ", ";
recoveredText << s->getLastName() << ", ";
recoveredText << s->getId() << ", ";
recoveredText << s->getGPA();
recoveredText << "n";
}
recoveredText.close();
cout << "nSuccess! The binary data was recovered.n";
//delete s;
}
catch (exception x)
{
cout << "nError! Could not recover binary data";
}
}
The recovered file looks identical to the original.
Solution
double Student::getSize()
{
return sizeof(firstName) + sizeof(lastName) + sizeof(gpa) + sizeof(id);
}
This is wrong for several reasons:
sizeof
evaluated to astd::size_t
, an unsigned integer type. Not a floating point.- Adding the sizes of a structure’s members is generally not sufficient to get the size of the structure. You need to consider padding.
- The size of a structure doesn’t include the this of things pointed to by its members, directly or indirectly. Here,
sizeof
on the string members gives you the size of your standard library’s implementation of thestd::string
structure, which is of very little use: in particular it is unrelated to the size of the actual string contents. It is a compile-time constant.
Just remove that function altogether. If you need the size of the struct, use sizeof(Student)
.
Student::Student(std::string studentInformation)
{
stringstream studentStream(studentInformation); // a stream of student information
studentStream >> firstName;
studentStream >> lastName;
studentStream >> id;
studentStream >> gpa;
}
Each and every line in that function can fail. You need to do more error checking.
std::vector<Student> exposeStudentVector() const;
This is not a good name. Just call it students()
and use another name for the member. (Some people like m_
prefixes, some a trailing _
for example.)
ifstream studentFile(filePath);
This can fail. You must check if it worked or not and find a way to convey that information to the caller – since you’re in a constructor your only two options are an exception and an out parameter (and this second one isn’t very appetizing since you’re left with an object in a strange state).
while (!studentFile.eof())
This is almost always wrong. See: Why is iostream::eof inside a loop condition considered wrong?.
studentFile.close();
The destructor of studentFile
will take care of that.
datafile.write((char*)&s, sizeof(s));
datafile.read((char*)s, sizeof(*s));
These don’t work at all. While this can work for plain data containers (C structs essentially, but even then this is not portable – see alignment and padding), they will not do the right thing for anything even slightly complicated (e.g. pointers). std::string
contains pointers inside it, you’re saving raw pointer values into an binary file, and then reading those back in. Those pointer values have no reason to be valid in another process. Even in the same process, if the pointers have been freed (objects pointed-to deleted for instance), you’re just pointing at random memory. On top of that, the constructors for the std::string
members aren’t run.
And you’re again failing to check that the operations succeeded for the read and write calls. (Plus an arbitrary limitation to exactly 13 entries.)
If you want to serialize structures, you should look into serialization libraries, it is in fact a non-trivial task to get right. Google Protobuf and boost::serialization come to mind. If you want to do it yourself, you’ll need to really think about how to save and restore every member in a safe way. An example for a string is to write out its length (in binary), then the contents. Another approach is just writing out the zero-terminated string contents, but that’s a bit more of a pain to deserialize and validate.
Make sure you test each part of your code independently, in new processes (i.e. doing a write/read in the same process will hide all kinds of pointer bugs that are actually undefined behavior and could crash in fancy fashions – or not unfortunately). Also test invalid inputs. Also test things like input file not being there at all, wrong permissions that prevent the output files from being generated, etc. Your program should behave correctly, or at least sanely, in these circumstances. Entering an infinite loop is not nice.
Make sure you check the return values of all functions that can fail. For starters: all I/O operations can fail, all memory allocations can fail, all parsing operations can fail, all system calls can fail.