Problem
This is an alternative parser based on the specifications from this question. Briefly stated, the input file is a text file which has at least 33 fields separated by semicolons.
If the fourth field begins with either a T
or an E
, the line is valid and a subset of it written to the output file. Specifically, fields as numbered from 0, should be output in this order: {0,2,3,4,5,6,10,9,11,7,32}, each separated by a comma. All other fields are discarded.
One of the other answers there suggested that one could use a Flex-based parser instead. My own efforts were not faster, but I’m hoping that someone can review this and show me how to extract more speed from this version.
lexer.l
%{
#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <algorithm>
#include <experimental/iterator>
#include <iterator>
#undef YY_DECL
#define YY_DECL int FileLexer::yylex()
class FileLexer : public yyFlexLexer {
public:
FileLexer(std::istream& in, std::ostream& out) :
yyFlexLexer{&in, &out},
out{out}
{}
using FlexLexer::yylex;
/// the yylex function is automatically created by Flex.
virtual int yylex();
private:
/// pointer to the current value
std::vector<std::string> vec;
std::ostream& out;
unsigned fieldcount{0};
bool valid{true};
};
%}
%option warn nodefault batch noyywrap c++
%option yyclass="FileLexer"
FIELD [^;n]*
DELIM ;
%%
{DELIM} { }
n {
if (valid && fieldcount >= 33) {
std::copy(vec.begin(), vec.end(), std::experimental::make_ostream_joiner(out, ","));
out << 'n';
}
vec.clear();
fieldcount = 0;
valid = true;
return 1;
}
{FIELD} {
if (valid) {
switch (fieldcount++) {
case 0:
case 1:
case 4:
case 5:
case 6:
case 7:
case 9:
case 32:
vec.push_back(yytext);
break;
case 3:
if (yytext[0] == 'E' || yytext[0] == 'T') {
vec.push_back(yytext);
valid = true;
} else {
valid = false;
}
break;
case 10:
{
auto n{vec.size()};
vec.push_back(yytext);
std::iter_swap(vec.begin()+n, vec.begin()+n-2);
}
break;
case 11:
{
auto n{vec.size()};
vec.push_back(yytext);
std::iter_swap(vec.begin()+n, vec.begin()+n-1);
}
break;
}
}
}
%%
int main(int argc, char *argv[]) {
if (argc >= 3) {
std::ifstream in{argv[1]};
std::ofstream out{argv[2]};
FileLexer lexer{in, out};
while (lexer.yylex() != 0)
{}
}
}
Compile with:
flex -o parsefile.cpp lexer.l
g++ -O2 -std=gnu++17 parsefile.cpp -o parsefile
This works, but it is slow (2.165 s) on my machine, with the same million-line input file as mentioned in my answer to the other question.
I tried it a few different ways, but I was unable to get a version that was faster than the PHP code in the other question. The switch
statement logic is arguably a bit overly clever and stores only the needed fields in the desired order, but the speed was about the same as the straightforward implementation.
If it matters, I’m using gcc
version 10.1 and flex
2.6.4 on a 64-bit Linux machine.
Solution
I see a few small issues in the C++ code, that probably won’t give any large performance benefit. Flex is doing the heavy work of reading the input and parsing it, there’s not much you can do about that.
Iterator arithmetic
Instead of:
case 10:
{
auto n{vec.size()};
vec.push_back(yytext);
std::iter_swap(vec.begin() + n, vec.begin() + n - 2);
}
You can also do iterator arithmetic on the end iterator, thereby avoiding the need to get the size of the vector:
case 10:
vec.push_back(yytext);
std::iter_swap(vec.end() - 1, vec.end() - 3);
Don’t return 1
after reading a newline character
There is no need to return from yylex()
after reading a newline, just remove the return 1
statement. This avoids needing the while
-loop in main()
.
Use emplace_back()
instead of push_back()
This avoids having to create a temporary that is being copied into the vector.
Overview
There is an issue (Bug) here in that yytext
points at the beginning of the lexeme. But the lexeme is not null (‘