Problem
I have to do a pairwise compare of a couple of text files. They have the following format:
id_1|errorcode_1|m|data_1_1|data_1_2|...|data_1_m
id_2|errorcode_2|m|data_2_1|data_2_2|...|data_2_m
.
.
.
id_n|errorcode_n|m|data_n_1|data_n_2|...|data_n_m
The columns are ordered, but the rows can be in an arbitrary order. And there are always a couple of data-fields that have to be ignored.
This is my method that I call twice (switching the target and source array). It takes around 20 minutes to execute with two files where n = 16000 and m = 100.
private static string[] GetDifferences( IEnumerable<string> source
, IEnumerable<string> target
, IEnumerable<int> ignoredIndices)
{
var uniqueInSource = new List<string>();
foreach (var sourceLine in source)
{
var found = false;
var sourceSplit = sourceLine.Split('|').ToList();
// filter out malformed lines
if (sourceSplit.Count > 2 && !int.TryParse(sourceSplit[2], out _))
{
continue;
}
foreach (var index in ignoredIndices)
{
sourceSplit.RemoveAt(index + 3);
}
foreach (var targetLine in source)
{
var targetSplit = targetLine.Split('|').ToList();
if (targetSplit.Count > 2 && !int.TryParse(targetSplit[2], out _))
{
continue;
}
foreach (var index in ignoredIndices)
{
targetSplit.RemoveAt(index + 3);
}
if (targetSplit.SequenceEqual(sourceSplit))
{
found = true;
break;
}
}
if (!found)
{
uniqueInSource.Add(string.Join("|", sourceLine));
}
}
return uniqueInSource.ToArray();
}
I feel that my approach might be too naive since I think it’s somewhere around O(m*n2).
Solution
Your method is currently doing three things at a time and repeating the same logic multiple times:
- parsing logs-1 (parsing a line)
- parsing logs-2 (parsing a line again with the exact same code)
- comparing both logs
It would be much easier to optimize and use it if you refactored it into specialized structures/methods.
You put the first two cases into the static factory method Parse
. It’ll encapsulate the parsing part so you have implemented it only once.
Then you need to encapsulate the equality. You do this by implementing the IEquatable<LogLine>
interface. You should do this very carefuly because both the Equals
and GetHashCode
methods are important for the performance.
This new class can now be very easily tested for: equality, parsing and performance.
public class LogLine : IEquatable<LogLine>
{
public string Errorcode { get; set; }
public string[] Data { get; set; }
// add other properties...
public static LogLine Parse(string log, IEnumerable<int> ignoreColumns)
{
// implement parsing
}
public bool Equals(LogLine other)
{
// implement log equality
}
public override bool Equals(object obj)
{
return base.Equals(obj as LogLine);
}
public override int GetHashCode()
{
// implement log hash-code
}
}
And that’s it because you now have a very LINQ-friendly class that you can use like this where you can parse all lines with a simple Select
:
// dummies
var logs1 = Enumerable.Empty<string>();
var logs2 = Enumerable.Empty<string>();
var ignoreColumns = Enumerable.Empty<int>();
var logLines1 = logs1.Select(l => LogLine.Parse(l, ignoreColumns));
var logLines2 = logs2.Select(l => LogLine.Parse(l, ignoreColumns));
And you can compare them however you like with Except
or Intersect
or use GroupBy
or Distinct
or whatever you want. If you use more then a single extension should call .ToList()
after each parsing so that you don’t parse them multiple times.