Problem
I reviewed this question that in my opinion uses a not so pretty batch-save so I tried improve it and this is what I came up with.
In this more generic solution that allows to reuse it I wanted to implement the using
statement instead of manually disposing and recreating the context.
It is an extension that you can use on a collection of entities. It requires the size of the batch and a factory method for the context. The last parameter is for reporting progress.
Unlike the original solution this one works with double so it is able to report progress for collection that are smaller then 100 items.
Internally it uses two lambdas: one for checking if the batch should be saved and the other one for progress reporting. The context is wrapped with a using statement.
static class EnumerableExtensions
{
public static void BatchSave<T, TContext>(
this IList<T> items,
int batchSize,
Func<TContext> createContext,
IProgress<double> progress
)
where T : class
where TContext : DbContext
{
var count = 0;
var lastPercentage = 0d;
var nextBatch = new Func<bool>(() => ++count % batchSize == 0);
var reportProgress = new Action(() =>
{
var percentage = Math.Round((double)count / (double)items.Count * 100d, 1);
if (percentage > lastPercentage)
{
progress.Report(percentage);
lastPercentage = percentage;
}
});
var enumerator = items.GetEnumerator();
while (enumerator.MoveNext())
{
using (var context = createContext())
{
context.Configuration.AutoDetectChangesEnabled = false;
context.Configuration.ValidateOnSaveEnabled = false;
do
{
//context.Set<T>().Add(enumerator.Current);
Console.WriteLine("Add " + enumerator.Current);
reportProgress();
} while (!nextBatch() && enumerator.MoveNext());
//context.SaveChanges();
Console.WriteLine("Save batch = " + count);
}
}
reportProgress();
}
}
class TestContext : DbContext { }
I needed to comment the two entity framework lines out so that it doesn’t crash with the TestContext
.
Example usage:
void Main()
{
// convert items to string to satisfy the class constraint
var items = Enumerable.Range(0, 23).Select(x => x.ToString()).ToArray();
var batchSize = 10;
var progress = new Progress<double>(percent => Console.WriteLine($"Progress {percent}"));
items.BatchSave(batchSize, () => new TestContext(), progress);
}
Output:
Add 0
Add 1
Add 2
Add 3
Add 4
Add 5
Add 6
Add 7
Add 8
Add 9
Save batch = 10
Add 10
Progress 4,3
Add 11
Add 12
Add 13
Add 14
Add 15
Add 16
Add 17
Add 18
Add 19
Save batch = 20
Add 20
Add 21
Add 22
Save batch = 23
Progress 13
Progress 17,4
Progress 21,7
Progress 26,1
Progress 30,4
Progress 34,8
Progress 39,1
Progress 43,5
Progress 47,8
Progress 52,2
Progress 56,5
Progress 60,9
Progress 65,2
Progress 69,6
Progress 73,9
Progress 78,3
Progress 8,7
Progress 82,6
Progress 91,3
Progress 95,7
Progress 100
Progress 87
Solution
var count = 0;
var lastPercentage = 0d;
var nextBatch = new Func<bool>(() => ++count % batchSize == 0);
It was a clever usage of context capture but I think it ended up playing against you.
Your method has way too much responsibility. You should had at least another method to resolve the batching problem. Because you mixed the two things(batch + db stuff) you ended up by having this:
var enumerator = items.GetEnumerator();
while (enumerator.MoveNext())
{
//...
do{
//...
} while (!nextBatch() && enumerator.MoveNext());
}
And the first thing that I thought about was: “Why the hell is he advancing the enumerator twice? Why couldn’t he use foreach
instead?” And than I realized, without even parsing the code too much: “Ah it must be because of the batching stuff…”.
It doesn’t really matter if I got that right or not, I had to spend time to think about it, asking this kind questions is very useful actually. I would strongly suggest you next time you see yourself using enumerator.MoveNext
to ask the same thing. Why aren’t you using foreach
?
Anyway, as I said you should separate those two concerns. A batch implementation would become like this:
public IEnumerable<IEnumerable<int>> Batch(int total, int batchSize){
for(int i = 0; i < total; i += batchSize){
var currentBatchSize = Math.Min(batchSize, total - i);
yield return Enumerable.Range(i * batchSize, currentBatchSize);
}
}
A similar thing could be said about reportProgress
. You ended up by creating abstraction over a simple calculation that is to calculate the report progress(with a simple mathematical formula). It didn’t really contribute in any positive way to the readability/maintainability of your code.
Putting that all into a new implementation would look like this:
public static IEnumerable<IEnumerable<int>> GetBatches(int total, int batchSize){
for(int i = 0; i < total; i += batchSize){
var currentBatchSize = Math.Min(batchSize, total - i);
yield return Enumerable.Range(i, currentBatchSize);
}
}
public static void BatchSave<T, TContext>(
this IList<T> items,
int batchSize,
Func<TContext> createContext,
IProgress<double> progress
)
where T : class
where TContext : DbContext
{
var batches = GetBatches(items.Count, batchSize).ToList();
int currentBatch = 0;
foreach(var batch in batches){
using (var context = createContext())
{
context.Configuration.AutoDetectChangesEnabled = false;
context.Configuration.ValidateOnSaveEnabled = false;
foreach(var i in batch){
Console.WriteLine("Add " + items[i]);
}
//context.SaveChanges();
progress.Report(++currentBatch * 100.0 / batches.Count);
Console.WriteLine("Save batch = " + currentBatch);
}
}
progress.Report(100);
}
This alternative might yet lack something. For example, if you are really a neat freak, maybe you would try to change the return result of GetBatches
in someway to be more useful. One thing that could be useful is to know which batch is the current batch. Should the return be IEnumerable<IEnumerable<Tuple<int, int>>>
(where tuple can be another classes with two integers if you want…). Would it be more useful to turn it to GetBatches<T>
? with IEnumerable<IEnumerable<T>>
return? Well I will leave that for you to decide, at least I am providing the foundations for it.
One another thing that might be worth pointing is that there isn’t really a reason to createContext()
multiple times, you can create it just one time, and just call SaveChanges
per each batch. Or maybe there is an efficiency reason, that I am not aware of… but you have to prove it by measuring.
I don’t really know C# but this looks pretty nice.
The main part of the code is tightly coupled to a DbContext
.
As I saw a TestContext
,
I was a bit surprised that it wasn’t used more to verify the behavior of the BatchSave
method.
That is, instead of commenting out the lines //context.Set<T>().Add(enumerator.Current);
and //context.SaveChanges();
and replacing them with print statements for the sake of a demo,
it would be great if you could leave the implementation intact and write a TestContext
that would allow verification.
A minor point is that nextBatch
is declared too early,
far from the first place of use.
It would be better to move it closer,
right in front of the declaration of enumerator
.
Finally, a tiny thing, the casting of items.Count
to double
is redundant, as the cast on count
will already make the expression double
.