Problem
I am trying to apply this simple macro to merge two big sized sheets (around 30000 rows each), but the process is too slow and never ends. The macro works perfectly with smaller sheets. Could you give me any advice for the optimization of my code please ?
My Macro consists in an Userform :
Public listChoice As String
'Using your code to get the sheet names for the ListBox rowsource.
Private Sub UserForm_Activate()
For n = 1 To ActiveWorkbook.Sheets.Count
With SelectSheet
.AddItem ActiveWorkbook.Sheets(n).Name
End With
Next n
End Sub
'Including an update event for the ListBox
Private Sub SelectSheet_AfterUpdate()
listChoice = SelectSheet.Text
End Sub
'Including a test just to demonstrate that the result is still retained. You don't need this, it demonstrates the results on the screenshot.
Private Sub CommandButton1_Click()
Dim sh1 As Worksheet
Dim sh2 As Worksheet
Dim sh3 As Worksheet
Dim lc As String
Dim letter2 As String
Dim letter3 As String
Dim mrgKeyRange1 As Range
Dim mrgKeyRange2 As Range
Dim cell As Range
Dim lastC1 As Integer
Dim lastC2 As Integer
Dim lastC3 As Integer
Dim lrow As Integer
Dim currentR As Integer
Dim key As Variant
lc = listChoice
'closing the UserForm
Unload Me
Set mrgKeyRange1 = Application.InputBox("Select the range by wich the rows of the current sheet will be merged with the sheet " & lc, Type:=8) 'type 8 serve a fargli pigliare un range
Set mrgKeyRange2 = Application.InputBox("Select the corresponding range in " & lc, Type:=8)
Application.ScreenUpdating = False
Application.EnableEvents = False
Application.Calculation = xlCalculationManual 'if there are many calculations, it helps to speed the macro up
Set sh1 = ActiveSheet
Set sh2 = ActiveWorkbook.Sheets(lc)
Set sh3 = Sheets.Add
'renaming the new sheet
If Len(sh1.Name) < 26 Then 'the limit of a sheet's title is 31 chars
sh3.Name = "Merged" & sh1.Name
Else
sh3.Name = "MergedSheet"
End If
'adding the headers to the new sheet
sh1.Rows(1).Copy Destination:=sh3.Rows(1)
lastC1 = LastColumn(sh1)
lastC2 = LastColumn(sh2) 'LastCol() is defined in the module LastRowColumn
lastC3 = LastColumn(sh3) + 1
letter2 = NumberToLetter(lastC2) 'NumberToLetter is described in NumToLet module
sh2.Range("A1:" & letter2 & "1").Copy Destination:=sh3.Cells(1, lastC3)
'formatting the headers of the new sheet (sh3)
With sh3
lastC3 = LastColumn(sh3)
letter3 = NumberToLetter(lastC3)
.Cells(1, 1).Copy
.Range("B1:" & letter3 & "1").PasteSpecial Paste:=xlPasteFormats
.Range("A1:" & letter3 & "1").Columns.AutoFit
End With
'>>CR note: the macro begins to be veeeeeery slow starting from here:
'For each value in the Merging Key range, it finds the corresponding row in the other sheet
For Each key In mrgKeyRange1
If Trim(key) <> "" Then
With mrgKeyRange2
Set cell = .Find(What:=key, _
LookIn:=xlValues, _
LookAt:=xlWhole, _
SearchDirection:=xlNext, _
MatchCase:=False)
If Not cell Is Nothing Then
lrow = LastRow(sh3)
key.EntireRow.Copy
sh3.Rows(lrow + 1).PasteSpecial xlPasteValues
currentR = cell.row
sh2.Range("A" & currentR & ":" & letter2 & currentR).Copy
sh3.Cells(lrow + 1, lastC1 + 1).PasteSpecial xlPasteValues
End If
End With
End If
Next
Application.ScreenUpdating = True
Application.EnableEvents = True
Application.Calculation = xlCalculationAutomatic
End Sub
In order to call the userform, I use this very simple code, written in another module:
Sub MergeSheets()
Merge_UserForm.Show vbModeless
End Sub
Here is an example of two sheets that I want to Merge by ID. Normally my sheets contain much more columns and the ID is variable (sometimes are numbers, sometimes strings,…)
Solution
@Mat’sMug does an excellent job with covering the OOP aspects of the code, so I’ll just dive straight into the performance issues.
1 – Just a tiny drop in the performance bucket, but you call lastC3 = LastColumn(sh3)
twice in this section:
lastC3 = LastColumn(sh3) + 1
letter2 = NumberToLetter(lastC2) 'NumberToLetter is described in NumToLet module
sh2.Range("A1:" & letter2 & "1").Copy Destination:=sh3.Cells(1, lastC3)
'formatting the headers of the new sheet (sh3)
With sh3
lastC3 = LastColumn(sh3)
You could omit the second call by simply calculating the new last column instead:
lastC3 = LastColumn(sh3) + 1
letter2 = NumberToLetter(lastC2) 'NumberToLetter is described in NumToLet module
sh2.Range("A1:" & letter2 & "1").Copy Destination:=sh3.Cells(1, lastC3)
'formatting the headers of the new sheet (sh3)
With sh3
lastC3 = lastC3 + lastC2
However, if you glance up about 13 lines of code you’ll notice that LastColumn(sh3)
will always be the same as lastC1
at that point because you’ve just added it on the line Set sh3 = Sheets.Add
and its only contents are the sh1
headers. Best would be to always use the calculated values when you can.
2 – The NumberToLetter
should never be used unless you’re displaying a human readable column address (and in that case you should let Excel do it for you). Computers work with numbers well, but strings, not so much. What happens with code like this…
letter2 = NumberToLetter(lastC2) 'NumberToLetter is described in NumToLet module
sh2.Range("A1:" & letter2 & "1").Copy Destination:=sh3.Cells(1, lastC3 + 1)
…is that you take your last column number (let’s say for the sake of argument that it’s 4 like your top data example). You make a function call to convert it to “D”, then concatenate it into a String
with "A1:" & "D" & "1"
, and pass the value to sh.Range
as “A1:D1”. And then… Excel parses the string. It determines that “A” refers to column 1, and “D” refers to column 4. You’re doing a bunch of work with the sole effect of making Excel do more work. Do Excel a solid and use the numeric interfaces:
With sh2
.Range(.Cells(1, 1), .Cells(lastC2, 1)).Copy Destination:=sh3.Cells(1, lastC3 + 1)
End With
3 – Don’t use .Copy
. At all. Ever. It actually does 2 copies internally – one to the clipboard and one to the destination. Then comes the really annoying UX part – it wipes out whatever the user already had in the clipboard. Ctrl-V didn’t do anything? Oh. Somebody used .Copy
in a macro. Too bad I closed that file/navigated away from that page/deselected the 100 unordered items I laboriously Ctrl-clicked… You get the point. Instead, just assign the values directly:
sh3.Range(sh3.Cells(1, lastC3 + 1),sh3.Cells(1, lastC3 + lastC2 + 1)).Value = _
sh2.Range(sh2.Cells(1, 1), sh2.Cells(lastC2, 1)).Value
It’s much faster, and doesn’t blow away the clipboard.
4 – While I have UX on the brain, you shouldn’t make the assumption that the user has Application.Calculation
set to xlCalculationAutomatic
. This is an application wide setting, and it holds its state after the your code completes. Instead, cache the current setting at the start of the code, do what you need to, and then change it back to the setting it was when you found it. Also, if you’re going to alter the state of the application, you should have error handling to make sure that it gets set back if you have an exception somewhere:
On Error GoTo CleanExit
Dim calcState As XlCalculation
Dim eventState As Boolean
calcState = Application.Calculation
Application.Calculation = xlCalculationManual
eventState = Application.EnableEvents
Application.EnableEvents = False
'...
CleanExit:
Application.Calculation = calcState
Application.EnableEvents = eventState
End Sub
5 – Only copy data that needs to be copied. .EntireRow
is well named – it’s every single cell in the row. All 16,384 of them. You already know how wide the data is because you measure it here:
lastC1 = LastColumn(sh1)
lastC2 = LastColumn(sh2)
lastC3 = LastColumn(sh3)
Just figure out which sheet mrgKeyRange1
belongs to (hint, it’s mrgKeyRange1.Parent
) and only copy the appropriate number of columns.
6 – Similar to #1, but with a much, much higher impact on your overall performance is the call lrow = LastRow(sh3)
inside your main loop. This value will never increase by more than 1. You enter the loop with only header rows in sh3
, and you’re adding rows to sh3
one at a time. It can be replaced with this…
lrow = 1
For Each key In mrgKeyRange1
If Trim(key) <> "" Then
With mrgKeyRange2
'...
If Not cell Is Nothing Then
lrow = lrow + 1
key.EntireRow.Copy
sh3.Rows(lrow).PasteSpecial xlPasteValues
7 – Pausing for a brief nitpick, the If
block beginning with If Not cell Is Nothing Then
should be indented another level.
8 – Avoid round trips to the Worksheets, especially in the loops. You can pull all of the values from a range of cells as an array like this:
Dim allValues As Variant
allValues = ActiveSheet.Range("A1:D1000").Value
'allValues is now a 2D array containing all the cell values
'in "A1:D1000".
Reading individual values from a worksheet is an expensive operation. Reading values from an array is really, really cheap. This is where the rubber really hits the road. You’ll see an immediate performance gain by restructuring your main loop like this:
Dim searchItems As Variant
searchItems = mrgKeyRange1.Value
Dim searchRow As Long
Dim searchCol As Long
For searchRow = LBound(searchItems, 1) To UBound(searchItems, 1)
For searchCol = LBound(searchItems, 2) To UBound(searchItems, 2)
key = searchItems(searchRow, searchCol)
'...
Next
Next
9 – I don’t know enough about the data to recommend a specific alternative approach, but my guess is that the .Find
call is another big source of burned CPU cycles. If you’re only checking one column in SheetX against one column in SheetY and at least one contains only unique values, it would probably be fastest to just sort both of them on the target columns and use “seeking” indexes to ladder your way down both sets of data at the same time. Pseudo-code:
Sort both lists.
Do until index1 hits the end of array1
Do until array2(index2) = array1(index1) or index2 hits the end of array2
Increment index2
Loop
If index2 is at the end of array2, Exit.
Process the match.
Do until array1(index1) = array2(index2) or index1 hits the end of array1
Increment index1
Loop
Loop
Otherwise it would probably be faster to read all of the mrgKeyRange2
values into a Dictionary
and use its hash lookups instead of doing individual searches. Store the “key” as the key and the row number (or collection of row numbers if you have duplicates) as the value. Then you can just test to see if they’re in the Dictionary
and retrieve the row number(s) as you iterate through mrgKeyRange1
. There’s an example of something similar to this method here.
The first thing I expect to see in any code module – be it a standard module, a class module, a worksheet module, ThisWorkbook
, or a UserForm
‘s code-behind, is this:
Option Explicit
With this, your code wouldn’t compile… because you’re not declaring every variable that you’re using – in the UserForm_Activate
handler, n
is an implicit Variant
that VBA allocates on the fly… and Variant should be avoided for routine tasks such as looping:
Private Sub UserForm_Activate()
For n = 1 To ActiveWorkbook.Sheets.Count
With SelectSheet
.AddItem ActiveWorkbook.Sheets(n).Name
End With
Next n
End Sub
There are a few things to improve here. A For Each
loop performs best when iterating an array; when iterating a collection of objects (such as the Sheets
collection of the ActiveWorkbook
), it’s best to use a For Each
loop:
Dim sheet As Worksheet
For Each sheet In ActiveWorkbook.Worksheets
SelectSheet.AddItem sheet.Name
Next
Notice this loop is iterating the Worksheets
collection – the Sheets
collection contains sheets that aren’t worksheets, such as chart sheets.
A new sheet
reference is “captured” at each iteration, so you don’t need to access the Sheets
or Worksheets
collection every time – it’s simply given to you by the iteration mechanism; that’s why a For Each
loop performs better with collections. Keep For...Next
loops for iterating arrays.
Public listChoice As String
Given that a UserForm
is really a class module with a designer and a default instance, it should be used as an object – and in object-oriented code, this listChoice
module-level variable is a public field.
A public field makes the String
value readable from the calling code. The problem is that is also makes the value writable from the calling code… which doesn’t always make sense and makes it easier to introduce bugs.
I like that you’ve abstracted away the ListBox
, so the caller doesn’t need to know that the selection is coming from a specific control. A better and more object-oriented way to do exactly that is to expose a property:
Public Property Get SelectedSheetName() As String
With SelectSheet
If .ListIndex = -1 Then Exit Property 'nothing is selected
SelectedSheetName = .List(.ListIndex)
End With
End Property
Notice that this eliminates the need for a public field for the client code to access the selected sheet name.
Now, looking at the rest of the code, it seems the listChoice
field / SelectedSheetName
property can very well be Private
, given it’s only used in the module it’s declared in: variables and members should always have the tightest possible scope – I like that all members are Private
, but that Public
field makes this code possible:
Merge_UserForm.listChoice = "potato"
Merge_UserForm.Show vbModeless
And then the user could click the CommandButton1
without making a valid selection in the SelectSheet
listbox, and then this line would blow up:
Set sh2 = ActiveWorkbook.Sheets(lc)
Side note, what’s the lc
local variable needed for? listChoice
is already there, in scope, waiting to be used… but I’ll get back to this in a moment.
I think the form, CommandButton1
in particular, is responsible for way too many things. A UserForm
is a view, a user interface: a UI exists because a program needs to collect user input.
I think these InputBox
calls are a missed opportunity to have two RefEdit
controls on that form, to collect Range
selections from the user without ugly and annoying InputBox
prompts – not to mention that Excel.Application.InputBox
is a bit confusing when there’s also the standard VBA.Interaction.InputBox
. And you’re not validating that the selected range is actually on the lc
sheet, which can lead to some interesting bugs.
Why do you need that ListBox at all anyway then? Let the user select a range, and extract the sheet’s name from that range!
So back the form’s responsibilities: collecting user input. What are the actual inputs you need? mrgKeyRange1
and mrgKeyRange2
? I’m not sure I completely understand exactly what’s happening (haven’t looked at the actual “do work” code yet), but it seems to me your UI could look something like this:
Then there would be some logic to validate the selected ranges and make sure the form can’t be OK’d without consistent input values (e.g. if the selected columns/rows need to line up, or if the two ranges must be on separate sheets, etc.) – the entire role and purpose of a UserForm
is to collect and validate the user’s input.
So the only code that should be in a form’s code-behind, is simple code that deals with simple mechanics, e.g. making sure that the object instance doesn’t get destroyed when the user decides to X-out and cancel everything:
Private cancelled As Boolean
Public Property Get IsCancelled() As Boolean
IsCancelled = cancelled
End Property
Public Property Get Selection1() As Range 'todo: rename
On Error Resume Next
Set Selection1 = Application.Range(RefEdit1.Value)
On Error GoTo 0
End Property
Public Property Get Selection2() As Range 'todo: rename
On Error Resume Next
Set Selection2 = Application.Range(RefEdit2.Value)
On Error GoTo 0
End Property
Private Sub OkButton_Click()
cancelled = False
Me.Hide
End Sub
Private Sub UserForm_QueryClose(Cancel As Integer, CloseMode As Integer)
If CloseMode = VbQueryClose.vbFormControlMenu Then cancelled = True
Cancel = True
Me.Hide
End Sub
If the input ranges can be invalid and you would want to prevent the user from OK’ing the form with such invalid input, I would suggest to keep that responsibility outside the form still – by implementing it in dedicated class modules.
I don’t know what the rules are, and I don’t know how many there are – so I’ll go and add a new class module and call it IRangeValidationRule
to define a simple interface:
Option Explicit
Public Function Validate(ByVal range1 As Range, ByVal range2 As Range) As Boolean
End Function
And then say one of the rules is that range1
and range2
must refer to distinct sheets – we would have another class module, DistinctSheetRangeValidationRule
:
Option Explicit
Implements IRangeValidationRule
Private Function IRangeValidationRule_Validate(ByVal range1 As Range, ByVal range2 As Range) As Boolean
On Error GoTo CleanFail
Dim result As Boolean
result = Not range1.Parent Is range2.Parent
CleanExit:
IRangeValidationRule_Validate = result
Exit Function
CleanFail:
result = False
Resume CleanExit
End Function
And so on for each validation rule you might have. Then the form could have this:
Private rules As New Collection
Public Sub AddValidationRule(ByVal rule As IRangeValidationRule)
rules.Add rule
End Sub
And you could then determine whether or not the selected ranges are valid by simply iterating the rules:
Private Function IsValid() As Boolean
Dim rule As IRangeValidationRule
For Each rule In rules
If Not rule.Validate(Selection1, Selection2) Then
IsValid = False
Exit Function
End If
Next
IsValid = True
End Function
And then you could handle the two RefEdit
controls’ AfterUpdate
handlers to run the validation and disable the OkButton
until the input is valid:
Private Sub RefEdit1_AfterUpdate()
OkButton.Enabled = IsValid
End Sub
Private Sub RefEdit2_AfterUpdate()
OkButton.Enabled = IsValid
End Sub
So, what would the calling code look like then? This couldn’t work:
Sub MergeSheets()
Merge_UserForm.Show vbModeless
End Sub
First, it’s working off the default instance of the form, and then with the actual “do work” code gone, that wouldn’t be doing anything. We need to create an instance and work with that – but first let’s remove that underscore and keep the form’s name PascalCase
; the underscore makes it look like some event handler or interface member implementation procedure. Then we’ll restrict user interactions to entering valid input or cancelling the form by using vbModal
, so while the form is displayed, the user can’t interact with anything other than the form:
Public Sub MergeSheets()
With New MergeUserForm
.Show vbModal
If Not .IsCancelled Then DoWork .Selection1, .Selection2 'todo: rename all these
End With
End Sub
Notice the conceptual difference here: instead of fire-and-forget displaying a form and not knowing what’s happening afterwards, we can see that we’re displaying a form, collecting Selection1
and Selection2
, allowing the user to cancel everything, and passing the inputs to some DoWork
procedure that’s responsible for the actual work – the form itself doesn’t do much.
If we have implemented validation rules as I’ve shown above, we could have this:
Public Sub MergeSheets()
With New MergeUserForm
.AddValidationRule New DistinctSheetValidationRule
.AddValidationRule New SomeOtherValidationRule
.Show vbModal
If Not .IsCancelled Then DoWork .Selection1, .Selection2 'todo: rename all these
End With
End Sub
The biggest advantage with this approach, is that you’ve decoupled the actual work (and input validation rules) from the UI, and you can even write unit tests for them if you want – and because the DoWork
procedure is taking the user input as parameters rather than prompting for it, you can write unit tests for it too!
I’ll end this answer here and let other reviewers address the actual do work procedure and its performance problems.