Find and copy pasting optimisation

Posted on

Problem

I am trying to apply this simple macro to merge two big sized sheets (around 30000 rows each), but the process is too slow and never ends. The macro works perfectly with smaller sheets. Could you give me any advice for the optimization of my code please ?

My Macro consists in an Userform :

Public listChoice As String

'Using your code to get the sheet names for the ListBox rowsource.

Private Sub UserForm_Activate()

    For n = 1 To ActiveWorkbook.Sheets.Count
        With SelectSheet
            .AddItem ActiveWorkbook.Sheets(n).Name
        End With
    Next n

End Sub

'Including an update event for the ListBox

Private Sub SelectSheet_AfterUpdate()

    listChoice = SelectSheet.Text

End Sub

'Including a test just to demonstrate that the result is still retained. You don't need this, it demonstrates the results on the screenshot.

Private Sub CommandButton1_Click()

    Dim sh1 As Worksheet
    Dim sh2 As Worksheet
    Dim sh3 As Worksheet
    Dim lc As String
    Dim letter2 As String
    Dim letter3 As String
    Dim mrgKeyRange1 As Range
    Dim mrgKeyRange2 As Range
    Dim cell As Range
    Dim lastC1 As Integer
    Dim lastC2 As Integer
    Dim lastC3 As Integer
    Dim lrow As Integer
    Dim currentR As Integer
    Dim key As Variant

    lc = listChoice

    'closing the UserForm
    Unload Me

    Set mrgKeyRange1 = Application.InputBox("Select the range by wich the rows of the current sheet will be merged with the sheet " & lc, Type:=8) 'type 8 serve a fargli pigliare un range
    Set mrgKeyRange2 = Application.InputBox("Select the corresponding range in " & lc, Type:=8)

    Application.ScreenUpdating = False
    Application.EnableEvents = False
    Application.Calculation = xlCalculationManual 'if there are many calculations, it helps to speed the macro up

    Set sh1 = ActiveSheet
    Set sh2 = ActiveWorkbook.Sheets(lc)
    Set sh3 = Sheets.Add

    'renaming the new sheet
    If Len(sh1.Name) < 26 Then 'the limit of a sheet's title is 31 chars
        sh3.Name = "Merged" & sh1.Name
    Else
        sh3.Name = "MergedSheet"
    End If

    'adding the headers to the new sheet
    sh1.Rows(1).Copy Destination:=sh3.Rows(1)
    lastC1 = LastColumn(sh1)
    lastC2 = LastColumn(sh2)            'LastCol() is defined in the module LastRowColumn
    lastC3 = LastColumn(sh3) + 1
    letter2 = NumberToLetter(lastC2) 'NumberToLetter is described in NumToLet module
    sh2.Range("A1:" & letter2 & "1").Copy Destination:=sh3.Cells(1, lastC3)

    'formatting the headers of the new sheet (sh3)
    With sh3
        lastC3 = LastColumn(sh3)
        letter3 = NumberToLetter(lastC3)
        .Cells(1, 1).Copy
        .Range("B1:" & letter3 & "1").PasteSpecial Paste:=xlPasteFormats
        .Range("A1:" & letter3 & "1").Columns.AutoFit
    End With

    '>>CR note: the macro begins to be veeeeeery slow starting from here:

    'For each value in the Merging Key range, it finds the corresponding row in the other sheet
    For Each key In mrgKeyRange1
        If Trim(key) <> "" Then
            With mrgKeyRange2
                Set cell = .Find(What:=key, _
                            LookIn:=xlValues, _
                            LookAt:=xlWhole, _
                            SearchDirection:=xlNext, _
                            MatchCase:=False)
                If Not cell Is Nothing Then
                lrow = LastRow(sh3)
                key.EntireRow.Copy
                sh3.Rows(lrow + 1).PasteSpecial xlPasteValues
                currentR = cell.row
                sh2.Range("A" & currentR & ":" & letter2 & currentR).Copy
                sh3.Cells(lrow + 1, lastC1 + 1).PasteSpecial xlPasteValues
                End If
            End With
        End If
    Next

    Application.ScreenUpdating = True
    Application.EnableEvents = True
    Application.Calculation = xlCalculationAutomatic


End Sub

In order to call the userform, I use this very simple code, written in another module:

Sub MergeSheets()
    Merge_UserForm.Show vbModeless
End Sub

Here is an example of two sheets that I want to Merge by ID. Normally my sheets contain much more columns and the ID is variable (sometimes are numbers, sometimes strings,…)

enter image description here
enter image description here

Solution

@Mat’sMug does an excellent job with covering the OOP aspects of the code, so I’ll just dive straight into the performance issues.

1 – Just a tiny drop in the performance bucket, but you call lastC3 = LastColumn(sh3) twice in this section:

lastC3 = LastColumn(sh3) + 1
letter2 = NumberToLetter(lastC2) 'NumberToLetter is described in NumToLet module
sh2.Range("A1:" & letter2 & "1").Copy Destination:=sh3.Cells(1, lastC3)

'formatting the headers of the new sheet (sh3)
With sh3
    lastC3 = LastColumn(sh3)

You could omit the second call by simply calculating the new last column instead:

lastC3 = LastColumn(sh3) + 1
letter2 = NumberToLetter(lastC2) 'NumberToLetter is described in NumToLet module
sh2.Range("A1:" & letter2 & "1").Copy Destination:=sh3.Cells(1, lastC3)

'formatting the headers of the new sheet (sh3)
With sh3
    lastC3 = lastC3 + lastC2

However, if you glance up about 13 lines of code you’ll notice that LastColumn(sh3) will always be the same as lastC1 at that point because you’ve just added it on the line Set sh3 = Sheets.Add and its only contents are the sh1 headers. Best would be to always use the calculated values when you can.

2 – The NumberToLetter should never be used unless you’re displaying a human readable column address (and in that case you should let Excel do it for you). Computers work with numbers well, but strings, not so much. What happens with code like this…

letter2 = NumberToLetter(lastC2) 'NumberToLetter is described in NumToLet module
sh2.Range("A1:" & letter2 & "1").Copy Destination:=sh3.Cells(1, lastC3 + 1)

…is that you take your last column number (let’s say for the sake of argument that it’s 4 like your top data example). You make a function call to convert it to “D”, then concatenate it into a String with "A1:" & "D" & "1", and pass the value to sh.Range as “A1:D1”. And then… Excel parses the string. It determines that “A” refers to column 1, and “D” refers to column 4. You’re doing a bunch of work with the sole effect of making Excel do more work. Do Excel a solid and use the numeric interfaces:

With sh2
    .Range(.Cells(1, 1), .Cells(lastC2, 1)).Copy Destination:=sh3.Cells(1, lastC3 + 1)
End With

3 – Don’t use .Copy. At all. Ever. It actually does 2 copies internally – one to the clipboard and one to the destination. Then comes the really annoying UX part – it wipes out whatever the user already had in the clipboard. Ctrl-V didn’t do anything? Oh. Somebody used .Copy in a macro. Too bad I closed that file/navigated away from that page/deselected the 100 unordered items I laboriously Ctrl-clicked… You get the point. Instead, just assign the values directly:

sh3.Range(sh3.Cells(1, lastC3 + 1),sh3.Cells(1, lastC3 + lastC2 + 1)).Value = _
    sh2.Range(sh2.Cells(1, 1), sh2.Cells(lastC2, 1)).Value

It’s much faster, and doesn’t blow away the clipboard.

4 – While I have UX on the brain, you shouldn’t make the assumption that the user has Application.Calculation set to xlCalculationAutomatic. This is an application wide setting, and it holds its state after the your code completes. Instead, cache the current setting at the start of the code, do what you need to, and then change it back to the setting it was when you found it. Also, if you’re going to alter the state of the application, you should have error handling to make sure that it gets set back if you have an exception somewhere:

    On Error GoTo CleanExit
    Dim calcState As XlCalculation
    Dim eventState As Boolean

    calcState = Application.Calculation
    Application.Calculation = xlCalculationManual
    eventState = Application.EnableEvents
    Application.EnableEvents = False
    '...
CleanExit:
    Application.Calculation = calcState
    Application.EnableEvents = eventState
End Sub

5 – Only copy data that needs to be copied. .EntireRow is well named – it’s every single cell in the row. All 16,384 of them. You already know how wide the data is because you measure it here:

lastC1 = LastColumn(sh1)
lastC2 = LastColumn(sh2)
lastC3 = LastColumn(sh3)

Just figure out which sheet mrgKeyRange1 belongs to (hint, it’s mrgKeyRange1.Parent) and only copy the appropriate number of columns.

6 – Similar to #1, but with a much, much higher impact on your overall performance is the call lrow = LastRow(sh3) inside your main loop. This value will never increase by more than 1. You enter the loop with only header rows in sh3, and you’re adding rows to sh3 one at a time. It can be replaced with this…

lrow = 1
For Each key In mrgKeyRange1
    If Trim(key) <> "" Then
        With mrgKeyRange2
            '...
            If Not cell Is Nothing Then
                lrow = lrow + 1
                key.EntireRow.Copy
                sh3.Rows(lrow).PasteSpecial xlPasteValues

7 – Pausing for a brief nitpick, the If block beginning with If Not cell Is Nothing Then should be indented another level.

8 – Avoid round trips to the Worksheets, especially in the loops. You can pull all of the values from a range of cells as an array like this:

Dim allValues As Variant
allValues = ActiveSheet.Range("A1:D1000").Value
'allValues is now a 2D array containing all the cell values
'in "A1:D1000".

Reading individual values from a worksheet is an expensive operation. Reading values from an array is really, really cheap. This is where the rubber really hits the road. You’ll see an immediate performance gain by restructuring your main loop like this:

Dim searchItems As Variant
searchItems = mrgKeyRange1.Value

Dim searchRow As Long
Dim searchCol As Long
For searchRow = LBound(searchItems, 1) To UBound(searchItems, 1)
    For searchCol = LBound(searchItems, 2) To UBound(searchItems, 2)
        key = searchItems(searchRow, searchCol)
        '...
    Next
Next

9 – I don’t know enough about the data to recommend a specific alternative approach, but my guess is that the .Find call is another big source of burned CPU cycles. If you’re only checking one column in SheetX against one column in SheetY and at least one contains only unique values, it would probably be fastest to just sort both of them on the target columns and use “seeking” indexes to ladder your way down both sets of data at the same time. Pseudo-code:

Sort both lists.
Do until index1 hits the end of array1
    Do until array2(index2) = array1(index1) or index2 hits the end of array2
        Increment index2
    Loop
    If index2 is at the end of array2, Exit.
    Process the match.
    Do until array1(index1) = array2(index2) or index1 hits the end of array1
        Increment index1
    Loop
Loop

Otherwise it would probably be faster to read all of the mrgKeyRange2 values into a Dictionary and use its hash lookups instead of doing individual searches. Store the “key” as the key and the row number (or collection of row numbers if you have duplicates) as the value. Then you can just test to see if they’re in the Dictionary and retrieve the row number(s) as you iterate through mrgKeyRange1. There’s an example of something similar to this method here.

The first thing I expect to see in any code module – be it a standard module, a class module, a worksheet module, ThisWorkbook, or a UserForm‘s code-behind, is this:

Option Explicit

With this, your code wouldn’t compile… because you’re not declaring every variable that you’re using – in the UserForm_Activate handler, n is an implicit Variant that VBA allocates on the fly… and Variant should be avoided for routine tasks such as looping:

Private Sub UserForm_Activate()

    For n = 1 To ActiveWorkbook.Sheets.Count
        With SelectSheet
            .AddItem ActiveWorkbook.Sheets(n).Name
        End With
    Next n

End Sub

There are a few things to improve here. A For Each loop performs best when iterating an array; when iterating a collection of objects (such as the Sheets collection of the ActiveWorkbook), it’s best to use a For Each loop:

Dim sheet As Worksheet
For Each sheet In ActiveWorkbook.Worksheets
    SelectSheet.AddItem sheet.Name
Next

Notice this loop is iterating the Worksheets collection – the Sheets collection contains sheets that aren’t worksheets, such as chart sheets.

A new sheet reference is “captured” at each iteration, so you don’t need to access the Sheets or Worksheets collection every time – it’s simply given to you by the iteration mechanism; that’s why a For Each loop performs better with collections. Keep For...Next loops for iterating arrays.


Public listChoice As String

Given that a UserForm is really a class module with a designer and a default instance, it should be used as an object – and in object-oriented code, this listChoice module-level variable is a public field.

A public field makes the String value readable from the calling code. The problem is that is also makes the value writable from the calling code… which doesn’t always make sense and makes it easier to introduce bugs.

I like that you’ve abstracted away the ListBox, so the caller doesn’t need to know that the selection is coming from a specific control. A better and more object-oriented way to do exactly that is to expose a property:

Public Property Get SelectedSheetName() As String
    With SelectSheet
        If .ListIndex = -1 Then Exit Property 'nothing is selected
        SelectedSheetName = .List(.ListIndex)
    End With
End Property

Notice that this eliminates the need for a public field for the client code to access the selected sheet name.

Now, looking at the rest of the code, it seems the listChoice field / SelectedSheetName property can very well be Private, given it’s only used in the module it’s declared in: variables and members should always have the tightest possible scope – I like that all members are Private, but that Public field makes this code possible:

Merge_UserForm.listChoice = "potato"
Merge_UserForm.Show vbModeless

And then the user could click the CommandButton1 without making a valid selection in the SelectSheet listbox, and then this line would blow up:

Set sh2 = ActiveWorkbook.Sheets(lc)

Side note, what’s the lc local variable needed for? listChoice is already there, in scope, waiting to be used… but I’ll get back to this in a moment.

I think the form, CommandButton1 in particular, is responsible for way too many things. A UserForm is a view, a user interface: a UI exists because a program needs to collect user input.

I think these InputBox calls are a missed opportunity to have two RefEdit controls on that form, to collect Range selections from the user without ugly and annoying InputBox prompts – not to mention that Excel.Application.InputBox is a bit confusing when there’s also the standard VBA.Interaction.InputBox. And you’re not validating that the selected range is actually on the lc sheet, which can lead to some interesting bugs.

Why do you need that ListBox at all anyway then? Let the user select a range, and extract the sheet’s name from that range!

So back the form’s responsibilities: collecting user input. What are the actual inputs you need? mrgKeyRange1 and mrgKeyRange2? I’m not sure I completely understand exactly what’s happening (haven’t looked at the actual “do work” code yet), but it seems to me your UI could look something like this:

a simple form with two RefEdit controls and an Ok button

Then there would be some logic to validate the selected ranges and make sure the form can’t be OK’d without consistent input values (e.g. if the selected columns/rows need to line up, or if the two ranges must be on separate sheets, etc.) – the entire role and purpose of a UserForm is to collect and validate the user’s input.

So the only code that should be in a form’s code-behind, is simple code that deals with simple mechanics, e.g. making sure that the object instance doesn’t get destroyed when the user decides to X-out and cancel everything:

Private cancelled As Boolean

Public Property Get IsCancelled() As Boolean
    IsCancelled = cancelled
End Property

Public Property Get Selection1() As Range 'todo: rename
    On Error Resume Next
    Set Selection1 = Application.Range(RefEdit1.Value)
    On Error GoTo 0
End Property

Public Property Get Selection2() As Range 'todo: rename
    On Error Resume Next
    Set Selection2 = Application.Range(RefEdit2.Value)
    On Error GoTo 0
End Property

Private Sub OkButton_Click()
    cancelled = False
    Me.Hide
End Sub

Private Sub UserForm_QueryClose(Cancel As Integer, CloseMode As Integer)
    If CloseMode = VbQueryClose.vbFormControlMenu Then cancelled = True
    Cancel = True
    Me.Hide
End Sub

If the input ranges can be invalid and you would want to prevent the user from OK’ing the form with such invalid input, I would suggest to keep that responsibility outside the form still – by implementing it in dedicated class modules.

I don’t know what the rules are, and I don’t know how many there are – so I’ll go and add a new class module and call it IRangeValidationRule to define a simple interface:

Option Explicit

Public Function Validate(ByVal range1 As Range, ByVal range2 As Range) As Boolean
End Function

And then say one of the rules is that range1 and range2 must refer to distinct sheets – we would have another class module, DistinctSheetRangeValidationRule:

Option Explicit
Implements IRangeValidationRule

Private Function IRangeValidationRule_Validate(ByVal range1 As Range, ByVal range2 As Range) As Boolean

    On Error GoTo CleanFail

    Dim result As Boolean
    result = Not range1.Parent Is range2.Parent

CleanExit:
    IRangeValidationRule_Validate = result
    Exit Function

CleanFail:
    result = False
    Resume CleanExit
End Function

And so on for each validation rule you might have. Then the form could have this:

Private rules As New Collection

Public Sub AddValidationRule(ByVal rule As IRangeValidationRule)
    rules.Add rule
End Sub

And you could then determine whether or not the selected ranges are valid by simply iterating the rules:

Private Function IsValid() As Boolean
    Dim rule As IRangeValidationRule
    For Each rule In rules
        If Not rule.Validate(Selection1, Selection2) Then
            IsValid = False
            Exit Function
        End If
    Next
    IsValid = True
End Function

And then you could handle the two RefEdit controls’ AfterUpdate handlers to run the validation and disable the OkButton until the input is valid:

Private Sub RefEdit1_AfterUpdate()
    OkButton.Enabled = IsValid
End Sub

Private Sub RefEdit2_AfterUpdate()
    OkButton.Enabled = IsValid
End Sub

So, what would the calling code look like then? This couldn’t work:

Sub MergeSheets()
    Merge_UserForm.Show vbModeless
End Sub

First, it’s working off the default instance of the form, and then with the actual “do work” code gone, that wouldn’t be doing anything. We need to create an instance and work with that – but first let’s remove that underscore and keep the form’s name PascalCase; the underscore makes it look like some event handler or interface member implementation procedure. Then we’ll restrict user interactions to entering valid input or cancelling the form by using vbModal, so while the form is displayed, the user can’t interact with anything other than the form:

Public Sub MergeSheets()
    With New MergeUserForm
        .Show vbModal
        If Not .IsCancelled Then DoWork .Selection1, .Selection2 'todo: rename all these
    End With
End Sub

Notice the conceptual difference here: instead of fire-and-forget displaying a form and not knowing what’s happening afterwards, we can see that we’re displaying a form, collecting Selection1 and Selection2, allowing the user to cancel everything, and passing the inputs to some DoWork procedure that’s responsible for the actual work – the form itself doesn’t do much.

If we have implemented validation rules as I’ve shown above, we could have this:

Public Sub MergeSheets()
    With New MergeUserForm
        .AddValidationRule New DistinctSheetValidationRule
        .AddValidationRule New SomeOtherValidationRule
        .Show vbModal
        If Not .IsCancelled Then DoWork .Selection1, .Selection2 'todo: rename all these
    End With
End Sub

The biggest advantage with this approach, is that you’ve decoupled the actual work (and input validation rules) from the UI, and you can even write unit tests for them if you want – and because the DoWork procedure is taking the user input as parameters rather than prompting for it, you can write unit tests for it too!

I’ll end this answer here and let other reviewers address the actual do work procedure and its performance problems.

Leave a Reply

Your email address will not be published. Required fields are marked *