Problem
This is the original question
https://www.geeksforgeeks.org/union-and-intersection-of-two-sorted-arrays-2/
Given two sorted arrays, find their union and intersection.
Example: Input : arr1[] = {1, 3, 4, 5, 7} arr2[] = {2, 3, 5, 6} Output : Union : {1, 2, 3, 4, 5, 6, 7} Intersection : {3, 5} Input : arr1[] = {2, 5, 6} arr2[] = {4, 6, 8, 10} Output : Union : {2, 4, 5, 6, 8, 10} Intersection : {6}
I also added one more case of finding items which are only in one of the two arrays and called it Diff.
Please review for performance.
Please do not comment about code in the same class as the test and the functions not being static. It is just faster for me like this to get to the point of the exercise.
using System.Collections.Generic;
using Microsoft.VisualStudio.TestTools.UnitTesting;
namespace ArrayQuestions
{
/// <summary>
/// https://www.geeksforgeeks.org/union-and-intersection-of-two-sorted-arrays-2/
/// </summary>
[TestClass]
public class UnionAndIntersectionOfTwoSortedArrays2
{
[TestMethod]
public void UnionTest()
{
int[] arr1 = { 1, 3, 4, 5, 7 };
int[] arr2 = { 2, 3, 5, 6 };
int[] union = { 1, 2, 3, 4, 5, 6, 7 };
CollectionAssert.AreEqual(union, Union(arr1, arr2));
}
[TestMethod]
public void IntersectionTest()
{
int[] arr1 = { 1, 3, 4, 5, 7 };
int[] arr2 = { 2, 3, 5, 6 };
int[] intersection = { 3, 5 };
CollectionAssert.AreEqual(intersection, Intersection(arr1, arr2));
}
[TestMethod]
public void DiffTest()
{
int[] arr1 = { 1, 3, 4, 5, 7 };
int[] arr2 = { 2, 3, 5, 6 };
int[] diff = { 1, 2, 4, 6, 7 };
CollectionAssert.AreEqual(diff, Diff(arr1, arr2));
}
private int[] Diff(int[] arr1, int[] arr2)
{
int i = 0;
int j = 0;
int n = arr1.Length;
int m = arr2.Length;
List<int> list = new List<int>();
while (i < n && j < m)
{
if (arr1[i] == arr2[j])
{
i++;
j++;
}
else if (arr1[i] < arr2[j])
{
list.Add(arr1[i]);
i++;
}
else
{
list.Add(arr2[j]);
j++;
}
}
while (i < n)
{
list.Add(arr1[i]);
i++;
}
while (j < m)
{
list.Add(arr2[j]);
j++;
}
return list.ToArray();
}
private int[] Intersection(int[] arr1, int[] arr2)
{
int i = 0;
int j = 0;
int n = arr1.Length;
int m = arr2.Length;
List<int> list = new List<int>();
while (i < n && j < m)
{
if (arr1[i] == arr2[j])
{
list.Add(arr1[i]);
i++;
j++;
}
else if (arr1[i] < arr2[j])
{
i++;
}
else
{
j++;
}
}
return list.ToArray();
}
public int[] Union(int[] arr1, int[] arr2)
{
int i = 0;
int j = 0;
int n = arr1.Length;
int m = arr2.Length;
List<int> list = new List<int>();
while (i < n && j < m)
{
if (arr1[i] < arr2[j])
{
list.Add(arr1[i]);
i++;
}
else if (arr2[j] < arr1[i])
{
list.Add(arr2[j]);
j++;
}
else // equals
{
list.Add(arr1[i]);
i++;
j++;
}
}
//handle the rest
for (; i < n; i++)
{
list.Add(arr1[i]);
}
for (; j < m; j++)
{
list.Add(arr2[j]);
}
return list.ToArray();
}
}
}
Solution
- You can optimize all 3 methods if you initialize
list
‘s capacity to the longest of the two arrays. Resizing a list involves allocating a new internal array and copying old items into the new array, which is something to keep in mind if you care about performance. - Your tests often contain small pieces of manually crafted data. That’s not a very scalable approach. It’s easy to generate large amounts of test data with a bit of Linq:
Enumerable.Range(0, 1000).Select(i => random.Next(0, 1000)).OrderBy(n => n).ToArray()
gives you an input array. Verification can be done with a few loops (is every number inarr1
andarr2
present inunion
, and is every number inunion
present in eitherarr1
orarr2
?). For the duplicate-handling implementation verification is even easier with a combination of Linq’sUnion
,Intersect
andOrderBy
methods. - Names like
i
,j
,n
andm
are not very descriptive – specifically which array they’re associated with is not clear from their names. Changing them to something likeindex1
,index2
,length1
andlength2
will make that relationship clear. - If you’re storing array lengths in local variables, then why not also store the results of
arr1[i]
andarr2[j]
in local variables before comparing them? Both of these are micro-optimizations that affect the readability of the code, so I would only do this for performance-sensitive code, and only when profiling shows that it’s actually an improvement.