Minimum number of copies of an application a company needs to purchase

Posted on

Problem

Some applications from vendors are allowed to be installed on multiple computers per user with specific
restrictions. In our scenario, each copy of the application (ID 374) allows the user to install the
application on to two computers if at least one of them is a laptop. Given the provided data, create a C#
utility that would calculate the minimum number of copies of the application the company must
purchase.

Examples

Example 1

Given the following scenario

Table 1

Only one copy of the application is required as the user has installed it on two computers, with one of
them being a laptop.

Example 2

In the following scenario

Table 2

Three copies of the application are required as UserID 2 has installed the application on two computers,
but neither of them is a laptop and thus both computers require a purchase of the application.

Example 3

Occasionally the data may contain duplicate records, in the following scenario

Table 3

Only two copies of the application are required as the data from the second and third rows are
effectively duplicates even though the ComputerType is lower case and the comment is different.

Expectations

Please provide a C# solution that calculates the minimum of copies of the application with ID 374 a
company must purchase.


My C# Solution

Program.cs

using System;
using System.IO;
using System.Collections.Generic;
using System.Linq;

namespace minimum_app_copies_to_purchase
{
    public class Installation
    {
        public int ComputerID {get;set;}
        public int UserID {get;set;}
        public int ApplicationID {get;set;}
        public string ComputerType {get;set;}
        public string Comment {get;set;}
    }
    class Program
    {
        static void Main(string[] args)
        {
            string[] csv = File.ReadAllLines(args[0]);
            

        List<Installation> rows = ParseCSV(csv);

        IEnumerable<Installation> applicationId374Query =
            from row in rows
            where row.ApplicationID == 374
            orderby row.UserID
            select row;

        IEnumerable<IGrouping<int, Installation>> groupUsersQuery =
            from installation in applicationId374Query
            group installation by installation.UserID;

        Int64 copiesNeeded = 0;

        foreach (IGrouping<int, Installation> group in groupUsersQuery)
        {
            if (group.Count() == 1)
            {
                copiesNeeded += 1;
            } else if (group.Count() == 2)
            {
                if (group.ContainsComputerType("LAPTOP")
                 && group.ContainsComputerType("DESKTOP"))
                 {
                     copiesNeeded += 1;
                 } else if (group.ContainsComputerType("LAPTOP")
                    && group.ContainsComputerType("LAPTOP"))
                    {
                        copiesNeeded += 1;
                    } else {
                        copiesNeeded += 2;
                    }
            }
        }

        Console.WriteLine("Total number of copies of application id 374 needed by the company: {0}",
                        copiesNeeded);
    }

  

    public static List<Installation> ParseCSV(string[] csv)
    {
            IEnumerable<string> queryRows = csv.Skip(1);
            IEnumerable<Installation> parseCsvQuery =
            from line in queryRows
            let values = line.ToUpper().Split(',')
            select new Installation {ComputerID = Int32.Parse(values[0]),
                UserID = Int32.Parse(values[1]), ApplicationID = Int32.Parse(values[2]),
                ComputerType = values[3], Comment = values[4]};

            IEnumerable<Installation> queryDistinctRows = parseCsvQuery.Distinct(new Comparer());
            List<Installation> rows = queryDistinctRows.ToList();
            
            return rows;
        }
    }
}

Comparer.cs

using System.Collections.Generic;

namespace minimum_app_copies_to_purchase
{
    public class Comparer: IEqualityComparer<Installation>
    {
        bool IEqualityComparer<Installation>.Equals(Installation a, Installation b)
        {
            return a.ComputerID == b.ComputerID && a.UserID == b.UserID &&
                a.ApplicationID == b.ApplicationID && a.ComputerType.Equals(b.ComputerType) &&
                a.Comment.Equals(b.Comment);
        }

        public int GetHashCode(Installation obj)
        {
            string s = $"{obj.ComputerID}{obj.UserID}{obj.ApplicationID}{obj.ComputerType}" +
            $"{obj.Comment}";
            
            return s.GetHashCode();
        }
    }
}

Extensions.cs

using System.IO;
using System.Collections.Generic;
using System.Linq;

namespace minimum_app_copies_to_purchase
{
    public static class Extensions
    {
        public static bool ContainsComputerType(this IGrouping<int, Installation> source, string computerType)
        {
            bool contains = false;

            foreach (Installation obj in source)
            {
                
                if (obj.ComputerType.Equals(computerType))
                {
                    contains = true;
                }
            }

            return contains;
        }
    }
}

Sample Data

The output and the time it takes for the program to execute when the input is the 1.1 GB sample-large.csv file is shown below.

First run:

$ time dotnet run sample-large.csv
Total number of copies of application id 374 needed by the company: 6077

real    1m10.500s
user    1m9.715s
sys     0m4.266s

Second run:

$ time dotnet run sample-large.csv
Total number of copies of application id 374 needed by the company: 6077

real    1m43.551s
user    1m19.293s
sys     0m5.288s

Third run:

$ time dotnet run sample-large.csv
Total number of copies of application id 374 needed by the company: 6077

real    4m8.570s
user    1m21.497s
sys     0m8.317s

The program takes too long when the input is the large csv file which is 1.1 GB in size, and my laptop also starts lagging. Memory usage by this program reaches 79%, and CPU usage spikes to around 76% intermittently.

Solution

First I believe there is a bug in the code. Where the code calculating the copiesNeeded only counts if the group count is 1 or 2. But the data has users that have more computers than just 1 or 2. The Count() in the group would be the number of computers the user has, which is more than 2. I added this code to throw an exception and the program threw.

else
{
    throw new Exception("Missing Computers" + group.Count().ToString());
}

To help with performance the entire file is being loaded into memory. File.ReadAllLines will return the entire file back into a string array. Also calling ToList will load the entire results into memory.

IEnumerable can be deferred so that only as data is needed will it be loaded into memory. Lucky that File.ReadLines does just that. Also we don’t need all the data in the file just for the count.

public class Installation
{
    public Installation(int userId, bool desktop)
    {
        UserId = userId;
        Desktop = desktop;
    }

    public int UserId { get; }
    public bool Desktop { get; }
}

Just basic info in the class now. The UserId it is assigned to and a bool if a desktop or laptop since that is the only two valid options.

public static int GetCount(string file, int applicationId)
{
    var distinct = new HashSet<int>();
    var results = File.ReadLines(file)
        .Skip(1)
        .Select(x => x.Split(','))
        .Where(x => Convert.ToInt32(x[2]) == applicationId && distinct.Add(Convert.ToInt32(x[0])))
        .Select(x => new Installation(Convert.ToInt32(x[1]), x[3].Equals("Desktop", StringComparison.OrdinalIgnoreCase)))
        .GroupBy(x => x.UserId)
        .Select(x => new
        {
            Desktops = x.Count(x => x.Desktop),
            Laptops = x.Count(x => !x.Desktop)
        })
        .Sum(x =>
        {
            int desktopLicense;
            int laptopLicense;
            if (x.Desktops >= x.Laptops)
            {
                // all free licenses used by desktops
                desktopLicense = x.Desktops - x.Laptops;
                laptopLicense = x.Laptops;
            }
            else
            {
                // take laptop minus desktop to get remaining free license
                //   need to divide by 2 because use a laptop for a laptop
                desktopLicense = 0;
                laptopLicense = x.Laptops - (x.Laptops - x.Desktops) / 2;
            }

            return desktopLicense + laptopLicense;
        });

    return results;
} 

Here I’m creating a HashSet to store the computerId. If the key really is computerid + userId would make the HashSet of Tuple<int, int>, but looking at the data does seem computerId is unique.

Then before making a class just check to make sure it’s the applicationId we want and it’s not a duplicate — that’s the Where statement

Then create our class that holds the barebones data — that’s the Select statement

We need to group by the user so that we can do the math per user — that’s the Group statement

Since we have a bool we will count how many desktops the user has and how many laptops the user has — that’s the select statement.

Now sum up the required licenses. Put comments in the code to explain what the math is.

This way the comparer and extension methods are not required. And best part this returns 13927 license needed for the large file for 374 and runs in 10 seconds on my machine while the code review code returns back 6077, again I believe this is the wrong number, and runs in a minute and 20 seconds on my machine.

License counting

foreach (IGrouping<int, Installation> group in groupUsersQuery)
{         
    if (group.Count() == 1)
    {             
        copiesNeeded += 1;
    } else if (group.Count() == 2)
    {
        if (group.ContainsComputerType("LAPTOP")
         && group.ContainsComputerType("DESKTOP"))
             {
             copiesNeeded += 1;
         } else if (group.ContainsComputerType("LAPTOP")
            && group.ContainsComputerType("LAPTOP"))
            {
                copiesNeeded += 1;
            } else {
                copiesNeeded += 2;
            }
    }
}

I am not a fan of the foreach, because it’s an unnecessary computational load.
When possible, it’s better to do a calculation once as opposed to iteratively approaching it.

A classic example here is calculating the sum of a series of consecutive numbers.
Instinctively, you’d be inclined to approach this using an iteration (for(int i = start; i <= end; i++) { result += i; }), but it’s much faster to realize that you can get the answer using a single calculation (result = ((end-start)/2)*(start+end), no matter the size of the collection.

Finding that formula means that your calculation is not impacted by the amount of machines any given user has, as the formula remains the same operation (just with bigger numbers).

Secondly, I also find there’s too much if-branching here. You’re hardcoding every possible permutation, as opposed to looking for the more elegant algorithm that covers your use case. While this is a simple example and easy to hand-write, this is going to bite you in real-life projects which are considerably less bite-sized about their logic and complexity.

What it all boils down to is that you can decide the number of licenses (for a given user) based solely on the count of desktops and laptops that the user owns. Let’s use two examples to help intuitively find the formula.

6 desktops, 2 laptops

The answer is 6 licenses. One for each desktop, as these can never share a license. The laptops can be ignored, as they are the plus one for a desktop. Therefore, the laptops don’t require an additional license.

We’ve uncovered one immovable fact: you always need at least as many licenses as you have desktops.

But what if there’s more laptops than you can hide as plus ones for the desktops? In other words, what if there’s more laptops than desktops?

2 desktops, 7 laptops

Well, we can still apply the desktop rule. 2 desktops can each have a plus one, so 2 laptops are already handled. Our calculation isn’t done yet, but we already can solve part of it

2 desktops, 7 laptops
=> 2 licenses + 5 laptops remaining    (using the "desktop plus one" rule)

For the remaining laptops, you can simply pair them off with each other. This is no more than remainingLaptops / 2, except that you do have to round up.

How do you get the remaining laptops? Well, it’s the initial amount minus the plus ones that already were assigned with the desktops. Therefore, remainingLaptops = initialLaptops - initialDesktops.

2 desktops, 7 laptops
=> 2 licenses + 5 laptops remaining    (using the "desktop plus one" rule)
=> 2 licenses + 3 licenses             (using the "laptop pairing" rule)

This solves the entire thing. We can now write out the logic:

public int CountRequiredLicenses(int desktopCount, int laptopCount)
{
    int licenseCount = 0;

    // Desktop plus one rule

    licenseCount += desktopCount;

    // Laptop pairing rule

    int remainingLaptops = laptopCount - desktopCount;

    // Only if there are remaining (non-plus-one) laptops
    if(remainingLaptops > 0)
    {
        int laptopPairsCount = Math.Ceiling((double)remainingLaptops/2);
        licenseCount += laptopPairsCount;
    }

    return licenseCount;
}

As to how you need to implement this CountRequiredLicenses method in the code you already have, you can keep most of your data aggregation logic, with the one exception that you need to count the amount of desktops/laptops in each user’s group.

The grand total equals the sum of the CountRequiredLicenses value for each user.


Performance

As to the performance issue, if it matters to your teacher, they are partly to blame for this issue. Asking for sorting and grouping of a massive CSV file is no simple task, and it’s leveraging the wrong technology if performance is key. A database server is much better outfitted for these kinds of large data operations.

If performance doesn’t matter to your teacher, then you shouldn’t be trying to break your back over this.

Assuming we can reduce computer equality to a simple match on computerID (as discussed in a bullet point in the next section of this answer), then you can improve performance by processing each row individually as opposed to trying to perform complex data operations on the entire set (group by etc).

Instead, since we’re only really interested in the count of desktops/laptops in a user’s name anyway, we can simply process the rows one at a time, and tally up the records. First, we define the record we want to keep (per user):

public class UserMachineCount
{
    public int UserId { get; }
    public int DesktopCount { get; set; }
    public int LaptopCount { get; set; }

    public UserMachineCount(int userId)
    {
        this.UserId = userId;
        this.DesktopCount = 0;
        this.LaptopCount = 0;
    }
}

Then we track these records in a collection. I picked a dictionary because of the lookup performance.

// int = userId
private Dictionary<int, UserMachineCount> userMachineCounts = new ...; 

We’ll also track the computer IDs we’ve already processed. I picked a hash set due to good lookup performance.

private HashSet<int> processedComputerIds = new ...;

Then we write a method which parses a given line, looks up the existing record, creates one if it doesn’t exist, and increases the tally accordingly.

private void ParseLine(string[] row)
{
    int computerId = Convert.ToInt32(row[0]);
    int userId = Convert.ToInt32(row[1]);
    int applicationId = Convert.ToInt32(row[2]);
    string computerType = row[3];

    // Skip unrelated application IDs
    if(applicationId != 374)
        return;

    // Skip already processed computer Ids
    if(processedComputerIds.Contains(computerId))
        return;

    processedComputerIds.Add(computerId);

    // Check for existing record
    UserMachineCount userRecord;
    if(!userMachineCounts.TryGetValue(userId, out userRecord))
    {
        // If it doesn't exist, create it
        userRecord = new UserMachineRecord(userId);
        userMachineCounts.Add(userId, userRecord);
    }

    // Increment the correct tally
    switch(computerType.ToUpper())
    {
        case "DESKTOP":
            userRecord.DesktopCount++;
            break;
        case "Laptop":
            userRecord.LaptopCount++;
            break;
        default:
            // Maybe throw an exception? Or do nothing?
    }
}

This means that you can reduce your computational complexity and memory footprint significantly, as you now only need to process your input file one row at a time, and you only store the information that you will actually use.


Tying it all together

From here, calculating the grand sum of licenses is actually really easy:

int totalLicenses = userMachineCounts.Values.Sum(userRecord=> CountRequiredLicenses(userRecord.DesktopCount, userRecord.LaptopCount));

In human readable terms, we calculate the CountRequiredLicenses output for each record, and sum them together.


Smaller review points

  • Parsing a CSV is a problem that has been solved many times over, and it has more edge cases than you’ve accounted for. Unless your teacher told you to not bother, it’s generally better to use an existing CSV parsing library so you don’t have to reinvent the wheel.
  • I already implicitly touched on this in the previous explanation, but you forgot to account for situations where a given user owns more than one desktop or more than two laptops. You’re using contains logic, and you’re not accounting for how many of a given type there are.
  • Good use of the IEqualityComparer, but arguably overkill here. You’re only going to use it once, so it’s not very ripe for making reusable. Good exercise though.
  • This is not your fault, but the premise makes it ambiguous what to do if two duplicate records (same computerID) exist but contradict one another (different user id, application id or computer type). This raises a potential red flag in your equality logic.
    • I would argue that computer equality should be done solely on computerID, and any conflict between the other columns should either raise an exception or have some implemented fallback logic to handle these kinds of contradictions.
  • You’ve hardcoded your logic for application ID 374. It’s not hard to parametrize this, and it would make a lot of sense to be able to choose the application for which to count the licenses.
  • ContainsComputerType is reinventing what LINQ’s Any() can already do for you. It’s not a bad implementation but it’s rather unnecessary to make it from scratch.

Leave a Reply

Your email address will not be published. Required fields are marked *