firstDuplicate of array

Posted on

Problem

This question has been asked for other languages, but I haven’t seen it for C++. Most notably here and here.

Given an array a that contains only numbers in the range from 1 to a.length, find the first duplicate number for which the second occurrence has the minimal index. In other words, if there are more than 1 duplicated numbers, return the number for which the second occurrence has a smaller index than the second occurrence of the other number does. If there are no such elements, return -1.

My tests are running correctly, but I’m getting a timeout from the website codesignalDotCom. I’m wondering what I’m doing that could be taking so much time, and what could be done to fix it.

int firstDuplicate(std::vector<int> a) {
    std::vector<int> newVector;  
    std::cout << a.size() << std::endl;
    for(int i = 0; i < a.size(); i++) {
        if(std::find(newVector.begin(), newVector.end(), a.at(i)) != newVector.end()){
            return a.at(i);
        }
        newVector.push_back(a.at(i));
    }
    return -1;
}

Solution

Find more appropriate data structures

This question has been asked for other languages, but I haven’t seen it for C++.

Getting timeout usually indicates an inefficient algorithm.
The language doesn’t matter, because algorithms are language agnostic.
Some of the posts you linked mention the word “performance”,
and they shed light to the problem,
with suggestions that could be applied here too,
even if the language is different.

The first thing to ask when a solution is too slow,
what’s the time complexity of the algorithm?

quadratic: O(n2)O(n2)

That’s not so good. Looping over all elements is a linear operation (O(n)O(n)),
and checking if an element is in a vector is also a linear operation.

In other words, the weak spot is using a vector to store the elements already seen.
A vector doesn’t provide a fast way to search for random elements.
Other data structures can provide faster than linear performance to lookup elements:
a tree set can do it in logarithmic time (O(logn)O(logn), making the overall complexity of your algorithm log-linear O(nlogn)O(nlogn)),
or a hash set can do it in constant time (O(1)O(1), making the overall complexity of your algorithm linear O(n)O(n)).

Coding style

The name newVector is not great, for several reasons:

  • What’s “new” about it?
  • Including the name of the data structure in the name of the variable usually doesn’t give much new information to the reader, and therefore doesn’t help understand its purpose.

This vector is used to store elements already seen. I would have called it seen.


The name a is even worse: it tells nothing to the reader about what it is.


The main loop goes over indexes of the input vector.
You could loop over the values of the input,
which would read more natural:

for (int value : a) {
    if (std::find(newVector.begin(), newVector.end(), value) != newVector.end()){
        return value;
    }
    newVector.push_back(value);
}

int firstDuplicate(std::vector<int> a) {

You’ve defined your function to take parameter a. Your program simply inspects the elements of a. a itself isn’t mutated, so copying a in the parameter of the function is not necessary. For cheap and impossible to copy parameter types, taking them by copy is fine. For parameters where copies become moderately expensive or the cost is unknown at compile time, prefer passing your parameter by reference-to-const.

Try to give your parameter a name that at least hints what values it expects. In this case, numbers from 1 upto some length. We can call them natural numbers.

int firstDuplicate(const std::vector<int>& natural_numbers)

    std::cout << a.size() << std::endl;

Unnecessary debugging artifact.


    std::vector<int> newVector;

Before you push values into this newVector, can any safe assumptions be made about this container? If we encounter the worst case, a set of natural numbers with no duplicates, then the size of newVector after inserting the elements of a will be the same.

    std::vector<int> values_seen;
    values_seen.reserve(natural_numbers.size());

    for(int i = 0; i < a.size(); i++) {

If you are not doing math that requires indices, then prefer range-based for loops instead of raw loops.

    for (auto number : natural_numbers) {

    for(int i = 0; i < a.size(); i++) {
        if(std::find(newVector.begin(), newVector.end(), a.at(i)) != newVector.end()){
            return a.at(i);
        }
    }

Since i is known to be a valid index (0i<size0i<size), you do not require bounds-checking from std::vector::at. Use std::vector::operator[] When you don't need bounds checking.


As for improving the overall algorithmic approach, you could speed up the lookups. Keep the elements seen sorted and binary search the insertion point. Hashing the seen values is another option. Both of those options require extra space. An inplace algorithm does exist. Go back to the problem statement and it describes the set of numbers as being 1 to a length. So you have a bunch of positive values being stored as int. Assuming that the largest value in the set of natural numbers (one-based) is smaller than the size of the array (zero-based), we can map values to the indices and can mark it through the signbit.

for (int value : natural_numbers) {
    const auto original_value = value & std::numeric_limits<int>::max();

    // The values in the array are one-based, but index-map zero-based
    if (natural_numbers[original_value - 1] < 0) {
         return original_value;
    }

    natural_numbers[original_value - 1] |= (1 << 31);
}

Note - The original numbers must be natural numbers (1, 2, ..., length). This approach does not work for negative numbers and is undefined behavior for zero.

Leave a Reply

Your email address will not be published. Required fields are marked *