DBSCAN in C++ for general and Android use

Posted on

Problem

I’ve implemented a templated DBSCAN for general use. At the moment, it’s going to be used on Android through the JNI. I used Wikipedia’s pseudocode and a little bit of the DBSCAN paper for reference. It’s pretty naive, so I’m wondering how I can speed it up, and what I can do to make it perform reasonably well on a phone. How can I improve my code?

template <typename T>
    struct node {
        T val;
        bool visited;
        bool clustered;
        node() : visited(false), clustered(false) {}
    };

    template <typename T>
    using cluster = std::vector<T>;

    template <typename T>
    using adj_list = std::list<node<T>*>;

    template <typename T>
    using graph = std::map<node<T>*, adj_list<T>>;

    template <typename T>
    std::vector<cluster<T>> DBSCAN(T* const& dataset, size_t const& dataset_size, double const& eps, size_t const& min_pts, double(*distance_function)(T const& lhs, T const& rhs)) {

        std::vector<node<T>> node_list(dataset_size);
        for (auto i = 0; i < dataset_size; ++i) {
            node_list[i].val = dataset[i];
        }

        graph<T> g;
        for (auto i = 0; i < node_list.size(); ++i) {
            for (auto j = 0; j < i; ++j) {
                if (distance_function(node_list[i].val, node_list[j].val) < eps) {
                    g[&node_list[i]].push_back(&node_list[j]);
                    g[&node_list[j]].push_back(&node_list[i]);
                }
            }
        }

        std::vector<cluster<T>> clusters;
        cluster<T> C;
        for (node<T>& n : node_list) {
            if (n.visited) continue;
            n.visited = true;
            adj_list<T> neighbour_pts = g[&n];
            if (neighbour_pts.size() >= min_pts) {
                expandCluster(n, neighbour_pts, C, g, min_pts);
                clusters.push_back(C);
                C = cluster<T>();
            }
        }

        return clusters;
    }

    template <typename T>
    void expandCluster(node<T>& point, adj_list<T>& neighbourhood, cluster<T>& C, graph<T>& g, unsigned const& min_pts) {
        C.push_back(point.val);
        point.clustered = true;
        for (node<T>*& n : neighbourhood) {
            if (!n->visited) {
                n->visited = true;
                adj_list<T> next_neighbourhood = g[n];
                if (next_neighbourhood.size() >= min_pts) neighbourhood.splice(neighbourhood.end(), next_neighbourhood);
            }
            if (!n->clustered) {
                C.push_back(n->val);
                n->clustered = true;
            }
        }
    }

Solution

Naming things

Why you have all caps function name?

Why you have variable with capital letter in expandCluster()?
At first I thought it is template parameter i miss.

Use auto more often:

// looks ugly
for (node<T>*& n : neighbourhood) {
}

// looks better
for (auto & n : neighbourhood) {
     // adjust for * here
}

// or you can do with using
using mynode = node<T>;

// still looks better
for (mynode* & n : neighbourhood) {
}

DBSCAN()

Make this a type:

template<class T>
using something = std::vector<cluster<T>>

Why you pass dataset_size by reference?

size_t is as big as the pointer, but there will be additional hidden dereference that will slow down the execution.
keep the const, remove &

size_t const dataset_size

Do the same for all non class parameters.

dataset

probably I am wrong, but if you type this:

T* const& dataset

like this:

const T* dataset

will be same, but more readable? Not sure what T is. If T is a container or array, why not just:

const CONTAINER dataset

Why you pass C like function pointers?

Do it with functor class. Or with std::function.

I am not very experienced with this, so I will do it with functor, but will add it into template parameters.

expandCluster()

more or less similar comments.

Leave a Reply

Your email address will not be published. Required fields are marked *