Problem
I’ve implemented a templated DBSCAN for general use. At the moment, it’s going to be used on Android through the JNI. I used Wikipedia’s pseudocode and a little bit of the DBSCAN paper for reference. It’s pretty naive, so I’m wondering how I can speed it up, and what I can do to make it perform reasonably well on a phone. How can I improve my code?
template <typename T>
struct node {
T val;
bool visited;
bool clustered;
node() : visited(false), clustered(false) {}
};
template <typename T>
using cluster = std::vector<T>;
template <typename T>
using adj_list = std::list<node<T>*>;
template <typename T>
using graph = std::map<node<T>*, adj_list<T>>;
template <typename T>
std::vector<cluster<T>> DBSCAN(T* const& dataset, size_t const& dataset_size, double const& eps, size_t const& min_pts, double(*distance_function)(T const& lhs, T const& rhs)) {
std::vector<node<T>> node_list(dataset_size);
for (auto i = 0; i < dataset_size; ++i) {
node_list[i].val = dataset[i];
}
graph<T> g;
for (auto i = 0; i < node_list.size(); ++i) {
for (auto j = 0; j < i; ++j) {
if (distance_function(node_list[i].val, node_list[j].val) < eps) {
g[&node_list[i]].push_back(&node_list[j]);
g[&node_list[j]].push_back(&node_list[i]);
}
}
}
std::vector<cluster<T>> clusters;
cluster<T> C;
for (node<T>& n : node_list) {
if (n.visited) continue;
n.visited = true;
adj_list<T> neighbour_pts = g[&n];
if (neighbour_pts.size() >= min_pts) {
expandCluster(n, neighbour_pts, C, g, min_pts);
clusters.push_back(C);
C = cluster<T>();
}
}
return clusters;
}
template <typename T>
void expandCluster(node<T>& point, adj_list<T>& neighbourhood, cluster<T>& C, graph<T>& g, unsigned const& min_pts) {
C.push_back(point.val);
point.clustered = true;
for (node<T>*& n : neighbourhood) {
if (!n->visited) {
n->visited = true;
adj_list<T> next_neighbourhood = g[n];
if (next_neighbourhood.size() >= min_pts) neighbourhood.splice(neighbourhood.end(), next_neighbourhood);
}
if (!n->clustered) {
C.push_back(n->val);
n->clustered = true;
}
}
}
Solution
Naming things
Why you have all caps function name?
Why you have variable with capital letter in expandCluster()
?
At first I thought it is template parameter i miss.
Use auto more often:
// looks ugly
for (node<T>*& n : neighbourhood) {
}
// looks better
for (auto & n : neighbourhood) {
// adjust for * here
}
// or you can do with using
using mynode = node<T>;
// still looks better
for (mynode* & n : neighbourhood) {
}
DBSCAN()
Make this a type:
template<class T>
using something = std::vector<cluster<T>>
Why you pass dataset_size
by reference?
size_t is as big as the pointer, but there will be additional hidden dereference that will slow down the execution.
keep the const, remove &
size_t const dataset_size
Do the same for all non class parameters.
dataset
probably I am wrong, but if you type this:
T* const& dataset
like this:
const T* dataset
will be same, but more readable? Not sure what T
is. If T
is a container or array, why not just:
const CONTAINER dataset
Why you pass C like
function pointers?
Do it with functor
class. Or with std::function
.
I am not very experienced with this, so I will do it with functor
, but will add it into template parameters.
expandCluster()
more or less similar comments.