statistics classes with number and object inputs

Posted on

Problem

I want to create two classes that accept values and perform operations on them. However, one of them is accepting numerical values – RollingStatisticsNumbers; and the other objects – RollingStatisticsObjects, with a specified weight. So they both implement Statistical interface that specifies calculation methods, but they have different add method signatures for adding values, that’s why I’m creating two other interfaces that extend the Statistical… So I want for RollingStatisticsNumbers to have a type of Number and for RollingStatisticsObjects just any comparable type. The problem is that it violates Liskov Substitution Principle because of stricter preconditions on children, but I have no idea what other approaches to undertake.

Interfaces:

public interface Statistical<T extends Comparable<T>> {
    Double getMean();
    Double getMedian();
    Double getSum();
    Double getStandardDeviation();
    List<T> getData();
}
public interface StatisticalWithoutWeight<T extends Number & Comparable<T>> extends Statistical<T> {
    void add(T value);
}
public interface StatisticalWithWeight<T extends Comparable<T>> extends Statistical<T> {
    void add(T value, int weight);
}

Classes:

public class RollingStatisticsNumbers<T extends Number & Comparable<T>> implements StatisticalWithoutWeight<T> {

    private List<T> values = new ArrayList<>();

    @Override
    public void add(T value) {
        int left = 0;
        int right = values.size() - 1;
        if (right == -1) {
            values.add(value);
            return;
        }
        int mid = (left + right) / 2;
        while (left <= right) {
            mid = (left + right) / 2;
            if (value.compareTo(values.get(mid)) < 0) right = mid - 1;
            else if (value.compareTo(values.get(mid)) > 0) left = mid + 1;
            else break;
        }
        if (value.compareTo(values.get(mid)) > 0) mid += 1;
        values.add(mid, value);
    }

    @Override
    public Double getMean() {
        BigDecimal mean = new BigDecimal(0);
        mean = BigDecimal.valueOf(getSum()).divide(new BigDecimal(values.size()), 2, RoundingMode.FLOOR);
        return mean.doubleValue();
    }

    @Override
    public Double getMedian() {
        int length = values.size();
        if (length % 2 == 0) {
            int lowMidIndex = (length / 2) - 1;
            BigDecimal low = BigDecimal.valueOf(values.get(lowMidIndex).doubleValue());
            BigDecimal high = BigDecimal.valueOf(values.get(lowMidIndex + 1).doubleValue());
            BigDecimal median = low.add(high).divide(new BigDecimal(2), 2, RoundingMode.FLOOR);
            return median.doubleValue();
        }
        return values.get(length/2).doubleValue();
    }

    @Override
    public Double getSum() {
        BigDecimal sum = new BigDecimal(0);
        for (T value: values) {
            sum = sum.add(BigDecimal.valueOf(value.doubleValue()));
        }
        return sum.doubleValue();
    }

    @Override
    public Double getStandardDeviation() {
        Double mean = getMean();
        BigDecimal deviation = new BigDecimal(0);
        for (T value: values) {
            BigDecimal e = BigDecimal.valueOf((value.doubleValue() - mean));
            e = e.pow(2);
            deviation = deviation.add(e);
        }
        deviation = deviation.divide(new BigDecimal(values.size()), 2, RoundingMode.FLOOR);
        MathContext mc = new MathContext(10);
        deviation = deviation.sqrt(mc);
        return deviation.doubleValue();
    }

    @Override
    public List<T> getData() {
        return values;
    }
}
public class RollingStatisticsObjects<T extends Comparable<T>> implements StatisticalWithWeight<T> {

    private RollingStatisticsNumbers<Integer> rollingStatisticsNumbers = new RollingStatisticsNumbers<>();
    private List<T> values = new ArrayList<>();

    @Override
    public void add(T value, int weight) {
        rollingStatisticsNumbers.add(weight);
        values.add(value);
    }

    @Override
    public Double getMean() {
        return rollingStatisticsNumbers.getMean();
    }

    @Override
    public Double getMedian() {
        return rollingStatisticsNumbers.getMedian();
    }

    @Override
    public Double getSum() {
        return rollingStatisticsNumbers.getSum();
    }

    @Override
    public Double getStandardDeviation() {
        return rollingStatisticsNumbers.getStandardDeviation();
    }

    @Override
    public List<T> getData() {
        return values;
    }
}

Example of usage:

public class App 
{
    public static void main( String[] args )
    {
        StatisticalWithoutWeight<Double> statisticalWithoutWeight = RollingStatistics.getStatisticsObjectNumbers();
        statisticalWithoutWeight.add(2.0);
        statisticalWithoutWeight.add(2.0);
        statisticalWithoutWeight.add(2.0);
        statisticalWithoutWeight.add(7.0);
        statisticalWithoutWeight.add(3.0);
        System.out.println(statisticalWithoutWeight.getData());
        System.out.println(statisticalWithoutWeight.getMean());
        System.out.println(statisticalWithoutWeight.getMedian());
        System.out.println(statisticalWithoutWeight.getSum());
        System.out.println(statisticalWithoutWeight.getStandardDeviation());

        Person person1 = new Person("John", "male", 25);
        Person person2 = new Person("Mia", "female", 27);
        Person person3 = new Person("Bob", "male", 21);

        StatisticalWithWeight<Person> statisticalWithWeight = RollingStatistics.getStatisticsObject();
        statisticalWithWeight.add(person1, person1.getAge());
        statisticalWithWeight.add(person2, person2.getAge());
        statisticalWithWeight.add(person3, person3.getAge());
        System.out.println(statisticalWithWeight.getData());
        System.out.println(statisticalWithWeight.getMean());
        System.out.println(statisticalWithWeight.getMedian());
        System.out.println(statisticalWithWeight.getSum());
        System.out.println(statisticalWithWeight.getStandardDeviation());
    }
}

public class Person implements Comparable<Person>{
    private String name;
    private String sex;
    private int age;

    public Person(String name, String sex, int age) {
        this.name = name;
        this.sex = sex;
        this.age = age;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public String getSex() {
        return sex;
    }

    public void setSex(String sex) {
        this.sex = sex;
    }

    public int getAge() {
        return age;
    }

    public void setAge(int age) {
        this.age = age;
    }


    @Override
    public int compareTo(Person o) {
        return this.age - o.age;
    }

    @Override
    public String toString() {
        final StringBuilder sb = new StringBuilder("Person{");
        sb.append("name='").append(name).append(''');
        sb.append(", sex='").append(sex).append(''');
        sb.append(", age=").append(age);
        sb.append('}');
        return sb.toString();
    }
}

output:
[2.0, 2.0, 2.0, 3.0, 7.0]
3.2
2.0
16.0
1.939071943
[Person{name=’John’, sex=’male’, age=25}, Person{name=’Mia’, sex=’female’, age=27}, Person{name=’Bob’, sex=’male’, age=21}]
24.33
25.0
73.0
2.493992783

Solution

The others gave excellent answers, here’s a little more tweaking.

Always add curly braces to loop & if

In my opinion, it’s a bad practice to have a block of code not surrounded by curly braces; I saw so many bugs in my career related to that, if you forget to add the braces when adding code, you break the logic / semantic of the code.

Use the constants that java.math.BigDecimal / java.math.BigInteger gives you instead of creating a new instance.

Those’s classes expose some of the most uses numbers (BigDecimal.ZERO, ect) that you can use instead of new BigDecimal(0).

Instead of using java.math.BigDecimal#pow(int), use the multiplication

In some cases, the pow method tend to be slower than using the multiplication.

Before

e = e.pow(2);

After

e.multiply(e);

RollingStatisticsNumbers class

Extract some of the logic to methods.

When you have logic that does the same thing, you can generally move it into a method and reuse it.

I suggest the following:

AdivideBy method to divide a number and floor it.

private BigDecimal divideBy(BigDecimal number, int size) {
   return number.divide(new BigDecimal(size), 2, RoundingMode.FLOOR);
}

A getValueOf method to create a new java.math.BigDecimal from the template.

private BigDecimal getValueOf(T t) {
   return BigDecimal.valueOf(t.doubleValue());
}

RollingStatisticsNumbers#getMean method

The double initialization of the mean variable is useless; you can remove the new BigDecimal(0) since the RollingStatisticsNumbers#divideBy will override the result in all cases.

Before

@Override
public Double getMean() {
   BigDecimal mean = new BigDecimal(0);
   mean = divideBy(BigDecimal.valueOf(getSum()), values.size());
   return mean.doubleValue();
}

After

@Override
public Double getMean() {
   return divideBy(BigDecimal.valueOf(getSum()), values.size()).doubleValue();
}

Refactored code

public class RollingStatisticsNumbers<T extends Number & Comparable<T>> implements StatisticalWithoutWeight<T> {

    private List<T> values = new ArrayList<>();

    @Override
    public void add(T value) {
        int left = 0;
        int right = values.size() - 1;
        if (right == -1) {
            values.add(value);
            return;
        }

        int mid = (left + right) / 2;
        while (left <= right) {
            mid = (left + right) / 2;
            if (value.compareTo(values.get(mid)) < 0) {
                right = mid - 1;
            } else if (value.compareTo(values.get(mid)) > 0) {
                left = mid + 1;
            } else {
                break;
            }
        }
        if (value.compareTo(values.get(mid)) > 0) {
            mid += 1;
        }
        values.add(mid, value);
    }

    @Override
    public Double getMean() {
        return divideBy(BigDecimal.valueOf(getSum()), values.size()).doubleValue();
    }

    @Override
    public Double getMedian() {
        int length = values.size();
        if (length % 2 == 0) {
            int lowMidIndex = (length / 2) - 1;
            BigDecimal low = getValueOf(values.get(lowMidIndex));
            BigDecimal high = getValueOf(values.get(lowMidIndex + 1));
            BigDecimal median = divideBy(low.add(high), 2);
            return median.doubleValue();
        }
        return values.get(length / 2).doubleValue();
    }

    @Override
    public Double getSum() {
        BigDecimal sum = BigDecimal.ZERO;
        for (T value : values) {
            sum = sum.add(getValueOf(value));
        }
        return sum.doubleValue();
    }

    @Override
    public Double getStandardDeviation() {
        Double mean = getMean();
        BigDecimal deviation = BigDecimal.ZERO;
        for (T value : values) {
            BigDecimal e = BigDecimal.valueOf((value.doubleValue() - mean));
            e = e.pow(2);
            deviation = deviation.add(e);
        }
        deviation = divideBy(deviation, values.size());
        MathContext mc = new MathContext(10);
        deviation = deviation.sqrt(mc);
        return deviation.doubleValue();
    }

    private BigDecimal getValueOf(T t) {
        return BigDecimal.valueOf(t.doubleValue());
    }

    private BigDecimal divideBy(BigDecimal number, int size) {
        return number.divide(new BigDecimal(size), 2, RoundingMode.FLOOR);
    }

    @Override
    public List<T> getData() {
        return values;
    }
}

Inheritance is for sharing an interface. The only reason to create a base class Statistical would be if you’re planning to write some polymorphic code

void doSomeComputation(Statistical stats) {
    Double stddev = stats.getStandardDeviation();
    ...
}

The code you posted doesn’t do any polymorphism; therefore it doesn’t need any inheritance relationships. Just make RollingStatisticsNumbers and RollingStatisticsObjects two different classes, with no inheritance at all.

I’m a little bit confused by how you decided to structure and approach this.

From what I can see, StatisticalWithoutWeight and StatisticalWithWeight are actually the same thing but you decided to treat them different for some reason. At least from your example, everything boils down to Numbers. So what you very likely want, is a single class, with no interfaces, which accepts a single Object that can tell the class what value to use.

public interface DataPoint {
    public int getStatisticalValue();
}

public class Statistical {
    public Double getMean();
    public Double getMedian();
    public Double getSum();
    public Double getStandardDeviation();
    public List<T> getData();
    public void add(DataPoint dataPoint);
}

With that in place, you can create yourself a convenience DataPoint:

public class ProvidingDataPoint implements DataPoint {
    public <TYPE> ProvidingDataPoint(TYPE instance, IntFunction<TYPE> valueProvider);
}

That would allow you to do:

statistical.add(new ProvidingDataPoint(person, (person) -> person.getAge()));

Or with a convenience method on statistical itself:

statistical.add(person, Person::getAge);

Of course, that’s not perfect. Ideally, DataPoint would return a BigDecimal. Overall, your usage of Double is odd, I’d expect either ints or BigDecimal, given that you do your calculations already in that you might as well provide the user of your class with the more precise values.


Thinking a little bit more about it, a better alternative would most likely be to have the provider of values added to the Statistical class and specialize it on a single type of value, like this:

public class Statistical<DATAPOINT_TYPE> {
    public Statistical(Function<DATAPOINT_TYPE, BigDecimal> valueProvider);
    public BigDecimal getMean();
    public BigDecimal getMedian();
    public BigDecimal getSum();
    public BigDecimal getStandardDeviation();
    public List<DATAPOINT_TYPE> getData();
    public Statistical<DATAPOINT_TYPE> add(DATAPOINT_TYPE datapoint);
}

That means that you can use it like this:

Statistical<Person> statistical = new Statistical<>((person) -> person.getAge());
statistical.add(personA);

As said before, I’d expose all values as BigDecimal to provide the user of the class with most precision.


    public List<T> getData() {
        return values;
    }

You’re exposing internal state through this. Somebody can use the returned List to add Objects without having to call add, which might or might not be wanted.

Ideally, you’d return a Collections.unmodifiableList(values) here, or a copy.


MathContext mc = new MathContext(10);

Unlikely, yes, but could still not be enough depending on what values are being sampled.


Having talked about that you should not expose internal state and that you should control adding values through the add function, it might be beneficial to cache some of these values when possible, so that multiple calls to the function do not recalculate the values. Otherwise I’d make clear that the operation might be expensive, for example by naming the methods correctly calculateMean.

Previous answers already deal with code quality, I’ll focus on algorithm and performance here.

If one doesn’t need the median, then this class is extremely wasteful and slow. It’s O(n) time to add a value because of your sorted insert and it having to move values around in the array list. Not to mention that it’s O(n) memory in the number of samples this is limiting it’s usability. Using a linked list wouldn’t help as you lose O(1) random access and gain O(n) random access which makes the binary search suffer instead. A set wouldn’t work either as it would prohibit adding duplicate values which is unreasonable for counting statistics.

I would split the interface into FastStatistics and MedianStatistics or something similar and for the fast statistics class use the mean and sum of squares method to compute the mean and variance in O(1) time and memory. And only if the median is needed keep the samples in a list, at indicated by using the median statistics object.

As for computing the median, I wouldn’t insertion sort into the vector, insertion sort is among the slower sorts.

In fact, if you want the median of n samples, you need to insert n times, meaning you’ll have O(n^2) time complexity. If you instead quicksort when calling getmean (possibly keeping a isSorted flag to avoid resorting every time you call it, although quicksort is usually fast on an already sorted array) you’ll instead have O(n*log(n)) time complexity which is heaps faster.

If you need the median frequently, like between each sample the above could possibly be slower. We’re talking O(n^2) of insertion sort vs O(n^2*log(n)) of quicksort, quicksort is slower by a little bit but I think the constant factor may play a large role here and depending on your quicksort implementation it might be faster on almost sorted arrays… Benchmarking needed.

Leave a Reply

Your email address will not be published. Required fields are marked *