# Counting the number of unique IP addresses in a very large file

Posted on

Problem

I made a test job to the position of Junior Java Developer. I did not receive any answer from the employer, so I would like to get a review here.

A simple text file with IPv4 addresses is given. One line is one address, something like this:

145.67.23.4
8.34.5.23
89.54.3.124
89.54.3.124
3.45.71.5.
...


The file size is not limited and can occupy tens and hundreds of gigabytes.

It is necessary to calculate the number of unique IP addresses in this file, consuming as little memory and time as possible. There is a “naive” algorithm for solving this task (we read strings and put it in HashSet), it is desirable that your implementation is better than this simple, naive algorithm.

## Test case

WARNING! This file is about 20GB, and is unpacking approximately 120GB. It consists of 8 billions lines.

## My approach

I emerge from the fact that there are 2 ^ 32 valid unique ip addresses. We can use a bit array of 2 ^ 32 bits (unsigned integer) and put each bit of this array in line with one ip address. Such an array will take exactly 512 megabytes of memory and will be allocated once at the start of the program. Its size does not depend on size of the input data.

Unfortunately, Java does not have an unsigned int type and also there are no convenient bit operations in it, so I use BitSet for my purposes.

## Thank you!

Please feel free to tell me all the flaws that you can find.

## Code

Main class

public class IpCounterApp {

private static String parseFileName(String[] args) {
String fileName = null;
if (args.length == 2 && "-file".equals(args[0])) {
fileName = args[1];
}
return fileName;
}

public static void main(String[] args) {
String fileName = parseFileName(args);
if (fileName == null) {
System.out.println("Wrong arguments. Use '-file file_name' to specify file for processing");
return;
}

UniqueIpCounter counter = new BitSetUniqueIpCounter();
long numberOfUniqueIp = counter.countUniqueIp(fileName);
if (numberOfUniqueIp != -1) {
System.out.println("Found " + numberOfUniqueIp + " unique IP's");
} else {
System.out.println("Some errors here. Check log for details.");
}
}
}



UniqueIpCounter interface

public interface UniqueIpCounter {

/*
In total there are 2 ^ 32 valid IP addresses exists.
*/
long NUMBER_OF_IP_ADDRESSES = 256L * 256 * 256 * 256;

/*
Map string representing the IP address in format 0-255.0-255.0-255.0-255 to number
in the range of 0..2^32-1 inclusive.
It is guaranteed that the input string contains a valid IP address.
*/
static long toLongValue(String ipString) {
StringBuilder field = new StringBuilder(3);
int startIndex = 0;
long result = 0;

for (int i = 0; i < 3; i++) {
int spacerPosition = ipString.indexOf('.', startIndex);
field.append(ipString, startIndex, spacerPosition);
int fieldValue = Integer.parseInt(field.toString());
field.setLength(0);
result += fieldValue * Math.pow(256, 3 - i);
startIndex = spacerPosition + 1;
}
result += Integer.parseInt(ipString.substring(startIndex));

return result;
}

/*
Returns the number of unique IP addresses in the file whose name is pass by the argument.
Returns the number from 0 to 2 ^ 32 inclusive.
Returns -1 in case of any errors.
*/
long countUniqueIp(String fileName);
}


BitSetUniqueIpCounter implementation

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.BitSet;
import java.util.logging.Level;
import java.util.logging.Logger;

public class BitSetUniqueIpCounter implements UniqueIpCounter {

private final Logger logger = Logger.getLogger("BitSetUniqueIpCounter");

/*
To count unique IP's use a bit array, where each bit is set in accordance with one IP address.
In Java, there is no unsigned int and maximum BitSet size is integer.MAX_VALUE therefore we use two arrays.
*/
private final BitSet bitSetLow = new BitSet(Integer.MAX_VALUE); // 0 - 2_147_483_647
private final BitSet bitSetHi = new BitSet(Integer.MAX_VALUE); // 2_147_483_648 - 4_294_967_295
private long counter = 0;

private void registerLongValue(long longValue) {
int intValue = (int) longValue;
BitSet workingSet = bitSetLow;
if (longValue > Integer.MAX_VALUE) {
intValue = (int) (longValue - Integer.MAX_VALUE);
workingSet = bitSetHi;
}

if (!workingSet.get(intValue)) {
counter++;
workingSet.set(intValue);
}
}

@Override
public long countUniqueIp(String fileName) {
logger.log(Level.INFO, "Reading file: " + fileName);
long linesProcessed = 0;
String line;
// If already counted 2 ^ 32 unique addresses, then to the end of the file there will be only duplicates
registerLongValue(UniqueIpCounter.toLongValue(line));
linesProcessed++;
}
logger.log(Level.INFO, "Total lines processed: " + linesProcessed);
} catch (FileNotFoundException e) {
counter = -1;
} catch (IOException e) {
logger.log(Level.WARNING, "IOException occurs", e);
counter = -1;
}
return counter;
}
}



Solution

I’m being extra pedantic as this is a job interview question. Your code is better than 90% of what I have seen in job interviews. It shows that you have a fairly good understanding of how to solve the problem but there are improvements to be made in both minor details and the overall design of the solution. (This answer is a bit unstructured as I just make observations as I read through the code.)

The file name parsing employs unnecessary variables and the boolean statements changes logical order mid sentence. The reversed order is used when guarding against null values but I find it difficult to read (increases cognitive complexity). You have also already gone through all possible error conditions by referencing args when checking its length, so the reverse comparison order does not have any functional purpose any more. A more readable way to guard against nulls would be Objects.requireNonNull(args).

private static String parseFileName(String[] args) {
Objects.requireNonNull(args, "args must not be null");
if (args.length == 2 && args[0].equals("-file")) {
return args[1];
}
return null;
}


Variables that are not intended to be changed should always be final.

Error messages should be written to System.err. Also, you’re inconsistently using System.out in one place and java.util.logging.Logger in another place and logging fatal errors as warnings.

You should avoid placing constants in interfaces as it leaks implementation details into the interface’s API.

You’re writing comments but not using the standard JavaDoc formatting. This reduces the usability of the comments as they will not be copied to automatically produced documentation or IDE tool tips.

The static toLongValue method for converting the input tries to force the UniqueIpCounter implementation to work with long values. I would not want to see static methods in interfaces and the internal representation of the data should not be exposed here. Leave the conversion to the responsibility of the specific implementation.

Forcing the UniqueIpCounter implementation to work with files, represented by file names, unnecessarily limits the reusability of the implementation and loads the implementation with multiple responsibilities, which violates the single responsibility principle. Reading the file should have been placed in a separate class. Parsing the read strings into IP addresses should have been another standalone class. The IP counter should thus only deal with the parsed input. Also, placing the responsibility of reading the file on the UniqueIpCounter you make unit testing the class very hard because now every test input has to be represented by a physical file on the disk.

The registerLongValue(long) seems an odd method in an interface named UniqueIpCounter. The naming should be consistent. So either registerIpAddress(...) or UniqueLongValueCounter.

Manuel suggested using InetAddress for parsing the input. When presented with a task of minimizing running time, you should read the source code of the libraries you use. It is obvious that InetAddress is a generic class that trades usability to performance. So you should implement a specialized string parsing algorithm and document why you wrote one instead of using a ready made library. You might even consider skipping the conversion to strings in between and create an InputStream reader that produces long values directly. In the end, the most important part is that you document why you did not choose to go with an easily readable and maintainable code. That said… your parsing algorithm is incredibly inefficient. To parse an IPv4 address to long you only need to read characters and append the digit value into an “accumulator”:

Set parsedIpAddress and currentByte to 0;
If digit, multiply currentByte by 10 and add digit.
If dot or new line, multiply parsedIpAddress by 256 and add current byte.
If new line, parsedIpAddress is complete.
Otherwise repeat from line 2.


Using BitSet was, in theory, the correct choice but as Manuel pointed out, even having two sets does not cover the whole address range. Thus you have to reimplement it without the size limitation. That being said, the variable names need improvement. lowAddresses and highAddresses would tell the reader that they are used to store addresses already found. Use normal JavaDoc comments instead of end-of-line comments. EOL-comments are hard to read and maintain and you always have to make a compromise on the usefulness of the content to make it fit.

“Return -1 on error” is an anti-pattern inherited from C, which should not be repeated. The correct way to handle errors is to throw an exception. I’m not sure it it was required in the task but there was no error handling in the input parsing.

Generic advice for job interview code

Job interview programming tasks are always too vague. It’s either because they want to see how you handle incomplete specifications or because they don’t know what they are doing. Regardless of the reason, this puts you into a position where you have to either ask for clarifications (which is usually a good thing) or make assumptions if you are not allowed to ask. When you make assumptions you have to document that you had multiple choices and list the reasons why you chose the one you implemented. If the reviewer does not agree with your approach they at least know that you made an intentional choice between several options instead of “doing the only one thing you knew.”

The maximum int value is $$231−1231−12^{31}-1$$, so your BitSets have $$231−1231−12^{31}-1$$ entries each, so together they only cover $$232−2232−22^{32}-2$$ IPs, missing two. And you documented their ranges incorrectly. If the file contains 255.255.255.255, you crash like this:

Exception in thread "main" java.lang.IndexOutOfBoundsException: bitIndex < 0: -2147483648
at java.base/java.util.BitSet.get(BitSet.java:624)
at BitSetUniqueIpCounter.registerLongValue(BitSetUniqueIpCounter.java:29)
at BitSetUniqueIpCounter.countUniqueIp(BitSetUniqueIpCounter.java:43)
at IpCounterApp.main(IpCounterApp.java:19)


I’d instead use a single int[] seen = new int[1 << 27] ($$2272272^{27}$$ ints with $$25252^{5}$$ bits each, so $$2322322^{32}$$ bits total) and treat it like this:

    private void registerLongValue(long longValue) {
int index = (int) (longValue >> 5);
int bit = 1 << (longValue & 31);
if ((seen[index] & bit) == 0) {
counter++;
seen[index] |= bit;
}
}


Your toLongValue is rather complicated, you could use InetAddress (might also be faster):

    static long toLongValue(String ipString) throws UnknownHostException {
long result = 0;
result = (result << 8) | (b & 255);
return result;
}


I’m agree with TorbenPutkonen’s considerations about your code and with Manuel’s observation about edge ipaddresses cases like 255.255.255.255 that can break your code, I focused about your method toLongValue translating your ipv4 address to an unique long:

static long toLongValue(String ipString) {
StringBuilder field = new StringBuilder(3);
int startIndex = 0;
long result = 0;

for (int i = 0; i < 3; i++) {
int spacerPosition = ipString.indexOf('.', startIndex);
field.append(ipString, startIndex, spacerPosition);
int fieldValue = Integer.parseInt(field.toString());
field.setLength(0);
result += fieldValue * Math.pow(256, 3 - i);
startIndex = spacerPosition + 1;
}
result += Integer.parseInt(ipString.substring(startIndex));

return result;
}


Your method can be rewritten dividing your String ipv4 address with String.split and left shifting the octets with the logical | operator like below:

public static long toLongValue(String ipString) {
String[] octets = ipString.split("\.");
return Long.parseLong(octets[0]) << 24 |
Long.parseLong(octets[1]) << 16 |
Long.parseLong(octets[2]) <<  8 |
Long.parseLong(octets[3]);
}


The absence of a primitive type unsigned int in java forces to use the primitive long type, so your BitSet will break when you pass your long converted to a negative int. The idea in itself for me is a correct choice but it has to be refined to include all the addresses. My first idea for performance was about the construction of a trie, it seems that as usual Knuth solved the problem creating the Patricia tree, you can check wikipedia notes for more details.

Since it’s a job interview question, you probably want to:

• specify a package instead of using the default one. E.g. package chpter.one;
• put only the necessary code (BitSetUniqueIpCounter.java, IpCounterApp.java, UniqueIpCounter.java) in src/main/java. Anything that is only used for tests should be in src/test/java or src/test/resources/.
• Don’t forget to close the file, even if Files.lines fails.
• Don’t dump the whole stacktrace during a test which is expected to fail: tests are green but the console is still full of java.io.FileNotFoundException: no file....
• use a linter (e.g. SonarLint) to clean up the code and look for potential bugs.

├── pom.xml
└── src
├── main
│   ├── java
│   │   ├── BitSetUniqueIpCounter.java
│   │   ├── IpCounterApp.java
│   │   ├── NaiveStreamApiUniqueIpCounter.java
│   │   ├── TestFileGenerator.java
│   │   └── UniqueIpCounter.java
│   └── resources
│       └── million.txt
└── test
└── java
├── BitSetUniqueIpCounterTest.java
└── UniqueIpCounterTest.java


And it could look like:

├── pom.xml
└── src
├── main
│   └── java
│       └── chpter
│           └── one
│               ├── BitSetUniqueIpCounter.java
│               ├── IpCounterApp.java
│               └── UniqueIpCounter.java
└── test
├── java
│   └── chpter
│       └── one
│           ├── BitSetUniqueIpCounterTest.java
│           ├── NaiveStreamApiUniqueIpCounter.java
│           ├── TestFileGenerator.java
│           └── UniqueIpCounterTest.java
└── resources
└── million.txt


Note that the test folder looks fuller, and src/main is much leaner.

I see one little problem with the code you gave, even though I feel like it’s a pretty good take at the problem you have.

Best case/worst case, you always ask for 512Mb memory. Say I give you a file that has 1 million times the same address, the code still initialized a 514 megabytes array.

Instead of using an array structure, I believe a tree structure would be more appropriate. I will not write it, as I’ve not written Java in awhile, but let’s look at this.

There are 256 possible values for each of the 4 parts of an IP address, which leads you 256^4 possibilities (the 512 megabytes you use).

So, what if we had a tree structure with four levels of depth and 256 leafs per node?

In the worst case scenario where you’d use all possible values-ish (256^4 – 255 different addressed), you would require 512 megabytes.

But, depending on the distribution of the addresses you have in your file, you could take as little as 256*4 bits of memory.

The pseudo-algorithm I’d use would look something like this:

treeStructure = new treeStructure()
countUnique(list):
for element in list:
part1, part2, part3, part4 = parseIPString(element)

if doesnt exist treeStructure[part1]:
treeStructure[part1] = new treeStructure()

if doesnt exist treeStructure[part1][part2]:
treeStructure[part1][part2] = new treeStructure()

if doesnt exist treeStructure[part1][part2][part3]:
treeStructure[part1][part2][part3] = array[256]

if treeStructure[part1][part2][part3][part4] == 0:
treeStructure[part1][part2][part3][part4] = 1
counter++


Now, I know this code is poorly written and I wouldn’t show this in a job interview. It should be pretty easy to make it clean by yourself as you’ve done quite well in your original post. The goal of the pseudo-code is to show the algorithm.

This way, you allocate memory in chunks of 256 bits when you need it instead of allocating a whole 512 megabytes at first.