Problem
I need to parse the data coming from the URL which looks like this:
hasProcess=true
version=1
DATACENTER=abc
TotalNumberOfServers:4
primary:{0=1, 1=2, 2=1, 3=2, 4=1, 5=2, 6=1, 7=2, 8=1, 9=2, 10=1, 11=2, 12=1, 13=2}
secondary:{0=0, 1=0, 2=0, 3=1, 4=0, 5=0, 6=0, 7=1, 8=0, 9=0, 10=0, 11=1, 12=0, 13=0}
hosttomachine:{3=machineA, 2=machineB, 1=machineC, 4=machineD}
DATACENTER=pqr
TotalNumberOfServers:2
primary:{0=1, 1=2, 2=1, 3=2, 4=1, 5=2, 6=1, 7=2, 8=1, 9=2, 10=1, 11=2, 12=1, 13=2, 14=1}
secondary:{0=0, 1=0, 2=0, 3=1, 4=0, 5=0, 6=0, 7=1, 8=0, 9=0, 10=0, 11=1, 12=0, 13=0, 14=0}
hosttomachine:{1=machineP, 4=machineQ}
DATACENTER=tuv
TotalNumberOfServers:0
primary:{}
secondary:{}
hosttomachine:{}
After parsing the data I need to store each datacenter data in a Map
like this:
HashMap<String, Map<Integer, String>> primaryData
For example, the Key of primaryData
is abc
and value is:
{0=1, 1=2, 2=1, 3=2, 4=1, 5=2, 6=1, 7=2, 8=1, 9=2, 10=1, 11=2, 12=1, 13=2}
which is for primary.
Similarly another Map
for secondary for each datacenter:
HashMap<String, Map<Integer, String>> secondaryData
For example, the Key of secondaryData
is abc
and value is:
{0=0, 1=0, 2=0, 3=1, 4=0, 5=0, 6=0, 7=1, 8=0, 9=0, 10=0, 11=1, 12=0, 13=0}
which is for secondary.
And lastly, one more map for hosttomachine
mapping for each datacenter:
HashMap<String, Map<Integer, String>> hostMachineMapping -
For example, the Key of hostMachineMapping
is abc
and value is:
{3=machineA, 2=machineB, 1=machineC, 4=machineD}
which is for hosttomachine
.
And all the above map will have data for its datacenter as I have three datacenter in the above example. So each each map will have three data. And also I will parse the above response only when hasProcess
is equal to true
. If it is not true, then I won’t parse anything.
This code takes more than 200 ms to parse the data and store it in its corresponding HashMap
. Is there any way to parse the above data efficiently and store it in particular HashMap
?
private void parseResponse(String response) throws Exception {
if (response != null) {
Map<String, Map<Integer, String>> primaryData = null;
Map<String, Map<Integer, String>> secondaryData = null;
Map<String, Map<Integer, String>> hostMachineMapping = null;
long version = 0L;
boolean changed = false;
String splitResponse[] = response.split("DATACENTER=");
boolean flag = false;
for (String sr : splitResponse) {
if (!flag) {
flag = true;
String[] header = sr.split("n");
changed = Boolean.parseBoolean(header[0].split("=")[1]);
if (!changed) {
return;
} else {
version = Integer.parseInt(header[1].split("=")[1]);
primaryData = new HashMap<String, Map<Integer, String>>();
secondaryData = new HashMap<String, Map<Integer, String>>();
hostMachineMapping = new HashMap<String, Map<Integer, String>>();
}
} else {
generateDataCenterMapping(sr, primaryData, secondaryData, hostMachineMapping);
}
}
if (changed) {
Mapping.setPrimaryData(primaryData);
Mapping.setSecondaryData(secondaryData);
Mapping.setHostMachineMapping(hostMachineMapping);
Mapping.setVersion(version);
}
}
}
private void generateDataCenterMapping(String sr, Map<String, Map<Integer, String>> primaryData,
Map<String, Map<Integer, String>> secondaryData,
Map<String, Map<Integer, String>> hostMachineMapping) throws Exception {
String[] data = sr.split("nt");
String dcName = data[0];
int numOfServers = Integer.parseInt(data[1].split(":")[1]);
if (numOfServers > 0) {
primaryData.put(dcName, generateMap(data[2]));
secondaryData.put(dcName, generateMap(data[3]));
hostMachineMapping.put(dcName, generateMap(data[4]));
}
}
private Map<Integer, String> generateMap(String map) throws Exception {
String tableString = map.split(":")[1];
Map<Integer, String> table = new HashMap<Integer, String>();
tableString = tableString.substring(1, tableString.length() - 1);
String[] entries = tableString.split(", ");
for (String e : entries) {
String[] entryVal = e.split("=");
table.put(Integer.parseInt(entryVal[0]), entryVal[1]);
}
return table;
}
Solution
I started to write a comment, but it became too long.
Questions
- 200 ms is pretty close to eternity, how long is your input?
- Can you measure how long
parseResponse
alone takes? - Could you post your data? Someone may try harder to optimize.
Potential slowness causes
I guess your String.split
could be the culprit (with more than one character a regex gets created; you should use
private static final Pattern COMMA_BLANK_PATTERN = Pattern.compile(", ");
for splitting and use split limit of 2. Probably faster could be using Guava’s
private static final Splitter COMMA_BLANK_SPLITTER = Splitter.on(", ");
Even better would be piece-wise processing instead of splitting.
Note also that Integer.parseInt
is internationalized and therefore slightly slower than necessary. Probably unimportant.
Review
if (response != null) {
As Mat’s Mug said, this is wrong. I’m using Guava’s Preconditions
with a static import writing simply
checkNotNull(response);
This makes it fail-fast rather than hiding the problem to bite you later.
String tableString = map.split(":")[1];
Map<Integer, String> table = new HashMap<Integer, String>();
tableString = tableString.substring(1, tableString.length() - 1);
This is pretty confusing by squeezing something else between the two tableString
defining lines.
Summary
Overall, it’s not bad. When speed is important, I’d go for something like
MyParser parser = new MyParser(response).skip("hasProcess=");
boolean hasProcess = parser.readBoolean();
parser.skip("nversion=");
int version = parser.readInt();
while (parser.skipIfLookingAt("nDATACENTER=")) {
parser.parseDatacenter(parser, whatever...);
}
It’s just an idea (damn similar to java.util.Scanner
, which isn’t exactly known for speed), but I guess that string splitting is the problem (simply as I can’t see anything else).
An example method could look like
MyParser skip(String prefix) {
for (int i=0; i<prefix.length(); ++i) {
if (content.charAt(index++) != prefix.charAt(i)) {
throw ...
}
}
}
where content
and index
are instance variables. This part doesn’t need any substring creation at all (and gets a bit unreadable because of this). Warning: Doing something like content = content.substring(index)
(so you could use startsWith
) would make it clearer but terribly slow as you’d copy a big part of the string a lot of times.
Filling the maps
Now, I’d probably move all the functionality into MyParser
, so that the parsing would be just
private void parseResponse(String response) {
new MyParser(response).parse();
}
The parser would have fields like
private final Map<String, Map<Integer, String>> primaryData;
so that I could write
void parseDatacenter() {
String datacenter = readTill('n');
skip("ntTotalNumberOfServers:");
int totalNumberOfServers = readInt();
skip("ntprimary{");
primaryData.put(datacenter, readMap());
skip("nt");
}
Map<String, String> readMap() {
Map<String, String> result = new HashMap<>();
while (true) {
String key = readTillAndSkip('=');
String value = readTillOneOf("},");
map.put(key, value);
if (lookingAt('}') {
break;
}
skip(", ");
}
return result;
}
As an example see this simple method
String readTillOneOf(String terminators) {
int start = index;
while (terminators.indexOf(content.charAt(index)) > -1) {
++index;
}
return content.substring(start, index);
}
I ignored quite some details like parsing numbers and I can’t tell how many parsing methods you’ll need, but basically it’s all pretty simple. Just skip what’s fixed and process what you want while looking for a terminator.
It could get a bit more complicated if you needed to be error-tolerant, like allowing multiple spaces, but then you could let some of your methods skip over them. Or, if it gets difficult, use regexes. But I don’t think it gets complicated.
Note that I’m unsure why your parser is slow. There’s a lot of splitting and this is the probable cause, but it’s just a guess.
I find it awkward that you’re wrapping the entire method’s body in an if
block:
if (response != null) {
In fact, I find it awkward that you’re doing nothing with a null
response, especially given an indicator such as throws Exception
.
I would expect the method to throw some kind of ArgumentNullException
(sorry if that’s not in Java, I’m a C# guy – I’m sure Java has something similar though) given a null
response string.
The typical way to do this, is to have a guard clause at the start of the method:
if (response == null) {
throw new ArgumentNullException("response");
}
And then, go on with the method body, knowing you have a valid response
. Notice that just eliminated a whole nesting level across the entire method!
I find your logic rather convoluted, and I would expect parseResponse
to return an object (or a collection of objects) holding the parsed data… not to have direct (or indirect) side-effects: I think your code is doing too many things.
The fact that parseResponse
isn’t a public member smells IMO, because parsing a response is a testable concern of its own, and there should be a class dedicated to making this work, where parseResponse
would be public. Having it buried as an internal implementation detail of something bigger tells me that the only way you have of testing (and benchmarking) your parsing logic is by running the entire thing – presumably without any way of really knowing how much of the 200ms is actually spent sending the request and receiving the response: you want to be able to feed that method with a “fake” response string, and test-run all edge cases without having to send an actual request to some remote server that could be offline for maintenance while you’re running your tests!
Unfortunately, there is quite a bit of fuzziness with respect to the performance issues here. First, what is actually taking 200ms? Is it just this method, exclusively? Is it this method plus the external methods it calls (the Mapping.set*
methods)? Is it the entire request and parsing? Second, has there been any profiling done to narrow down some candidates for what exactly might be the bottleneck? It could be the string manipulation, the HashMap
, or something else completely. Third, what kind of input are we looking at? There is a sample, but are there hundreds of datacenters or hundreds of thousands? If the number is truly large, 200ms may actually be completely reasonable. There is also the possibility of being bound by some component of the hardware, but we can hand wave that.
Given the constraints listed, here are a few suggestions. They are not tested because of the potential variability of the above. The shape and size of your actual data could lead to significant differences between the suggestions.
Strings
There is obviously a lot of string manipulation going on, particularly String#split
. maaartinus goes into some detail on why this can lead to performance troubles, especially if it is using regexes internally. In place of some of the splits, String#substring
, String#startsWith
, and String#endsWith
might be more performant.
It might also be faster to handle lines individually using a combination of StringReader
and BufferedReader
, or Scanner
. Each of these has their own set of performance properties and optimizations to consider. These could even make your code slower depending on how they handle things internally, but it might be worth trying them.
HashMaps
HashMap
is generally a great default choice for a map in Java. It provides O(1)
insertion and retrieval in most cases. But it does have some quirks. First, if you are putting a lot of data in sequentially, the HashMap
internally needs to keep growing its internal structures to keep up. This could cause performance problems because each time it grows, it needs to allocate a new chunk of memory, then rehash all the keys to put them into the new chunk. We could instead just tell HashMap
what we would like the initial capacity to be, accounting for the load factor, so that this is not an issue. I have seen data to suggest that this may not be as big of an issue as suggested in the documentation, though.
Collisions can also affect the performance of a HashMap
, but you are using built-in Java objects for keys. Generally, the built-in hashing algorithms produce few collisions on the general set of inputs.
The hashing algorithms themselves, though, may be a bottleneck depending on the shape of your data (the length of the keys). The built-in hashing algorithms are designed to be cheap, but they are not free. There is a possibility that the actual hashing is what is taking so long. This is doubtful, as the algorithms are cheap and the results can be cached, but it is a possibility.
Since you did not post what Mapping.set*
does, I will make a suggestion based on a potentiality. If you are iterating over the HashMap
s in those methods, you could be experiencing your performance losses in there. Iterating over a HashMap
is dependent on the capacity rather than the number of elements. A HashMap
with a capacity much larger that the number of elements will slow down.
You can try to vary the type of Map
you use to see if one of these is affecting the performance. For example, you could use a LinkedHashMap
if iteration is the problem. You could test with a TreeMap
to see if a completely different structure may actually be more suitable in practice. Because you have two levels of Map
with different key types, they may each benefit from different optimizations.
Other things
I agree with @Mat’s Mug about using guard clauses to reduce the indentation, and I see more places where similar strategies may be appropriate.
I see you have a variable name flag
, and I don’t like that. It’s used awkwardly, and the fact that it has such a nondescript name tells me that it doesn’t have a good purpose. Let’s look at what it actually does:
boolean flag = false;
for (String sr : splitResponse) {
if (!flag) {
flag = true;
String[] header = sr.split("n");
changed = Boolean.parseBoolean(header[0].split("=")[1]);
if (!changed) {
return;
} else {
version = Integer.parseInt(header[1].split("=")[1]);
primaryData = new HashMap<String, Map<Integer, String>>();
secondaryData = new HashMap<String, Map<Integer, String>>();
hostMachineMapping = new HashMap<String, Map<Integer, String>>();
}
} else {
generateDataCenterMapping(sr, primaryData, secondaryData, hostMachineMapping);
}
}
It is used in the loop to switch behavior. The conditional is a negated conditional with an else
clause, which is kind of a smell. Generally, I would start by reversing the conditional, but here there’s more. Immediately after the conditional, flag
is set to true
, and then never set again. I see what this is. This is a “just run me the first time” chunk of code. We can take this out, and just adjust the loop to skip the first iteration.
if(splitResponse.length == 0) {
return;
}
String[] header = splitResponse[0].split("n");
changed = Boolean.parseBoolean(header[0].split("=")[1]);
if (!changed) {
return;
}
version = Integer.parseInt(header[1].split("=")[1]);
primaryData = new HashMap<>();
secondaryData = new HashMap<>();
hostMachineMapping = new HashMap<>();
for (int i = 1; i < splitResponse.length; i++) {
String sr = splitResponse[i];
generateDataCenterMapping(sr, primaryData, secondaryData, hostMachineMapping);
}
I pulled that code out, but I also did a few other things. Because the code is now outside the loop, I had to add a guard to make sure that there is at least one element in the splitResponse
array. Since the conditional on !changed
returned, the else
was superfluous, so I just removed it and un-tabbed the code inside. I also abbreviated the HashMap
instantiations with the diamond operator(Java 7+). The loop is much simpler and the boolean flag
is completely gone.
The changed
flag can also be eliminated through similar steps.
I might also suggest (though the changes would be more significant) that, instead of having three separate Map
s, you could create an object to hold the three pieces of data for each key, then have a single map from the key to the object. This would make the association between those data more explicit.