Problem
This code sorts string that may contain numbers in natural order, that’s it, “item 2” comes before than “item 10”. It currently ignores case. I plan to implement options to handle case sensitivity and descending order.
For me it’s fast enough, faster than other implementations I’ve seen using regexes. Apart from the first lines, it’s JavaScript.
I made it returning the actual sorting function so we can cache its config object.
I’d like to know if it have any issues regarding Unicode, code style, etc. I plan to handle descending order like this, codified into a helper function.
return 0 - result;
For case sensitivity, I’d strip toUpperCase()
.
"use strict";
interface NaturalKeyFunc {
(obj: any): any;
}
interface NaturalSortConfig {
keyFunc?: NaturalKeyFunc;
//desc?: boolean;
//caseInsensitive?: boolean;
}
function naturalSort(config?: NaturalSortConfig) {
return function(a, b) {
if (typeof a === "number" && typeof b === "number")
return a - b;
if (a === null || b === null)
return NaN;
if (a === undefined || b === undefined)
return NaN;
if (Array.isArray(a) || Array.isArray(b))
return NaN;
if (Object.prototype.toString.call(a) === "[object Object]" || typeof Object.prototype.toString.call(b) === "[object Object]") {
if (!config && !config.keyFunc)
return NaN;
if (typeof a === "object")
a = config.keyFunc(a);
if (typeof b === "object")
b = config.keyFunc(b);
}
if (typeof a !== "string" || typeof b !== "string")
return a - b;
if (a.length === 0 || b.length === 0)
return a.length - b.length;
var s1: string = a,
s2: string = b,
indexA = 0, indexB = 0,
numberA, numberB,
charA, charB,
difference;
do {
numberA = 0;
while (indexA < s1.length) {
charA = s1[indexA];
indexA++;
if (!(charA >= "0" && charA <= "9")) break;
numberA = numberA * 10 + parseInt(charA);
}
numberB = 0;
while (indexB < s2.length) {
charB = s2[indexB];
indexB++;
if (!(charB >= "0" && charB <= "9")) break;
numberB = numberB * 10 + parseInt(charB);
}
if (charA === " " && charB === " ") {
while (indexA < s1.length && s1[indexA] === " ") {
indexA++;
}
while (indexB < s2.length && s2[indexB] === " ") {
indexB++;
}
}
difference = numberA - numberB;
if (difference !== 0) return difference;
if (charA !== charB) {
if (charA === ".") return -1;
if (charB === ".") return 1;
return charA.toUpperCase().charCodeAt(0) - charB.toUpperCase().charCodeAt(0);
}
} while (indexA < s1.length && indexB < s2.length);
return (indexB - s2.length) - (indexA - s1.length);
}
}
Some tests:
let natu = naturalSort();
it("can sort numbers inside strings", function () {
expect(natu("a1", "a2")).toBeLessThan(0);
expect(natu("1 item", "1 item")).toEqual(0);
expect(natu("item 1", "item 1")).toEqual(0);
expect(natu("item 1", "item 2")).toBeLessThan(0);
expect(natu("item 1", "item 10")).toBeLessThan(0);
expect(natu("item 2", "item 10")).toBeLessThan(0);
expect(natu("item 1 item a", "item 1 item a")).toEqual(0);
expect(natu("item 1 item 1", "item 1 item 10")).toBeLessThan(0);
expect(natu("item 01", "item 1")).toEqual(0);
});
Solution
Dead code
interface NaturalSortConfig {
keyFunc?: NaturalKeyFunc;
//desc?: boolean;
//caseInsensitive?: boolean;
}
Dead code does nothing, except take up screen space, and confuse people who don’t know why it was written and removed in the first place.
Braces
This is a bit of a holy-war. However, at least one bug (the Apple SSL bug) was caused by missing braces. Personally, I use them:
if (typeof a === "number" && typeof b === "number") {
return a - b;
}
Instead of:
if (typeof a === "number" && typeof b === "number")
return a - b;
Functions
You may want to consider splitting this into more methods. One possible method might be to move the code to determine whether to early-return into its own function:
function IsValidData (o) {
return o === null || o === undefined || Array.isArray(o);
}
Most of the code in naturalSort()
is duplicated – more functions would really help here.
Early exit
As mentioned in the comments in the previous (deleted) answer, you can early-return as soon as you know which value is sorted higher. For example, sorting “aa” against “bb” should be able to return after checking the first character of each, while “aaa” vs. “aba” should be able return after checking the second, and so on.