Problem
Just started learning Java Strings. Tried to implement the Soundex algorithm.
Package to hold the String related functions
package com.java.strings;
public class StringFunctions {
/**
* Removes all the spaces in a given String.
* E.g: : "A B CDD" becomes "ABCDD"
*/
public static String squeeze (String in) {
String temp = "";
StringBuilder sb = new StringBuilder(in.trim());
int i = 0;
while (i < sb.length()) {
if (sb.charAt(i) == ' ') {
// Starting with the current position, shift all
// the characters from right to left.
for (int j=i; j < sb.length() - 1; j++)
sb.setCharAt(j, sb.charAt(j+1));
// The length of string is reduced by 1
temp = sb.substring(0, sb.length()-1);
sb.setLength(0);
sb.append(temp);
}
// After shifting the characters from right to left, the new
// character in the current position might be a Space. If so,
// the same position has to be processed again .
if (sb.charAt(i) != ' ')
i++;
}
return sb.toString();
}
/**
* Removes Continuous Duplicate characters in a string.
* E.g: "AAABCCCDDDBB" becomes "ABCDB"
*/
public static String removeContDupChars(String in ) {
String temp = "";
StringBuilder sb = new StringBuilder(in);
int i = 0;
char prevChar;
while (i < sb.length()) {
prevChar = sb.charAt(i);
for (int j=i+1; j<sb.length(); j++) {
// As long as there are same characters, Replace all the Duplicates
// with Space.
if (prevChar == sb.charAt(j))
sb.setCharAt(j, ' ');
else
// Where there is a different char, break the inner loop.
break;
}
i++;
}
return squeeze(sb.toString());
}
}
Soundex algorithm implementation – https://en.wikipedia.org/wiki/Soundex
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import com.java.strings.*;
class SoundExClass {
public final static Map<Character, Character> map = new HashMap<Character, Character>() {{
put('B', '1');
put('F', '1');
put('P', '1');
put('V', '1');
put('C', '2');
put('G', '2');
put('J', '2');
put('K', '2');
put('Q', '2');
put('S', '2');
put('X', '2');
put('Z', '2');
put('D', '3');
put('T', '3');
put('L', '4');
put('M', '5');
put('N', '5');
put('R', '6');
put('A', 'A');
put('E', 'E');
put('I', 'I');
put('O', 'O');
put('U', 'U');
put('Y', 'Y');
put('H', 'H');
put('W', 'W');
}};
public char getValue(char ch) {
char value = map.get(ch);
return value;
}
public final static char[] vowels = {'A', 'E', 'I', 'O', 'U'};
public final static char[] notvowels = {'H', 'W'};
public final static char[] dropChars = {'A', 'E', 'I', 'O', 'U', 'Y', 'H', 'W'};
private String inputString;
public SoundExClass(String inputString) {
// this.inputString = Arrays.asList(inputString);
this.inputString = inputString.toUpperCase().trim();
}
public String implementSoundEx() {
StringBuilder sb = new StringBuilder(inputString);
char currentChar, nextChar = ' ';
// Remove continuous duplicate characters from the String.
// E.g, After the function call, a String 'AAABBBCCCDD' becomes 'ABCD'
String temp = StringFunctions.removeContDupChars(sb.toString());
sb.setLength(0);
sb.append(temp);
// If the first and second characters have the same number
// Replace the second character with Space.
if (getValue(sb.charAt(0)) == getValue(sb.charAt(1)))
sb.setCharAt(1, ' ');
// Replace the consonants with their corresponding mapped numbers.
// Do this for all characters, except the First char.
for (int i=1; i < sb.length(); i++) {
if (sb.charAt(i) != ' ' )
sb.setCharAt(i, getValue(sb.charAt(i)));
}
// Repeat the processing starting with the Second character.
for (int i=1; i < sb.length(); i++) {
currentChar = sb.charAt(i);
// If the current character is any of A,E,I,O,U,Y,H,W or Space, ignore it.
// Process the next character.
if ((Arrays.binarySearch(dropChars, currentChar) != -1) || (currentChar == ' '))
continue;
// If two or more letters with the same number are adjacent in, only retain the first letter;
// For this, put a space in the next character.
if (i+1 < sb.length()) {
nextChar = sb.charAt(i+1);
if (currentChar == nextChar)
sb.setCharAt(i+1, ' ');
}
// Two letters with the same number separated by 'H' or 'W' are coded as a single number
if (i+2 < sb.length()) {
if (currentChar == sb.charAt(i+2))
if (Arrays.binarySearch(notvowels, nextChar) != -1)
sb.setCharAt(i+2, ' ');
}
}
// Replace all the vowels with Spaces.
for (int i=1; i < sb.length(); i++) {
if (Arrays.binarySearch(vowels, sb.charAt(i)) != -1)
sb.setCharAt(i, ' ');
}
//
String x = sb.toString();
return StringFunctions.squeeze(x).substring(0, 4);
}
}
class SoundEx {
public static void main(String ...strings) {
SoundExClass soundex = new SoundExClass("Ashcraft");
System.out.println(soundex.implementSoundEx());
}
}
Solution
Bugs
The Soundex code for “Jackson” should be “J250”. You fail to elide the “C” and the “K”, and as a result, your code returns “J225” instead.
The Soundex code for “Wu” should be “W000”, and the code for “Google” should be “G240”. Your code crashes with a StringIndexOutOfBoundsException
for both.
If the input contains a non-alphabetic character, then SoundExClass.getValue()
crashes with a NullPointerException
due to unboxing a null Character
.
Organization
The com.java.strings
package name infringes on someone else’s namespace — assuming that you are not the owner of the java.com
domain.
The SoundExClass
class should be public
— how else will other people call your code? But …Class
is a pretty cumbersome Hungarian suffix. Furthermore, I think that there isn’t much point in forcing your users to split what should be a simple function call into an object instantiation and a method call. I would just make it
public class Soundex {
// Suppress default constructor
private Soundex() {}
public static String soundex(String name) {
…
}
}
In SoundExClass
, variables map
, vowels
, notvowels
, and dropChars
should not be public. You don’t want any other code to be able to alter their contents. (Note that being final
does not make them unmodifiable.)
The getValue()
method should be private, since it’s an implementation detail that nobody outside the class should be concerned about.
Implementation
removeContDupChars()
is superfluous. There is no point in removing consecutive characters in the input string. Just map the characters into their respective digits — you will eventually elide them anyway when you get to // If two or more letters with the same number are adjacent in, only retain the first letter
.
squeeze()
would be a lot simpler if you took advantage of StringBuilder.deleteCharAt()
. I think that the squeeze()
function should operate directly on a StringBuilder
.
The comments in implementSoundEx()
are helpful, but what would be even better is if each small code block were its own function operating on a StringBuilder
. That would make the functionality even clearer. Taking a cue from this Haskell solution, I would suggest rewriting implementSoundEx()
to look more like this:
public static String soundex(String s) {
s = s.toUpperCase().trim();
if (s.isEmpty()) throw new IllegalArgumentException();
StringBuilder sb = new StringBuilder(s);
digitize(sb);
removeContDupChars(sb);
squeeze(sb);
return sb.setCharAt(0, s.charAt(0)).append("000").setLength(4).toString();
}
StringFunctions
The squeeze ()
method is IMO a little bit over the top. The same can be achieved by using the replaceAll()
method of the String
class by using the regex patter s
like so
public static String squeeze (String in) {
return in.replaceAll("\s+","");
}
which looks a lot cleaner.
The Let me rephrase this: the removeContDupChars()
method is named wrong because it doesn’t remove continous duplicate chars but replaces them with a space
char.removeContDupChars()
contains misleading comments which lets one assume that it replaces continous duplicated chars by a space
char.
IMO a cleaner way would be to have a replaceContinousDuplicateChars(string in, char replacement)
which is doing exactly what the name implies. Leaving the fomrer removeContDupChars()
method like so
public static String removeContDupChars(String in) {
string result = replaceContinousDuplicateChars(in, ' ');
result = squeeze(result);
return result;
}
public static String replaceContinousDuplicateChars(String in, char replacement) {
String temp = "";
StringBuilder sb = new StringBuilder(in);
int i = 0;
while (i < sb.length()) {
char prevChar = sb.charAt(i);
for (int j = i + 1; j < sb.length(); j++) {
// As long as there are same characters, Replace all the Duplicates
// with Space.
if (prevChar == sb.charAt(j)) {
sb.setCharAt(j, ' ');
} else {
// Where there is a different char, break the inner loop.
break;
}
}
i++;
}
return sb.toString();
}
As you will notice I have added braces
to if..else
to make the indent more prominent. Another reason for braces is, although they might be optional, to prevent the code to become error prone. I have given your variables some space to breathe to make it easier to read.
I have moved the declaration of char prevChar
inside the while
loop and deleted the unused String temp
variable. You should declare your variables as near to their usage. This prevents such unused variables (if you ignore your IDE which usually tells you) and one doesn’t need to scroll arround to find that variable.
In addition you shouldn’t use abbreviations to name variables/methods or classes because it makes your code harder to read and understand.
I venture myself to tackle the implementSoundEx
. Let’s get to it:
StringBuilder sb = new StringBuilder(inputString);
char currentChar, nextChar = ' ';
// Remove continuous duplicate characters from the String.
// E.g, After the function call, a String 'AAABBBCCCDD' becomes 'ABCD'
String temp = StringFunctions.removeContDupChars(sb.toString());
sb.setLength(0);
sb.append(temp);
currentChar
and nextChar
don’t have to be declared here, they belong to your for
loop.
Try to keep variables decalarations to the most inner scope.
Also I probably could live without those variables, since you are using sb.charAt(i+2)
anyway.
Starting the builder with inputString
is useless here because you are clearing it later.
It would be better to call removeContDupChars
with inputString
and start the builder with temp
.
Also that kind of comment suits very well to the XML documentation but not here in the caller method.
// If the first and second characters have the same number
// Replace the second character with Space.
if (getValue(sb.charAt(0)) == getValue(sb.charAt(1)))
sb.setCharAt(1, ' ');
I don’t think that getValue
add many value to your code when it’s so easy to write map.get
.
It also takes the necessity to jump to the getValue
method to see what it does and IMO the intent of what you’re trying to do comes quite clear.
for (int i=1; i < sb.length(); i++){
//...
if (i+1 < sb.length())
//...
if (i+2 < sb.length()){
if (currentChar == sb.charAt(i+2))
if (Arrays.binarySearch(notvowels, nextChar) != -1)
//..
}
It turns out that you can trim the first if
if you don’t go until the end of your builder.
Just subtract 1
to your exiting condition and you’re good to go.
Also try to give some spacing between declarations and math operators.
The &&
operator is your friend here, with a line wrap it becomes a manageable condition and reduces code indentation.
If you don’t like the line wrap in the if you can always leave it to a boolean variable.
You could also include some braces, even though some people think that is a personal choice.
String x = sb.toString();
Common sb.toString()
is not that long that wouldn’t fit the next line.
But if you’re afraid of consuming much space or trying to be tidy you could always wrap the return in the dots (.
).
And now it’s time to post the resulting code:
public String implementSoundEx() {
String temp = StringFunctions.removeContDupChars(inputString);
StringBuilder sb = new StringBuilder(temp);
// If the first and second characters have the same number
// Replace the second character with Space.
if (map.get(sb.charAt(0)) == map.get(sb.charAt(1))){
sb.setCharAt(1, ' ');
}
// Replace the consonants with their corresponding mapped numbers.
// Do this for all characters, except the First char.
for (int i = 1; i < sb.length(); i++) {
if (sb.charAt(i) != ' ' ){
sb.setCharAt(i, map.get(sb.charAt(i)));
}
}
// Repeat the processing starting with the Second character.
for (int i = 1; i < sb.length() - 1; i++) {
// If the current character is any of A,E,I,O,U,Y,H,W or Space, ignore it.
// Process the next character.
if ((Arrays.binarySearch(dropChars, sb.charAt(i)) != -1) || (sb.charAt(i) == ' ')){
continue;
}
// If two or more letters with the same number are adjacent in, only retain the first letter;
// For this, put a space in the next character.
if (sb.charAt(i) == sb.charAt(i + 1)){
sb.setCharAt(i + 1, ' ');
}
// Two letters with the same number separated by 'H' or 'W' are coded as a single number
if (i + 2 < sb.length() &&
sb.charAt(i) == sb.charAt(i + 2) &&
Arrays.binarySearch(notvowels, sb.charAt(i + 1)) != -1
){
sb.setCharAt(i + 2, ' ');
}
}
// Replace all the vowels with Spaces.
for (int i = 1; i < sb.length(); i++) {
if (Arrays.binarySearch(vowels, sb.charAt(i)) != -1){
sb.setCharAt(i, ' ');
}
}
return StringFunctions.squeeze(sb.toString()).substring(0, 4);
}
Here’s an alternative implementation of your removeContDupChars
method.
public static String removeContDupChars(String in) {
StringBuilder sb = new StringBuilder();
char prev;
boolean first = true;
for (char next : in.toCharArray()) {
if (first || next != prev) {
if (next != ' ') {
// Remove condition to keep first space
sb.append(next)
}
prev = next;
}
first = false;
}
return sb.toString();
}
Taking advantage of the fact that StringBuilder
objects can expand their capacity dynamically, there is no need to start with anything in it, or to give spaces unintended remove-me meanings, or to do costly O(n2)
remove-and-shift operations: start with an empty one and append what you need as you go.
You could get rid of the first
flag by replicating the logic inside the loop before it for the first char only. But you would then have to explicitly check that the string is not empty. My Python background makes me like this approach better, YMMV.