Problem
I have been doing this test task for C# Developer role at UBS.London recently and have been rejected. No reason stated yet. Was wondering if you can take a look and provide feedback on what I can do to improve solution. The task description is below and solution itself is on GitHub. Also attached code for two main classes I have implemented. Unit tests are in solution. No UI provided as I thought it is not required as Unit tests are self explanatory.
Hope it will be helpful for someone else doing interviews for C# Dev roles.
Create a .net application that will solve the following problem. There
are no time constraints and you are free to use any resources at your
disposal.The Problem As an author I want to know the number of times each word
appears in a sentence So that I can make sure I’m not repeating myselfAcceptance Criteria Given a sentence When the program is run Then I’m
returned a distinct list of words in the sentence and the number of
times they have occurredExample Input: “This is a statement, and so is this.” Output: this – 2
is – 2 a – 1 statement – 1 and – 1 so – 1
I am also using Netfx-Guard library from Nuget
Solutions contains two projects: main project with SentenceParser
and StringSplitter
classes and unit testing projects.
SentenceParser
and StringSplitter
classes are below.
using System.Collections.Generic;
using System.Linq;
using UBS.TextParsing.Interfaces;
namespace UBS.TextParsing
{
public class SentenceParser
{
private IStringSplitter _splitter;
public SentenceParser(IStringSplitter splitter)
{
_splitter = splitter;
}
public IDictionary<string, int> Parse(string sentence)
{
Guard.NotNullOrEmpty(() => sentence, sentence);
var words = _splitter.SplitIntoWords(sentence);
var wordsCounts = from word in words
group word by word.ToLower()
into g
select new {Word = g.Key, Count = g.Count()};
return wordsCounts.ToDictionary(wc => wc.Word, wc => wc.Count);
}
}
}
Here is StringSplitter
class which splits sentence into words.
using System;
using System.Collections.Generic;
using System.Text;
using UBS.TextParsing.Interfaces;
namespace UBS.TextParsing
{
public class StringSplitter : IStringSplitter
{
public IEnumerable<string> SplitIntoWords(string sentence)
{
Guard.NotNull(() => sentence, sentence);
var stringList = new List<string>();
var currentWordSb = new StringBuilder();
foreach (var chr in sentence)
{
if (Char.IsPunctuation(chr) || Char.IsSeparator(chr) || Char.IsWhiteSpace(chr))
{
AddToList(currentWordSb, stringList);
currentWordSb.Clear();
}
else
{
currentWordSb.Append(chr);
}
}
AddToList(currentWordSb, stringList);
return stringList;
}
private void AddToList(StringBuilder stringBuilder, ICollection<string> collection)
{
var word = stringBuilder.ToString();
if(!string.IsNullOrEmpty(word))
collection.Add(word);
}
}
}
Solution
Guard.NotNullOrEmpty(() => sentence, sentence);
This introduces a dependency on a third-party library, which is 100 lines. Why not just write
if (string.IsNullOrEmpty(sentence))
{
throw new ...
}
The ISplitter
interface is overkill. As is the StringSplitter
class, probably — as @NickUdell mentioned, a regular expression looks like the right tool for the job here.
You have the core of the solution right here:
var words = _splitter.SplitIntoWords(sentence);
var wordsCounts = from word in words
group word by word.ToLower()
into g
select new {Word = g.Key, Count = g.Count()};
return wordsCounts.ToDictionary(wc => wc.Word, wc => wc.Count);
Which can be simplified to
var words = _splitter.SplitIntoWords(sentence.ToLower());
return words.GroupBy(word => word)
.ToDictionary(group => group.Key, group => group.Count());
Besides being a little shorter, we are creating fewer objects. Let’s say there are n
words returned by SplitIntoWords
, d
of which are distinct (up to case). The first version creates 2n
string objects (the n
returned by SplitIntoWords
and these same n
words lower-cased) and d
instances of the anonymous type. The second version creates n+1
string objects. This might not be a concern, but since it’s an easy win, we should just take it.
SplitIntoWords
could be written using yield
. Again, not a big difference but it makes the code a bit more re-usable, since client code can use Take
, First
, etc without having to have generate every word in the string.
public IEnumerable<string> SplitIntoWords(string sentence)
{
var word = new StringBuilder();
foreach (var chr in sentence)
{
if (Char.IsPunctuation(chr) || Char.IsSeparator(chr) || Char.IsWhiteSpace(chr))
{
if (word.Length > 0)
{
yield return word.ToString();
word.Clear();
}
}
else
{
word.Append(chr);
}
}
if (word.Length > 0)
{
yield return word.ToString();
}
}
The purpose of the solution should be to get a distinct list of words and the number of occurance in the sentence.
SplitIntoWords() method
The code is iterating over each character inside the String
sentence and checks if the character is either a punctuation
, a separation character
or a white space
character. If this is true, you assume the end of a word.
A better approach would have been to check if Char.IsLetter()
and if that is not true the end of a word is reached. In this way not only the speratation, punctuation and white space is handled but also digits, control characters etc.
AddToList() method
For interview coding tests you should always use braces {}
also for single if
statements.
Parse() method
The checking with Guard.NotNullOrEmpty()
should at least be replaced with Guard.NotNull()
, as this is the only case where an exception should be thrown.
But as mjolka alreday said, it would be better to use the built in methods. Also consider that for an input parameter which only holds whitespace an empty Dictionary<String,int>
should be returned.
So this should be changed to
if (sentence == null) { throw new ArgumentNullException("sentence"); }
if (string.IsNullOrWhiteSpace(sentence)) { return new Dictionary<String, int>(); }
You are calling word.ToLower()
on each word you got from the call of the SplitIntoWords()
method. It would be better to call the ToLower()
method of the provided input parameter.
Naming
It’s unwise to include types in your naming, unless it’s unavoidable.
var stringList = new List<string>();
The variable name tells us nothing about it’s purpose. It’s a list of strings? Of course it is, that’s what the data type is!
Instead you should name it:
var words = new List<string>();
Because that’s what it actually represents.
Additionally your function AddStringToList
isn’t very helpful. Firstly, it’s not even taking a List
parameter, it’s a Collection
, and secondly, you don’t mention that it actually gets a string from the StringBuilder
parameter and adds it to the list.
Comments
You haven’t commented anywhere. It’s probably unnecessary to add metadata comments to the functions here, unless you’re building a documentation file, but since it’s an interview question you should be going all out.
Code
I have no idea what your guard statement looks like, but what is the reasoning behind passing in an Action
that gets the sentence
variable along with the sentence
variable itself?
EDIT: Upon looking at the full source it appears to be from an external library.
Is there any reason why you don’t use Regex
here? It’s kinda what it was designed for. Most of your string splitting code could be replaced with a single Regex.Split
call, or even a couple of string.Split
calls.
Your method AddToList
doesn’t depend on any instance data and so should be static. In fact, the entire class could be static, if it wasn’t for the fact you built it to an interface.
Globalization
You should specify a CultureInfo
for your .ToLower
calls. For this, I recommend CultureInfo.CurrentCulture
since you’re dealing with word that will be shown on screen. There is some argument for InvariantCulture
because you’re performing comparisons on them. I find it unlikely the culture would change during execution, but to be safest, it would be best to cache the value of CultureInfo.CurrentCulture
before execution.
The Good
You’re coding to interfaces, you’re trying to stay DRY (even if I doubt the helpfulness of AddToList
), using LINQ and guarding your arguments. That’s already a great starting point.
your naming makes me stop and think much harder than I should…
foreach (var chr in sentence)
are you iterating over each character, in the sentence? looking at this I think logically I would iterate through the words of the sentence and then the characters of the words… so this confuses me a little.
your parameter shouldn’t be sentence
it should be something like inputString
or stringToBeSplit
.
and the variable in the foreach statement should be character
instead of chr
just to be absolutely clear that you are iteration by character. heck, I would even go as far as saying you could have saved it also by writing it like this
foreach (char chr in sentence)
a string
is a char
array so it will allow you to do this.
Instead of all that fun stuff, let’s get rid of all the punctuation to begin with
You should use this to replace
- Commas
- Periods
- Double and single quotes
- Parenthesis
- Brackets
- etc.
I would probably make this a Static Method in which you can code all of these in and call it on the entire sentence, and a Regex Replace might do well here to limit the number of lines of code inside this (new)method.
Again, this should be done before you try to split the sentence into words. it wouldn’t make sense to try and split on punctuation in the Sentence.
Then, I would use String.Split
and create a string array, maybe create a dictionary of the distinct words from the first array, then loop through the first loop and increment an integer value the corresponds to that word in the dictionary.
When you are done looping then print out your Dictionary.
Looks like this is what you were doing, but then you loop through the Sentence character by character and that doesn’t make sense; you have characters, then you have words, then you have sentences, then you have Paragraphs, then you pages, then chapters, then it can go to acts or sections and then books. you skipped from characters to Sentences… Why?
I imagine that you were worried about contractions, that should be a word-level
function/method to tell other word objects being compared to it that there are actually 2 words inside that object.
You should create a word object(class) and create some operator overloads for it, so that you can compare contractions with non-contractions (don't = do
or don't = not
) where it should be counted in 2 columns.
you have the general idea of how to do it, but you went off on that one weird tangent. at least I would say it’s tangential.
Now comes the issue with breaking the sentences out of the paragraphs.
You receive a big blob of text, so String.Split
it on periods (as long as you are certain there will not be any laying around in numbers….of course you could exclude the numbers from the start as well).
You could still run the initial Punctuation Replace before you separate into Sentences by excluding Periods from the punctuation list.