# Counting occurrences of each word in a sentence

Posted on

Problem

I have been doing this test task for C# Developer role at UBS.London recently and have been rejected. No reason stated yet. Was wondering if you can take a look and provide feedback on what I can do to improve solution. The task description is below and solution itself is on GitHub. Also attached code for two main classes I have implemented. Unit tests are in solution. No UI provided as I thought it is not required as Unit tests are self explanatory.

Hope it will be helpful for someone else doing interviews for C# Dev roles.

Exercise description

Create a .net application that will solve the following problem. There
are no time constraints and you are free to use any resources at your
disposal.

The Problem As an author I want to know the number of times each word
appears in a sentence So that I can make sure I’m not repeating myself

Acceptance Criteria Given a sentence When the program is run Then I’m
returned a distinct list of words in the sentence and the number of
times they have occurred

Example Input: “This is a statement, and so is this.” Output: this – 2
is – 2 a – 1 statement – 1 and – 1 so – 1

Solution is on GitHub

I am also using Netfx-Guard library from Nuget

Solutions contains two projects: main project with SentenceParser and StringSplitter classes and unit testing projects.

SentenceParser and StringSplitter classes are below.

using System.Collections.Generic;
using System.Linq;
using UBS.TextParsing.Interfaces;

namespace UBS.TextParsing
{
public class SentenceParser
{
private IStringSplitter _splitter;

public SentenceParser(IStringSplitter splitter)
{
_splitter = splitter;
}

public IDictionary<string, int> Parse(string sentence)
{
Guard.NotNullOrEmpty(() => sentence, sentence);

var words = _splitter.SplitIntoWords(sentence);

var wordsCounts = from word in words
group word by word.ToLower()
into g
select new {Word = g.Key, Count = g.Count()};

return wordsCounts.ToDictionary(wc => wc.Word, wc => wc.Count);
}
}
}


Here is StringSplitter class which splits sentence into words.

using System;
using System.Collections.Generic;
using System.Text;
using UBS.TextParsing.Interfaces;

namespace UBS.TextParsing
{
public class StringSplitter : IStringSplitter
{
public IEnumerable<string> SplitIntoWords(string sentence)
{
Guard.NotNull(() => sentence, sentence);

var stringList = new List<string>();

var currentWordSb = new StringBuilder();

foreach (var chr in sentence)
{
if (Char.IsPunctuation(chr) || Char.IsSeparator(chr) || Char.IsWhiteSpace(chr))
{
currentWordSb.Clear();
}
else
{
currentWordSb.Append(chr);
}
}

return stringList;
}

private void AddToList(StringBuilder stringBuilder, ICollection<string> collection)
{
var word = stringBuilder.ToString();
if(!string.IsNullOrEmpty(word))
}
}
}


Solution

Guard.NotNullOrEmpty(() => sentence, sentence);


This introduces a dependency on a third-party library, which is 100 lines. Why not just write

if (string.IsNullOrEmpty(sentence))
{
throw new ...
}


The ISplitter interface is overkill. As is the StringSplitter class, probably — as @NickUdell mentioned, a regular expression looks like the right tool for the job here.

You have the core of the solution right here:

      var words = _splitter.SplitIntoWords(sentence);

var wordsCounts = from word in words
group word by word.ToLower()
into g
select new {Word = g.Key, Count = g.Count()};

return wordsCounts.ToDictionary(wc => wc.Word, wc => wc.Count);


Which can be simplified to

var words = _splitter.SplitIntoWords(sentence.ToLower());
return words.GroupBy(word => word)
.ToDictionary(group => group.Key, group => group.Count());


Besides being a little shorter, we are creating fewer objects. Let’s say there are n

$n$

$n$ words returned by SplitIntoWords, d

$d$

$d$ of which are distinct (up to case). The first version creates 2n

$2n$

$2n$ string objects (the n

$n$

$n$ returned by SplitIntoWords and these same n

$n$

$n$ words lower-cased) and d

$d$

$d$ instances of the anonymous type. The second version creates n+1

$n+1$

$n + 1$ string objects. This might not be a concern, but since it’s an easy win, we should just take it.

SplitIntoWords could be written using yield. Again, not a big difference but it makes the code a bit more re-usable, since client code can use Take, First, etc without having to have generate every word in the string.

public IEnumerable<string> SplitIntoWords(string sentence)
{
var word = new StringBuilder();
foreach (var chr in sentence)
{
if (Char.IsPunctuation(chr) || Char.IsSeparator(chr) || Char.IsWhiteSpace(chr))
{
if (word.Length > 0)
{
yield return word.ToString();
word.Clear();
}
}
else
{
word.Append(chr);
}
}

if (word.Length > 0)
{
yield return word.ToString();
}
}


The purpose of the solution should be to get a distinct list of words and the number of occurance in the sentence.

SplitIntoWords() method

The code is iterating over each character inside the String sentence and checks if the character is either a punctuation, a separation character or a white space character. If this is true, you assume the end of a word.

A better approach would have been to check if Char.IsLetter() and if that is not true the end of a word is reached. In this way not only the speratation, punctuation and white space is handled but also digits, control characters etc.

For interview coding tests you should always use braces {} also for single if statements.

Parse() method

The checking with Guard.NotNullOrEmpty() should at least be replaced with Guard.NotNull(), as this is the only case where an exception should be thrown.

But as mjolka alreday said, it would be better to use the built in methods. Also consider that for an input parameter which only holds whitespace an empty Dictionary<String,int> should be returned.
So this should be changed to

if (sentence == null) { throw new ArgumentNullException("sentence"); }
if (string.IsNullOrWhiteSpace(sentence)) { return new Dictionary<String, int>(); }


You are calling word.ToLower() on each word you got from the call of the SplitIntoWords() method. It would be better to call the ToLower() method of the provided input parameter.

Naming

It’s unwise to include types in your naming, unless it’s unavoidable.

var stringList = new List<string>();

The variable name tells us nothing about it’s purpose. It’s a list of strings? Of course it is, that’s what the data type is!

var words = new List<string>();

Because that’s what it actually represents.

Additionally your function AddStringToList isn’t very helpful. Firstly, it’s not even taking a List parameter, it’s a Collection, and secondly, you don’t mention that it actually gets a string from the StringBuilder parameter and adds it to the list.

You haven’t commented anywhere. It’s probably unnecessary to add metadata comments to the functions here, unless you’re building a documentation file, but since it’s an interview question you should be going all out.

Code

I have no idea what your guard statement looks like, but what is the reasoning behind passing in an Action that gets the sentence variable along with the sentence variable itself?

EDIT: Upon looking at the full source it appears to be from an external library.

Is there any reason why you don’t use Regex here? It’s kinda what it was designed for. Most of your string splitting code could be replaced with a single Regex.Split call, or even a couple of string.Split calls.

Your method AddToList doesn’t depend on any instance data and so should be static. In fact, the entire class could be static, if it wasn’t for the fact you built it to an interface.

Globalization

You should specify a CultureInfo for your .ToLower calls. For this, I recommend CultureInfo.CurrentCulture since you’re dealing with word that will be shown on screen. There is some argument for InvariantCulture because you’re performing comparisons on them. I find it unlikely the culture would change during execution, but to be safest, it would be best to cache the value of CultureInfo.CurrentCulture before execution.

The Good

You’re coding to interfaces, you’re trying to stay DRY (even if I doubt the helpfulness of AddToList), using LINQ and guarding your arguments. That’s already a great starting point.

your naming makes me stop and think much harder than I should…

foreach (var chr in sentence)


are you iterating over each character, in the sentence? looking at this I think logically I would iterate through the words of the sentence and then the characters of the words… so this confuses me a little.

your parameter shouldn’t be sentence it should be something like inputString or stringToBeSplit.

and the variable in the foreach statement should be character instead of chr just to be absolutely clear that you are iteration by character. heck, I would even go as far as saying you could have saved it also by writing it like this

foreach (char chr in sentence)


a string is a char array so it will allow you to do this.

Instead of all that fun stuff, let’s get rid of all the punctuation to begin with

You should use this to replace

• Commas
• Periods
• Double and single quotes
• Parenthesis
• Brackets
• etc.

I would probably make this a Static Method in which you can code all of these in and call it on the entire sentence, and a Regex Replace might do well here to limit the number of lines of code inside this (new)method.

Again, this should be done before you try to split the sentence into words. it wouldn’t make sense to try and split on punctuation in the Sentence.

Then, I would use String.Split and create a string array, maybe create a dictionary of the distinct words from the first array, then loop through the first loop and increment an integer value the corresponds to that word in the dictionary.

When you are done looping then print out your Dictionary.

Looks like this is what you were doing, but then you loop through the Sentence character by character and that doesn’t make sense; you have characters, then you have words, then you have sentences, then you have Paragraphs, then you pages, then chapters, then it can go to acts or sections and then books. you skipped from characters to Sentences… Why?

I imagine that you were worried about contractions, that should be a word-level function/method to tell other word objects being compared to it that there are actually 2 words inside that object.

You should create a word object(class) and create some operator overloads for it, so that you can compare contractions with non-contractions (don't = do or don't = not) where it should be counted in 2 columns.

you have the general idea of how to do it, but you went off on that one weird tangent. at least I would say it’s tangential.

Now comes the issue with breaking the sentences out of the paragraphs.

You receive a big blob of text, so String.Split it on periods (as long as you are certain there will not be any laying around in numbers….of course you could exclude the numbers from the start as well).

You could still run the initial Punctuation Replace before you separate into Sentences by excluding Periods from the punctuation list.