Problem
I am new to C# and would appreciate it if you review my code. This code finds the anagrams in a list of string. Where can I be more efficient in my code? is there any way I can save memory?
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Diagnostics;
using System.Diagnostics.Tracing;
using System.Linq;
using System.Text;
namespace RA
{
class Anagram
{
//private List<List<string>> anagrams;
public static List<List<string>> find(List<string> strList)
{
var anagrams = new List<List<string>>();
var anagramsTable = new Dictionary<string, List<string>>();
foreach (string str in strList)
{
var sortedStr = string.Concat(str.OrderBy(c => c));
string.Concat(sortedStr.OrderBy(c => c));
if (anagramsTable.ContainsKey(sortedStr))
{
anagramsTable[sortedStr].Add(str);
}
else
{
var values = new List<string>();
values.Add(str);
anagramsTable.Add(sortedStr, values);
}
}
foreach (var pair in anagramsTable)
{
anagrams.Add(pair.Value);
}
return anagrams;
}
static void Main(string[] args)
{
var words = new List<string>();
words.Add("ACT");
words.Add("CAT");
words.Add("TAC");
var anagrams = find(words);
foreach (var sublist in anagrams)
{
foreach (var item in sublist)
Console.Write(item + " ");
}
}
}
}
Solution
I won’t repeat the germane critiques of the other answers, which are good. Rather than concentrating on the line-by-line improvements, I have some suggestions for the general organization of the code.
First off, your general technique here is sound. But I want to describe two techniques you can use to make the code easier to read and understand, and to make more powerful tools that you can use in other projects.
The first is to notice that you have created a new data structure but done it in a completely ad-hoc way. You have built a data structure called “multidictionary” — that is, a dictionary that is a map from a key to a set of values — out of a dictionary. That’s great, but your code would be easier to read and better organized if you actually just made a multidictionary type!
sealed class Multidictionary<K, V>
{
private Dictionary<K, List<V>> dictionary = ...
// Can you fill in the rest of this class?
// Here are some methods you'll probably want:
public void Add(K k, V v) { ... }
public IEnumerable<K> Keys { get ... }
public IEnumerable<IEnumerable<V>> Values { get ... }
}
(Of course you can always download a professional multidictionary class; there are plenty of such packages available. But for the learner it is skill-building to make your own.)
Once you have that class, look at how much nicer your main algorithm gets:
var anagramsTable = new Multidictionary<string, string>();
foreach (string str in strList)
anagramsTable.Add(string.Concat(str.OrderBy(c => c), str);
var anagrams = anagramsTable.Values
.Select(vs => vs.ToList()).ToList();
By offloading the boring mechanical details into its own class, the meaning of the code becomes very clear. What are we doing? Putting each string into a multidictionary keyed on a canonicalization, then extracting the values of that multidictionary into a list of lists. The code makes it very clear what is happening.
Some of the other answers have suggested that you can do the same with GroupBy
. That is completely accurate. You now have some insight into how GroupBy
is implemented. GroupBy
constructs a multidictionary behind the scenes!
Let’s for the sake of pedagogy assume that GroupBy
does not exist or you are not familiar with it. Can we improve this further? Yes, by again seeing that there is a generalization. What are you really doing here, in general? You are taking a set and partitioning it into equivalence classes.
In case you are not familiar with this jargon, let me give you a brief explanation. An equivalence relation is a function that takes two things and returns a Boolean that means “equivalent” if true and “not equivalent” if false.
Equivalence relations follow three rules:
- X is always equivalent to itself no matter what X is. This is the reflexive property.
- The equivalence of X and Y is the same as the equivalence of Y and X. This is the symmetric property.
- If X is equivalent to Y and Y is equivalent to Z then X must be equivalent to Z. This is the transitive property.
The two most basic equivalence relations are the two extremes. Equality — two things are equivalent if they are exactly equal and unequivalent otherwise — is the equivalence relation where the least number of things are equivalent to each other, and “everything is equivalent to everything” is the one with the most.
The tricky bit comes when we have equivalences that are somewhere in the middle. “Two strings are equivalent if they are anagrams of each other” is an equivalence relation (exercise: check that all three properties are met).
If you have a set of things — strings in your case — then a partition into equivalence classes is what you are doing. You are finding all the subsets of those strings which are equivalent to each other; that subset is called an “equivalence class”.
As you have discovered, an efficient algorithm for partitioning a set into its equivalence classes is the algorithm:
- For each member of the set, find the “canonical value” of the equivalence class, even if that member is not in the set itself. That is in your case, the smallest string in lexicographic order that is equivalent to the given string.
- Group the members of the set on equality of their associated canonical value.
So my challenge to you is: can you implement this algorithm?
static List<List<T>> Partition(
this IEnumerable<T> items,
Func<T, T> canonicalize)
{ ... }
If you can, then your specific problem becomes a one-liner!
var anagrams = items.Partition(s => string.Concat(s.OrderBy(c => c));
And you will then have a new tool in your toolbox. I use the partition-by-canonical-value function all the time in my work.
Again, this is just a special case of GroupBy
, as noted in other answers. However, consider now this challenge. Suppose we do not have a canonicalizer but we do have a relation:
static List<List<T>> Partition(
this IEnumerable<T> items,
Func<T, T, bool> relation)
{ ... }
Can you come up with an efficient implementation of the partition function only having a predicate which tells you when two things are equivalent? Remember that you are allowed to take advantage of the reflexive, symmetric and transitive properties of the relation. A naive algorithm is straightforward but this is a surprisingly tricky problem to make efficient. The paper https://link.springer.com/chapter/10.1007%2F978-3-319-21840-3_36 has an analysis of this problem if it interests you.
Another exercise to consider: suppose your original list contains duplicates. Is it correct to have duplicates in the output list of lists? Or should they be deduplicated? Can you modify the algorithms you’ve created so far to ensure that they are efficiently deduplicated?
Some quick remarks:
-
Please follow the Capitalization Conventions.
-
Use meaningful names.
strList
is bad in various ways: not only does it contain the name of its type (List
), it also contains a pointless abbreviation. Why not name this input parameterwords
, the term you actually use in yourMain()
?- Same for
str
orsortedStr
. anagramsTable
does not tell me what it actually is.values
is too generic.RA
is an unintelligible namespace name.
-
If you need to work with the value of a key in a dictionary, use
TryGetValue()
instead ofContainsKey()
. -
The
foreach (var pair in anagramsTable)
section can be replaced by a simple LINQ query.
var sortedStr = string.Concat(str.OrderBy(c => c));
string.Concat(sortedStr.OrderBy(c => c));
I think the second line is redundant as you don’t bind its result to anything.
var anagrams = new List<List<string>>();
This is also redundant, because you actually have the lists of anagram items in the dictionary. So instead of moving the anagram lists from the dictionary to this list, you can just query the dictionary like:
return anagramsTable.Select(kvp => kvp.Value).ToList();
at the end of the method.
Instead of this:
if (anagramsTable.ContainsKey(sortedStr))
{
anagramsTable[sortedStr].Add(str);
}
else
{
var values = new List<string>();
values.Add(str);
anagramsTable.Add(sortedStr, values);
}
You should use (as BCdotWEB mentions) TryGetValue()
:
if (!anagramsTable.TryGetValue(sortedStr, out var anagrams))
{
anagramsTable[sortedStr] = anagrams = new List<string>();
}
anagrams.Add(sortedStr);
You could go the full linq way though:
public static List<List<string>> find(List<string> strList)
{
return strList.GroupBy(s => string.Concat(s.OrderBy(c => c))).Select(gr => gr.ToList()).ToList();
}
Other answers have pointed out the code issues, but you also asked about potential improvements in efficiency. Currently this program runs comparisons in O(n log n) time (per comparison), mainly due to the sorting algorithm. (Any default library sorting algorithm is going to have average case O(n log n) time.) Can we do better? Sure. This problem can be solved in O(2n + 2m) time (n being the number of characters per word and m being the number of unique letters in each word, strictly equal to or less than n by definition, so worst case O(4n)):
- Make a
Dictionary<char, int>
. (Maybe changeint
tolong
if you’re testing really big data…) - For each character in the string, if its key is not in the dictionary, add it, setting the value to 1. If it is in the dictionary, add 1 to its value. This gives you a list of counts for each character in the string.
- For each character in the other strings to compare, simply repeat 1 and 2.
- Loop through keys on the dictionaries and compare your counts. If any key exists in one dictionary and not the other, or there are two values with different counts, the words are not anagrams. Otherwise, they are.
Note: This comes at the expense of a some memory to store those dictionaries, but that’s usually the case when you trade one resource (memory) for another (time). The benefits most likely won’t be noticeable for small data sets like yours (your longest example has only 4 letters), but for larger data sets, this algorithm will scale more gracefully.
You could also improve on this with some obvious sanity checks: e.g. if the strings are not the same length you can immediately return false
without even bothering to check anything else, etc.
If you’re only comparing 2 words, you could also combine the last 2 steps – only keep one dictionary, and on the second word you decrease the counts in the dictionary for each letter (immediately shortcut returning false
if a key is not found in the second word, and removing any entry whose count is reduced to 0), then check that the dictionary is empty at the end. (This only works for comparing just 2 words, because it destroys the dictionary in the process, but might be more efficient overall.)
Answering questions of @Eric Lippert, I have done some refactoring.
The Multidictionary
class I implemented as follow
public class Multidictionary<K, V>
{
private Dictionary<K, List<V>> dictionary;
public Multidictionary()
{
dictionary = new Dictionary<K, List<V>>();
}
public void Add(K k, V v)
{
List<V> list;
dictionary.TryGetValue(k, out list);
if (list == null)
{
list = new List<V>();
dictionary.Add(k, list);
}
list.Add(v);
}
public IEnumerable<K> Keys { get { return dictionary.Keys; } }
public IEnumerable<IEnumerable<V>> Values
{
get { return dictionary.Values; }
}
}
Now to find anagrams, we can use Multidictionary
.
var words= new List<string>() { "litu", "flit", "lift", "lute", "tule", "etui", "lieu", "lite", "tile", "flue", "fuel", "felt", "left", "file", "lief", "life", "flub", "bute", "tube", "blue", "lube", "belt", "blet", "bite", "bile", "luau", "latu", "alit", "lati", "tail", "tali", "tufa", "flat", "fiat", "alif", "fail", "fila", "late", "tael", "tale", "teal", "tela", "ilea", "fate", "feat", "feta", "alef", "feal", "flea", "leaf", "abut", "tabu", "tuba", "blat", "bait", "bail", "flab", "beau", "abet", "bate", "beat", "beta", "able", "bale", "blae" };
var anagramDictionary = new Multidictionary<string, string>();
words.ForEach(word => anagramDictionary.Add(string.Concat(word.OrderBy(c => c)), word));
var anagrams = anagramDictionary.Values.Select(vs => vs);
Another way, we can write Partition
function as @Eric Lippert suggested using GroupBy
.
public static IEnumerable<IEnumerable<T>> Partition<T>(this IEnumerable<T> items, Func<T, T> canonicalize)
{
return items.GroupBy(_ => canonicalize(_)).Select(_ => _);
}
In this way solution is one liner
var groups = words.Partition(s => string.Concat(s.OrderBy(c => c)));