Problem
(This problem is related to Simple string-split by root and sufix algorithm)
There are many ways to find a “common root” of a list of similar strings, that begins with the same substring… The question is what the best way (faster and elegant)? Do you know a book or library with these algorithms?
Example: in the set {"foo2", "foo1", "foofo"}
the root is "foo"
.
In practical use we need also ignore extrange itens, so we can expect the same "foo"
root in the set {"hello", "foo2", "foo1", "foofo", "bye"}
. A “root minimal length” parameter is also a practical issue.
Here is a sample and a first candidate algorithm:
$words = array('i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself',
'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which',
'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be',
'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an',
'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for',
'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',
'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under',
'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all',
'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not',
'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don',
'should', 'now');
/**
* Get the common start-string of two strings.
* @param $a string.
* @param $b string with probable common root.
* @return string $root or empty.
* @see this function can feed the root for str_splitByRoot() function.
*/
function str_getRoot($a,$b) {
$rootLen = strspn($b ^ $a, " "); // length before difference.
return substr($a,0,$rootLen);
}
/**
* Get the common start-string of many strings.
* The algorithm splits the string by its differences.
* @param $a array of strings.
* @param $b string with probable common root, or integer for array index.
* @return array of strings with $root and family.
*/
function array_getRoot($a,$b='',$minLen=1) {
if (is_integer($b))
$b = $a[$b];
if (!$b) $b=$a[0];
$family = array();
foreach ($a as $item) {
$r = str_getRoot($item,$b);
if (strlen($r)>=$minLen) {
$family[]=$item;
if ($r!=$b) $b=$r;
} //if
} // for
array_unshift($family,$b);
return $family;
}
Applying in the example:
$list = array("foo2", "hello", "foo1", "foofo", "bye");
$fam = array_getRoot($list);
print "nThe root of list is ".array_shift($fam); // foo
Using to show the roots of the $words
list:
sort($words); // not necessary
foreach($words as $i){
$fam4 = array_getRoot($words,$i,4);
$root = array_shift($fam4);
if (count($fam4)>1)
print "n * $i - ($root) ".join(', ',$fam4);
else { // repeate da same procedure:
$fam3 = array_getRoot($words,$i,3);
$root = array_shift($fam3);
if (count($fam3)>1)
print "n * $i - ($root) ".join(', ',$fam3);
}
}
Some results:
* about - (abo) about, above * again - (again) again, against * her - (her) her, here, hers, herself * hers - (hers) hers, herself * you - (you) you, your, yours, yourself, yourselves * yourself - (your) your, yours, yourself, yourselves
Solution
(this is not a good answer because… the answer needs performance analysis and comparison with alternative algorithms)
Only to show a tested and “unified” (str+array) algorithm… A “code” based on the defined (see question’s statement) functions,
function str_getRoot2($a,$b='',$minLen=1) { // "2" to avoid recurrence
if (is_array($a)) // and $b is a seed.
return array_getRoot($a,$b,$minLen);
else // $b is a root
return str_getRoot($a,$b);
}
Complete code, with an aditional retFamily
parameter, and recurrence:
/**
* Get the commom start-string of two or more strings.
* @version v1.2 of 2015-06-11.
* @param $a mix, string for final operation, or array for scan.
* @param $b string with probable commom root, or integer for array index.
* @param $retFamily boolean for show "family".
* @return string $root.
* @see http://codereview.stackexchange.com/q/92700
*/
function str_getRoot($a,$b='',$minLen=1,$retFamily=false) {
if (is_array($a)) { // $b is a seed.
$family = array();
if (is_integer($b))
$b = $a[$b];
if (count($a)>1) {
foreach ($a as $item) if ($item) {
if (is_array($item)) die("array of array is invalid");
if (!$b) $b=$item; // or trim($b)..trim($item)
$r = str_getRoot($item,$b,$minLen); // go to str
if ($r) {
if ($retFamily) $family[]=$item;
if ($r!=$b) $b=$r;
} //if
} // for
} elseif (isset($a) && count($a)>0)
$family[]= $b = $a[0];
if ($retFamily) {
array_unshift($family,$b);
return $family;
} else
return $b;
} else { // $b is a root
$rooLen = strspn($b ^ $a, " "); // split by diff
return ($rooLen>=$minLen)? substr($a,0,$rooLen): '';
}
}
Examples of use: get family array with $fam4 = str_getRoot($words,'yourself',4);
or get specific root of a pair by str_getRoot('yourself','your');