Simple string-root detection in a string-family

Posted on

Problem

(This problem is related to Simple string-split by root and sufix algorithm)

There are many ways to find a “common root” of a list of similar strings, that begins with the same substring… The question is what the best way (faster and elegant)? Do you know a book or library with these algorithms?

Example: in the set {"foo2", "foo1", "foofo"} the root is "foo".

In practical use we need also ignore extrange itens, so we can expect the same "foo" root in the set {"hello", "foo2", "foo1", "foofo", "bye"}. A “root minimal length” parameter is also a practical issue.

Here is a sample and a first candidate algorithm:

  $words = array('i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
   'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself',
   'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which',
   'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be',
   'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an',
   'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for',
   'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',
   'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under',
   'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all',
   'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not',
   'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don',
   'should', 'now');

  /**
   * Get the common start-string of two strings.  
   * @param $a string. 
   * @param $b string with probable common root.
   * @return string $root or empty.
   * @see this function can feed the root for str_splitByRoot() function.
   */
  function str_getRoot($a,$b) {
        $rootLen = strspn($b ^ $a, ""); // length before difference.
        return substr($a,0,$rootLen);
  }

  /**
   * Get the common start-string of many strings.  
   * The algorithm splits the string by its differences.
   * @param $a array of strings. 
   * @param $b string with probable common root, or integer for array index.
   * @return array of strings with $root and family.
   */
  function array_getRoot($a,$b='',$minLen=1) {
    if (is_integer($b)) 
        $b = $a[$b];
    if (!$b) $b=$a[0];
    $family = array();                
    foreach ($a as $item) {
        $r = str_getRoot($item,$b);
        if (strlen($r)>=$minLen) {
            $family[]=$item;
            if ($r!=$b) $b=$r;
        } //if
    } // for
    array_unshift($family,$b);
    return  $family;
  }

Applying in the example:

$list = array("foo2", "hello", "foo1", "foofo", "bye");
$fam = array_getRoot($list);
print "nThe root of list is ".array_shift($fam); // foo

Using to show the roots of the $words list:

  sort($words); // not necessary
  foreach($words as $i){
       $fam4 = array_getRoot($words,$i,4);
       $root = array_shift($fam4);
       if (count($fam4)>1)
            print "n * $i - ($root) ".join(', ',$fam4);
       else { // repeate da same procedure:
            $fam3 = array_getRoot($words,$i,3);
            $root = array_shift($fam3);
            if (count($fam3)>1)
                 print "n * $i - ($root) ".join(', ',$fam3);
       }
  }

Some results:

* about - (abo) about, above
* again - (again) again, against
* her   - (her) her, here, hers, herself
* hers  - (hers) hers, herself
* you   - (you) you, your, yours, yourself, yourselves
* yourself - (your) your, yours, yourself, yourselves

Solution

(this is not a good answer because… the answer needs performance analysis and comparison with alternative algorithms)

Only to show a tested and “unified” (str+array) algorithm… A “code” based on the defined (see question’s statement) functions,

  function str_getRoot2($a,$b='',$minLen=1) { // "2" to avoid recurrence
    if (is_array($a)) // and $b is a seed.
       return array_getRoot($a,$b,$minLen);
    else  // $b is a root
       return str_getRoot($a,$b);
  }

Complete code, with an aditional retFamily parameter, and recurrence:

  /**
   * Get the commom start-string of two or more strings.  
   * @version v1.2 of 2015-06-11.
   * @param $a mix, string for final operation, or array for scan. 
   * @param $b string with probable commom root, or integer for array index.
   * @param $retFamily boolean for show "family".
   * @return string $root.
   * @see http://codereview.stackexchange.com/q/92700
   */
  function str_getRoot($a,$b='',$minLen=1,$retFamily=false) {
    if (is_array($a)) { // $b is a seed.
        $family = array();  
        if (is_integer($b))
            $b = $a[$b];
        if (count($a)>1) {
            foreach ($a as $item) if ($item) {
                if (is_array($item)) die("array of array is invalid");
                if (!$b) $b=$item; // or trim($b)..trim($item)
                $r = str_getRoot($item,$b,$minLen); // go to str
                if ($r) {
                   if ($retFamily) $family[]=$item;
                   if ($r!=$b) $b=$r;
                } //if
            } // for            
        } elseif (isset($a) && count($a)>0)
            $family[]= $b = $a[0];
        if ($retFamily) {
            array_unshift($family,$b);
            return $family;
        } else
            return $b;
    } else {  // $b is a root
        $rooLen = strspn($b ^ $a, ""); // split by diff
        return ($rooLen>=$minLen)? substr($a,0,$rooLen): '';        
    }
  }

Examples of use: get family array with $fam4 = str_getRoot($words,'yourself',4); or get specific root of a pair by str_getRoot('yourself','your');

Leave a Reply

Your email address will not be published. Required fields are marked *