Map String to list of string

Posted on

Problem

Input: String : Title; List Tags

Problem statement:

Need to map title to tags, e.g. title is "china man japan beijing bell" and tags are ["china man" "beijing bell"], output should be ["china man" "japan" "beijing bell"]

What I am doing:

List<String> getTitleSplits(String title, List<String> tags) {
        List<String> titleSplits = title.split(" "));
        List<String> ret = new ArrayList<String>();
        LinkedHashMap<String, Boolean> splitValueMap = new LinkedHashMap<String, Boolean>();
        for(String split : titleSplits) {
            splitValueMap.put(split, FALSE);
        }
        for(String titleSplit : titleSplits) {
                boolean flag = false;
                for(String tag : tags) {
                    if(tag.contains(titleSplit) || tag.contains(titleSplit.toLowerCase())) {
                        List<String> tagSplits = Arrays.asList(tag.split((" ")));
                        List<String> list = new ArrayList<String>();
                        for (String tagSplit : tagSplits) {
                            if (splitValueMap.get(tagSplit) == FALSE &&
                                    titleSplits.contains(tagSplit)) {
                                list.add(tagSplit);
                                splitValueMap.put(tagSplit, TRUE);
                                flag = true;
                            }
                        }
                        if(flag) {
                            ret.add(StringUtils.join(list, " ").trim());
                            splitValueMap.put(titleSplit, TRUE);
                            break;
                        }
                    }
                }
                if(!flag && splitValueMap.get(titleSplit) == FALSE) {
                    ret.add(titleSplit);
                }
            }
        return ret;
}

Please review this.

Solution

This problem intrigued me, and it has aspects which relate to a real-world problem I am working on. Performance is a significant concern of mine, and reviewing your code, and the current answer by wassgren (+1), I suspect there’s a better way to do it.

In both of your cases, you break the title down to words, and then build up the words in ways to see if they match tags. This creates a challenge which scales badly (I believe the solutions are about O(nt2)O(nt2) (n the size of the title, t the number of tags), and will have poor performance for reasonably sized titles, or tag sets.

A better way would be to only ‘split’ the title on the tag boundaries… locate each tag inside the title, and, if it is there, identify that location, then look in the next spots.

Still, this requires matching many tags, against many places in the title, which then becomes an order of O(nt)O(nt)

Regular expressions, though, are able to significantly improve that by using Automata to match the text. A single regular expression on the title will approach O(n)O(n) performance, and not significantly depend on the number of tags.

Thus, by building a complex regular expression, you can reduce the algorithmic complexity substantially. So, a regular expression like china man|beijing bell will match the first of those tags in the title.

Regular expressions can also work on word-boundaries, and be case insensitive. Thus, there is no need to do any splitting of the input, and conversion to lower-case….

Because regular expressions have structured orders of evaluation, we can build up a regex that encompasses all the required tags.

There are thus two parts… given a regular expression that matches all the tags, your code becomes simply:

    List<String> result = new ArrayList<>();
    Matcher mat = **SOME_PATTERN**.matcher(text);
    int lastPos = 0;
    while (mat.find()) {
        addNotEmpty(text, lastPos, mat.start(), result);
        result.add(mat.group());
        lastPos = mat.end();
    }
    addNotEmpty(text, lastPos, text.length(), result);

    return result.toArray(new String[result.size()]);

The code to build the regular expression is just a clever way of “orring” the tags:

    List<String> quoted = sources.stream().map(tag -> "\b" + Pattern.quote(tag) + "\b")
            .collect(Collectors.toList());
    String pat = String.join("|", quoted);

    return new CompiledTagSet(Pattern.compile(pat, Pattern.CASE_INSENSITIVE), sources.toString());

Note, you look for word boundaries on the tag, and ‘quote’ it so that any tags containing special regex characters are not going to break things…. Then you concatenate them with the regex ‘or’ character |.

Finally, you compile it to be case-insensitive.

Putting this all together, you can have a system like:

package tagger;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;
import java.util.Comparator;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Collectors;

public class TagSplitter {

    public static final class CompiledTagSet {
        private final Pattern pattern;
        private final String sources;


        private CompiledTagSet(Pattern pattern, String sources) {
            this.pattern = pattern;
            this.sources = sources;
        }

        private Pattern getPatterns() {
            return pattern;
        }

        @Override
        public String toString() {
            return sources;
        }

    }

    public static final CompiledTagSet compileTags(Collection<String> tags) {
        // strip out duplicate, empty, and null tags.
        List<String> sources = tags.stream()
                .filter(tag -> tag != null)
                .map(tag -> tag.trim())
                .filter(tag -> !tag.isEmpty())
                .distinct()
                .collect(Collectors.toList());

        // THIS IS CRITICAL
        /*
         * Sort the 'tags' in reverse order alphabetically. The reason is as follows:
         * If one tag is a prefix of another, like "bobby" and "bobby tables", then we
         * want our regex to match "bobby tables" if it exists, and not just "bobby".
         * Add another tag "tables", and we have three: "bobby", "bobby tables" and
         * "tables". If we were to match the text:
         *     I am bobby tables
         * we want the match to be "bobby tables", and not "bobby" and "tables"
         * Java regular expressions are "alternate ordered" which means, if you have
         * OR conditions in the regex, the first one (left most) that matches, wins.
         * So, if we have the regular expression "bobby|bobby tables|tables" and match
         * that against "I am bobby tables", it will match "bobby" first.
         * 
         *  The regular expression "bobby tables|bobby|tables", on the other hand, will
         *  match "bobby tables" first.
         *  
         *  By sorting the tags in reverse order, longer tags will happen first.
         *  
         *  I believe this will also make track-back mechanisms faster in the automaton
         *  if closely related tags are closer together.
         */
        sources.sort(Comparator.reverseOrder());

        List<String> quoted = sources.stream().map(tag -> "\b" + Pattern.quote(tag) + "\b").collect(Collectors.toList());
        String pat = String.join("|", quoted);

        return new CompiledTagSet(Pattern.compile(pat, Pattern.CASE_INSENSITIVE), sources.toString());
    }

    private static final void addNotEmpty(String text, int left, int right, List<String> target) {
        if (right <= left) {
            return;
        }
        String candidate = text.substring(left, right).trim();
        if (candidate.isEmpty()) {
            return;
        }
        target.add(candidate);
    }

    public static String[] splitByTags(final String text, final CompiledTagSet tags) {

        List<String> result = new ArrayList<>();
        Matcher mat = tags.getPatterns().matcher(text);
        int lastPos = 0;
        while (mat.find()) {
            addNotEmpty(text, lastPos, mat.start(), result);
            result.add(mat.group());
            lastPos = mat.end();
        }
        addNotEmpty(text, lastPos, text.length(), result);

        return result.toArray(new String[result.size()]);

    }

    public static String[] splitByTags(final String text, String...tags) {
        // filter and process the tags.
        // remove null values, duplicates, empty strings, etc.
        return splitByTags(text, compileTags(Arrays.asList(tags)));
    }

    public static void main(String[] args) {
        String[][] tagSets = {
                {"china man", "beijing bell", "man eater"}

        };
        String[] inputs = {"China man eater Japan Beijing bell"};

        List<CompiledTagSet> compiled = Arrays.stream(tagSets).map(tags -> compileTags(Arrays.asList(tags))).collect(Collectors.toList());

        for (String input : inputs) {
            for (CompiledTagSet ts : compiled) {
                System.out.printf("Split: %s%nUsing: %s%nMatch: %s",
                        input, ts, Arrays.toString(splitByTags(input, ts)));
            }
        }

    }
}

Note that there is special handling for situations where one tag is a prefix of another tag.

The above code produces the output:

Split: China man eater Japan Beijing bell
Using: [man eater, china man, beijing bell]
Match: [China man, eater Japan, Beijing bell]

EDIT: After fine input from the OP and others the answer has been revised to fulfil the requirements of order and to include words separated by space.

I have tested the code and it doesn’t seem to work the way you described in the text.

First I fixed some small compiler errors, then I found the first problem. The split-function does not return a List, it returns a String array (String[]). See below for an example of how to transform an array to a List.

List<String> titleSplits = Arrays.asList(title.split(" "));

I have provided an alternative implementation to your problem that will work if you have a reasonable number of tags. This is not the most efficient solution for large datasets (i.e. large amount of tags or really long titles) but it is clean, simple to maintain and works for smaller sets of data.

public List<String> getTitleSplits(final String title, final List<String> tags) {
    final String lowerCaseTitle = title.toLowerCase();

    // Retrieve all words from the title
    final Set<String> allWords =
            Arrays.stream(lowerCaseTitle.split(" ")).collect(Collectors.toSet());

    // Find all "tags" in the title
    final Set<String> tagsInTitle = tags.stream()
            .filter(tag -> lowerCaseTitle.contains(tag.toLowerCase()))
            .collect(Collectors.toSet());

    // "Whichever part of title is not in tags list, would be splitted based on space"
    final Set<String> wordsThatAreNotPartOfTags =
            allWords.stream().filter(word -> tagsInTitle.stream().noneMatch(tag -> tag.contains(word))).collect(Collectors.toSet());

    // Merge, sort and return
    return Stream.of(wordsThatAreNotPartOfTags, tags)
            .flatMap(set -> set.stream()) // Merge the result collections
            .sorted((tag1, tag2) -> Integer.valueOf(lowerCaseTitle.indexOf(tag1)).compareTo(lowerCaseTitle.indexOf(tag2))).collect(Collectors.toList());
}

Furthermore, I also provided a test case to verify the behaviour of the code:

@Test
public void testTitleSplits() {
    final List<String> result1 = getTitleSplits("china man japan beijing bell", Arrays.asList("china man", "beijing bell"));
    Assert.assertEquals(3, result1.size());
    Assert.assertEquals(0, result1.indexOf("china man"));
    Assert.assertEquals(1, result1.indexOf("japan"));
    Assert.assertEquals(2, result1.indexOf("beijing bell"));

    // The extra word "man" should not be part of the output since "china man" contains "man"
    final List<String> result2 = getTitleSplits("china man japan man beijing bell", Arrays.asList("china man", "beijing bell"));
    Assert.assertEquals(3, result2.size());
    Assert.assertEquals(0, result2.indexOf("china man"));
    Assert.assertEquals(1, result2.indexOf("japan"));
    Assert.assertEquals(2, result2.indexOf("beijing bell"));
}

The suggested solution is based on Java 8.

Leave a Reply

Your email address will not be published. Required fields are marked *