Splitting string by words to new strings

Posted on

Problem

Function:

char **strsplit (char source[])
{
    char **strings = calloc(1, sizeof(char *));
    int nStrings = 0, nChars = 0;

    strings[0] = calloc(2, sizeof(char));

    while('' != *source)
    {
        if(*source == ' ')
        {
            nStrings++;
            nChars = 0;
            strings = realloc(strings, (1 + nStrings) * sizeof(char *));
            strings[nStrings] = calloc(2, sizeof(char));
            source++;
            continue;
        }

        strings[nStrings][nChars] = *source;
        strings[nStrings][nChars + 1] = '';

        nChars++;
        source++;
        strings[nStrings] = realloc(strings[nStrings], (2 + nChars) * sizeof(char));
    }

    return(strings);
}

Example usage:

char *source = "Hello Code Review";
char **words = strsplit(source);

printf("Second word is: %s", words[1]);

Output:

Code

Solution

Add some documentation to function.

While the name strsplit is pretty descriptive in itself, you should still add documentation to at least explain the argument and the return value.

For example, I was a little confused when I saw strsplit because I thought it should’ve taken another argument that was the delimiter.


Add some more comments to your code!

I had to rubberduck this code a couple times before I understood what was happening and where.

I think at least having a comment for what each variable was doing would help a lot.


Allocating and re-allocating memory is very inefficient. You should set one size of memory that will definitely fit the amount of memory needed.

For example, you could just allocate, for each output string, the size of the input string. However, this eats up a lot of memory for big strings.

You may have to come up with another algorithm for this.


You are sort-of reinventing the wheel here, and I don’t think that that is your intent.

<string.h> already has a function char *strtok(char *str, const char *delim). However, this function only grabs tokens one at a time, so you’d have to take these steps:

  1. Grab a token
  2. Put it in the string array
  3. Go to 1

Is roughly what you’d have to do.

I wrote out an implementation in C. If you’d like to see it, then leave a comment.

Wait, do I read it right this code was not written just for practice? Unfortunately the actual problem was not stated. What expectations are there when it comes to the content of the string? How about spaces at the beginning, end, or multiple spaces between words?

The function is unusable. There is no indication how many words were found. It should either return a number or have the array NULL-terminated (or preferably both).

As was mentioned by someone else reallocs here are a non-starter. Apart from the issue of not checking for failed allocation, you copy data a lot.

Instead you can just traverse the string counting how many words are there, allocate appropriate table and have a second pass where you allocate + fill elements.

Except chances you don’t even want to do that. Maybe source string is expected to be modifiable and all strings fried at one, in which case you can just have an array to substrings where you replace spaces with zeros in the original string.

It maybe you don’t want that either, maybe you are fine enough with a pair addr + length for each string, who knows.

On to the code.

char **strsplit (char source[])
{
    char **strings = calloc(1, sizeof(char *));

Ouch.

    int nStrings = 0, nChars = 0;

    strings[0] = calloc(2, sizeof(char));

Why? For an empty string?

    while('' != *source)
    {

Yoda-style comparison with no merit….

        if(*source == ' ')

… followed by normal comparison on the next line.

        {
            nStrings++;
            nChars = 0;
            strings = realloc(strings, (1 + nStrings) * sizeof(char *));
            strings[nStrings] = calloc(2, sizeof(char));
            source++;
            continue;
        }

        strings[nStrings][nChars] = *source;
        strings[nStrings][nChars + 1] = '';

        nChars++;
        source++;
        strings[nStrings] = realloc(strings[nStrings], (2 + nChars) * sizeof(char));

This is barely readable and very inefficient.
}

    return(strings);
}

Leave a Reply

Your email address will not be published. Required fields are marked *