Remove control and non-ASCII characters from large files

Posted on

Problem

I get huge incoming files (up to 6GB) and they are littered with Control and non-ASCII characters. I need to strip them and I made this routine (below). The problem is that it is insanely slow. I would love any thoughts or advice on how I can speed it up.

public void StripHighBitCharacters(string fn)
{
    string writeFile = fn + "B";
    using (var reader = new StreamReader(fn))
    using (var writer = new StreamWriter(writeFile))
    {
        while (!reader.EndOfStream)
        {
            string line = reader.ReadLine();
            if (line.Length > 0)
            {
                writer.WriteLine(BuildClearString(line));
            }
            else
            {
                writer.WriteLine(line);
            }
        }
    }
    File.Copy(writeFile, fn, true);
    File.Delete(writeFile);
}
public string BuildClearString(string line)
{

    StringBuilder sb = new StringBuilder();

    foreach (char c in line)
    {
        if (c >= 32 && c <= 175)
        {
            sb.Append(c);
        }
    }

    return (sb.ToString());
}

Solution

General

  • You are using using statements which is always good.
  • You have small and well named methods which is good as well but the StripHighBitCharacters() method doesn’t do what the name implies. The BuildClearString() method is doing what StripHighBitCharacters() should do based on its name.
  • The method parameter of StripHighBitCharacters is poorly named. Why don’t you name it fileName ?
  • You should be consistent with the usage of the var type. Why didn’t you use it e.g for the string writeFile ?

@1201ProgramAlarm mentioned in his/her answer reusing the StringBuilder which is the way to go for a performance boost but I would take this further.

  • I would initialize the StringBuilder with a starting capacity of at least 4 kb, because usually your filesystem is storing its data in 4 kb blocks. But because you expect to get real big files you should increase the capacity to e.g 4mb.

  • Instead of creating a new file with a filename of fn + "B" you should use
    Path.GetTempFileName() and after the content is written delete the original and move the temp file to the original destination.

Implementing the mentioned points will lead to

private const int maxCapacity = 4096 * 1024;
private StringBuilder sb = new StringBuilder(maxCapacity);

public void CleanFile(string fileName)
{

    var tempFileName = Path.GetTempFileName();
    using (var reader = new StreamReader(fileName))
    using (var writer = new StreamWriter(tempFileName))
    {
        sb.Length = 0;
        while (!reader.EndOfStream)
        {
            var line = reader.ReadLine();
            if (line.Length + sb.Length > maxCapacity)
            {
                writer.Write(sb.ToString());
                sb.Length = 0;
            }
            StripHighBitCharacters(line);

        }
    }
    
    File.Delete(fileName);
    File.Move(tempFileName, fileName);
}

public void StripHighBitCharacters(string value)
{
    foreach (var c in value.Where(c => c > 31 && c < 176))
    {
       sb.Append(c);
    }

    sb.AppendLine();
}

After using poor man profiling (using Stopwatch) I figured that the provided StripHighBitCharacters() method using linq took around 39 seconds.

Using just a loop and an if like so

public void StripHighBitCharacters(string value)
{
    foreach (var c in value)
    {
        if (c > 31 && c < 176)
        {
            sb.Append(c);
        }
    }

    sb.AppendLine();
}  

the measurements went better. It took only 22 seconds.

Both tests had been done using a file with 1.3 GB.

One source of performance degradation is frequent memory allocations. In your case, the StringBuilder will allocate space for every line in your file, and may allocate additional space (along with a data copy) for longer lines.

You can eliminate all of that by reusing the StringBuilder object. At the start of BuildClearString, call the clear method on it (sb.Clear(); or sb.Length = 0;). Follow that up by a capacity check.

if (sb.Capacity < line.Length)
    sb.Capacity = line.Length;

By changing the capacity you ensure that you have a buffer big enough to hold all the characters you will be adding, so you won’t incur any memory allocations when processing a line. By reusing it, you keep the existing allocated memory, thus avoiding any allocations for later lines unless the line is longer than any you’ve already encountered. You can also set an initial capacity on the StringBuilder object.

Why not just use the BinaryReader/BinaryWriter? If you have a lot of line breaks, you might end up with more iterations of your loop with ReadLine() and BinaryReader would minimize that overhead, and eliminate the need for StringBuilder or estimating the size of the buffer.

private void StripUnwantedChars(string InFile, string OutFile, int readSize = 1048576)
{
    using (var fsInFile = File.Open(InFile, FileMode.Open, FileAccess.Read))
    using (var bReader = new BinaryReader(fsInFile))
    using (var fsOutfile = File.Open(OutFile, FileMode.Create))
    using (var bWriter = new BinaryWriter(fsOutfile))
    {
        while (fsInFile.Position != fsInFile.Length)
        {
            byte[] bytes = bReader.ReadBytes(readSize);
            foreach (byte checkByte in bytes)
            {
                if (((checkByte >= 32) && (checkByte <= 175)) || (checkByte == 13) || (checkByte == 10))
                {
                    bWriter.Write(checkByte);
                }
            }
        }
    }
}

EDIT: Added check for line break and carriage return characters.

This should improve performance, so long as the StreamWriter.Write(char) implementation does not have particularly poor overheads.

NB, this will remove the need for any intermediate StringBuilder and associated temporary arrays.

public void StripHighBitCharacters(string fn)
{
  string writeFile = fn + "B";
  using (var reader = new StreamReader(fn))
  using (var writer = new StreamWriter(writeFile))
  {
    while (!reader.EndOfStream)
    {
      string line = reader.ReadLine();
      if (line.Length > 0)
      {
        foreach (var c in line.Where(c => c >= 32 && c <= 175)) { writer.Write(c); }
      }

      writer.WriteLine();
    }
  }
  // You may wish to consider moving `fn` to a temp location and then deleting it after the `File.Move(writeFile, fn)` line succeeds
  File.Delete(fn);
  File.Move(writeFile, fn);
}

Consider a producer consumer pattern like BlockingCollection. Read a line and strip in the producer. Write the clean lines in the consumer. This keeps the disk active and strip is basically free. Use an UpperBound so the producer does not get too far ahead of the consumer.

As has been said just have one String builder

private StringBuilder sb = new StringBuilder();
public string BuildClearString(string line)
{
    sb.clear();  

If you don’t need leading and trailing white-space characters then use String.Trim Method.

var line = reader.ReadLine().Trim();

This might be faster. But I doubt it.

foreach (char c in line)
{
    if (c < 32)
    {
        continue;
    }
    if (c > 175)
    {
        continue;
    }
    sb.Append(c);
}

Without the producer consumer part I would trim it down. Those checks take time.

public void StripHighBitCharacters(string fn)
{
    string writeFile = fn + "B";
    using (var reader = new StreamReader(fn))
    using (var writer = new StreamWriter(writeFile))
    {
        while (!reader.EndOfStream)
        {
            string line = reader.ReadLine().Trim();
            writer.WriteLine(BuildClearString(line));
        }
    }
    File.Delete(fn);
    File.Move(writeFile, fn);
}

private StringBuilder sb = new StringBuilder();
public string BuildClearString(string line)
{
    sb.Clear();  
    foreach (char c in line.Where(c => c >= 32 && c <= 175))
    {
        sb.Append(c);
    }   
    return (sb.ToString());
}

Leave a Reply

Your email address will not be published. Required fields are marked *