Remove non-printable ASCII characters from a string in C#

[non-printable ASCII characters]

The following TrimNonAscii extension method removes the non-printable ASCII characters from a string.

public static string TrimNonAscii(this string value)
{
    string pattern = "[^ -~]+";
    Regex reg_exp = new Regex(pattern);
    return reg_exp.Replace(value, "");
}

In ASCII, the printable characters lie between space (” “) and “~”. The code makes a regular expression that represents all characters that are outside of that range repeated one or more times. It uses the expression to create a Regex object and then uses its Replace method to remove those characters. The method then returns the resulting string.

Note that this method removes many useful Unicode characters such as £, Æ, and ♥, in addition to fonts such as Cyrillic and Kanji. It’s mostly useful for standard English text.

I don’t know of a simple way to remove Unicode characters in bulk. You would probably need to make a table of characters that you do or do not want to include and then either loop through the string looking for them or use a Regex object to remove the ones you don’t want.


Download Example   Follow me on Twitter   RSS feed   Donate




This entry was posted in extension methods, strings and tagged , , , , , , , , , , , , . Bookmark the permalink.

7 Responses to Remove non-printable ASCII characters from a string in C#

  1. balu says:

    Excellent…

  2. Marcone says:

    Thanks, this was the only one code that resolve my problem.

  3. Joseph says:

    This works great but this seems to remove Unicode characters which I don’t want. I only want to remove non printable characters and the following character I have is being remove:

  4. Rod Stephens says:

    What character code is that? I only see a sort of underscore.

    Unfortunately Unicode defines hundreds of non-printable characters such as control characters and formatting characters, and many are kind of scattered around instead of having nice contiguous values. So I don’t know how to filter them out.

    There’s also the issue of which font you are using. For example, you may not have a font for a particular locale installed. In that case text in that font might appear is non-printable on your system.

    Anyway, I don’t know if there’s a simple solution to this issue. This example sort of does this, although it’s not trivial:

  5. andre says:

    thanks works great

  6. Adrian says:

    The Regular Expression in the code uses the “*” (asterisk) quantifier. It means zero-or-more repetitions of the preceding thing. For this character removal task it would be better to use the “+” (plus) quantifier which means one or more repetitions.

    For the given replacement string (i.e. the empty string) the final result is the same. But, having the zero-or-more means that the replacement string will be inserted between every pair of input characters. This is easily seen by changing the return expression to be:

    reg_exp.Replace(value, “=”);

    My experiments show the code using “[^ -~]*” runs significantly slower than the version using “[^ -~]+”. However I suspect that for most programs this speed difference is not important. It would only matter for programs that do large amounts of processing with regular expressions.

Leave a Reply

Your email address will not be published. Required fields are marked *