The following TrimNonAscii extension method removes the non-printable ASCII characters from a string.
public static string TrimNonAscii(this string value) { string pattern = "[^ -~]+"; Regex reg_exp = new Regex(pattern); return reg_exp.Replace(value, ""); }
In ASCII, the printable characters lie between space (” “) and “~”. The code makes a regular expression that represents all characters that are outside of that range repeated one or more times. It uses the expression to create a Regex object and then uses its Replace method to remove those characters. The method then returns the resulting string.
Note that this method removes many useful Unicode characters such as £, Æ, and ♥, in addition to fonts such as Cyrillic and Kanji. It’s mostly useful for standard English text.
I don’t know of a simple way to remove Unicode characters in bulk. You would probably need to make a table of characters that you do or do not want to include and then either loop through the string looking for them or use a Regex object to remove the ones you don’t want.




Excellent…
Thanks, this was the only one code that resolve my problem.
This works great but this seems to remove Unicode characters which I don’t want. I only want to remove non printable characters and the following character I have is being remove:
What character code is that? I only see a sort of underscore.
Unfortunately Unicode defines hundreds of non-printable characters such as control characters and formatting characters, and many are kind of scattered around instead of having nice contiguous values. So I don’t know how to filter them out.
There’s also the issue of which font you are using. For example, you may not have a font for a particular locale installed. In that case text in that font might appear is non-printable on your system.
Anyway, I don’t know if there’s a simple solution to this issue. This example sort of does this, although it’s not trivial:
thanks works great
The Regular Expression in the code uses the “*” (asterisk) quantifier. It means zero-or-more repetitions of the preceding thing. For this character removal task it would be better to use the “+” (plus) quantifier which means one or more repetitions.
For the given replacement string (i.e. the empty string) the final result is the same. But, having the zero-or-more means that the replacement string will be inserted between every pair of input characters. This is easily seen by changing the return expression to be:
reg_exp.Replace(value, “=”);
My experiments show the code using “[^ -~]*” runs significantly slower than the version using “[^ -~]+”. However I suspect that for most programs this speed difference is not important. It would only matter for programs that do large amounts of processing with regular expressions.
Excellent point, Adrian! I’ll update this. (Others note that the code above has been changed as Adrian suggested.)
Your regex only keeps the values 32 – 126, so all char values > 126 will be filtered out. You can do this instead:
string pattern = $”[{(char)0}-{(char)31}{(char)127}]+”;
This removes all characters in the range 0 – 31, the control characters. It additionally removes the DEL character (127). This ensures your unicode characters stay intact. This has one drawback of not filtering out the unicode non-printable characters though.
Note: The “+” in your regex is unneeded, the function will work the same. I don’t know if there are any speed advantages to that, however.
-Daniel Pinski
The goal was to only keep printable ASCII characters, but your version looks good if you also want other Unicode characters.
I think I looked around at one time to see if there was a way to remove non-printable Unicode characters and I vaguely remember reading that there was no easy way to do that. I suppose you could print onto a bitmap and see if any of the pixels were changed.