Title: List unique words in a text file in C#
This example uses regular expressions and LINQ to list the unique words contained in a text file in C#.
When you enter the name of a file and click List Words, the following code executes.
// List the words in the file.
private void btnListWords_Click(object sender, EventArgs e)
{
// Get the file's text.
string txt = File.ReadAllText(txtFile.Text);
// Use regular expressions to replace characters
// that are not letters or numbers with spaces.
Regex reg_exp = new Regex("[^a-zA-Z0-9]");
txt = reg_exp.Replace(txt, " ");
// Split the text into words.
string[] words = txt.Split(
new char[] { ' ' },
StringSplitOptions.RemoveEmptyEntries);
// Use LINQ to get the unique words.
var word_query =
(from string word in words
orderby word select word).Distinct();
// Display the result.
string[] result = word_query.ToArray();
lstWords.DataSource = result;
lblSummary.Text = result.Length + " words";
}
The code first uses File.ReadAllText to copy the file's text into a string.
Next the code uses regular expressions to replace non-letter and non-number characters with spaces. It uses the pattern [^a-zA-Z0-9]. The ^ means "not the following characters." The a-zA-Z0-9 part means any lowercase or uppercase letter or a digit. The code uses the Regex object's Replace method to replace characters that match the pattern with a space character.
The code then uses Split to break the text into an array of words, removing any duplicates.
The code uses LINQ to select all of the words from the array and sort them. It uses the Distinct method to remove duplicates.
Finally the code displays the words in a ListBox and displays the number of words in a Label.
Download the example to experiment with it and to see additional details.
|