Examine the unique words in a Microsoft Word file in C#

[unique words]

This example is a modification of the earlier post List the unique words in a Microsoft Word file in C#. That program reads the words in a Microsoft word file, sorts them, and then displays the unique words in a list box.

I sometimes use that program to look for spelling errors in the books that I write. It’s sometimes hard to notice that Word has flagged a particular word as misspelled. I can use the previous example to make a list of all of the words and then look for any that are misspelled.

I’m working on a new book now (I’ll post details in the next few weeks) and I decided to modify that example to make it more useful. This version adds two new features.

First, it writes the unique words into a file. I can then copy and paste its contents into a new Microsoft Word document and let Word check the individual words for spelling errors.

Second, this version splits words that are in Pascal case or camel case. For example, a piece of code might contain the word “PreviwMouseDown.” This kind of typo is particularly difficult to detect because Microsoft Word flags all Pascal and camel case words as spelling errors so it becomes too easy to ignore the warnings. The program splits this word into “Previw Mouse Down.” Now when Word flags the first part “Previw” as an error, it’s much easier to see that this really is misspelled.

The following code shows how the program processes the indicated file’s words.

// List the words in the file.
private void btnListWords_Click(object sender, EventArgs e)
{
    Cursor = Cursors.WaitCursor;

    // Get the file's text.
    FileInfo file_info = new FileInfo(txtFile.Text);
    string extension = file_info.Extension.ToLower();
    string txt;
    if ((extension == ".doc") ||
        (extension == ".docx"))
    {
        txt = GrabWordFileWords(txtFile.Text);
    }
    else
    {
        txt = File.ReadAllText(txtFile.Text);
    }

    // Use regular expressions to replace characters
    // that are not letters or numbers with spaces.
    Regex reg_exp = new Regex("[^a-zA-Z0-9]");
    txt = reg_exp.Replace(txt, " ");

    // Split the text into words.
    string[] words = txt.Split(
        new char[] { ' ' },
        StringSplitOptions.RemoveEmptyEntries);

    // Use LINQ to get the unique words.
    var word_query =
        (from string word in words
         orderby word
         select ToProperCase(word)).Distinct();

    // Display the result.
    string[] result = word_query.ToArray();
    File.WriteAllText("words.txt",
        string.Join(Environment.NewLine, result));
    lstWords.DataSource = result;
    lblSummary.Text = result.Length + " words";
    Cursor = Cursors.Default;
}

This code uses the GrabWordFileWords method described in the earlier post to get the words contained in the file. (See that post for details.)

Next the code uses a regular expression to convert special characters such as _ and & into spaces. It then uses the string class’s Split method to divide the text into words separated by spaces.

The program then uses a LINQ query to call the ToProperCase method described shortly for each of the words. It selects the result of that method and orders the results by the words. (It probably should order them by ToProperCase(word), but it probably doesn’t make much difference to the ordering.)

The program finishes by displaying the results in a list box and writing the words into a file.

The following code shows the ToProperCase method.

private string ToProperCase(string word)
{
    string result = word[0].ToString();
    for (int i = 1; i < word.Length; i++)
    {
        char ch = word[i];
        if (char.IsUpper(ch)) result += " ";

        result += ch.ToString();
    }
    return result;
}

This method creates a result string and adds the input word’s first character to it. The code then loops through the word’s remaining characters. When it sees a capital letter, it adds a space to the result before that letter. The code then adds the letter to the result.

After it finishes processing the word’s letters, the method returns the finished result.

(Out of the 6,600+ words discovered by the program, this program helped me identify a dozen or so typos that might otherwise have been missed.)


Download Example   Follow me on Twitter   RSS feed   Donate




About RodStephens

Rod Stephens is a software consultant and author who has written more than 30 books and 250 magazine articles covering C#, Visual Basic, Visual Basic for Applications, Delphi, and Java.

This entry was posted in files, Office, strings, Word and tagged , , , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *