Use LINQ to select words of certain lengths from a file in C#

[LINQ]

This example uses LINQ to read a file, remove unwanted characters, select words of a specified length, and save the result in a new file.

Recently I needed a big word list so I searched around for public domain dictionaries. I found one that was close to what I needed in the file 6of12.txt in the 12Dicts package available here. That file has several problems that make it not quite prefect for my use:

  • It contains words that are too short and too long for my purposes.
  • It includes non-alphabetic characters at the end of some words to give extra information about them.
  • Some words contain embedded non-alphabetic characters as in A-bomb and bric-a-brac.

The following code shows how the program processes the file.

// Select words that have the given minimum length.
private void btnSelect_Click(object sender, EventArgs e)
{
    // Remove non-alphabetic characters at the ends of words.
    Regex end_regex = new Regex("[^a-zA-Z]+$");
    string[] all_lines = File.ReadAllLines("6of12.txt");
    var end_query =
        from string word in all_lines
        select end_regex.Replace(word, "");

    // Remove words that still contain non-alphabetic characters.
    Regex middle_regex = new Regex("[^a-zA-Z]");
    var middle_query =
        from string word in end_query
        where !middle_regex.IsMatch(word)
        select word;

    // Make a query to select lines of the desired length.
    int min_length = (int)nudMinLength.Value;
    int max_length = (int)nudMaxLength.Value;
    var length_query =
        from string word in middle_query
        where (word.Length >= min_length) &&
              (word.Length <= max_length)
        select word;

    // Write the selected lines into a new file.
    string[] selected_lines = length_query.ToArray();
    File.WriteAllLines("Words.txt", selected_lines);

    MessageBox.Show("Selected " + selected_lines.Length +
        " words out of " + all_lines.Length + ".");
}

The code starts by using a LINQ query to remove non-alphabetic characters from the ends of words.

It then uses a second LINQ query to select only words that now contain no non-alphabetic characters. (That eliminates A-bomb and bric-a-brac.)

Next a third LINQ query selects words with lengths between those indicated by the user.

Finally the code invokes the final query’s ToArray method to convert the results into an array of words. It then uses File.WriteAllLines to write the words into a new file named Words.txt.

The code finishes by displaying the number of words in the new and original files.


Download Example   Follow me on Twitter   RSS feed   Donate




This entry was posted in algorithms, files, LINQ and tagged , , , , , , , , , , , . Bookmark the permalink.

2 Responses to Use LINQ to select words of certain lengths from a file in C#

  1. Adrian says:

    The line regular expression at:
    Regex end_regex = new Regex(“[^a-zA-Z]*$”);
    will match on every line as the “*” quantifier is used. Hence every line without trailing non-letters will have an empty string appended. Changing the regular expression to “[^a-zA-Z]+$” with a “+” quantifier makes the code more clearly match the requirement, it may also make it a little faster.

    • RodStephens says:

      You’re right about the intent. As for speed, this example is so fast (about 0.004 seconds on my computer) that it doesn’t matter. If you run the code in a loop with 100 or so trials, however, the difference seems to be significant.

      I’ll update the post. Thanks for pointing this out.

Leave a Reply

Your email address will not be published. Required fields are marked *