[Beginning Database Design Solutions, Second Edition]

Title: Find duplicate files in C#, Part 4 of 4

The last three posts described an application that searches for duplicate files and removes them. The program seems to work fairly well, at least for small test directories. When I tried it on a directory containing around 8,000 files, however, it took around 229 seconds (3 minutes 49 seconds). That may not be the end of the world because you probably won't need to check a particular directory for duplicates very often, but it still seemed like a long time.

The reason the program takes so long is that it computes a hash code for each file. To compute a hash code, the cryptographic methods must open the file, read it completely, and produce a code. Processing each file in its entirety takes a while.

Another test you can perform to see if two files are the same is to compare their file sizes. This is much less reliable than a hash code because many files might just happen to have the same sizes without being the same. However, comparing file sizes is much faster than hashing two files and comparing their hash codes. If two files have different sizes, then they are definitely not the same. If they have the same sizes, they might still be different. In that case, you can use the slower hash code to see if they really are the same.

This version of the program starts by grouping files by size. If a group contains a single file, then you know there is no duplicate for that file so you can eliminate it from consideration. After eliminating those files, the program uses the previous method of comparing hash codes to see which of the remaining files are duplicates.

The following code shows the LINQ queries that this version uses to identify duplicates.

// Get FileInfos representing the files. var get_infos = from string filename in Directory.GetFiles(txtDirectory.Text) select new FileInfo(filename); // Group the FileInfos by file length. var fileinfo_groups = from FileInfo file_info in get_infos group file_info by file_info.Length into g where g.Count() > 1 select g; // Flatten the result to get a list of FileInfos. var flattened = fileinfo_groups.SelectMany(x => x).ToList(); // Get a list of the files and their hash values. var get_info = from FileInfo file_info in flattened select new { Name = file_info.FullName, Hash = GetHashString(file_info.FullName) }; // Group the files by hash value. var group_infos = from info in get_info group info by info.Hash into g where g.Count() > 1 //orderby g.Key select g; // Loop through the files. int num_groups = 0; int num_files = 0; foreach (var g in group_infos) { ... }

The get_infos query loops through all of the files in the directory and selects FileInfo objects representing the files.

Next, the fileinfo_groups query groups the FileInfo objects by file length. It keeps only the groups that contain more than one file. (The groups that might contain a duplicate.)

The code then uses the previous query's SelectMany method to select the items in the groups built by the fileinfo_groups query. This combines the values in the groups to make a single flattened group containing all of the selected FileInfo objects. After using the SelectMany method, the program uses ToList to convert the result into a list.

Now the program returns to something similar to the previous version. The get_info query loops through the remaining files and selects their names and hash codes.

Finally, the group_infos query groups the files by hash code and selects only the groups that contain more than one file.

After this point, the code populates the TreeView control as in the previous examples. See the previous posts for details.

With this change, the program processes the 8,000 file directory in about 61 seconds, much faster than the previous version's 229 seconds!

Note that some of the information that the two versions collect seems to be cached. Perhaps the disk drive is storing some of the file information it gathered in memory. This means if you run the program repeatedly, the run times may vary greatly.

In any case, the new version is fast enough to be useful for searching moderately large directories. Click the Download button to get the new version of the program.

Download the example to experiment with it and to see additional details.