[Beginning Database Design Solutions, Second Edition]

Title: Find duplicate files in C#, Part 1 of 4

This example lets you find and remove duplicate files. It's fairly complex, so I'm going to cover it's more interesting pieces in several posts. I won't cover some of the less interesting pieces at all.

Enter a directory path or click the ellipsis button to browse for a folder. When you click Search, the program searches for duplicate files in that directory and displays them in the TreeView control on the left. It groups the files by hash code.

If two files have the same hash code, then they are very likely to be the same, but there is a tiny chance (roughly 1 in 65,000 for this hash algorithm) that the two files are different. If you click on a file, the program displays it on the right. (The program can display several kinds of text and graphic files.) By clicking on the files, you can determine whether they are actually different.

If you click the Select Duplicates button, the program checks all of the files except the first one in each hash group. You can also check and uncheck the files individually by hand.

If you click the Delete Selected button, the program moves the files you selected into the recycle bin.

This post explains how the program finds the duplicated files. When you click the Search button, the following code executes.

// Search the directory for duplicates. private void btnSearch_Click(object sender, EventArgs e) { Stopwatch watch = new Stopwatch(); watch.Start(); Cursor = Cursors.WaitCursor; trvFiles.Visible = false; trvFiles.Nodes.Clear(); lblNumDuplicates.Text = ""; Refresh(); try { // Get a list of the files and their hash values. var get_info = from string filename in Directory.GetFiles(txtDirectory.Text) select new { Name = filename, Hash = BytesToString(GetHash(filename)) }; // Group the files by hash value. var group_infos = from info in get_info group info by info.Hash into g where g.Count() > 1 //orderby g.Key select g; // Loop through the files. int num_groups = 0; int num_files = 0; foreach (var g in group_infos) { num_groups++; TreeNode hash_node = trvFiles.Nodes.Add( g.Key.ToString()); foreach (var info in g) { num_files++; TreeNode file_node = new TreeNode(info.Name); file_node.Tag = new FileInfo(info.Name); hash_node.Nodes.Add(file_node); } } // Display the number of duplicates. lblNumDuplicates.Text = (num_files - num_groups).ToString() + " duplicate files"; // Expand all nodes. trvFiles.ExpandAll(); } catch (Exception ex) { MessageBox.Show(ex.Message); } finally { // Scroll to the top. if (trvFiles.Nodes.Count > 0) trvFiles.Nodes[0].EnsureVisible(); trvFiles.Visible = true; Cursor = Cursors.Default; } watch.Stop(); Console.WriteLine(watch.Elapsed.TotalSeconds.ToString("0.00") + " seconds"); }

The code performs some initialization tasks such as creating a Stopwatch and displaying the wait cursor.

Next, the code creates a LINQ query that loops over the files returned by the Directory.GetFiles method, which returns the names of the files in a directory. For each file, the query selects the file's name and the file's MD5 hash code. (See the example Calculate hash codes for a file in C# for information on how to get the hash code and convert it into a string.)

The code then uses another LINQ query to group the results of the first query by hash code. It selects only groups where the number of files in the group (i.e. have the same hash code) is greater than one. The result is a query that returns groups containing at least two files with the same hash codes.

(You can uncomment the orderby clause if you want to sort the groups by hash code. That slows the program down, though, and the hash codes don't really mean anything so sorting them isn't very useful. I put that in there so I could compare the results to the results of another version of the program that I'll describe in a later post.)

Having built the grouping query, the program loops through the groups. For each group, the program adds a top-level node to the TreeView control displaying the hash code. It then loops through the group's contents. Each item within the group contains a file's name and hash code. The code creates a child node for the file, setting its text equal to the file's name. It also sets the child node's Tag property to a FileInfo object representing the file. Later it can use that object to display or delete the file.

As it processes the groups, the code keeps track of the number of duplicate files and the number of groups. It then reports [# files] - [# groups] as the number of duplicate files. (It basically assumes that one file in each group should be kept and the rest are duplicates.)

Next, the code expands all of the TreeView control's nodes so you can see all of the duplicate files. It also scrolls to the top of the TreeView control. The code finishes by displaying the elapsed time in the Console window.

In the next few posts, I'll explain how the program:

Displays file previews and selects the duplicate files when you click the Select Duplicates button
Deletes files when you click the Delete Selected button

See the following examples for more information on specific techniques that the program uses.

Download the example to experiment with it and to see additional details.