[Beginning Database Design Solutions, Second Edition]

Title: Use k-means clustering to find clusters of data in C#

[Use k-means clustering to find clusters of data in C#]

I was recently editing John Mueller and Luca Massaron's book Machine Learning For Dummies, Second Edition. (This link is for the first edition. I'll try to remember to update it when the new edition is available.) The book is about building machine learning algorithm programs in the Python or R programming languages, but this algorithm is so simple that you can implement it easily without any of the special libraries that Python and R provide.

The Goal

The goal is to partition data into clusters. For example, suppose you have a database of customers and their video purchases. If you can cluster the data, then you might be able to identify customers with similar interests. If a customer is in a cluster that seems to like the cartoon sci-fi movies like Monsters vs. Aliens, then you might recommend the movie Megamind to that customer.

The idea is to partition N pieces of data into k clusters. To do that, we need to find centers or centroids for the clusters and assign each piece of data to the center that is closest.

Figuring out how to represent qualitative data such as liking a particular movie is tricky. For example, you might use a different dimension for each movie and assign a customer the values 1 through 5 along that dimension.

[Use k-means clustering to find clusters of data in C#]

It's much easier to see clusters in two-dimensional point data, so that's what this example considers. The picture on the right shows 400 data points in a two-dimensional plane. It's easy to see that the points should probably be grouped into three clusters. This example will identify the clusters in cases such as this one and in harder cases where the clusters are not as clearly defined.

The Example

Right-click to create data points.

The Make Points button lets you create many random points all at once. If the points were distributed completely randomly across the program's picture box, the result wouldn't have obvious clusters so it wouldn't be as interesting. To create non-randomly positioned points, left-click to define seed points. (Those are shown as crosses on the program.) After you have defined some seeds, clicking the Make Points button will make the program generate points that are random but grouped around the seeds.

After you have defined some points, enter the number of clusters you want to create and click the Make Clusters button. The program runs the algorithm, coloring the clusters as it goes.

The label at the bottom of the form shows the current score and the final solution's score (I'll say more about that later) and the number of steps it took to find the clusters.

Click the Clear button to remove the points and seeds so you can start over.

K-Means Clustering

The k-means clustering algorithm is one way to find clusters for the points. There are other versions of this algorithm, but because this one is so common, it's often called simply "the k-means algorithm." It's also called Lloyd's algorithm, named after Stuart Lloyd who first proposed the algorithm at Bell Labs in 1957. There are faster versions of the algorithm, so this version is also sometimes called the "standard" or "naive" version.

This version of the algorithm is remarkably simple.

To find k clusters, pick k of the data points randomly to be the initial cluster centers.
Repeat:
1. For each data point P, find the closest cluster center and assign the point to that cluster.
2. For each cluster, move the center to the centroid of the points assigned to that cluster.
Repeat step 2 until the cluster centers no longer move significantly.

The result you get depends on the initial centers that you pick randomly, so sometimes the algorithm will reach a stable result even though it may not be the best possible solution. For example, the following picture shows two stable solutions for the same points.

[Use k-means clustering to find clusters of data in C#]

To find the best solution, repeat the algorithm several times with different initial cluster centers and pick the one that gives the best result. I'll talk later about how you might decide which solution is best.

Source Code

The algorithm itself is simple, but there are a lot of details that you need to handle to make the example work nicely. I'm not going to cover them here. Instead I'll focus on the details that are needed for the algorithm itself. Download the example to see the rest.

PointData

The program uses the following PointData class to store the data points.

public class PointData { public PointF Location; public int ClusterNum; public PointData(PointF location, int cluster_num) { Location = location; ClusterNum = cluster_num; } public PointData(float x, float y, int set) : this(new PointF(x, y), set) { } }

This class defines two key properties: Location and ClusterNum. The Location property gives the data point's location in two-dimensional space. The ClusterNum property gives the index of the cluster to which the point has been assigned by the k-means clustering algorithm.

MakePoints

The program uses the following MakePoints method to generate random points grouped around the seeds that you have defined.

private void MakePoints() { // Make sure there is at least one seed. if (Seeds.Count < 1) { MessageBox.Show("Please define at least one seed first."); return; } // Clear the Centroids list. Centroids.Clear(); // Make the points. Random rand = new Random(); double max_r = Math.Min( picItems.ClientSize.Width, picItems.ClientSize.Height) / 6; int num_points = int.Parse(txtNumPoints.Text); for (int i = 0; i < num_points; i++) { int seed_num = rand.Next(Seeds.Count); double r = max_r * rand.NextDouble() + max_r * rand.NextDouble(); double theta = rand.Next(360); float x = Seeds[seed_num].X + (float)(r * Math.Cos(theta)); float y = Seeds[seed_num].Y + (float)(r * Math.Sin(theta)); Points.Add(new PointData(x, y, 0)); } picItems.Refresh(); }

This method does some setup, such as clearing the current centroid list, and then loops through the desired number of points. For each point, it picks a random seed point. It then generates two values and an angle. It places the new point a distance equal to the sum of the two values away from the chosen point. It uses the cosine and sine of the angle to move the point away in the random direction.

Assigning Points

The code that assigns the points to clusters is the most interesting part of the program. When you click the Make Clusters button, the following code executes.

private void btnMakeClusters_Click(object sender, EventArgs e) { int num_clusters = int.Parse(txtNumClusters.Text); if (Points.Count < num_clusters) return; // Reset the data. // Pick random centroids. Centroids = new List<PointF>(); Points.Randomize(); for (int i = 0; i < num_clusters; i++) Centroids.Add(Points[i].Location); foreach (PointData point_data in Points) point_data.ClusterNum = 0; NumSteps = 0; picItems.Refresh(); lblScore.Text = ""; Cursor = Cursors.WaitCursor; tmrUpdate.Enabled = true; }

This code creates a new Centroids list to hold the centroids for the clusters. It calls the Randomize extension method to randomize the points. (See the code for details of how that works.) It then uses the first points as the initial centroids for the clusters.

The code then loops through the points and assigns them all initially to cluster number 0. It resets the NumSteps counter and enables the tmrUpdate timer component.

The following code shows the tmrUpdate timer's Tick event handler.

private void tmrUpdate_Tick(object sender, EventArgs e) { NumSteps++; UpdateSolution(); picItems.Refresh(); }

This code simply increments the step counter, calls UpdateSolution to do the interesting work, and refreshes the display.

The following code shows the UpdateSolution method.

private void UpdateSolution() { // Find new centroids. int num_clusters = Centroids.Count; PointF[] new_centers = new PointF[num_clusters]; int[] num_points = new int[num_clusters]; foreach (PointData point in Points) { double best_dist = Distance(point.Location, Centroids[0]); int best_cluster = 0; for (int i = 1; i < num_clusters; i++) { double test_dist = Distance(point.Location, Centroids[i]); if (test_dist < best_dist) { best_dist = test_dist; best_cluster = i; } } point.ClusterNum = best_cluster; new_centers[best_cluster].X += point.Location.X; new_centers[best_cluster].Y += point.Location.Y; num_points[best_cluster]++; } // Calculate the new centroids. List<PointF> new_centroids = new List<PointF>(); for (int i = 0; i < num_clusters; i++) { new_centroids.Add(new PointF( new_centers[i].X / num_points[i], new_centers[i].Y / num_points[i])); } // See if the centroids have moved. bool centroids_changed = false; for (int i = 0; i < num_clusters; i++) { const float min_change = 0.5f; if ((Math.Abs(Centroids[i].X - new_centroids[i].X) > min_change) || (Math.Abs(Centroids[i].Y - new_centroids[i].Y) > min_change)) { centroids_changed = true; break; } } if (!centroids_changed) { tmrUpdate.Enabled = false; //System.Media.SystemSounds.Beep.Play(); lblScore.Text = "Score: " + Score().ToString() + ", # Steps: " + NumSteps.ToString(); Cursor = Cursors.Default; return; } // Update the centroids. Centroids = new_centroids; }

This method starts with some initialization. It then loops through the data points. For each point, it loops through the current cluster centroids to see which is closest to the point. It then assigns that point to the closest cluster.

This loop also adds the point's X and Y coordinates to the new_centers value for the cluster. After it has assigned all of the points, the program will divide those values by the number of points assigned to each cluster to find the cluster's new centroid.

After it finishes the first loop, the method loops through the centroids. As I mentioned in the preceding paragraph, it divides the values stored in the new_centroids array by the number of points in each cluster to find the cluster's new centroid.

Next the program determines whether the new centroids are significantly different from the previous ones. To do that, it loops through the clusters and calculates the X and Y distances between the old and new centroids. If either coordinate has moved by more than 0.5, the program considers the centroid to have moved. (Note that the distances here are pixels, so if the centroids move by less than 0.5 pixels, they are essentially fixed.)

If the centroids have not moved, then the method disables the tmrUpdate timer and displays the results.

Score

I mentioned earlier that you may get different results from different starting centers. To find the best solution, you need to repeat the algorithm several times and pick the best arrangement of clusters. To do that, you need to be able to somehow score the solutions.

The following Score method calculates a score for a solution.

private double Score(List centroids, List points) { double total = 0; foreach (PointData point_data in points) { total += Distance2(point_data.Location, centroids[point_data.ClusterNum]); } return total; }

This method calculates the total of the squares of the distances between each point and the center of its cluster. It simply loops through the points, calls the Distance2 method (not shown) to calculate the distance between the point and its assigned center, and adds the result to a total.

Summary

The k-means clustering algorithm shown here is fairly simple. It usually produces a result that looks reasonably good in only a few steps. Download the example and give it a try. Some of the most interesting runs happen when the randomly selected initial cluster centers are close to each other so they need to move several times before they settle down.

[Use k-means clustering to find clusters of data in C#]

Note that this algorithm does not pick the number of clusters that it should make. In the earlier examples shown here, it's fairly obvious that the points should be grouped into three clusters, but consider the picture on the right. Here the algorithm used three clusters when it probably should have used two. Basically one of the two obvious clusters is separated into two pieces. If you run the program several times, it will sometimes divide the left cluster and sometimes the right.

In these simple two-dimensional examples, it's usually easy to guess the correct number of clusters, but it might be much harder for higher-dimensional data. For example, suppose the points were in six-dimensional space. It could be very hard to figure out how many clusters to use.

To solve that problem, you will probably need to come up with a new scoring system that can compare solutions containing different numbers of clusters. You couldn't just use the simple Score method shown earlier, because it will favor using a separate cluster for each data point.

Download the example to experiment with it and to see additional details.