###### K-Means Step by Step in F#

Recently we had a discussion on clustering techniques at one of our weekly tech talks and the k-means clustering algorithm came up. K-means is considered one of the simplest unsupervised machine learning techniques and I thought it would be cool to try my hand at an F# implementation of it.

In short, k-means is a way to put data into groups based on distance between nearest neighbors. Technically it works with any dimension but for this post I’ll stick with 1-d data to make things easy.

## K-means background

Imagine you have some 1-dimensional data like

```1, 2, 5, 14, 17, 19, 20
```

And you want to group this into two groups. Pretty easily you can see that 1, 2, and 5 go into one group, and 14, 17, 19, 20 should go in another group. But why? For this small data set we intuitively figured out that 1, 2, and 5 are close together, but at the same time far away from 14, 17, 19, and 20, which are also close together. Really what we did was we calculated the centroid of each group and assigned the values to the centroid. A centroid is a fancy way of saying center point of a group. It’s calculated by taking the average of a group.

For example, if we have the group

```1, 2, 5
```

What is its centroid? The answer is

```(1 + 2 + 5) / 3 = 2.666
```

So with k-means you end up with k centroids and a bunch of data grouped to that centroid. It’s grouped to that centroid because that data is closest to that centroid vs any other centroid. To calculate the centroids and grouping k-means uses an iterative process:

• Pick k random points. So if you are doing a k=2 clustering, just pick 2 random points from your data set. These are going to be our initial centroids. It doesn’t matter if it’s right, or even if the points are the same. The algorithm is going to fine tune things for us. Technically this is called the Forgy Method of initialization. There are a bunch of other ways to initialize your centroids, but Forgy seems to be pretty common.
• For every point, calculate its distance from the centroid. So lets say we picked points 1 and 2 as our centroids. The distances then look like this:
```     Δ1      Δ2
1    0        1
2    1        0
5    4        3
14   13       12
17   16       15
19   18       17
20   19       18
```
• Now what we do is assign each data point to its nearest centroid. So, obviously, point 1 is closest to the initial arbitrary point 1 centroid. And you can see that basically everything else is closer to the arbitrary point 2 centroid. So we’re left with a grouping like this now:
```data around centroid 1: 1
data around centroid 2: 2, 5, 14, 17, 19, 20
```
• Now, the next step is to calculate new centroids based on these groups. For this example, we can take the average of all the points together to find the new center. So the new centroids become 1 and 12.83. Now we go back to step 2 using the new centroids. If you keep track of the current centroid and the previous centroid, then you can stop the k-means calculation when centroids don’t change between iterations. At this point everything has converged and you’ve got your k clusters

Unfortunately k-means doesn’t always converge, so you can add a convergence delta (i.e. if the centroid between the current and previous iterations hasn’t really changed much, so it’s within some acceptable range) or an iteration limit (only try to cluster 100 times) so you can set bounds on the clustering.

## Implementation

The first thing to do for my implementation is to define a structure representing a single data point. I created a `DataPoint` class that takes a float list which represents the data points dimensions. So, a 1-dimesional data point would have a float list of one element. A 2-d data point has one of two points (x and y), etc. To give credit where credit is due, I took the dimensional array representation from another F# implementation.

```type DataPoint(input:float list) =
member this.Data = input
member this.Dimensions = List.length input
override x.Equals(yobj) =
match yobj with
| :? DataPoint as y -> (x.Data = y.Data)
| _ -> false
static member Distance (d1:DataPoint) (d2:DataPoint) =
if List.length d1.Data <> List.length d2.Data then
0.0
else
let sums = List.fold2(fun acc d1Item d2Item ->
acc + (d2Item - d1Item)**2.0
) 0.0 d1.Data d2.Data
sqrt(sums)
```

It has its data, the dimensions, an equality overload (so I can compare data points by their data and not by reference), as well as a static helper method to calculate the euclidean distance between two n-dimensional points. Here I’m using the `fold2` method which can fold two lists simultaneously.

## Alias data types

For fun, I wanted to use F#’s data aliasing feature, so I created some helper types to make dealing with tuples and other data pairing easier. I’ve defined the following

```type Centroid = DataPoint

type Cluster = Centroid * DataPoint List

type Distance = float

type DistanceAroundCentroid = DataPoint * Centroid * Distance

type Clusters = Cluster List

type ClustersAndOldCentroids = Clusters * Centroid list
```

The only unfortunate thing is you need to explicitly mention aliased types for the type inference to work properly. The reason is because F# will prefer native types to typed aliases (why choose `Distance` over float unless you are told to), but it would’ve been nice if it had some way to prefer local type aliases in a module.

## Centroid assignment

This next major step is to take the current raw data list and what the current centroids are and cluster them. A cluster is a centroid and its associated data points. To make things clearer, I created a helper function `distFromCentroid` which takes a point, a centroid, and returns a tuple of point * centroid * distance.

```let private distFromCentroid pt centroid : DistanceAroundCentroid = (pt, centroid, (DataPoint.Distance centroid pt))
```

The bulk of the k-means work is in `assignCentroids`. First, for every data point, we calculate its distance from the current centroid. Then we select the centroid which is closest to our data points and create a list of point * centroid tuples. Once we have a list of point * centroid tuples, we want to group this list by the second value in the tuple: the centroid. Here, I’m leveraging the pipe operator to do the grouping and mapping. First, we group the lists by the centroid. This gives us a list of `centroid * (datapoint * centroid)` items. But this isn’t quite right yet. We want to end up with a list of `centroid * dataPoint list` so I passed this temporary list through some other minor filtering and formatting.

```(*
Takes the input list and the current group of centroids.
Calculates the distance of each point from each centroid
then assigns the data point to the centroid that is closest.
*)
let private assignCentroids (rawDataList:DataPoint list) (currentCentroids:Clusters) =
let dataPointsWithCentroid =
rawDataList
|> List.map(fun pt ->
let currentPointByEachCentroid = List.map(fun (centroid, _) -> distFromCentroid pt centroid) currentCentroids
let (_, nearestCentroid, _) = List.minBy(fun (_, _, distance) -> distance) currentPointByEachCentroid
(pt, nearestCentroid)
)

(*
|> Group all the data points by their centroid
|> Select centroid * dataPointList
|> For each previous centroid, calculate a new centroid based on the aggregated list
*)

List.toSeq dataPointsWithCentroid
|> Seq.groupBy (fun (_, centr) -> centr)
|> Seq.map(fun (centr, dataPointsWithCentroid) ->
let dataPointList = Seq.toList (Seq.map(fun (pt, _) -> pt) dataPointsWithCentroid)
(centr, dataPointList)
)
|> Seq.map(fun (cent, dataPointList) ->
let newCent = calculateCentroidForPts dataPointList
(newCent, dataPointList)
)
|> Seq.toList

```

## Calculating new centroids

Since we’ve grouped data points by centroid, we now need to calculate what the new centroid for that list is. This is where I employ a disclaimer. For my implementation of n-dimensional centroids I averaged each dimension together to get a new value. I’m not totally sure this is actually valid for arbitrary geometrical shapes (given by n-dimensions), but if it’s not at least this is the only method that needs to be updated.

First I create an empty n-dimension array of all zeros, initialized by the dimensions of the first point in the list (all the dimensions should be the same so it doesn’t matter which point we choose). This is the seed for list for the next fold. I’m not actually going to fold the value into a single element; I’m going to fold all indexes of the data point list together into a new array by adding each dimension together.

For example, if I have two points of two dimensions [1, 2] and [3, 4] I add 1 and 3 together to make the first dimension, then 2 and 4 together to make the second dimension. This gives me an intermediate vector of [4,6]. Then I can average each dimension by the number of data points used, so in our example we used two data points, so the new centroid would be [2, 3]. Since a centroid is the same as a data point (even though one that doesn’t exist in our original raw set), we can return the new centroid as a data point. Having everything treated as data points makes the calculations easier.

```(*
take each float list representing an n-dimesional space for every data point
and average each dimension together.  I.e. element 1 for each point gets averaged together
and element 2 gets averaged together.  This new float list is the new nDimesional centroid
*)

let private calculateCentroidForPts (dataPointList:DataPoint List) =
let nDimesionalEmptyList = List.init firstElem.Dimensions (fun i-> 0.0)
nDimesionalEmptyList
dataPointList

// scale the nDimensional sums by the size
let newCentroid = List.map(fun pt -> pt/((float)(List.length dataPointList))) addedDimensions
new DataPoint(newCentroid)
```

Here is the function to aggregate list elements together

```(*  goes through two lists and adds each element together
i.e. index 1 with index 1, index 2 with index2
and returns a new list of the items aggregated
*)

let private addListElements list1 list2 operator =
let rec addListElements' list1 list2 accumulator operator =
if List.length list1 = 0 then
accumulator
else
```

I’m using a hidden accumulator so that the function will be tail recursive.

## Cluster runner

The above code only does one single cluster iteration though . What we need is a clustering runner that does the clustering iterations. This runner will be responsible for determining when to stop the clustering. There are four functions here:

• `getClusters` and `getOldCentroids` are helper methods to pull data out of the tuples (fst and snd are built in functions)
• `cluster` is a basic method that will cluster as many times as necessary until the centroids stop changing
• `clusterWithIterationLimit` is the bulk method that takes in the raw data, the clustering amount (k), the max number of clustering iterations, and a convergence delta to avoid infinite loops

For each iteration create a new set of clusters, then test to see if if the centroids are the same. If they are, we’ve converged and we can end. If not, see if the clusters are “close enough” (the convergence delta). The final base case is if we’ve clustered the max iterations then return.

```(*
but also take in a max iteration limit as well as a converge
delta.

continue to cluster until the centroids stop changing or
the iteration limit is reached OR the converge delta is achieved

Returns a new "Clusters" type representing the centroid
and all the data points associated with it
*)

let private getClusters (cluster:ClustersAndOldCentroids) = fst cluster
let private getOldCentroids (cluster:ClustersAndOldCentroids) = snd cluster

let clusterWithIterationLimit data k limit delta =
let initialClusters:Clusters = initialCentroids (List.toArray data) k
|> Seq.toList
|> List.map (fun i -> (i,[]))

let rec cluster' (clustersWithOldCentroids:ClustersAndOldCentroids) count =
if count = 0 then
getClusters clustersWithOldCentroids
else
let newClusters = assignCentroids data (getClusters clustersWithOldCentroids)
let previousCentroids = getOldCentroids clustersWithOldCentroids
let newCentroids = extractCentroidsFromClusters newClusters

// clusters didnt change
if listsEqual newCentroids previousCentroids then
newClusters
else if centroidsDeltaMinimzed previousCentroids newCentroids delta then
newClusters
else
cluster' (newClusters, newCentroids) (count - 1)

cluster' (initialClusters, extractCentroidsFromClusters initialClusters) limit

(*
continue to cluster until the centroids stop changing
Returns a new "Clusters" type representing the centroid
and all the data points associated with it
*)

let cluster data k = clusterWithIterationLimit data k Int32.MaxValue 0.0
```

## Demo

Using the sample data from the beginning of this post, let’s run it through my k-means clusterer:

```module KDataTest

open System
open KMeans

let kClusterValue = 2

let sampleData : KMeans.DataPoint list = [new DataPoint([1.0]);
new DataPoint([2.0]);
new DataPoint([5.0]);
new DataPoint([14.0]);
new DataPoint([17.0]);
new DataPoint([19.0]);
new DataPoint([20.0])]

KMeans.cluster sampleData kClusterValue
|> Seq.iter(KMeans.displayClusterInfo)

```

And the output is

```Centroid [2.66666666666667], with data points:
[1], [2], [5]

Centroid [17.5], with data points:
[14], [17], [19], [20]
```

Full source available at our github