class kMeans
n: Features
m: Datapoints
K: Clusters
X: m×n dataset
class kMeans
Initialize a new k-means clustering object. Pass the following options, and proceed to clustering. Options:
K
(Integer) Default: 5
The number of clusters.
maxIterations
(Integer) Default: 100
The number of iterations before the algorithm stops.
enableConvergenceTest
(Boolean) Default: true
Enable convergence test. This test can be computationally intensive, but
might also save many iterations.
tolerance
(Float) Default: 1e-9
Floating point value for the convergence tolerance.
initialize
(Function)
The function used to initialize the centroids.
Default: The Forgy method, which chooses K random datapoints, as the
initial centroids. You can also write your own, with the following signature
fn(X, K, m, n), where:
X
(2D Array): The set of datapoints. Remember this is passed by reference and is therefore **mutable**.
K
(Integer): The number of centroids to initialize.m
(Integer): The number of datapoints in X
.n
(Integer): The number of features of each datapoint in X
.distanceMetric
(Function)
The function used to measure the distance between centroids and points in its
cluster. constructor: (options = {}) ->
@K = options.K ? 5
@maxIterations = options.maxIterations ? 100
@enableConvergenceTest = options.enableConvergenceTest ? true
@tolerance = options.tolerance ? 1e-9
@initialize = options.initialize ? kMeans.initializeForgy
@distanceMetric = options.distanceMetric ? @sumSquared
if not (1 <= @K < Infinity)
throw "K must be in the interval [1, Infinity)"
if not (1 <= @maxIterations < Infinity)
throw "maxIterations must be in the interval [1, Infinity)"
Initialize clustering over dataset X
.
X
should be a m×n matrix of m data rows and n feature columns.
cluster: (@X) ->
@prevCentroids = []
@clusters = []
@currentIteration = 0
[@m, @n] = [@X.length, @X[0].length]
if not @m? or not @n? or @n < 1
throw "Data must be of the format [[x_11, ... , x_1n], ... , [x_m1, ... , x_mn]]"
@centroids = @initialize(@X, @K, @m, @n)
if @centroids.length != @K or @centroids[0].length != @n
throw "`initialize` must return a K×n matrix"
Used when custom logic is interleaved within the clustering process.
See autoCluster(X)
method below for an example
step: ->
@currentIteration++ < @maxIterations
Cluster dataset X
by means of the standard algorithm:
1. Find closest centroid for each datapoint and assign it to the centroids cluster
2. Move the centroid to the mean of its cluster
3. Optional Check for convergence, by measuring the distance moved
by the centroid since the last iteration
autoCluster: (X) ->
@cluster X
while @step()
@findClosestCentroids()
@moveCentroids()
break if @hasConverged()
The Forgy method uses K random data points as initial centroids. Accessed as
kMeans.initializeForgy
O(K)
@initializeForgy: (X, K, m, n) ->
X[Math.floor Math.random() * m] for k in [0...K]
This initialization places K centroids at random, within the range of the data points.
Accessed as kMeans.initializeInRange
O(n·m+K·n)
@initializeInRange: (X, K, m, n) ->
min = Infinity for i in [0...n]
max = -Infinity for i in [0...n]
for x in X
for d, i in x
min[i] = Math.min min[i], d
max[i] = Math.max max[i], d
(for k in [0...K]
(for d in [0...n]
Math.random() * (max[d] - min[d]) + min[d]
)
)
Assign each datapoint to the cluster of its closest centroid. This is done by adding the datapoint's index to an array of clusters, where the index of each cluster corrosponds to the index of the centroid for that cluster.
findClosestCentroids: ->
if @enableConvergenceTest
@prevCentroids = (r.slice(0) for r in @centroids) #Clone optimization
Datapoints will be assigned to clusters to optimize the move step however, this means that it will be hard to find out which cluster a specific datapoint belongs to, without probing every element of every cluster
@clusters = ([] for i in [0...@K])
for x, i in @X
cMin = 0
xMin = Infinity
for c, j in @centroids
min = @distanceMetric c, x
if min < xMin
cMin = j
xMin = min
@clusters[cMin].push i
Iterate through each cluster and move it's centroid to the mean of all the datapoints in that cluster.
moveCentroids: ->
for cl, i in @clusters
continue if cl.length < 1 #Avoid division by 0
for j in [0...@n]
sum = 0
for d in cl
sum += @X[d][j]
@centroids[i][j] = sum/(cl.length)
Check whether any of the elements of the absolute difference between the previous centroid positions and the current are greater than a set tolerance. In case they're none are greater, then the algorithm has converged.
hasConverged: ->
return false if not @enableConvergenceTest
for i in [0...@n]
for j in [0...@m]
absDelta = Math.abs @prevCentroids[i][j] - @centroids[i][j]
return true if @tolerance > absDelta
return false
The default distance metric used by the findClosestCentroids
step. Takes the
square of the euclidian distances between points. A custom metric can be used by
changing the options.distanceMetric
when initializing kMeans.
sumSqured: (X, Y) ->
sum = 0
n = X.length
while n--
sum += (Y[n] - X[n]) * (Y[n] - X[n])
sum
Check whether module.exports or exports are present, otherwise attach the class to the window object.
if module?.exports? or exports?
module.exports = exports = kMeans
else
window.kMeans = kMeans