Problem
Let X∈Rd×n be a data matrix with examples in its
columns (e.g. there are n examples with d features each). Let C∈Rd×k be a matrix of k-means centroids (cluster centers).
Recall the k-means algorithm iteratively computes two high-level steps:
- Find the distances from every example to every centroid.
- Update each centroid to be the mean of the examples assigned to it. The
examples are assigned to the centroid which they are closest to.
Write a single line vectorized implementation of step 1.
Solution
show
We compute the distances from the i-th example to the j-th centroid in
the (i,j)-th element of a matrix D∈Rn×k. Note
also that the below is not computing the distance exactly, but the relative
distances (all that we need for k-means) are preserved. The exact distances
could be computed with a third term similar to the second but with the matrix
X. However this is unecessary work.
D=XTK+11T(K⊙K)
In the above ⊙ is the Hadamard (elementwise) product and the 1 vectors have the size necessary for the syntax of the above statement to be
sensical (namely n and d respecitvly).
The line could be implemented in code with something like (using numpy
syntax)
np.dot(X.T, K) + np.sum(K**2, axis=1, keepdims=True)