Problem
Let be a data matrix with examples in its
columns (e.g. there are examples with features each). Let be a matrix of k-means centroids (cluster centers).
Recall the k-means algorithm iteratively computes two high-level steps:
- Find the distances from every example to every centroid.
- Update each centroid to be the mean of the examples assigned to it. The
examples are assigned to the centroid which they are closest to.
Write a single line vectorized implementation of step 1.
Solution
show
We compute the distances from the -th example to the -th centroid in
the -th element of a matrix . Note
also that the below is not computing the distance exactly, but the relative
distances (all that we need for k-means) are preserved. The exact distances
could be computed with a third term similar to the second but with the matrix
. However this is unecessary work.
\begin{align}
D = X^T K + {\bf 1} {\bf 1}^T (K \odot K)
\end{align}
In the above is the Hadamard (elementwise) product and the vectors have the size necessary for the syntax of the above statement to be
sensical (namely and respecitvly).
The line could be implemented in code with something like (using numpy
syntax)
np.dot(X.T, K) + np.sum(K**2, axis=1, keepdims=True)