k-means Clustering Algorithm

Notice

An error where the horizontal bars for fractions aren’t displayed occurs in this webpage.

Two thirds is displayed like \(\frac{2}{3}\).

What is clustering?

Clustering is the process of grouping a set of vector data so that data in the same group (called a cluster) is more similar than data in other groups (clusters.)

Two types of clustering methods

There are two main types of clustering methods. One is hard clustering (also known as crisp clustering) and the other is fuzzy clustering (also known as soft clustering.) Hard clustering is a process that simply classifies data into multiple clusters. On the other hand, fuzzy clustering method allows a datum to belong to multiple clusters and uses a coefficient \(u\in{[0,1]}\) to express the extent to which a certain datum belongs to a certain cluster. This article explains k-means (also known as Hard c-means), an algorithm for hard clustering.

Summary of k-means algorithm

Step 1. Set initial cluster centres randomly.
Step 2. Assign each datum to a nearest cluster centre.
Step 3. Move cluster centres to the centres of gravity of each clusters.
Step 4. If cluster centres converge, output the clusters. Otherwise, go back to Step 2.

Step 1. Initialise cluster centres

We set initial cluster centres randomly. The number of elements in a cluster centre is the same as the number of elements each datum we are clustering has.
$$V=\{v_1,v_2,\cdots,v_C\}$$
,where \(V\) is a set of clusters, \(C\) is the number of clusters and \(v_k\) is the k-th cluster centre.

Step 2. Assign data

We assign each datum to a nearest cluster centre. Squared euclidean distance is used to determine that which cluster is the nearest to a datum.
In other words, the cluster each datum belongs is obtained by finding the minimum of the following expression achieved for \(u_{i,k}\).
$$\sum_{i=1}^C\sum_{k=1}^Nu_{i,k}\|x_k-v_i\|_2^2$$
,where \(N\) is the number of data, \(x_k\) is the i-th datum and \(u_{i,k}\) shows whether \(x_k\) belongs to the i-th cluster (If \(x_k\) belongs to the i-th cluster, \(u_{i,k}=1\). Otherwise \(u_{i,k}=0\).)

Step 3. Move cluster centres

We move cluster centres to the centres of gravity of each clusters. Cluster centres are obtained by solving the following optimisation problem.
$$minimize_v~~~{\sum_{i=1}^C\sum_{k=1}^Nu_{i,k}\|x_k-v_i\|_2^2}$$ Since \(v_i\in{V}\) and \(v_j\in{V}(i\neq j)\) are independent, \(v_i\) is obtained by solving $$minimize_{v_i}~{\sum_{k=1}^Nu_{i,k}\|x_k-v_i\|_2^2}$$
Since \(v_{i,q}\) and \(v_{i,r} (q\neq r)\) are independent, \(v_{i,q}\) is obtained by solving
$$minimize_{v_{i,q}}~~~{\sum_{k=1}^Nu_{i,k}(x_{k,q}-v_{i,q})^2}$$
,where \(v_{i,q}\) is the q-th element of \(v_i\).
Since the argument of minimize is convex for \(v_{i,q}\), \(v_{i,q}\) is obtained by solving
$$\frac{\partial}{\partial v_{i,q}}\sum_{k=1}^Nu_{i,k}(x_{k,q}-v_{i,q})^2=0$$
$$⇔ -2\sum_{k=1}^Nu_{i,k}(x_{k,q}-v_{i,q})=0$$
$$⇔\sum_{k=1}^Nu_{i,k}(x_{k,q}-v_{i,q})=0$$
$$⇔\sum_{k=1}^Nu_{i,k}x_{k,q}-\sum_{k=1}^Nu_{i,k}v_{i,q}=0$$
$$⇔\sum_{k=1}^Nu_{i,k}x_{k,q}-v_{i,q}\sum_{k=1}^Nu_{i,k}=0$$
$$⇔v_{i,q}=\frac{\sum_{k=1}^Nu_{i,k}x_{k,q}}{\sum_{k=1}^Nu_{i,k}}$$
Therefore,
$$v_i=(v_{i,1},v_{i,2},\cdots,v_{i,Q})$$
,where \(Q\) is the number of elements each datum has.
$$⇔v_i=(\frac{\sum_{k=1}^Nu_{i,k}x_{k,1}}{\sum_{k=1}^Nu_{i,k}},\frac{\sum_{k=1}^Nu_{i,k}x_{k,2}}{\sum_{k=1}^Nu_{i,k}},\cdots,\frac{\sum_{k=1}^Nu_{i,k}x_{k,Q}}{\sum_{k=1}^Nu_{i,k}})$$
$$⇔v_i=\frac{1}{\sum_{k=1}^Nu_{i,k}}(\sum_{k=1}^Nu_{i,k}x_{k,1},\sum_{k=1}^Nu_{i,k}x_{k,2},\cdots,\sum_{k=1}^Nu_{i,k}x_{k,Q})$$
$$⇔v_i=\frac{1}{\sum_{k=1}^Nu_{i,k}}\sum_{k=1}^Nu_{i,k}(x_{k,1},x_{k,2},\cdots,x_{k,Q})$$
$$⇔v_i=\frac{1}{\sum_{k=1}^Nu_{i,k}}\sum_{k=1}^Nu_{i,k}x_k$$

Step 4. Determine whether cluster centres have converged

If cluster centres have converged, it outputs the clusters and this algorithm terminates. Otherwise, we go back to Step 2.

Reference

Homenda, Wladislaw and Pedrycz, Witold. 1953. PATTERN RECOGNITION. New Jersey: John Wiley & Sons, Inc.

That’s all. Thank you for reading.

Kindly leave a comment if you have any doubts.

コメント

タイトルとURLをコピーしました