1. Introduction
Labels are an essential ingredient to supervised algorithms like Support Vector Machines, which learns a hypothesis function to predict labels given features. Kmeans clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories/groups/response variables).
2. Key Terms

CLuster is a collection of data points aggregated together because of certain similarities.

CLuster Centroid (or simply centroid) is the mean of a cluster, its values are the coordinatewise average of the data points in this cluster.

WithinCluster Varianceis the coordinatewise squared deviations from the cluster centroid of all the observations belonging to that cluster:
In the expression above denotes jth predictor of observation ; denotes a set of points belonging to cluster and denotes a centroid of cluster

Total WithinCluster Varianceis a withincluster variance summed up across all clusters:
Note that the notation means the euclidean distance between vectors and
3. Data Representation and Preparation
In the formulas above represents a vector in a Pdimensional space and P is a number of predictors in data set. As you can see from the formulas above,
4. Algorithm

Step 1: Randomly generate K centroids

Step 2: Assign data points to the cluster of the closest centroid:

Step 3: Compute the mean of each cluster

Step 4: Reassign centroids to respective clusters’ means computed in Step 3

Step 5: If the stop criterion is not satisfied: Go to Step 2
Stop criterion can be one of the following:

Cluster reassignation results in same clusters

A specified number of iterations is reached

Reassigned centroids are located close (need to specify the distance) to the previous centroids

In order to achieve global optima, the algorithm should be run multiple times and clusters’ realization that is observed more often will be our global optima.
Example: In Figure 1, you can see a Kmeans algorithm. Training examples are shown as dots, and cluster centroids (K) are shown as crosses. (a) is an original dataset. (b) is a random initial cluster centroids. (cf) is an illustration of running two iterations of kmeans. In each iteration, we assign each training example to the closest cluster centroid (shown by ”painting” the training examples the same color as the cluster centroid to which is assigned); then we move each cluster centroid to the mean of the points assigned to it.
5. Choosing K
There are three most common ways of selecting the number of clusters K.

Utilize our domain knowledge or any other insight about the data. For instance, we want to cluster flower and we know that our data contains exactly 3 types of flowers. Another example is when we want to cluster cars sold last year. In this case, K will be the number of all car manufacturers available on the market.

Run the algorithms several times for different values of K and select such K that results in the smallest value of total withincluster variance.

Perform crossvalidation and select such K that performs the best on a holdout dataset.