K Means parameters and results (in R Studio) explained

My visualization of the concept of clustering

Preliminary note: This post is the result of personal study on the subject. I studied computer science years ago at Politecnico of Milano university but data, especially business intelligence, became my profession. Data science has been the next step even if my background is more on programming and IT. I realise that there’s a lot to learn in this field so I tried to write with a beginner/student approach. Any suggestion for improvement is appreciated.

The objective of this article is explaining the results of the K Means algorithm that you get when you run it on your data using R Studio. I think that most of the comments I wrote can be applied to whatever tool you use for K Means:

  • First of all you know that you use K Means algorithm to try to segment data in a number of clusters.
A very easy csv file with random data (in range 0..99) organized in rows (observations) and columns (features). I have skipped the part on how you read it as I assume this is an easy task.
  • Because you have to provide as input the number of clusters you want to find, you can try to visualise the data in advance using any plot function to check if the number of cluster (k) are someway visible so that you can do an educated gues. Alternatively you can use a number (like 3 for example) you think could be the outcome of the algorithm.

Using K Means is straightforward, you can check the RStudio help but basically the command is like the single line below with ‘a’ as my variable with the data set, 3 as number of clusters and the 2 params iter.max and nstart which I explain later.

  • The results of the algorithm goes to variable ‘cl’ which I created with that name but could be what you prefer:

Let’s focus now on iter.max and nstart parameter which are apparently easy to understand but require attention:

  • iter.max is the number of times the algorithm is run before results are returned. Do not forget that the algorithm works finding a minimum of a cost function through several iteration steps. If you say iter.max=1 it could work but it’s a nonsense as for the most times 1 iteration is not enough.

If you did things rights you should obtain something like this below out of RStudio:

  • Let’s see the meaning of everything because we don’t want to use visualization but understanding the outputs :
Sample of output of the Kmeans algorithm run in Rstudio. The objective here is not showing visually the results but explaining the terms the algorithm returns
  • Title (in red): the output says we have our 3 clusters where the size of each cluster is made of 28,18,39 observations. In itself this is good to verify the homogeneous distribution of data for each cluster. So it’s a good start.

Well this completes the analysis of what RStudio outputs when we run K Means.

It is just the beginning of a journey but it’s a start. This is my first topic on this complex subject and I hope to have interested you.

IT Senior Manager and Consultant. Data Warehouse and Business Intelligence expertise in design and build. Freelance.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store