Clustering with multiple features and optimal k value in k means clustering

Screen Link: https://app.dataquest.io/m/40/k-means-clustering/17/conclusion

  1. This screen mentions following (highlighted text in screenshot):


    Can you please tell me where the above topics are covered in the Data Science path

  2. Also how is the optimal value of k determined when using the k-means clustering algorithm for a data about which we have no prior idea (pls consider a scenario with more than 2 features where plotting the data will not be possible)

  3. How do we know that k-means has converged for a scenario with more than 2 features?

For determining the optimal value of K basically, the number of clusters. I have used two methods,

The first one, the Elbow method in which we calculate the within-cluster sum of square (WSS) value for each cluster and sum up them. It measures the compactness of the clusters which we want to be as small as possible. And then we plot its value corresponding to different K and take the k-value which is at the bend point in the plot. Something like this below
image

And the second with the help of the Silhouette score (varies between -1 to 1), In this case, a score close to 1 means better the cluster. You can learn more about these methods on the internet.

Elbow method can be used with any number of features dataset. But for the Silhouette score, you must need at least 2 features reason is on its algorithm.

In the K-means algorithm, You randomly select any K points from the dataset as the cluster centroid at the start, and then you assign other points to the nearest centroid according to its distance (calculated using Euclidean distance). As you calculate Euclidean distance between 2 points (in 2 dimensions) the same way you calculate for n features (means n dimensions).

I hope this may help you.

1 Like

@Prem - Thanks for your response. Can you throw some light on the 1st question as well. Are the topics actually covered somewhere in the path and I am missing it or its a wrong statement.

This screen mentions following (highlighted text in screenshot):

  1. I am also not clear on the convergence part. What I mean to ask is how many iterations do we run for a particular number of clusters before stopping? In case of many variables (more than 2) we will not be able to plot the centroids to see whether they are changing much or not. How will we know when to stop in that case?

  2. Can you please explain how the WCSS and Silhouette scores are calculated. This seems like something that should have been there in the course content but has been missed out.

Hi @vinayak.naik87

Sorry! for the late response.

For your first question, I can’t see any other algorithm for clustering after k-means missions in the course.

There is no fixed rule for determining the number of iterations in an algorithm you can check the accuracy of the model on the test data and then you increase or decrease the number of iterations. You can see the centroids of the cluster using the attribute cluster_centers_ from the sklearn library.

Look at this article here may help you to get an idea about the Elbow method. Check this one for Silhoutte Score.

Let me know if you need more help.