in Education by
Today I'm trying to learn something about K-means. I Have understood the algorithm and I know how it works. Now I'm looking for the right k... I found the elbow criterion as a method to detect the right k but I do not understand how to use it with scikit learn?! In scikit learn, I'm clustering things in this way kmeans = KMeans(init='k-means++', n_clusters=n_clusters, n_init=10) kmeans.fit(data) So should I do this several times for n_clusters = 1...n and watch at the Error rate to get the right k? think this would be stupid and would take a lot of time? Select the correct answer from above options

1 Answer

0 votes
by
 
Best answer
In your case, k-means clustering can be implemented using Elbow Criterion, if the true labels are known in advance. Elbow Criterion Method: The idea behind the elbow method is to implement k-means clustering on a given dataset for a range of values of k (num_clusters, e.g k=1 to 10), and for each value of k, calculate the sum of squared errors (SSE). Elbow method plot a line graph of the SSE for each value of k. If the line graph looks like an arm - a red circle in the below line graph, the "elbow" on the arm is the value of optimal k (number of the cluster). K-means is used to minimize SSE. SSE tends to decrease toward 0 as we increase k and SSE is 0 when k is equal to the number of data points in the dataset, because then each data point is its own cluster, and there is no error between it and the center of its cluster. So the goal is to choose a optimal value of k that still has a low SSE, and the elbow usually represents where we start to have diminishing returns by increasing k. For Example: from sklearn.cluster import KMeans import matplotlib.pyplot as plt sse = {} for k in range(1, 10): kmeans = KMeans(n_clusters=k, max_iter=1000).fit(data) data["clusters"] = kmeans.labels_ print(data["clusters"]) sse[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center plt.figure() plt.plot(list(sse.keys()), list(sse.values())) plt.xlabel("Number of cluster") plt.ylabel("SSE") plt.show() Plot for above code: enter image description here We can see in the above plot, 3 is the optimal number of clusters (encircled red) for this dataset, which is indeed correct. Hope this answer helps.

Related questions

0 votes
    Can I specify my own distance function using scikit-learn K-Means Clustering? Select the correct answer from above options...
asked Jan 22, 2022 in Education by JackTerrance
0 votes
    The classifiers in machine learning packages like liblinear and nltk offer a method show_most_informative_features(), which ... lot! Select the correct answer from above options...
asked Feb 4, 2022 in Education by JackTerrance
0 votes
    How can I extract the decision path as a textual list from a trained tree in a decision tree ? Something similar to ... then class='A' Select the correct answer from above options...
asked Jan 22, 2022 in Education by JackTerrance
0 votes
    How can I save a trained Naive Bayes classifier to a disk and use it for predicting data? Select the correct answer from above options...
asked Jan 22, 2022 in Education by JackTerrance
0 votes
    There is only one question related to this and it is more about which one is better. I just don't ... exactly lies the difference? Select the correct answer from above options...
asked Jan 27, 2022 in Education by JackTerrance
0 votes
    Can anyone tell me why scikit learn is used? Select the correct answer from above options...
asked Jan 10, 2022 in Education by JackTerrance
0 votes
    Can anyone tell me what is test size in Scikit learn? Select the correct answer from above options...
asked Jan 10, 2022 in Education by JackTerrance
0 votes
    What is the difference between the two? It seems that both create new columns, in which their number is equal to ... they are in. Select the correct answer from above options...
asked Feb 1, 2022 in Education by JackTerrance
0 votes
    I'm wondering how to calculate precision and recall measures for multiclass multilabel classification, i.e. classification ... labels? Select the correct answer from above options...
asked Jan 31, 2022 in Education by JackTerrance
0 votes
    While training a tensorflow seq2seq model I see the following messages : W tensorflow/core/common_runtime/gpu/pool_allocator ... GB GPU Select the correct answer from above options...
asked Feb 8, 2022 in Education by JackTerrance
0 votes
    While training a tensorflow seq2seq model I see the following messages : W tensorflow/core/common_runtime/gpu/pool_allocator ... GB GPU Select the correct answer from above options...
asked Feb 5, 2022 in Education by JackTerrance
0 votes
    Classification problems, such as logistic regression or multinomial logistic regression, optimize a cross-entropy loss. ... jungle. Select the correct answer from above options...
asked Feb 4, 2022 in Education by JackTerrance
0 votes
    I am trying to groupby a column and compute value counts on another column. import pandas as pd dftest = pd. ... Amt, already exists Select the correct answer from above options...
asked Feb 1, 2022 in Education by JackTerrance
0 votes
    I have just built my first model using Keras and this is the output. It looks like the standard output you get ... - loss: 0.1928 Select the correct answer from above options...
asked Feb 1, 2022 in Education by JackTerrance
0 votes
    I'm starting with input data like this df1 = pandas.DataFrame( { "Name" : ["Alice", "Bob", "Mallory", ... Any hints would be welcome. Select the correct answer from above options...
asked Jan 28, 2022 in Education by JackTerrance
...