Utility function used to determine the optimal number of clusters.
Source:R/optimal_clusters.R
, R/optimal_clusters_plot.R
, R/utils.R
fitmeasures.Rd
This function will calculate a number of metrics used to determine the optimal number of clusters. The result will be a data frame where each row corresponds to the number of clusters (k) and the columns are the various metrics. An S3 plotting function is provided.
Usage
optimal_clusters(
df,
max_k = 9,
cluster_fun = stats::kmeans,
wss = TRUE,
silhoutte = TRUE,
gap = TRUE,
calinski_harabasz = TRUE,
davies_bouldin = TRUE,
rand_index = TRUE
)
# S3 method for class 'optimalclusters'
plot(x, ...)
wss(df, k = 9, cluster_fun = stats::kmeans)
silhouette_score(df, k = 9, cluster_fun = stats::kmeans, ...)
calinski_harabasz(df, k = 9, cluster_fun = stats::kmeans, ...)
davies_bouldin(df, k = 9, cluster_fun = stats::kmeans, ...)
rand_index(df, k = 9, cluster_fun = stats::kmeans, ...)
Arguments
- df
data frame to calculate the Rand index from.
- max_k
maximum number of clusters to estimate.
- cluster_fun
clustering function.
- wss
whether to calculate the within sum of squares.
- silhoutte
whether to calculate the Silhoutte index.
- gap
whether to calculate the Gap statistic.
- calinski_harabasz
whether to calculate the Calinski Harabasz index
- davies_bouldin
whether to calculate the Davies Bouldin index.
- rand_index
whether to calculate the Rand index.
- x
the result of
optimal_clusters()
- ...
other parameters passed to
fossil::rand.index()
.- k
number of clusters.
Value
a data frame with various metrics used to determine the optimal number
of clusters. Each row corresponds to k ranging from 1 to max_k
and has the
following columns (if the corresponding metric was requested):
- k
Number of clusters
- wss
Within sum of squares
- silhoutte
Silhouette analysis measures the quality of clustering and provides an insight into the separation distance between the resulting clusters. A higher silhouette score indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.
- gap
The Gap Statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The optimal k is the value that yields the largest gap statistic.
- calinski_harabasz
This Index evaluates clusters based on their compactness and separation. The index is calculated using the ratio of between-cluster variance to within-cluster variance, with higher values indicate better-defined clusters
- davies_bouldin
The Davies-Bouldin index is a measure of the how much separation there is between clusters. Lower values of the Davies-Bouldin index indicate a model with better separation.
- rand_index
It calculates the proportion of agreement between the two clusters, considering both the pairs of elements that are correctly assigned to the same or different clusters. Higher values indicate greater similarity and better clustering quality.
a ggplot2 expression