Skip to contents

This function will calculate a number of metrics used to determine the optimal number of clusters. The result will be a data frame where each row corresponds to the number of clusters (k) and the columns are the various metrics. An S3 plotting function is provided.

Usage

optimal_clusters(
  df,
  max_k = 9,
  cluster_fun = stats::kmeans,
  wss = TRUE,
  silhoutte = TRUE,
  gap = TRUE,
  calinski_harabasz = TRUE,
  davies_bouldin = TRUE,
  rand_index = TRUE
)

# S3 method for class 'optimalclusters'
plot(x, ...)

wss(df, k = 9, cluster_fun = stats::kmeans)

silhouette_score(df, k = 9, cluster_fun = stats::kmeans, ...)

calinski_harabasz(df, k = 9, cluster_fun = stats::kmeans, ...)

davies_bouldin(df, k = 9, cluster_fun = stats::kmeans, ...)

rand_index(df, k = 9, cluster_fun = stats::kmeans, ...)

Arguments

df

data frame to calculate the Rand index from.

max_k

maximum number of clusters to estimate.

cluster_fun

clustering function.

wss

whether to calculate the within sum of squares.

silhoutte

whether to calculate the Silhoutte index.

gap

whether to calculate the Gap statistic.

calinski_harabasz

whether to calculate the Calinski Harabasz index

davies_bouldin

whether to calculate the Davies Bouldin index.

rand_index

whether to calculate the Rand index.

x

the result of optimal_clusters()

...

other parameters passed to fossil::rand.index().

k

number of clusters.

Value

a data frame with various metrics used to determine the optimal number of clusters. Each row corresponds to k ranging from 1 to max_k and has the following columns (if the corresponding metric was requested):

k

Number of clusters

wss

Within sum of squares

silhoutte

Silhouette analysis measures the quality of clustering and provides an insight into the separation distance between the resulting clusters. A higher silhouette score indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.

gap

The Gap Statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The optimal k is the value that yields the largest gap statistic.

calinski_harabasz

This Index evaluates clusters based on their compactness and separation. The index is calculated using the ratio of between-cluster variance to within-cluster variance, with higher values indicate better-defined clusters

davies_bouldin

The Davies-Bouldin index is a measure of the how much separation there is between clusters. Lower values of the Davies-Bouldin index indicate a model with better separation.

rand_index

It calculates the proportion of agreement between the two clusters, considering both the pairs of elements that are correctly assigned to the same or different clusters. Higher values indicate greater similarity and better clustering quality.

a ggplot2 expression