Cluster profile validation — cluster

This function takes multiple random samples from the provided data frame, estimate cluster membership, and calculate the mean (the default, a different statistic can be specified using the summary_fun parameter, e.g. median may be appropriate) for each cluster using both the in sample and out-of-bag (i.e. out of sample). For the out-of-bag sample

This function will re-assign cluster labels such that the mean absolute difference to the the cluster labels from the full dataset is as small as possible.

Usage

cluster_validation(
  df,
  n_clusters = 2,
  cluster_fun = stats::kmeans,
  get_cluster_fun = function(x) {
x$cluster
 },
  oob_predict_fun = function(fit, newdata) {
     predict(fit, newdata = newdata)
 },
  summary_fun = mean,
  n_samples = 100,
  sample_size = 0.5 * nrow(df),
  replace = FALSE,
  standardize = TRUE,
  seed,
  verbose = interactive(),
  ...
)

# S3 method for class 'clustervalidation'
plot(
  x,
  plot_complete = TRUE,
  plot_in_sample = TRUE,
  plot_oob_sample = TRUE,
  point_alpha = 0.1,
  point_size = 1,
  line_alpha = 0.1,
  line_width = 1,
  complete_color = "blue",
  complete_size = 1,
  complete_point_size = 2,
  xlab = "",
  ylab = ifelse(attr(x, "standardize"), "Mean Standard Score", "Mean Score"),
  ...
)

# S3 method for class 'clustervalidation'
print(x, ...)

# S3 method for class 'clustervalidation'
summary(object, in_sample = FALSE, oob_sample = TRUE, ...)

# S3 method for class 'cv_summary'
plot(x, ...)

fix_cluster_labels(
  cv,
  greedy = (length(unique(cv$oob_sample$cluster)) > 6),
  var,
  ...
)

plot_distributions(
  cv,
  plot_in_sample = TRUE,
  plot_oob_sample = FALSE,
  nrow = NULL,
  palette = 2,
  ...
)

Arguments

df: data frame to estimate clusters. Note that all columns will be used in the estimation.
n_clusters: the number of clusters to estimate.
cluster_fun: the function used to estimate the clusters.
get_cluster_fun: the function used to get the cluster classes. This function takes one parameter, the result of cluster_fun.
oob_predict_fun: the function used to get predictions from the out-of-bag sample. Function takes two parameters, the first is the results of cluster_fun, the second is the out-of-bag sample data frame.
summary_fun: the function used to calculate the statistic for each cluster and iteration. Defaults to mean.
n_samples: the number of random samples to draw.
sample_size: the size of each random sample. Defaults to 50% of observations.
replace: whether sampling should be done with replacement.
standardize: whether the variables should be standardized before estimating clusters.
seed: random number seed. Note that the seed is set before each iteration to seed + i where i is the iteration number.
verbose: whether the function should print the status while running.
...: currently not used.
x: result of summary.clustervalidation()
plot_complete: whether the profile line using the complete data set should be plotted.
plot_in_sample: whether to plot the in sample (i.e. bootstrap) distributions.
plot_oob_sample: whether to plot the out-of-bag samples.
point_alpha: the alpha (transparency) level for points.
point_size: the size of the points.
line_alpha: the alpha (transparency) level for lines.
line_width: width of the lines.
complete_color: the color of the path for the path using the complete dataset (i.e. plot_complete = TRUE)
complete_size: the size of the path for the path using the complete dataset (i.e. plot_complete = TRUE)
complete_point_size: the point size of the path for the path using the complete dataset (i.e. plot_complete = TRUE)
xlab: label for the x-axis.
ylab: label for the y-axis.
object: the results of cluster_validation.
in_sample: plot the in sample results.
oob_sample: plot the out-of-bag sample results.
cv: the results from cluster_validation().
greedy: if FALSE this will find the optimal cluster pattern by calculating difference for all combinations. If TRUE this will consider only one variable (the one with the largest variance across means).
var: the variable to use if greedy = TRUE.
nrow: number of rows. This is ignored if plot_in_sample = TRUE and plot_oob_sample = TRUE.
palette: If a string, will use that named palette. If a number, will index into the list of palettes of appropriate type. See ggplot2::scale_color_brewer() for more information.

Value

a list with the following elements:

complete_sample: data frame of results using the entire data set.
in_sample: data frame of in sample results.
oob_sample: data frame of out-of-bag results.
complete_model_fit: model fit for the full data set.
in_sample_model_fits: model fits for each sample.

Each of these data frames contain four columns:

iter: the iteration
cluster: the cluster
variable: the variable
value: the mean value of variable for the given cluster and interation

a ggplot2 expression.

the cv object with the cluster labels in the oob_sample data frame reassigned so that the labels match across all iterations as best as possible.

a ggplot2 expression.

Details

The number of rows in the resulting data frames will be equal to: ncol(df) * n_samples * n_clusters.