Cluster profile validation
Source:R/cluster_validation.R
, R/cluster_validation_plot.R
, R/cluster_validation_print.R
, and 3 more
cluster_validation.Rd
This function takes multiple random samples from the provided data frame,
estimate cluster membership, and calculate the mean (the default, a different
statistic can be specified using the summary_fun
parameter, e.g. median
may be appropriate) for each cluster using both the in sample and out-of-bag
(i.e. out of sample). For the out-of-bag sample
This function will re-assign cluster labels such that the mean absolute difference to the the cluster labels from the full dataset is as small as possible.
Usage
cluster_validation(
df,
n_clusters = 2,
cluster_fun = stats::kmeans,
get_cluster_fun = function(x) {
x$cluster
},
oob_predict_fun = function(fit, newdata) {
predict(fit, newdata = newdata)
},
summary_fun = mean,
n_samples = 100,
sample_size = 0.5 * nrow(df),
replace = FALSE,
standardize = TRUE,
seed,
verbose = interactive(),
...
)
# S3 method for class 'clustervalidation'
plot(
x,
plot_complete = TRUE,
plot_in_sample = TRUE,
plot_oob_sample = TRUE,
point_alpha = 0.1,
point_size = 1,
line_alpha = 0.1,
line_width = 1,
complete_color = "blue",
complete_size = 1,
complete_point_size = 2,
xlab = "",
ylab = ifelse(attr(x, "standardize"), "Mean Standard Score", "Mean Score"),
...
)
# S3 method for class 'clustervalidation'
print(x, ...)
# S3 method for class 'clustervalidation'
summary(object, in_sample = FALSE, oob_sample = TRUE, ...)
# S3 method for class 'cv_summary'
plot(x, ...)
fix_cluster_labels(
cv,
greedy = (length(unique(cv$oob_sample$cluster)) > 7),
var,
...
)
plot_distributions(
cv,
plot_in_sample = TRUE,
plot_oob_sample = FALSE,
nrow = NULL,
palette = 2,
...
)
Arguments
- df
data frame to estimate clusters. Note that all columns will be used in the estimation.
- n_clusters
the number of clusters to estimate.
- cluster_fun
the function used to estimate the clusters.
- get_cluster_fun
the function used to get the cluster classes. This function takes one parameter, the result of
cluster_fun
.- oob_predict_fun
the function used to get predictions from the out-of-bag sample. Function takes two parameters, the first is the results of
cluster_fun
, the second is the out-of-bag sample data frame.- summary_fun
the function used to calculate the statistic for each cluster and iteration. Defaults to
mean
.- n_samples
the number of random samples to draw.
- sample_size
the size of each random sample. Defaults to 50% of observations.
- replace
whether sampling should be done with replacement.
- standardize
whether the variables should be standardized before estimating clusters.
- seed
random number seed. Note that the seed is set before each iteration to
seed + i
wherei
is the iteration number.- verbose
whether the function should print the status while running.
- ...
currently not used.
- x
result of
summary.clustervalidation()
- plot_complete
whether the profile line using the complete data set should be plotted.
- plot_in_sample
whether to plot the in sample (i.e. bootstrap) distributions.
- plot_oob_sample
whether to plot the out-of-bag samples.
- point_alpha
the alpha (transparency) level for points.
- point_size
the size of the points.
- line_alpha
the alpha (transparency) level for lines.
- line_width
width of the lines.
- complete_color
the color of the path for the path using the complete dataset (i.e.
plot_complete = TRUE
)- complete_size
the size of the path for the path using the complete dataset (i.e.
plot_complete = TRUE
)- complete_point_size
the point size of the path for the path using the complete dataset (i.e.
plot_complete = TRUE
)- xlab
label for the x-axis.
- ylab
label for the y-axis.
- object
the results of
cluster_validation
.- in_sample
plot the in sample results.
- oob_sample
plot the out-of-bag sample results.
- cv
the results from
cluster_validation()
.- greedy
if
FALSE
this will find the optimal cluster pattern by calculating difference for all combinations. IfTRUE
this will consider only one variable (the one with the largest variance across means).- var
the variable to use if
greedy = TRUE
.- nrow
number of rows. This is ignored if
plot_in_sample = TRUE
andplot_oob_sample = TRUE
.- palette
If a string, will use that named palette. If a number, will index into the list of palettes of appropriate type. See
ggplot2::scale_color_brewer()
for more information.
Value
a list with the following elements:
- complete_sample
data frame of results using the entire data set.
- in_sample
data frame of in sample results.
- oob_sample
data frame of out-of-bag results.
- complete_model_fit
model fit for the full data set.
- in_sample_model_fits
model fits for each sample.
Each of these data frames contain four columns:
- iter
the iteration
- cluster
the cluster
- variable
the variable
- value
the mean value of
variable
for the given cluster and interation
a ggplot2 expression.
a ggplot2 expression.
the cv
object with the cluster labels in the oob_sample
data frame reassigned so
that the labels match across all iterations as best as possible.
a ggplot2
expression.