Train models using different combinations of predictor variables based upon missing data patterns.
Source:R/medley.R
medley.Rd
This function will train a collection of models based upon the pattern of missingness. Each observation will be used in the model with most dependent variables available. For example, consider the following data.frame where `y` is the dependent variable, `x` represents an observed value, and `NA` indicates a missing value:
Usage
medley(
formula,
data,
method = glm,
var_sets = get_variable_sets(data = data, formula = formula, min_set_size =
min_set_size),
min_set_size = 0.1,
exclusive_membership = TRUE,
...
)
# S3 method for class 'medley'
summary(object, ...)
# S3 method for class 'medley'
print(x, ...)
# S3 method for class 'medley'
predict(object, newdata, ...)
# S3 method for class 'medley'
fit(object, ...)
Arguments
- formula
with all possible predictor varaibles to be considered.
- data
data.frame used to estimate the models.
- method
the function used to train the models (e.g. glm, randomForest).
- var_sets
a list of formulas to use for the predictive models.
- min_set_size
the minimum set size as a percentage to incldue as a model.
- exclusive_membership
whether an observation should only be used only in the model for which the most predictor variables are available. If `FALSE` then observations may be used in training more than one model. This is experimental.
- ...
other parameters passed to the `fit()` function.
- object
the results from `medley`.
- x
the results of `medley`.
- newdata
(optional) a new data.frame to get predictions for.
Value
an object with the following elements:
- n_models
the number of models trained.
- formulas
the list of formulas used to train the models.
- models
list of objects returned from the training method.
- data
the data.frame used to train the models.
- model_observations
a data.frame that specifies which observations are used for which model(s).
a vector of predictions.
a vector of fitted values.
Details
“` ID Y Var1 Var2 Var3 1 x x x x 2 x x x NA 3 x x NA NA “`
We can train three different models:
* Model 1: Y ~ Var1 + Var2 + Var3 * Model 2: Y ~ Var1 + Var2 * Model 3: Y ~ Var1
When deciding what model each observation will be used in is determined by examining which model has the most dependent variables that row has values for. In the example above, row 1 would be used with model 1, row 2 would be used with model 2, and row 3 would be used with model 3.
If `exclusive_membership = FALSE` then row 1 would be used in all 3 models and row 2 would be used in models 2 and 3. I do recommend using this parameter with caution as model assumptions are not confirmed, especially independence.
Examples
formulas <- medley::get_variable_sets(daacs, retained ~ .)
medley_out <- medley(data = daacs, formula = retained ~ ., var_sets = formulas)
predicted_values <- predict(medley_out)
#> Warning: Predictions will be returned from the first model only.