Skip to contents

This function will train a collection of models based upon the pattern of missingness. Each observation will be used in the model with most dependent variables available. For example, consider the following data.frame where `y` is the dependent variable, `x` represents an observed value, and `NA` indicates a missing value:

Usage

medley(
  formula,
  data,
  method = glm,
  var_sets = get_variable_sets(data = data, formula = formula, min_set_size =
    min_set_size),
  min_set_size = 0.1,
  exclusive_membership = TRUE,
  ...
)

# S3 method for class 'medley'
summary(object, ...)

# S3 method for class 'medley'
print(x, ...)

# S3 method for class 'medley'
predict(object, newdata, ...)

# S3 method for class 'medley'
fit(object, ...)

Arguments

formula

with all possible predictor varaibles to be considered.

data

data.frame used to estimate the models.

method

the function used to train the models (e.g. glm, randomForest).

var_sets

a list of formulas to use for the predictive models.

min_set_size

the minimum set size as a percentage to incldue as a model.

exclusive_membership

whether an observation should only be used only in the model for which the most predictor variables are available. If `FALSE` then observations may be used in training more than one model. This is experimental.

...

other parameters passed to the `fit()` function.

object

the results from `medley`.

x

the results of `medley`.

newdata

(optional) a new data.frame to get predictions for.

Value

an object with the following elements:

n_models

the number of models trained.

formulas

the list of formulas used to train the models.

models

list of objects returned from the training method.

data

the data.frame used to train the models.

model_observations

a data.frame that specifies which observations are used for which model(s).

a vector of predictions.

a vector of fitted values.

Details

“` ID Y Var1 Var2 Var3 1 x x x x 2 x x x NA 3 x x NA NA “`

We can train three different models:

* Model 1: Y ~ Var1 + Var2 + Var3 * Model 2: Y ~ Var1 + Var2 * Model 3: Y ~ Var1

When deciding what model each observation will be used in is determined by examining which model has the most dependent variables that row has values for. In the example above, row 1 would be used with model 1, row 2 would be used with model 2, and row 3 would be used with model 3.

If `exclusive_membership = FALSE` then row 1 would be used in all 3 models and row 2 would be used in models 2 and 3. I do recommend using this parameter with caution as model assumptions are not confirmed, especially independence.

Examples

formulas <- medley::get_variable_sets(daacs, retained ~ .)
medley_out <- medley(data = daacs, formula = retained ~ ., var_sets = formulas)
predicted_values <- predict(medley_out)
#> Warning: Predictions will be returned from the first model only.