Introduction
Most predictive modeling strategies require complete data to train and predict outcomes. A common strategy is to impute the missing data before training the models. However, in many situations, missing data occurs in a systematic way therefore violating the missing completely at random (MCAR) assumption. For instance, data may be collected over multiple time periods or the available data may vary depending on varying collection protocols or selection bias at the individual level. The goal of this package is to provide a framework for conducting predictive model that takes into account the patterns of missing data.
The motivating example for this approach predictive modeling comes from predicting student success in college. Institutions have become increasingly interested in using predictive models to identify students at risk for attrition as early as possible in order to provided targeted interventions and supports. Information about students is collected over several months beginning at the college application phase, commitment to enroll, orientation, and finally behavior data as the student begins coursework. In addition to missing data being related to where the student is in the process, students may opt to not provide certain data elements. This results in a complex system where a single predictive model is appropriate.
Data Source
This paper will explore data from the Diagnostic Assessment and
Achievement of College Skills (DAACS; https://daacs.net). DAACS is a suite of technological
and social supports designed to optimize student learning. Students
complete assessments in self-regulated learning, writing, mathematics,
and reading and upon completion receive immediate feedback in terms of
developing, emerging, and mastering. The feedback is tailored to their
results. The data for this paper was part of a larger randomized control
trial where DAACS was embedded within orientation for the treatment
students. Although students were instructed that orientation was
required, there were no consequences for not completing orientation and
therefore approximately 43% of students did not attempt orientation, and
therefore did not complete DAACS. The goal is to predicted term-to-term
retention (the retained
variable) for each student based
upon the available data. Table @ref(tab:descriptives) for descriptive
statistics which reveals that the DAACS results are not complete for
some students. The overall retention rate is 56.17%.
Variable | N | Mean | Std Dev | Median |
---|---|---|---|---|
retained | 5154 | |||
… No | 2259 | 44% | ||
… Yes | 2895 | 56% | ||
page_views | 39 | 24 | 34 | |
srl | 2.8 | 0.43 | 2.8 | |
math | 0.58 | 0.2 | 0.61 | |
reading | 0.85 | 0.17 | 0.89 | |
writing | 0.78 | 0.16 | 0.83 | |
income | 4.4 | 2.4 | 4 | |
employment | 2.6 | 0.75 | 3 | |
ell | 0.93 | 0.26 | 1 | |
ed_mother | 3.3 | 2 | 3 | |
ed_father | 3.2 | 2 | 3 | |
ethnicity | 5154 | |||
… Black or African American | 1049 | 20% | ||
… Hispanic | 699 | 14% | ||
… Other | 548 | 11% | ||
… White | 2858 | 55% | ||
gender | 5154 | |||
… FEMALE | 2123 | 41% | ||
… MALE | 3031 | 59% | ||
military | 5154 | |||
… No | 2881 | 56% | ||
… Yes | 2273 | 44% | ||
age | 35 | 9.3 | 34 |
Missing Data Patterns
To begin we first explore the patterns of missing data. Figure @ref(fig:upset) is an UpSet plot created from a shadow matrix1. Each vertical line corresponds to a set, or combination, of variables. The dots indicate that that variable is included in the set. The bars on the top correspond to the number of observations in that set and the bars to the right correspond to the total number of observed values. The largest set includes only the demographics variables, the second largest included demographics and all DAACS variables. There is a third set that includes self-regulated learning along with demographics that is worth considering since it contains more than 10% of the observations. It should be noted that there is a potential fourth set which included demographic variables along with three of DAACS variables (SRL, reading, and math). However, since this set has fewer than 10% of the observations we will use a three set/model solution.
shadow_matrix <- as.data.frame(!is.na(daacs))
ComplexHeatmap::make_comb_mat(shadow_matrix) |> ComplexHeatmap::UpSet()
Baseline Models
There are generally two choices when estimating models when there is missing data: 1) Model using only the available data or 2) Impute the missing data before modeling.
Using available data
In the DAACS data set as depicted in Figure @ref(fig:upset) the demographics variables were observed for all students. To start we train a logistic regression model on the training data.
lr_out <- glm(data = daacs_train,
formula = retained ~ income + employment + ell + ed_mother + ed_father +
ethnicity + gender + military + age,
family = binomial(link = 'logit'))
rf_out <- randomForest(formula = factor(retained) ~ income + employment + ell + ed_mother + ed_father +
ethnicity + gender + military + age,
data = daacs_train)
We can get predicted values from the validation data set and print the confusion matrix.
lr_predictions <- predict(lr_out, newdata = daacs_valid, type = 'response')
confusion_matrix(observed = daacs_valid$retained,
predicted = lr_predictions > 0.5)
#> predicted
#> observed FALSE TRUE
#> FALSE 295 (19.07%) 386 (24.95%)
#> TRUE 216 (13.96%) 650 (42.02%)
#> Accuracy: 61.09%
rf_predictions <- predict(rf_out, newdata = daacs_valid, type = 'response')
confusion_matrix(observed = daacs_valid$retained,
predicted = rf_predictions)
#> predicted
#> observed FALSE TRUE
#> FALSE 279 (18.03%) 402 (25.99%)
#> TRUE 251 (16.22%) 615 (39.75%)
#> Accuracy: 57.79%
The overall accuracy using only the demographic variables is 61.09%
Multiple imputation
Another common approach to modeling with missing data is multiple
imputation. The mice
package (Multivariate Imputations by
Chained Equations) is a robust and popular approach to imputing missing
data. For simplicity we will use the final imputed data set for
comparison2.
mice_out <- mice::mice(daacs[,-1], M = 5, seed = 2112, printFlag = FALSE)
daacs_complete <- cbind(retained = daacs$retained, mice::complete(mice_out))
daacs_train_complete <- daacs_complete[train_rows,]
daacs_valid_complete <- daacs_complete[-train_rows,]
With the missing DAACS data imputed we can train a logistic regression model using the full data set.
mice_lr_out <- glm(formula = retained ~ .,
data = daacs_train_complete,
family = binomial(link = logit))
mice_lr_predictions <- predict(mice_lr_out, newdata = daacs_valid_complete, type = 'response')
confusion_matrix(observed = daacs_valid_complete$retained,
predicted = mice_lr_predictions > 0.5)
#> predicted
#> observed FALSE TRUE
#> FALSE 297 (19.20%) 384 (24.82%)
#> TRUE 213 (13.77%) 653 (42.21%)
#> Accuracy: 61.41%
mice_rf_out <- randomForest(formula = factor(retained) ~ .,
data = daacs_train_complete)
mice_rf_predictions <- predict(mice_rf_out, newdata = daacs_valid_complete, type = 'response')
confusion_matrix(observed = daacs_valid_complete$retained,
predicted = mice_rf_predictions)
#> predicted
#> observed FALSE TRUE
#> FALSE 278 (17.97%) 403 (26.05%)
#> TRUE 207 (13.38%) 659 (42.60%)
#> Accuracy: 60.57%
The overall accuracy using only imputed data set is 61.41%
Medley models
The medley_train
function implements a step wise
approach to training models. The data
and
formula
parameters specify the data set and full model
(i.e. all possible predictor variables to be considered), similar to
other modeling functions in R. The method
parameter
indicates what model procedure should be used. In this example we will
estimate logistic regression models. The medley_train
can
take any additional parameters that need to be passed to the
method
function.
medley_lr_out <- medley_train(data = daacs_train,
formula = retained ~ .,
method = glm,
family = binomial(link = logit))
Table @ref(tab:model-summaries) provides the baseline retention rate by model along with the number of observations and formula for each of the models. Before exploring the specific of the modeling this reveals that the pattern of missing data is predictive of success. Students who complete all four DAACS assessments are % more likely to be retained then students who did not complete any of the assessments.
Model | n | Success | Formula |
---|---|---|---|
1 | 1268 | 70.98 | retained ~ page_views + srl + math + reading + writing + income + employment + ell + ed_mother + ed_father + ethnicity + gender + military + age |
2 | 788 | 58.25 | retained ~ page_views + srl + income + employment + ell + ed_mother + ed_father + ethnicity + gender + military + age |
3 | 1551 | 43.20 | retained ~ income + employment + ell + ed_mother + ed_father + ethnicity + gender + military + age |
The object returned by medley_train
contains the
following elements:
-
n_models
- The number of models estimated. -
formulas
- A list of the formulas used for each model. -
models
- A list containing the model output for each model. In this example this would contain the results of theglm
function call. -
data
- The full data set used to train the models. -
model_observations
- A data frame indicating which models each observation was used in. The rows correspond to the rows indata
and the columns correspond to the model.
By default the algorithm will use all sets that have at least 10% of
the total observations (see @ref(tab:model-summaries)). This can be
adjusted using the min_set_size
parameter (see the
get_variable_sets()
function). Optionally you can specify
the models directly by passing a list of formulas with the
var_sets
parameter.
Table @ref(tab:modelresults) provides the model summaries for the 3 models estimated.
(1) | (2) | (3) | ||||
---|---|---|---|---|---|---|
(Intercept) | 0.574 *** | (0.144) | 0.533 ** | (0.175) | 0.435 *** | (0.085) |
page_views | -0.001 | (0.001) | -0.001 | (0.001) | ||
srl | -0.025 | (0.031) | -0.043 | (0.042) | ||
math | 0.085 | (0.075) | ||||
reading | -0.086 | (0.087) | ||||
writing | 0.031 | (0.083) | ||||
income | -0.000 | (0.006) | 0.010 | (0.008) | 0.014 * | (0.006) |
employment | 0.030 | (0.017) | -0.043 | (0.024) | -0.011 | (0.017) |
ell | 0.004 | (0.053) | 0.103 | (0.074) | 0.069 | (0.047) |
ed_mother | 0.011 | (0.009) | 0.000 | (0.012) | -0.008 | (0.009) |
ed_father | 0.002 | (0.008) | 0.021 | (0.012) | 0.006 | (0.008) |
ethnicityHispanic | -0.049 | (0.050) | 0.102 | (0.061) | 0.026 | (0.043) |
ethnicityOther | -0.003 | (0.051) | 0.052 | (0.067) | 0.057 | (0.046) |
ethnicityWhite | 0.009 | (0.039) | 0.082 | (0.045) | 0.015 | (0.030) |
genderMALE | 0.037 | (0.029) | -0.010 | (0.040) | 0.064 * | (0.027) |
militaryTRUE | 0.166 *** | (0.030) | 0.150 *** | (0.041) | 0.174 *** | (0.029) |
age | 0.001 | (0.002) | -0.001 | (0.002) | -0.006 *** | (0.001) |
N | 1268 | 788 | 1551 | |||
logLik | -759.912 | -538.476 | -1052.473 | |||
AIC | 1555.824 | 1106.953 | 2130.946 | |||
*** p < 0.001; ** p < 0.01; * p < 0.05. |
The S3 generic function predict
has been implemented.
Specifying the newdata
parameter will give predictions for
the validation data set.
medley_lr_predictions <- predict(medley_lr_out,
newdata = daacs_valid,
type = 'response')
The confusion matrix is provided below. The overall accuracy for the medley model is 61.086% which is a 5.11% over the baseline, or null, model.
confusion_matrix(observed = daacs_valid$retained,
predicted = medley_lr_predictions > 0.5)
#> predicted
#> observed FALSE TRUE
#> FALSE 306 (19.78%) 375 (24.24%)
#> TRUE 174 (11.25%) 692 (44.73%)
#> Accuracy: 64.51%
Random Forests
The core functionality of the medley_train
algorithm is
to select the most appropriate model given the available data. The
specific predictive model is up to the user. Fernandez-Delgado et al
(2014) evaluated the performance of 179 classifiers across 121 data
sets. Their results showed that, in general, random forest was the best
performing model. To begin, we load the randomForest
pacakge and convert our dependent variable to a factor to ensure a
classification (versus regression) model is estimated.
daacs_train$retained <- as.factor(daacs_train$retained)
daacs_valid$retained <- as.factor(daacs_valid$retained)
Training and predicting are the same as above except we set
method = randomForest
.
medley_rf_out <- medley_train(data = daacs_train,
formula = retained ~ .,
method = randomForest)
medley_rf_predictions <- predict(medley_rf_out,
newdata = daacs_valid,
type = "response")
Lastly, the confusion matrix gives the overall accuracy. In this example though we see the random forest performs sligthly worse than logistic regression.
confusion_matrix(observed = daacs_valid$retained,
predicted = medley_rf_predictions )
#> predicted
#> observed 1 2
#> FALSE 320 (20.69%) 361 (23.34%)
#> TRUE 194 (12.54%) 672 (43.44%)
#> Accuracy: 64.12%
Using observations in multiple models
The default behavior of the medley_train
algorithm is
for each observation to be used in only one model. However, in this
particular example, we have complete demographic data for all students
so we could potentially use all observations to train that model. The
exclusive_membership
parameter will allow observations to
be used in training models for which there is complete data.
medley_rf_out2 <- medley_train(data = daacs_train,
formula = retained ~ .,
exclusive_membership = FALSE,
method = randomForest)
medley_rf_predictions2 <- predict(medley_rf_out2,
newdata = daacs_valid,
type = "response")
confusion_matrix(observed = daacs_valid$retained,
predicted = medley_rf_predictions2 )
#> predicted
#> observed 1 2
#> FALSE 234 (15.13%) 447 (28.89%)
#> TRUE 110 (7.11%) 756 (48.87%)
#> Accuracy: 63.99%
As the results above show we get a very modest increase in the overall accuracy. It should be noted that predictions are estimated from the model that uses most variables for each observation.
Discussion
Method | Accuracy | Improvement |
---|---|---|
Observed data only logistic regression | 61.09 | 4.92 |
Observed data only random forest | 57.79 | 1.62 |
Imputed data set logistic regression | 61.41 | 5.24 |
Imputed data set random forest | 60.57 | 4.40 |
Medley with logistic regression | 64.51 | 8.34 |
Medley with random forest | 64.12 | 7.95 |
Note: Improvement is the difference with the overall retention rate of 56.17%.
Predicted
Model Observed FALSE TRUE
Observed data only logistic regression FALSE 295 (19.07%) 386 (24.95%)
TRUE 216 (13.96%) 650 (42.02%)
Observed data only random forest FALSE 279 (18.03%) 402 (25.99%)
TRUE 251 (16.22%) 615 (39.75%)
Imputed data set logistic regression FALSE 297 (19.20%) 384 (24.82%)
TRUE 213 (13.77%) 653 (42.21%)
Imputed data set random forest FALSE 278 (17.97%) 403 (26.05%)
TRUE 207 (13.38%) 659 (42.60%)
Medley with logistic regression FALSE 306 (19.78%) 375 (24.24%)
TRUE 174 (11.25%) 692 (44.73%)
Medley with random forest FALSE 320 (20.69%) 361 (23.34%)
TRUE 194 (12.54%) 672 (43.44%)
Accuracy
61.09%
57.79%
61.41%
60.57%
64.51%
64.12%