Skip to contents

Introduction

Most predictive modeling strategies require complete data to train and predict outcomes. A common strategy is to impute the missing data before training the models. However, in many situations, missing data occurs in a systematic way therefore violating the missing completely at random (MCAR) assumption. For instance, data may be collected over multiple time periods or the available data may vary depending on varying collection protocols or selection bias at the individual level. The goal of this package is to provide a framework for conducting predictive model that takes into account the patterns of missing data.

The motivating example for this approach predictive modeling comes from predicting student success in college. Institutions have become increasingly interested in using predictive models to identify students at risk for attrition as early as possible in order to provided targeted interventions and supports. Information about students is collected over several months beginning at the college application phase, commitment to enroll, orientation, and finally behavior data as the student begins coursework. In addition to missing data being related to where the student is in the process, students may opt to not provide certain data elements. This results in a complex system where a single predictive model is appropriate.

Data Source

This paper will explore data from the Diagnostic Assessment and Achievement of College Skills (DAACS; https://daacs.net). DAACS is a suite of technological and social supports designed to optimize student learning. Students complete assessments in self-regulated learning, writing, mathematics, and reading and upon completion receive immediate feedback in terms of developing, emerging, and mastering. The feedback is tailored to their results. The data for this paper was part of a larger randomized control trial where DAACS was embedded within orientation for the treatment students. Although students were instructed that orientation was required, there were no consequences for not completing orientation and therefore approximately 43% of students did not attempt orientation, and therefore did not complete DAACS. The goal is to predicted term-to-term retention (the retained variable) for each student based upon the available data. Table @ref(tab:descriptives) for descriptive statistics which reveals that the DAACS results are not complete for some students. The overall retention rate is 56.17%.

Descriptive Statistics
Variable N Mean Std Dev Median
retained 5154
… No 2259 44%
… Yes 2895 56%
page_views 39 24 34
srl 2.8 0.43 2.8
math 0.58 0.2 0.61
reading 0.85 0.17 0.89
writing 0.78 0.16 0.83
income 4.4 2.4 4
employment 2.6 0.75 3
ell 0.93 0.26 1
ed_mother 3.3 2 3
ed_father 3.2 2 3
ethnicity 5154
… Black or African American 1049 20%
… Hispanic 699 14%
… Other 548 11%
… White 2858 55%
gender 5154
… FEMALE 2123 41%
… MALE 3031 59%
military 5154
… No 2881 56%
… Yes 2273 44%
age 35 9.3 34

Missing Data Patterns

To begin we first explore the patterns of missing data. Figure @ref(fig:upset) is an UpSet plot created from a shadow matrix1. Each vertical line corresponds to a set, or combination, of variables. The dots indicate that that variable is included in the set. The bars on the top correspond to the number of observations in that set and the bars to the right correspond to the total number of observed values. The largest set includes only the demographics variables, the second largest included demographics and all DAACS variables. There is a third set that includes self-regulated learning along with demographics that is worth considering since it contains more than 10% of the observations. It should be noted that there is a potential fourth set which included demographic variables along with three of DAACS variables (SRL, reading, and math). However, since this set has fewer than 10% of the observations we will use a three set/model solution.

shadow_matrix <- as.data.frame(!is.na(daacs))
ComplexHeatmap::make_comb_mat(shadow_matrix) |> ComplexHeatmap::UpSet()
UpSet plot of the missing data

UpSet plot of the missing data

Data Preparation

To perform the predictive modeling we will split the data into training (70%) and validation (30%) data sets.

set.seed(2112); train_rows <- sample(nrow(daacs), nrow(daacs) * 0.7)
daacs_train <- daacs[train_rows,]
daacs_valid <- daacs[-train_rows,]

Baseline Models

There are generally two choices when estimating models when there is missing data: 1) Model using only the available data or 2) Impute the missing data before modeling.

Using available data

In the DAACS data set as depicted in Figure @ref(fig:upset) the demographics variables were observed for all students. To start we train a logistic regression model on the training data.

lr_out <- glm(data = daacs_train,
              formula = retained ~ income + employment + ell + ed_mother + ed_father +
                ethnicity + gender + military + age,
              family = binomial(link = 'logit'))
rf_out <- randomForest(formula = factor(retained) ~ income + employment + ell + ed_mother + ed_father +
                         ethnicity + gender + military + age,
                       data = daacs_train)

We can get predicted values from the validation data set and print the confusion matrix.

lr_predictions <- predict(lr_out, newdata = daacs_valid, type = 'response')
confusion_matrix(observed = daacs_valid$retained,
                 predicted = lr_predictions > 0.5)
#>               predicted             
#>   observed        FALSE         TRUE
#>      FALSE 295 (19.07%) 386 (24.95%)
#>       TRUE 216 (13.96%) 650 (42.02%)
#> Accuracy: 61.09%

rf_predictions <- predict(rf_out, newdata = daacs_valid, type = 'response')
confusion_matrix(observed = daacs_valid$retained,
                 predicted = rf_predictions)
#>               predicted             
#>   observed        FALSE         TRUE
#>      FALSE 279 (18.03%) 402 (25.99%)
#>       TRUE 251 (16.22%) 615 (39.75%)
#> Accuracy: 57.79%

The overall accuracy using only the demographic variables is 61.09%

Multiple imputation

Another common approach to modeling with missing data is multiple imputation. The mice package (Multivariate Imputations by Chained Equations) is a robust and popular approach to imputing missing data. For simplicity we will use the final imputed data set for comparison2.

mice_out <- mice::mice(daacs[,-1], M = 5, seed = 2112, printFlag = FALSE)
daacs_complete <- cbind(retained = daacs$retained, mice::complete(mice_out))

daacs_train_complete <- daacs_complete[train_rows,]
daacs_valid_complete <- daacs_complete[-train_rows,]

With the missing DAACS data imputed we can train a logistic regression model using the full data set.

mice_lr_out <- glm(formula = retained ~ .,
                   data = daacs_train_complete,
                   family = binomial(link = logit))
mice_lr_predictions <- predict(mice_lr_out, newdata = daacs_valid_complete, type = 'response')
confusion_matrix(observed = daacs_valid_complete$retained,
                 predicted = mice_lr_predictions > 0.5)
#>               predicted             
#>   observed        FALSE         TRUE
#>      FALSE 297 (19.20%) 384 (24.82%)
#>       TRUE 213 (13.77%) 653 (42.21%)
#> Accuracy: 61.41%

mice_rf_out <- randomForest(formula = factor(retained) ~ .,
                            data = daacs_train_complete)
mice_rf_predictions <- predict(mice_rf_out, newdata = daacs_valid_complete, type = 'response')
confusion_matrix(observed = daacs_valid_complete$retained,
                 predicted = mice_rf_predictions)
#>               predicted             
#>   observed        FALSE         TRUE
#>      FALSE 278 (17.97%) 403 (26.05%)
#>       TRUE 207 (13.38%) 659 (42.60%)
#> Accuracy: 60.57%

The overall accuracy using only imputed data set is 61.41%

Medley models

The medley_train function implements a step wise approach to training models. The data and formula parameters specify the data set and full model (i.e. all possible predictor variables to be considered), similar to other modeling functions in R. The method parameter indicates what model procedure should be used. In this example we will estimate logistic regression models. The medley_train can take any additional parameters that need to be passed to the method function.

medley_lr_out <- medley_train(data = daacs_train,
                              formula = retained ~ .,
                              method = glm,
                              family = binomial(link = logit))

Table @ref(tab:model-summaries) provides the baseline retention rate by model along with the number of observations and formula for each of the models. Before exploring the specific of the modeling this reveals that the pattern of missing data is predictive of success. Students who complete all four DAACS assessments are % more likely to be retained then students who did not complete any of the assessments.

Baseline retention rate by model
Model n Success Formula
1 1268 70.98 retained ~ page_views + srl + math + reading + writing + income + employment + ell + ed_mother + ed_father + ethnicity + gender + military + age
2 788 58.25 retained ~ page_views + srl + income + employment + ell + ed_mother + ed_father + ethnicity + gender + military + age
3 1551 43.20 retained ~ income + employment + ell + ed_mother + ed_father + ethnicity + gender + military + age

The object returned by medley_train contains the following elements:

  • n_models - The number of models estimated.
  • formulas - A list of the formulas used for each model.
  • models - A list containing the model output for each model. In this example this would contain the results of the glm function call.
  • data - The full data set used to train the models.
  • model_observations - A data frame indicating which models each observation was used in. The rows correspond to the rows in data and the columns correspond to the model.

By default the algorithm will use all sets that have at least 10% of the total observations (see @ref(tab:model-summaries)). This can be adjusted using the min_set_size parameter (see the get_variable_sets() function). Optionally you can specify the models directly by passing a list of formulas with the var_sets parameter.

Table @ref(tab:modelresults) provides the model summaries for the 3 models estimated.

(#tab:modelresults) Logistic regression results
(1) (2) (3)
(Intercept) 0.574 *** (0.144) 0.533 **  (0.175) 0.435 *** (0.085)
page_views -0.001     (0.001) -0.001     (0.001)               
srl -0.025     (0.031) -0.043     (0.042)               
math 0.085     (0.075)                              
reading -0.086     (0.087)                              
writing 0.031     (0.083)                              
income -0.000     (0.006) 0.010     (0.008) 0.014 *   (0.006)
employment 0.030     (0.017) -0.043     (0.024) -0.011     (0.017)
ell 0.004     (0.053) 0.103     (0.074) 0.069     (0.047)
ed_mother 0.011     (0.009) 0.000     (0.012) -0.008     (0.009)
ed_father 0.002     (0.008) 0.021     (0.012) 0.006     (0.008)
ethnicityHispanic -0.049     (0.050) 0.102     (0.061) 0.026     (0.043)
ethnicityOther -0.003     (0.051) 0.052     (0.067) 0.057     (0.046)
ethnicityWhite 0.009     (0.039) 0.082     (0.045) 0.015     (0.030)
genderMALE 0.037     (0.029) -0.010     (0.040) 0.064 *   (0.027)
militaryTRUE 0.166 *** (0.030) 0.150 *** (0.041) 0.174 *** (0.029)
age 0.001     (0.002) -0.001     (0.002) -0.006 *** (0.001)
N 1268               788               1551              
logLik -759.912           -538.476           -1052.473          
AIC 1555.824           1106.953           2130.946          
*** p < 0.001; ** p < 0.01; * p < 0.05.

The S3 generic function predict has been implemented. Specifying the newdata parameter will give predictions for the validation data set.

medley_lr_predictions <- predict(medley_lr_out,
                                 newdata = daacs_valid,
                                 type = 'response')

The confusion matrix is provided below. The overall accuracy for the medley model is 61.086% which is a 5.11% over the baseline, or null, model.

confusion_matrix(observed = daacs_valid$retained,
                 predicted = medley_lr_predictions > 0.5)
#>               predicted             
#>   observed        FALSE         TRUE
#>      FALSE 306 (19.78%) 375 (24.24%)
#>       TRUE 174 (11.25%) 692 (44.73%)
#> Accuracy: 64.51%

Random Forests

The core functionality of the medley_train algorithm is to select the most appropriate model given the available data. The specific predictive model is up to the user. Fernandez-Delgado et al (2014) evaluated the performance of 179 classifiers across 121 data sets. Their results showed that, in general, random forest was the best performing model. To begin, we load the randomForest pacakge and convert our dependent variable to a factor to ensure a classification (versus regression) model is estimated.

daacs_train$retained <- as.factor(daacs_train$retained)
daacs_valid$retained <- as.factor(daacs_valid$retained)

Training and predicting are the same as above except we set method = randomForest.

medley_rf_out <- medley_train(data = daacs_train,
                              formula = retained ~ .,
                              method = randomForest)
medley_rf_predictions <- predict(medley_rf_out, 
                                 newdata = daacs_valid,
                                 type = "response")

Lastly, the confusion matrix gives the overall accuracy. In this example though we see the random forest performs sligthly worse than logistic regression.

confusion_matrix(observed = daacs_valid$retained,
                 predicted = medley_rf_predictions )
#>               predicted             
#>   observed            1            2
#>      FALSE 320 (20.69%) 361 (23.34%)
#>       TRUE 194 (12.54%) 672 (43.44%)
#> Accuracy: 64.12%

Using observations in multiple models

The default behavior of the medley_train algorithm is for each observation to be used in only one model. However, in this particular example, we have complete demographic data for all students so we could potentially use all observations to train that model. The exclusive_membership parameter will allow observations to be used in training models for which there is complete data.

medley_rf_out2 <- medley_train(data = daacs_train,
                              formula = retained ~ .,
                              exclusive_membership = FALSE,
                              method = randomForest)
medley_rf_predictions2 <- predict(medley_rf_out2, 
                          newdata = daacs_valid,
                          type = "response")
confusion_matrix(observed = daacs_valid$retained,
                 predicted = medley_rf_predictions2 )
#>               predicted             
#>   observed            1            2
#>      FALSE 234 (15.13%) 447 (28.89%)
#>       TRUE  110 (7.11%) 756 (48.87%)
#> Accuracy: 63.99%

As the results above show we get a very modest increase in the overall accuracy. It should be noted that predictions are estimated from the model that uses most variables for each observation.

Discussion

Model performance summary
Method Accuracy Improvement
Observed data only logistic regression 61.09 4.92
Observed data only random forest 57.79 1.62
Imputed data set logistic regression 61.41 5.24
Imputed data set random forest 60.57 4.40
Medley with logistic regression 64.51 8.34
Medley with random forest 64.12 7.95

Note: Improvement is the difference with the overall retention rate of 56.17%.

                                                    Predicted             
                                  Model Observed        FALSE         TRUE
 Observed data only logistic regression    FALSE 295 (19.07%) 386 (24.95%)
                                            TRUE 216 (13.96%) 650 (42.02%)
                                                                          
       Observed data only random forest    FALSE 279 (18.03%) 402 (25.99%)
                                            TRUE 251 (16.22%) 615 (39.75%)
                                                                          
   Imputed data set logistic regression    FALSE 297 (19.20%) 384 (24.82%)
                                            TRUE 213 (13.77%) 653 (42.21%)
                                                                          
         Imputed data set random forest    FALSE 278 (17.97%) 403 (26.05%)
                                            TRUE 207 (13.38%) 659 (42.60%)
                                                                          
        Medley with logistic regression    FALSE 306 (19.78%) 375 (24.24%)
                                            TRUE 174 (11.25%) 692 (44.73%)
                                                                          
              Medley with random forest    FALSE 320 (20.69%) 361 (23.34%)
                                            TRUE 194 (12.54%) 672 (43.44%)
                                                                          
         
 Accuracy
         
         
   61.09%
         
         
   57.79%
         
         
   61.41%
         
         
   60.57%
         
         
   64.51%
         
         
   64.12%