Medical Student Enrollment Forecasting.nb

16 minute read

Random Forest + Boosted Linear Ensemble
Student enrollment forecasting graphic

Student enrollment forecasting graphic

 

Business Challenge

Can we forecast medical student enrollments using advertising data from web-based platforms?

This is the question I tried to answer using Lone Star Medical Career College’s digital advertising campaign, spending, and student enrollment data to forecast enrollments for the coming year. As I started to develop KPI charts for Lone Star Medical Career College, it became apparent to me that this might be a big opportunity to apply the predictive modeling workflow I had just learned in my ML class.

Why student enrollments? For a career college, this is the main return on investment (ROI) figure. Technically, the return for all the time and resources that go into digital marketing can be measured in many different ways: demographic exposure, increases in online followers, etc. In this case, I wanted any work I do to potentially have a direct impact on the company’s bottom line. This is a common tactic, the researchers on Garter.com tell me, for any company piloting a data mining initiative. The following notebook highlights my first attempt at this business challenge.

Note: Due to delays in the big data approach, I decided to go with the data supplied on their KPI spreadsheets for preliminary analysis. Future development will be a big data project with actual hourly social media advertising data. Also, the name of the college was changed to ensure company privacy.

# Libraries
library(caret)
library(skimr)
library(corrplot)

Data Challenge

To forecast student enrollments, I will use the medical school’s monthly historical data and their digital marketing company’s monthly KPI reports for campaign spending. At first, I incorporated other marketing-based dependent variables into the data such as lead counts and social media platform (Google Ads vs. Facebook Ads), but initial model performances seemed rather low. After some refinement and visual exploration using Tableau, I settled for the following dependent variables: campaign type and spending.

Imported Data

# Import and clean data
data <- read.csv("C:/Users/george/Documents/Digital Marketing Data/Triggers/spendingAndEnrollments.csv")
dataRGV <- data
dataRGV

Skimmed Data

Using skimr allows an at-a-glance view of the imported data. Campaign is a factor variable and spending is numeric. Before visualizing or processing the campaign variable, I will have to convert it to to a numeric dummy variable for better performance. I also note that there are only 51 instances of this data, so how I decide to partition the data will have a bigger impact on performance than usual.

# Split data
x <- subset(dataRGV, select = -Enrollments)
y <- subset(dataRGV, select = Enrollments)

# Data summary
skimX <- skim(x)
skimY <- skim(y)
skimX
-- Data Summary ------------------------
                           Values
Name                       x     
Number of rows             51    
Number of columns          2     
_______________________          
Column type frequency:           
  factor                   1     
  numeric                  1     
________________________         
Group variables            None  

-- Variable type: factor ----------------------------------------------------------------------------
# A tibble: 1 x 6
  skim_variable n_missing complete_rate ordered n_unique top_counts                        
* <chr>             <int>         <dbl> <lgl>      <int> <chr>                             
1 Campaign              0             1 FALSE          5 LVN: 11, Med: 11, Nur: 11, Pha: 11

-- Variable type: numeric ---------------------------------------------------------------------------
# A tibble: 1 x 11
  skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist 
* <chr>             <int>         <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Spending              0             1 1377.  505.  362.  958. 1232. 1778. 2270. ▁▇▅▅▅
skimY
-- Data Summary ------------------------
                           Values
Name                       y     
Number of rows             51    
Number of columns          1     
_______________________          
Column type frequency:           
  numeric                  1     
________________________         
Group variables            None  

-- Variable type: numeric ---------------------------------------------------------------------------
# A tibble: 1 x 11
  skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist 
* <chr>             <int>         <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Enrollments           0             1  9.20  8.03     0     3     8    13    31 ▇▅▂▁▂
# Data partition
seed <- 123
set.seed(seed)
dataPart <- createDataPartition(y$Enrollments,
                              p = .70,
                              list = FALSE)
xTrain <- x[dataPart,]
yTrain <- y[dataPart,]

xTest <- x[-dataPart,]
yTest <- y[-dataPart,]

Training Data Visualization and Correlation

For a visual component to the training data, I use the standard boxplot and histogram charts. I also plot the correlation between variables. At this point I note that no correlation screening seems necessary for this data, and the only outliers are in the dependent variable data.

# Create numeric set of x variables
xTrainNum <- as.data.frame(sapply(xTrain, as.numeric))

# Boxplot of predictors
boxplot(xTrainNum)

boxplot(yTrain)


# Histograms
hist(xTrainNum$Campaign)

hist(xTrainNum$Spending)

hist(yTrain)


# Correlation plots and correlation ceiling
correlations <- cor(xTrainNum)
corrplot(correlations, method = "number", order = "hclust")

ML Algorithm Training

Pre-processing

I preprocessed using the standard center/scaling and YeoJohnson to normalize the data, and spatialSign for outliers (which I don’t think is necessary, but I usually add it anyway as it might boost performance a bit). I also used 10-fold repeated cross validation for resampling. In addition, I used dummy variables to account for the Campaign factor variable, and tuned all models.

Models and Tuned Performance

Using the caret workflow, I trained the data on the following models:

  • Boosted Linear Regression (BstLm)
  • Elastic Net Regression (enet)
  • Generalized Linear Regression (glmnet)
  • Least Angle Regression (lars)
  • Random Forest (rf)

Of the models I trained, ‘rf’ random forest seemed to have the best performance with an RMSE = 4.99 and R2 = 0.74.

# Model set up
dmy <- dummyVars(" ~ .", data = xTrain, fullRank = T)
xTrainDummy <- data.frame(predict(dmy, newdata = xTrain))

set.seed(seed)
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10)


# Predictive modeling w/ built-in feature selection
gridBstLm <- expand.grid(mstop = seq(8, 12, by = 1), nu = 0.5)
set.seed(seed)
modelBstLm <- train(x = xTrainDummy, y = yTrain, 
                method = "BstLm", 
                preProc = c("center", "scale", "YeoJohnson", "spatialSign"),
                tuneGrid = gridBstLm,
                trControl = ctrl)
modelBstLm # RMSE = 5.36
Boosted Linear Model 

37 samples
 5 predictor

Pre-processing: centered (5), scaled (5), Yeo-Johnson transformation (1), spatial
 sign transformation (5) 
Resampling: Cross-Validated (10 fold, repeated 10 times) 
Summary of sample sizes: 33, 33, 34, 33, 33, 34, ... 
Resampling results across tuning parameters:

  mstop  RMSE      Rsquared   MAE     
   8     5.473832  0.7172550  4.423317
   9     5.423629  0.7146034  4.332640
  10     5.361737  0.7182865  4.291986
  11     5.421260  0.7097770  4.344338
  12     5.441640  0.7093973  4.332588

Tuning parameter 'nu' was held constant at a value of 0.5
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mstop = 10 and nu = 0.5.
gridEnet <- expand.grid(fraction = seq(0.6, 0.8, by = 0.1), lambda = 0)
set.seed(seed)
modelEnet <- train(x = xTrainDummy, y = yTrain, 
                method = "enet", 
                preProc = c("center", "scale", "YeoJohnson", "spatialSign"),
                tuneGrid = gridEnet,
                trControl = ctrl)
modelEnet # RMSE = 5.15
Elasticnet 

37 samples
 5 predictor

Pre-processing: centered (5), scaled (5), Yeo-Johnson transformation (1), spatial
 sign transformation (5) 
Resampling: Cross-Validated (10 fold, repeated 10 times) 
Summary of sample sizes: 33, 33, 34, 33, 33, 34, ... 
Resampling results across tuning parameters:

  fraction  RMSE      Rsquared   MAE     
  0.6       5.145523  0.7253282  4.015712
  0.7       5.145162  0.7193330  3.940605
  0.8       5.159086  0.7181921  3.933155

Tuning parameter 'lambda' was held constant at a value of 0
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were fraction = 0.7 and lambda = 0.
gridGlmNet <- expand.grid(alpha = c(0.9, 1, 1.1), lambda = c(0.6, 0.7, 0.8))
set.seed(seed)
modelGlmNet <- train(x = xTrainDummy, y = yTrain, 
                method = "glmnet",
                preProc = c("center", "scale", "YeoJohnson", "spatialSign"),
                tuneGrid = gridGlmNet,
                trControl = ctrl)
alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1
modelGlmNet # RMSE = 5.17
glmnet 

37 samples
 5 predictor

Pre-processing: centered (5), scaled (5), Yeo-Johnson transformation (1), spatial sign transformation (5) 
Resampling: Cross-Validated (10 fold, repeated 10 times) 
Summary of sample sizes: 33, 33, 34, 33, 33, 34, ... 
Resampling results across tuning parameters:

  alpha  lambda  RMSE      Rsquared   MAE     
  0.9    0.6     5.179595  0.7158963  3.969675
  0.9    0.7     5.178803  0.7172605  3.986190
  0.9    0.8     5.178822  0.7197332  4.018424
  1.0    0.6     5.175057  0.7169748  3.971990
  1.0    0.7     5.172897  0.7193271  3.997127
  1.0    0.8     5.172940  0.7228559  4.039927
  1.1    0.6     5.175057  0.7169748  3.971990
  1.1    0.7     5.172897  0.7193271  3.997127
  1.1    0.8     5.172940  0.7228559  4.039927

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 0.7.
gridLars <- expand.grid(fraction = c(0.65, 0.7, 0.75))
set.seed(seed)
modelLars <- train(x = xTrainDummy, y = yTrain, 
                method = "lars", 
                preProc = c("center", "scale", "YeoJohnson", "spatialSign"),
                tuneGrid = gridLars,
                trControl = ctrl)
modelLars # RMSE = 5.14
Least Angle Regression 

37 samples
 5 predictor

Pre-processing: centered (5), scaled (5), Yeo-Johnson transformation (1), spatial sign transformation (5) 
Resampling: Cross-Validated (10 fold, repeated 10 times) 
Summary of sample sizes: 33, 33, 34, 33, 33, 34, ... 
Resampling results across tuning parameters:

  fraction  RMSE      Rsquared   MAE     
  0.65      5.143184  0.7214069  3.962712
  0.70      5.145162  0.7193330  3.940605
  0.75      5.150836  0.7185754  3.926442

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was fraction = 0.65.
gridRF <- expand.grid(mtry = c(1.5, 2, 2.5))
set.seed(seed)
modelRF <- train(x = xTrainDummy, y = yTrain, 
                method = "rf", 
                preProc = c("center", "scale", "YeoJohnson", "spatialSign"),
                tuneGrid = gridRF,
                trControl = ctrl)
modelRF # RMSE = 4.99, R^2 = 0.74
Random Forest 

37 samples
 5 predictor

Pre-processing: centered (5), scaled (5), Yeo-Johnson transformation (1), spatial sign transformation (5) 
Resampling: Cross-Validated (10 fold, repeated 10 times) 
Summary of sample sizes: 33, 33, 34, 33, 33, 34, ... 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared   MAE     
  1.5   4.999638  0.7365701  3.699015
  2.0   4.993346  0.7359774  3.713026
  2.5   4.994010  0.7383909  3.702070

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 2.

Model Performance Comparison Visualization

The below tables and graphs show that though the performance is similar across the models, ‘rf’ random forest performs a little better on the data than the other models. In addition, the largest statistical difference lies between ‘rf’ random forest and ‘BstLm’ boosted linear regression, so I may want to use them together if I decide to train a model ensemble.

# Compare model performances
modelResults <- resamples(list(BstLm = modelBstLm, Enet = modelEnet, GlmNet = modelGlmNet,
                               Lars = modelLars, RF = modelRF))
summary(modelResults)

Call:
summary.resamples(object = modelResults)

Models: BstLm, Enet, GlmNet, Lars, RF 
Number of resamples: 100 

MAE 
            Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
BstLm  1.0604282 2.850993 4.017808 4.291986 5.782943 9.492177    0
Enet   0.3811782 2.405743 3.534180 3.940605 5.562529 8.729665    0
GlmNet 0.5022129 2.424687 3.721210 3.997127 5.564189 8.642130    0
Lars   0.4126387 2.473788 3.680178 3.962712 5.571044 8.506586    0
RF     0.5466667 1.821698 3.620551 3.713026 5.237585 9.404392    0

RMSE 
            Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
BstLm  1.0940970 3.326571 4.654847 5.361737 7.719512 11.44774    0
Enet   0.3853941 2.882951 4.467947 5.145162 7.422270 11.68322    0
GlmNet 0.5091294 3.038352 4.510618 5.172897 7.321011 11.65059    0
Lars   0.4411069 3.004878 4.494555 5.143184 7.108701 11.72801    0
RF     0.5676295 2.278917 4.603319 4.993346 7.753244 11.67767    0

Rsquared 
               Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
BstLm  1.863982e-04 0.5673405 0.8226835 0.7182865 0.9379406 0.9999987    0
Enet   1.488194e-04 0.5670039 0.8265974 0.7193330 0.9248793 0.9999990    0
GlmNet 1.214535e-04 0.5735208 0.8275697 0.7193271 0.9297692 0.9999997    0
Lars   5.876646e-05 0.5780867 0.8314518 0.7214069 0.9291269 0.9999995    0
RF     6.060585e-03 0.5901719 0.8272488 0.7359774 0.9576658 0.9999867    0
dotplot(modelResults)


# Check differences
modelDiff <- diff(modelResults)
summary(modelDiff)

Call:
summary.diff.resamples(object = modelDiff)

p-value adjustment: bonferroni 
Upper diagonal: estimates of the difference
Lower diagonal: p-value for H0: difference = 0

MAE 
       BstLm     Enet      GlmNet    Lars      RF      
BstLm             0.35138   0.29486   0.32927   0.57896
Enet   4.171e-11           -0.05652  -0.02211   0.22758
GlmNet 5.041e-08 0.0028759            0.03442   0.28410
Lars   4.624e-09 1.0000000 0.0001817            0.24969
RF     0.0006684 1.0000000 0.4586034 0.8073614         

RMSE 
       BstLm     Enet      GlmNet    Lars      RF       
BstLm             0.216575  0.188840  0.218553  0.368391
Enet   0.0005908           -0.027735  0.001978  0.151816
GlmNet 0.0095493 0.3060463            0.029713  0.179551
Lars   0.0030252 1.0000000 0.0079327            0.149838
RF     0.2641020 1.0000000 1.0000000 1.0000000          

Rsquared 
       BstLm   Enet       GlmNet     Lars       RF        
BstLm          -1.047e-03 -1.041e-03 -3.120e-03 -1.769e-02
Enet   1.00000             5.877e-06 -2.074e-03 -1.664e-02
GlmNet 1.00000 1.00000               -2.080e-03 -1.665e-02
Lars   1.00000 1.00000    0.06348               -1.457e-02
RF     1.00000 1.00000    1.00000    1.00000              
dotplot(modelDiff)

Finalized Model Performance on Test Data

The final RMSE and R2 for the ‘rf’ random forest model on the test data:

# Transform test data
dmyTest <- dummyVars(" ~ .", data = xTest, fullRank = T)
xTestDummy <- data.frame(predict(dmyTest, newdata = xTest))

# Evaluate on test data
predictRF <- predict(modelRF, newdata = xTestDummy)
RMSE(predictRF, yTest)
[1] 6.612456
R2(predictRF, yTest)
[1] 0.4893805

Final Model Test Details

The final ‘rf’ model returned disappointing results for the test data. Only 49% of the variance can be explained by the model which returned an RMSE = 6.61, a lot higher of an error than with the training set. Final model details are given:

# Finalize Model
library(randomForest)
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 㤼㸱randomForest㤼㸲

The following object is masked from 㤼㸱package:ggplot2㤼㸲:

    margin
set.seed(seed)
preProcParams <- preProcess(xTrainDummy, method = c("center", "scale", "YeoJohnson", "spatialSign"))
xTrainTrans <- predict(preProcParams, xTrainDummy)
finalModel <- randomForest(x = xTrainTrans, y = yTrain, mstop = 10)
finalModel

Call:
 randomForest(x = xTrainTrans, y = yTrain, mstop = 10) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 1

          Mean of squared residuals: 31.06351
                    % Var explained: 46.7
summary(finalModel)
                Length Class  Mode     
call              4    -none- call     
type              1    -none- character
predicted        37    -none- numeric  
mse             500    -none- numeric  
rsq             500    -none- numeric  
oob.times        37    -none- numeric  
importance        5    -none- numeric  
importanceSD      0    -none- NULL     
localImportance   0    -none- NULL     
proximity         0    -none- NULL     
ntree             1    -none- numeric  
mtry              1    -none- numeric  
forest           11    -none- list     
coefs             0    -none- NULL     
y                37    -none- numeric  
test              0    -none- NULL     
inbag             0    -none- NULL     
# Evaluate on test data
xTestNum <- as.data.frame(sapply(xTest, as.numeric))
set.seed(seed)
xTestTrans <- predict(preProcParams, xTestDummy)
finalModelPredict <- predict(finalModel, newdata = xTestTrans, mstop = 10)

# Calculate performance
finalModelRMSE <- RMSE(finalModelPredict, yTest)
finalModelR2 <- R2(finalModelPredict, yTest)
finalModelRMSE
[1] 6.625432

At this point, it seems the model may be overfit. For now, let’s proceed to trying an ensemble of the ‘rf’ random forest and ‘BstLm’ boosted linear models.

Train and Evaluate ‘rf’ + ‘BstLm’ Model Ensemble

The ‘rf’ + ‘BstLm’ model ensemble performed much better on the training data than either model alone. However, the ensemble resulted in a slight decrease in RMSE performance and a slight increase in R2 performance on the test data.

library(caretEnsemble)

Attaching package: 㤼㸱caretEnsemble㤼㸲

The following object is masked from 㤼㸱package:ggplot2㤼㸲:

    autoplot
# Train a list of models with caretList()
listOfModels <- c('rf', 'BstLm')
set.seed(seed)
modelList <- caretList(x = xTrainDummy, y = yTrain, 
                    trControl = ctrl,
                    preProc = c("center", "scale", "YeoJohnson", "spatialSign"), 
                    methodList = listOfModels)
trControl$savePredictions not 'all' or 'final'.  Setting to 'final' so we can ensemble the models.indexes not defined in trControl.  Attempting to set them ourselves, so each model in the ensemble will have the same resampling indexes.
# Model results and correlation
results <- resamples(modelList)
summary(results)

Call:
summary.resamples(object = results)

Models: rf, BstLm 
Number of resamples: 100 

MAE 
           Min.  1st Qu.   Median    Mean  3rd Qu.     Max. NA's
rf    0.6956250 1.714279 3.508950 3.75202 5.066671 10.76811    0
BstLm 0.6308284 2.613128 3.969049 4.33590 5.610788 11.57693    0

RMSE 
           Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
rf    0.7554055 2.176715 4.471739 4.980845 7.477950 12.63715    0
BstLm 0.6886622 3.335793 4.804094 5.316516 7.318563 12.37133    0

Rsquared 
              Min.   1st Qu.    Median      Mean   3rd Qu.     Max. NA's
rf    1.398705e-03 0.6367162 0.8773240 0.7373097 0.9651239 0.999880    0
BstLm 8.916097e-05 0.5744188 0.7831661 0.7132427 0.9424384 0.998526    0
dotplot(results)

modelCor(results)
             rf     BstLm
rf    1.0000000 0.8122617
BstLm 0.8122617 1.0000000
# Combine the models with caretEnsemble()
set.seed(seed)
modelEnsemble <- caretEnsemble(modelList, metric = "RMSE",
                              trControl = trainControl(number = 2))

summary(modelEnsemble) # RMSE = 5.0463
The following models were ensembled: rf, BstLm 
They were weighted: 
0.4024 0.5448 0.3882
The resulting RMSE is: 5.691
The fit for each individual model on the RMSE is: 
# Evaluate on test data
predictEnsemble <- predict(modelEnsemble, newdata = xTestDummy)
RMSE(predictEnsemble, yTest)
[1] 6.465112
R2(predictEnsemble, yTest)
[1] 0.5127668

Since overall performance is comparable, I’ll go ahead and roll with it for the enrollment forecasting.

Data Solution

Final Model Prediction: Forecasted Enrollments

To forecast future student enrollments, I used the following December 2019 spending figures:

  • LVN: $1,271.86
  • Med Asst: $2,568.00
  • Pharmacy: $1,015.64
  • Nurse Aide: $939.75
  • Med Asst: $1,688.54

Let’s assume the following spending increases by Lone Star Medical Career College:

  • Dec + 10%
  • Dec + 15%
  • Dec + 20%

Based on the different campaign categories and these increases in spending, the third column lists the forecasted student enrollments.

# Import and clean data
predData <- read.csv("C:/Users/george/Documents/Digital Marketing Data/Triggers/predictedSpendingAndEnrollments.csv")

# Transform forecast data
dmyPredData <- dummyVars(" ~ .", data = predData, fullRank = T)
predDummy <- data.frame(predict(dmyPredData, newdata = predData))

# Predict forecast data
predEnrollments <- data.frame(predict(modelEnsemble, newdata = predDummy))
predExport <- data.frame(c(predData,predEnrollments))
predExport
#write.csv(predExport, "C:/Users/george/Documents/Digital Marketing Data/Triggers/predictedExport.csv", row.names = FALSE)

Business Solution

Overall, the ‘rf’+‘BstLm’ ensemble model indicates that incremental increases in spending should have a positive effect on the number of overall medical student enrollments. However, due to low model performance, these predictions are not very conclusive. Future projects should provide higher quality data and better modeling parameters.

Since this is the first kind of applied data mining project for both the medical school and their digital marketing company, the transaction channel between company ad spending and student enrollment counts is not yet automated into a database, and it is not clear how either company communicates this type of ROI with each other. In other words, we can not fully account for the amount of student enrollments that are solely attributed to ad-generated leads and spending. Unfortunately, this is a major issue when attempting to forecast future enrollments. Due to this fact, the model developed appears to be overfit and does not handle new data as well as hoped.

Future developments may include more grainular, hourly data sourced from the online platforms themselves and perhaps developing a less overfit model. In addition, maybe a new strategy for seeing ad-generated leads through to enrollments can be optimized for much better marketing funnel business intelligence performance. This project demonstrates the importance of streamlining business processes for optimized intelligence.

---
title: "Random Forest + Boosted Linear Ensemble"
date: 2020-01-16
tags: [machine learning, forecasting]
header:
#  image: "/images/enrollmentForecastPic.png"
excerpt: "Machine Learning, Forecasting"
output: 
  html_notebook:
    code_folding: hide
---
<center>

![*Student enrollment forecasting graphic*](C:/Users/george/Documents/R Assignments/EBB AND FLOW/GeorgeAaronG.github.io/images/enrollmentForecastPic.png)

</center>

&nbsp;

# Business Challenge

> Can we forecast medical student enrollments using advertising data from web-based platforms?

This is the question I tried to answer using Lone Star Medical Career College's digital advertising campaign, spending, and student enrollment data to forecast enrollments for the coming year.  As I started to develop [KPI charts for Lone Star Medical Career College](https://public.tableau.com/profile/george.garcia3963#!/vizhome/LoneStarMedicalCareerCollege/LoneStarMedicalCareerCollege), it became apparent to me that this might be a big opportunity to apply the predictive modeling workflow I had just learned in my ML class.

Why student enrollments?  For a career college, this is the main return on investment (ROI) figure.  Technically, the return for all the time and resources that go into digital marketing can be measured in many different ways: demographic exposure, increases in online followers, etc.  In this case, I wanted any work I do to potentially have a direct impact on the company's bottom line.  This is a common tactic, [the researchers on Garter.com tell me](https://www.gartner.com/en), for any company piloting a data mining initiative.  The following notebook highlights my first attempt at this business challenge.

*Note:*  Due to delays in the big data approach, I decided to go with the data supplied on their KPI spreadsheets for preliminary analysis.  Future development will be a big data project with actual hourly social media advertising data.  Also, the name of the college was changed to ensure company privacy.

```{r message =  FALSE}
# Libraries
library(caret)
library(skimr)
library(corrplot)
```

# Data Challenge

To forecast student enrollments, I will use the medical school's monthly historical data and their digital marketing company's monthly KPI reports for campaign spending.  At first, I incorporated other marketing-based dependent variables into the data such as lead counts and social media platform (Google Ads vs. Facebook Ads), but initial model performances seemed rather low.  After some refinement and visual exploration using *Tableau*, I settled for the following dependent variables: campaign type and spending.

### Imported Data
```{r}
# Import and clean data
data <- read.csv("C:/Users/george/Documents/Digital Marketing Data/Triggers/spendingAndEnrollments.csv")
dataRGV <- data
dataRGV
```

### Skimmed Data

Using *skimr* allows an at-a-glance view of the imported data.  Campaign is a factor variable and spending is numeric.  Before visualizing or processing the campaign variable, I will have to convert it to to a numeric dummy variable for better performance.  I also note that there are only 51 instances of this data, so how I decide to partition the data will have a bigger impact on performance than usual.
```{r}
# Split data
x <- subset(dataRGV, select = -Enrollments)
y <- subset(dataRGV, select = Enrollments)

# Data summary
skimX <- skim(x)
skimY <- skim(y)
skimX
skimY

# Data partition
seed <- 123
set.seed(seed)
dataPart <- createDataPartition(y$Enrollments,
                              p = .70,
                              list = FALSE)
xTrain <- x[dataPart,]
yTrain <- y[dataPart,]

xTest <- x[-dataPart,]
yTest <- y[-dataPart,]
```

### Training Data Visualization and Correlation

For a visual component to the training data, I use the standard boxplot and histogram charts.  I also plot the correlation between variables.  At this point I note that no correlation screening seems necessary for this data, and the only outliers are in the dependent variable data.
```{r}
# Create numeric set of x variables
xTrainNum <- as.data.frame(sapply(xTrain, as.numeric))

# Boxplot of predictors
boxplot(xTrainNum)
boxplot(yTrain)

# Histograms
hist(xTrainNum$Campaign)
hist(xTrainNum$Spending)
hist(yTrain)

# Correlation plots and correlation ceiling
correlations <- cor(xTrainNum)
corrplot(correlations, method = "number", order = "hclust")
```

## ML Algorithm Training

### Pre-processing

I preprocessed using the standard *center/scaling* and *YeoJohnson* to normalize the data, and *spatialSign* for outliers (which I don't think is necessary, but I usually add it anyway as it might boost performance a bit).  I also used 10-fold repeated cross validation for resampling.  In addition, I used dummy variables to account for the Campaign factor variable, and tuned all models.

## Models and Tuned Performance

Using the *caret* workflow, I trained the data on the following models:

* Boosted Linear Regression *(BstLm)*
* Elastic Net Regression *(enet)*
* Generalized Linear Regression *(glmnet)*
* Least Angle Regression *(lars)*
* Random Forest *(rf)*

Of the models I trained, *'rf'* random forest seemed to have the best performance with an RMSE = 4.99 and R^2^ = 0.74.
```{r}
# Model set up
dmy <- dummyVars(" ~ .", data = xTrain, fullRank = T)
xTrainDummy <- data.frame(predict(dmy, newdata = xTrain))

set.seed(seed)
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10)


# Predictive modeling w/ built-in feature selection
gridBstLm <- expand.grid(mstop = seq(8, 12, by = 1), nu = 0.5)
set.seed(seed)
modelBstLm <- train(x = xTrainDummy, y = yTrain, 
                method = "BstLm", 
                preProc = c("center", "scale", "YeoJohnson", "spatialSign"),
                tuneGrid = gridBstLm,
                trControl = ctrl)
modelBstLm # RMSE = 5.36

gridEnet <- expand.grid(fraction = seq(0.6, 0.8, by = 0.1), lambda = 0)
set.seed(seed)
modelEnet <- train(x = xTrainDummy, y = yTrain, 
                method = "enet", 
                preProc = c("center", "scale", "YeoJohnson", "spatialSign"),
                tuneGrid = gridEnet,
                trControl = ctrl)
modelEnet # RMSE = 5.15

gridGlmNet <- expand.grid(alpha = c(0.9, 1, 1.1), lambda = c(0.6, 0.7, 0.8))
set.seed(seed)
modelGlmNet <- train(x = xTrainDummy, y = yTrain, 
                method = "glmnet",
                preProc = c("center", "scale", "YeoJohnson", "spatialSign"),
                tuneGrid = gridGlmNet,
                trControl = ctrl)
modelGlmNet # RMSE = 5.17

gridLars <- expand.grid(fraction = c(0.65, 0.7, 0.75))
set.seed(seed)
modelLars <- train(x = xTrainDummy, y = yTrain, 
                method = "lars", 
                preProc = c("center", "scale", "YeoJohnson", "spatialSign"),
                tuneGrid = gridLars,
                trControl = ctrl)
modelLars # RMSE = 5.14

gridRF <- expand.grid(mtry = c(1.5, 2, 2.5))
set.seed(seed)
modelRF <- train(x = xTrainDummy, y = yTrain, 
                method = "rf", 
                preProc = c("center", "scale", "YeoJohnson", "spatialSign"),
                tuneGrid = gridRF,
                trControl = ctrl)
modelRF # RMSE = 4.99, R^2 = 0.74

```

### Model Performance Comparison Visualization

The below tables and graphs show that though the performance is similar across the models, *'rf'* random forest performs a little better on the data than the other models.  In addition, the largest statistical difference lies between *'rf'* random forest and *'BstLm'* boosted linear regression, so I may want to use them together if I decide to train a model ensemble.
```{r}
# Compare model performances
modelResults <- resamples(list(BstLm = modelBstLm, Enet = modelEnet, GlmNet = modelGlmNet,
                               Lars = modelLars, RF = modelRF))
summary(modelResults)
dotplot(modelResults)

# Check differences
modelDiff <- diff(modelResults)
summary(modelDiff)
dotplot(modelDiff)
```

## Finalized Model Performance on Test Data

The final RMSE and R^2^ for the *'rf'* random forest model on the test data:
```{r}
# Transform test data
dmyTest <- dummyVars(" ~ .", data = xTest, fullRank = T)
xTestDummy <- data.frame(predict(dmyTest, newdata = xTest))

# Evaluate on test data
predictRF <- predict(modelRF, newdata = xTestDummy)
RMSE(predictRF, yTest)
R2(predictRF, yTest)
```

### Final Model Test Details

The final *'rf'* model returned disappointing results for the test data.  Only 49% of the variance can be explained by the model which returned an RMSE = 6.61, a lot higher of an error than with the training set.  Final model details are given: 
```{r}
# Finalize Model
library(randomForest)
set.seed(seed)
preProcParams <- preProcess(xTrainDummy, method = c("center", "scale", "YeoJohnson", "spatialSign"))
xTrainTrans <- predict(preProcParams, xTrainDummy)
finalModel <- randomForest(x = xTrainTrans, y = yTrain, mstop = 10)
finalModel
summary(finalModel)

# Evaluate on test data
xTestNum <- as.data.frame(sapply(xTest, as.numeric))
set.seed(seed)
xTestTrans <- predict(preProcParams, xTestDummy)
finalModelPredict <- predict(finalModel, newdata = xTestTrans, mstop = 10)

# Calculate performance
finalModelRMSE <- RMSE(finalModelPredict, yTest)
finalModelR2 <- R2(finalModelPredict, yTest)
finalModelRMSE
```

At this point, it seems the model may be **overfit**.  For now, let's proceed to trying an ensemble of the **'rf'** random forest and **'BstLm'** boosted linear models.


### Train and Evaluate 'rf' + 'BstLm' Model Ensemble

The **'rf' + 'BstLm'** model ensemble performed much better on the training data than either model alone.  However, the ensemble resulted in a slight decrease in RMSE performance and a slight increase in R^2^ performance on the test data.
```{r}
library(caretEnsemble)

# Train a list of models with caretList()
listOfModels <- c('rf', 'BstLm')
set.seed(seed)
modelList <- caretList(x = xTrainDummy, y = yTrain, 
                    trControl = ctrl,
                    preProc = c("center", "scale", "YeoJohnson", "spatialSign"), 
                    methodList = listOfModels)

# Model results and correlation
results <- resamples(modelList)
summary(results)
dotplot(results)
modelCor(results)

# Combine the models with caretEnsemble()
set.seed(seed)
modelEnsemble <- caretEnsemble(modelList, metric = "RMSE",
                              trControl = trainControl(number = 2))

summary(modelEnsemble) # RMSE = 5.0463

# Evaluate on test data
predictEnsemble <- predict(modelEnsemble, newdata = xTestDummy)
RMSE(predictEnsemble, yTest)
R2(predictEnsemble, yTest)
```
Since overall performance is comparable, I'll go ahead and roll with it for the enrollment forecasting.

# Data Solution

### Final Model Prediction: Forecasted Enrollments

To forecast future student enrollments, I used the following December 2019 spending figures:

* **LVN**: $1,271.86
* **Med Asst**: $2,568.00
* **Pharmacy**: $1,015.64
* **Nurse Aide**: $939.75
* **Med Asst**: $1,688.54

Let's assume the following spending increases by Lone Star Medical Career College:

* Dec + 10%
* Dec + 15%
* Dec + 20%

Based on the different campaign categories and these increases in spending, the third column lists the forecasted student enrollments.
```{r}
# Import and clean data
predData <- read.csv("C:/Users/george/Documents/Digital Marketing Data/Triggers/predictedSpendingAndEnrollments.csv")

# Transform forecast data
dmyPredData <- dummyVars(" ~ .", data = predData, fullRank = T)
predDummy <- data.frame(predict(dmyPredData, newdata = predData))

# Predict forecast data
predEnrollments <- data.frame(predict(modelEnsemble, newdata = predDummy))
predExport <- data.frame(c(predData,predEnrollments))
predExport
#write.csv(predExport, "C:/Users/george/Documents/Digital Marketing Data/Triggers/predictedExport.csv", row.names = FALSE)
```

# Business Solution

Overall, the *'rf'+'BstLm'* ensemble model indicates that **incremental increases in spending should have a positive effect on the number of overall medical student enrollments.**  However, due to low model performance, these predictions are not very conclusive.  Future projects should provide higher quality data and better modeling parameters.

Since this is the first kind of applied data mining project for both the medical school and their digital marketing company, the transaction channel between company ad spending and student enrollment counts is not yet automated into a database, and it is not clear how either company communicates this type of ROI with each other.  In other words, we can not fully account for the amount of student enrollments that are solely attributed to ad-generated leads and spending.  Unfortunately, this is a major issue when attempting to forecast future enrollments.  Due to this fact, the model developed appears to be overfit and does not handle new data as well as hoped.

Future developments may include more grainular, hourly data sourced from the online platforms themselves and perhaps developing a less overfit model.  In addition, maybe a new strategy for seeing ad-generated leads through to enrollments can be optimized for much better marketing funnel business intelligence performance.  This project demonstrates the importance of streamlining business processes for optimized intelligence.

Updated: