Medical Student Enrollment Forecasting.nb
Random Forest + Boosted Linear Ensemble
2020-01-16
Business Challenge
Can we forecast medical student enrollments using advertising data from web-based platforms?
This is the question I tried to answer using Lone Star Medical Career College’s digital advertising campaign, spending, and student enrollment data to forecast enrollments for the coming year. As I started to develop KPI charts for Lone Star Medical Career College, it became apparent to me that this might be a big opportunity to apply the predictive modeling workflow I had just learned in my ML class.
Why student enrollments? For a career college, this is the main return on investment (ROI) figure. Technically, the return for all the time and resources that go into digital marketing can be measured in many different ways: demographic exposure, increases in online followers, etc. In this case, I wanted any work I do to potentially have a direct impact on the company’s bottom line. This is a common tactic, the researchers on Garter.com tell me, for any company piloting a data mining initiative. The following notebook highlights my first attempt at this business challenge.
Note: Due to delays in the big data approach, I decided to go with the data supplied on their KPI spreadsheets for preliminary analysis. Future development will be a big data project with actual hourly social media advertising data. Also, the name of the college was changed to ensure company privacy.
# Libraries
library(caret)
library(skimr)
library(corrplot)
Data Challenge
To forecast student enrollments, I will use the medical school’s monthly historical data and their digital marketing company’s monthly KPI reports for campaign spending. At first, I incorporated other marketing-based dependent variables into the data such as lead counts and social media platform (Google Ads vs. Facebook Ads), but initial model performances seemed rather low. After some refinement and visual exploration using Tableau, I settled for the following dependent variables: campaign type and spending.
Imported Data
# Import and clean data
data <- read.csv("C:/Users/george/Documents/Digital Marketing Data/Triggers/spendingAndEnrollments.csv")
dataRGV <- data
dataRGV
Skimmed Data
Using skimr allows an at-a-glance view of the imported data. Campaign is a factor variable and spending is numeric. Before visualizing or processing the campaign variable, I will have to convert it to to a numeric dummy variable for better performance. I also note that there are only 51 instances of this data, so how I decide to partition the data will have a bigger impact on performance than usual.
# Split data
x <- subset(dataRGV, select = -Enrollments)
y <- subset(dataRGV, select = Enrollments)
# Data summary
skimX <- skim(x)
skimY <- skim(y)
skimX
-- Data Summary ------------------------
Values
Name x
Number of rows 51
Number of columns 2
_______________________
Column type frequency:
factor 1
numeric 1
________________________
Group variables None
-- Variable type: factor ----------------------------------------------------------------------------
# A tibble: 1 x 6
skim_variable n_missing complete_rate ordered n_unique top_counts
* <chr> <int> <dbl> <lgl> <int> <chr>
1 Campaign 0 1 FALSE 5 LVN: 11, Med: 11, Nur: 11, Pha: 11
-- Variable type: numeric ---------------------------------------------------------------------------
# A tibble: 1 x 11
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
* <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Spending 0 1 1377. 505. 362. 958. 1232. 1778. 2270. ▁▇▅▅▅
skimY
-- Data Summary ------------------------
Values
Name y
Number of rows 51
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None
-- Variable type: numeric ---------------------------------------------------------------------------
# A tibble: 1 x 11
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
* <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Enrollments 0 1 9.20 8.03 0 3 8 13 31 ▇▅▂▁▂
# Data partition
seed <- 123
set.seed(seed)
dataPart <- createDataPartition(y$Enrollments,
p = .70,
list = FALSE)
xTrain <- x[dataPart,]
yTrain <- y[dataPart,]
xTest <- x[-dataPart,]
yTest <- y[-dataPart,]
Training Data Visualization and Correlation
For a visual component to the training data, I use the standard boxplot and histogram charts. I also plot the correlation between variables. At this point I note that no correlation screening seems necessary for this data, and the only outliers are in the dependent variable data.
# Create numeric set of x variables
xTrainNum <- as.data.frame(sapply(xTrain, as.numeric))
# Boxplot of predictors
boxplot(xTrainNum)
boxplot(yTrain)
# Histograms
hist(xTrainNum$Campaign)
hist(xTrainNum$Spending)
hist(yTrain)
# Correlation plots and correlation ceiling
correlations <- cor(xTrainNum)
corrplot(correlations, method = "number", order = "hclust")
ML Algorithm Training
Pre-processing
I preprocessed using the standard center/scaling and YeoJohnson to normalize the data, and spatialSign for outliers (which I don’t think is necessary, but I usually add it anyway as it might boost performance a bit). I also used 10-fold repeated cross validation for resampling. In addition, I used dummy variables to account for the Campaign factor variable, and tuned all models.
Models and Tuned Performance
Using the caret workflow, I trained the data on the following models:
- Boosted Linear Regression (BstLm)
- Elastic Net Regression (enet)
- Generalized Linear Regression (glmnet)
- Least Angle Regression (lars)
- Random Forest (rf)
Of the models I trained, ‘rf’ random forest seemed to have the best performance with an RMSE = 4.99 and R2 = 0.74.
# Model set up
dmy <- dummyVars(" ~ .", data = xTrain, fullRank = T)
xTrainDummy <- data.frame(predict(dmy, newdata = xTrain))
set.seed(seed)
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10)
# Predictive modeling w/ built-in feature selection
gridBstLm <- expand.grid(mstop = seq(8, 12, by = 1), nu = 0.5)
set.seed(seed)
modelBstLm <- train(x = xTrainDummy, y = yTrain,
method = "BstLm",
preProc = c("center", "scale", "YeoJohnson", "spatialSign"),
tuneGrid = gridBstLm,
trControl = ctrl)
modelBstLm # RMSE = 5.36
Boosted Linear Model
37 samples
5 predictor
Pre-processing: centered (5), scaled (5), Yeo-Johnson transformation (1), spatial
sign transformation (5)
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 33, 33, 34, 33, 33, 34, ...
Resampling results across tuning parameters:
mstop RMSE Rsquared MAE
8 5.473832 0.7172550 4.423317
9 5.423629 0.7146034 4.332640
10 5.361737 0.7182865 4.291986
11 5.421260 0.7097770 4.344338
12 5.441640 0.7093973 4.332588
Tuning parameter 'nu' was held constant at a value of 0.5
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mstop = 10 and nu = 0.5.
gridEnet <- expand.grid(fraction = seq(0.6, 0.8, by = 0.1), lambda = 0)
set.seed(seed)
modelEnet <- train(x = xTrainDummy, y = yTrain,
method = "enet",
preProc = c("center", "scale", "YeoJohnson", "spatialSign"),
tuneGrid = gridEnet,
trControl = ctrl)
modelEnet # RMSE = 5.15
Elasticnet
37 samples
5 predictor
Pre-processing: centered (5), scaled (5), Yeo-Johnson transformation (1), spatial
sign transformation (5)
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 33, 33, 34, 33, 33, 34, ...
Resampling results across tuning parameters:
fraction RMSE Rsquared MAE
0.6 5.145523 0.7253282 4.015712
0.7 5.145162 0.7193330 3.940605
0.8 5.159086 0.7181921 3.933155
Tuning parameter 'lambda' was held constant at a value of 0
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were fraction = 0.7 and lambda = 0.
gridGlmNet <- expand.grid(alpha = c(0.9, 1, 1.1), lambda = c(0.6, 0.7, 0.8))
set.seed(seed)
modelGlmNet <- train(x = xTrainDummy, y = yTrain,
method = "glmnet",
preProc = c("center", "scale", "YeoJohnson", "spatialSign"),
tuneGrid = gridGlmNet,
trControl = ctrl)
alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1alpha >1; set to 1
modelGlmNet # RMSE = 5.17
glmnet
37 samples
5 predictor
Pre-processing: centered (5), scaled (5), Yeo-Johnson transformation (1), spatial sign transformation (5)
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 33, 33, 34, 33, 33, 34, ...
Resampling results across tuning parameters:
alpha lambda RMSE Rsquared MAE
0.9 0.6 5.179595 0.7158963 3.969675
0.9 0.7 5.178803 0.7172605 3.986190
0.9 0.8 5.178822 0.7197332 4.018424
1.0 0.6 5.175057 0.7169748 3.971990
1.0 0.7 5.172897 0.7193271 3.997127
1.0 0.8 5.172940 0.7228559 4.039927
1.1 0.6 5.175057 0.7169748 3.971990
1.1 0.7 5.172897 0.7193271 3.997127
1.1 0.8 5.172940 0.7228559 4.039927
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 0.7.
gridLars <- expand.grid(fraction = c(0.65, 0.7, 0.75))
set.seed(seed)
modelLars <- train(x = xTrainDummy, y = yTrain,
method = "lars",
preProc = c("center", "scale", "YeoJohnson", "spatialSign"),
tuneGrid = gridLars,
trControl = ctrl)
modelLars # RMSE = 5.14
Least Angle Regression
37 samples
5 predictor
Pre-processing: centered (5), scaled (5), Yeo-Johnson transformation (1), spatial sign transformation (5)
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 33, 33, 34, 33, 33, 34, ...
Resampling results across tuning parameters:
fraction RMSE Rsquared MAE
0.65 5.143184 0.7214069 3.962712
0.70 5.145162 0.7193330 3.940605
0.75 5.150836 0.7185754 3.926442
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was fraction = 0.65.
gridRF <- expand.grid(mtry = c(1.5, 2, 2.5))
set.seed(seed)
modelRF <- train(x = xTrainDummy, y = yTrain,
method = "rf",
preProc = c("center", "scale", "YeoJohnson", "spatialSign"),
tuneGrid = gridRF,
trControl = ctrl)
modelRF # RMSE = 4.99, R^2 = 0.74
Random Forest
37 samples
5 predictor
Pre-processing: centered (5), scaled (5), Yeo-Johnson transformation (1), spatial sign transformation (5)
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 33, 33, 34, 33, 33, 34, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
1.5 4.999638 0.7365701 3.699015
2.0 4.993346 0.7359774 3.713026
2.5 4.994010 0.7383909 3.702070
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 2.
Model Performance Comparison Visualization
The below tables and graphs show that though the performance is similar across the models, ‘rf’ random forest performs a little better on the data than the other models. In addition, the largest statistical difference lies between ‘rf’ random forest and ‘BstLm’ boosted linear regression, so I may want to use them together if I decide to train a model ensemble.
# Compare model performances
modelResults <- resamples(list(BstLm = modelBstLm, Enet = modelEnet, GlmNet = modelGlmNet,
Lars = modelLars, RF = modelRF))
summary(modelResults)
Call:
summary.resamples(object = modelResults)
Models: BstLm, Enet, GlmNet, Lars, RF
Number of resamples: 100
MAE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
BstLm 1.0604282 2.850993 4.017808 4.291986 5.782943 9.492177 0
Enet 0.3811782 2.405743 3.534180 3.940605 5.562529 8.729665 0
GlmNet 0.5022129 2.424687 3.721210 3.997127 5.564189 8.642130 0
Lars 0.4126387 2.473788 3.680178 3.962712 5.571044 8.506586 0
RF 0.5466667 1.821698 3.620551 3.713026 5.237585 9.404392 0
RMSE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
BstLm 1.0940970 3.326571 4.654847 5.361737 7.719512 11.44774 0
Enet 0.3853941 2.882951 4.467947 5.145162 7.422270 11.68322 0
GlmNet 0.5091294 3.038352 4.510618 5.172897 7.321011 11.65059 0
Lars 0.4411069 3.004878 4.494555 5.143184 7.108701 11.72801 0
RF 0.5676295 2.278917 4.603319 4.993346 7.753244 11.67767 0
Rsquared
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
BstLm 1.863982e-04 0.5673405 0.8226835 0.7182865 0.9379406 0.9999987 0
Enet 1.488194e-04 0.5670039 0.8265974 0.7193330 0.9248793 0.9999990 0
GlmNet 1.214535e-04 0.5735208 0.8275697 0.7193271 0.9297692 0.9999997 0
Lars 5.876646e-05 0.5780867 0.8314518 0.7214069 0.9291269 0.9999995 0
RF 6.060585e-03 0.5901719 0.8272488 0.7359774 0.9576658 0.9999867 0
dotplot(modelResults)
# Check differences
modelDiff <- diff(modelResults)
summary(modelDiff)
Call:
summary.diff.resamples(object = modelDiff)
p-value adjustment: bonferroni
Upper diagonal: estimates of the difference
Lower diagonal: p-value for H0: difference = 0
MAE
BstLm Enet GlmNet Lars RF
BstLm 0.35138 0.29486 0.32927 0.57896
Enet 4.171e-11 -0.05652 -0.02211 0.22758
GlmNet 5.041e-08 0.0028759 0.03442 0.28410
Lars 4.624e-09 1.0000000 0.0001817 0.24969
RF 0.0006684 1.0000000 0.4586034 0.8073614
RMSE
BstLm Enet GlmNet Lars RF
BstLm 0.216575 0.188840 0.218553 0.368391
Enet 0.0005908 -0.027735 0.001978 0.151816
GlmNet 0.0095493 0.3060463 0.029713 0.179551
Lars 0.0030252 1.0000000 0.0079327 0.149838
RF 0.2641020 1.0000000 1.0000000 1.0000000
Rsquared
BstLm Enet GlmNet Lars RF
BstLm -1.047e-03 -1.041e-03 -3.120e-03 -1.769e-02
Enet 1.00000 5.877e-06 -2.074e-03 -1.664e-02
GlmNet 1.00000 1.00000 -2.080e-03 -1.665e-02
Lars 1.00000 1.00000 0.06348 -1.457e-02
RF 1.00000 1.00000 1.00000 1.00000
dotplot(modelDiff)
Finalized Model Performance on Test Data
The final RMSE and R2 for the ‘rf’ random forest model on the test data:
# Transform test data
dmyTest <- dummyVars(" ~ .", data = xTest, fullRank = T)
xTestDummy <- data.frame(predict(dmyTest, newdata = xTest))
# Evaluate on test data
predictRF <- predict(modelRF, newdata = xTestDummy)
RMSE(predictRF, yTest)
[1] 6.612456
R2(predictRF, yTest)
[1] 0.4893805
Final Model Test Details
The final ‘rf’ model returned disappointing results for the test data. Only 49% of the variance can be explained by the model which returned an RMSE = 6.61, a lot higher of an error than with the training set. Final model details are given:
# Finalize Model
library(randomForest)
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.
Attaching package: 㤼㸱randomForest㤼㸲
The following object is masked from 㤼㸱package:ggplot2㤼㸲:
margin
set.seed(seed)
preProcParams <- preProcess(xTrainDummy, method = c("center", "scale", "YeoJohnson", "spatialSign"))
xTrainTrans <- predict(preProcParams, xTrainDummy)
finalModel <- randomForest(x = xTrainTrans, y = yTrain, mstop = 10)
finalModel
Call:
randomForest(x = xTrainTrans, y = yTrain, mstop = 10)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 1
Mean of squared residuals: 31.06351
% Var explained: 46.7
summary(finalModel)
Length Class Mode
call 4 -none- call
type 1 -none- character
predicted 37 -none- numeric
mse 500 -none- numeric
rsq 500 -none- numeric
oob.times 37 -none- numeric
importance 5 -none- numeric
importanceSD 0 -none- NULL
localImportance 0 -none- NULL
proximity 0 -none- NULL
ntree 1 -none- numeric
mtry 1 -none- numeric
forest 11 -none- list
coefs 0 -none- NULL
y 37 -none- numeric
test 0 -none- NULL
inbag 0 -none- NULL
# Evaluate on test data
xTestNum <- as.data.frame(sapply(xTest, as.numeric))
set.seed(seed)
xTestTrans <- predict(preProcParams, xTestDummy)
finalModelPredict <- predict(finalModel, newdata = xTestTrans, mstop = 10)
# Calculate performance
finalModelRMSE <- RMSE(finalModelPredict, yTest)
finalModelR2 <- R2(finalModelPredict, yTest)
finalModelRMSE
[1] 6.625432
At this point, it seems the model may be overfit. For now, let’s proceed to trying an ensemble of the ‘rf’ random forest and ‘BstLm’ boosted linear models.
Train and Evaluate ‘rf’ + ‘BstLm’ Model Ensemble
The ‘rf’ + ‘BstLm’ model ensemble performed much better on the training data than either model alone. However, the ensemble resulted in a slight decrease in RMSE performance and a slight increase in R2 performance on the test data.
library(caretEnsemble)
Attaching package: 㤼㸱caretEnsemble㤼㸲
The following object is masked from 㤼㸱package:ggplot2㤼㸲:
autoplot
# Train a list of models with caretList()
listOfModels <- c('rf', 'BstLm')
set.seed(seed)
modelList <- caretList(x = xTrainDummy, y = yTrain,
trControl = ctrl,
preProc = c("center", "scale", "YeoJohnson", "spatialSign"),
methodList = listOfModels)
trControl$savePredictions not 'all' or 'final'. Setting to 'final' so we can ensemble the models.indexes not defined in trControl. Attempting to set them ourselves, so each model in the ensemble will have the same resampling indexes.
# Model results and correlation
results <- resamples(modelList)
summary(results)
Call:
summary.resamples(object = results)
Models: rf, BstLm
Number of resamples: 100
MAE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
rf 0.6956250 1.714279 3.508950 3.75202 5.066671 10.76811 0
BstLm 0.6308284 2.613128 3.969049 4.33590 5.610788 11.57693 0
RMSE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
rf 0.7554055 2.176715 4.471739 4.980845 7.477950 12.63715 0
BstLm 0.6886622 3.335793 4.804094 5.316516 7.318563 12.37133 0
Rsquared
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
rf 1.398705e-03 0.6367162 0.8773240 0.7373097 0.9651239 0.999880 0
BstLm 8.916097e-05 0.5744188 0.7831661 0.7132427 0.9424384 0.998526 0
dotplot(results)
modelCor(results)
rf BstLm
rf 1.0000000 0.8122617
BstLm 0.8122617 1.0000000
# Combine the models with caretEnsemble()
set.seed(seed)
modelEnsemble <- caretEnsemble(modelList, metric = "RMSE",
trControl = trainControl(number = 2))
summary(modelEnsemble) # RMSE = 5.0463
The following models were ensembled: rf, BstLm
They were weighted:
0.4024 0.5448 0.3882
The resulting RMSE is: 5.691
The fit for each individual model on the RMSE is:
# Evaluate on test data
predictEnsemble <- predict(modelEnsemble, newdata = xTestDummy)
RMSE(predictEnsemble, yTest)
[1] 6.465112
R2(predictEnsemble, yTest)
[1] 0.5127668
Since overall performance is comparable, I’ll go ahead and roll with it for the enrollment forecasting.
Data Solution
Final Model Prediction: Forecasted Enrollments
To forecast future student enrollments, I used the following December 2019 spending figures:
- LVN: $1,271.86
- Med Asst: $2,568.00
- Pharmacy: $1,015.64
- Nurse Aide: $939.75
- Med Asst: $1,688.54
Let’s assume the following spending increases by Lone Star Medical Career College:
- Dec + 10%
- Dec + 15%
- Dec + 20%
Based on the different campaign categories and these increases in spending, the third column lists the forecasted student enrollments.
# Import and clean data
predData <- read.csv("C:/Users/george/Documents/Digital Marketing Data/Triggers/predictedSpendingAndEnrollments.csv")
# Transform forecast data
dmyPredData <- dummyVars(" ~ .", data = predData, fullRank = T)
predDummy <- data.frame(predict(dmyPredData, newdata = predData))
# Predict forecast data
predEnrollments <- data.frame(predict(modelEnsemble, newdata = predDummy))
predExport <- data.frame(c(predData,predEnrollments))
predExport
#write.csv(predExport, "C:/Users/george/Documents/Digital Marketing Data/Triggers/predictedExport.csv", row.names = FALSE)
Business Solution
Overall, the ‘rf’+‘BstLm’ ensemble model indicates that incremental increases in spending should have a positive effect on the number of overall medical student enrollments. However, due to low model performance, these predictions are not very conclusive. Future projects should provide higher quality data and better modeling parameters.
Since this is the first kind of applied data mining project for both the medical school and their digital marketing company, the transaction channel between company ad spending and student enrollment counts is not yet automated into a database, and it is not clear how either company communicates this type of ROI with each other. In other words, we can not fully account for the amount of student enrollments that are solely attributed to ad-generated leads and spending. Unfortunately, this is a major issue when attempting to forecast future enrollments. Due to this fact, the model developed appears to be overfit and does not handle new data as well as hoped.
Future developments may include more grainular, hourly data sourced from the online platforms themselves and perhaps developing a less overfit model. In addition, maybe a new strategy for seeing ad-generated leads through to enrollments can be optimized for much better marketing funnel business intelligence performance. This project demonstrates the importance of streamlining business processes for optimized intelligence.