Student Forecast Model - Kieran Shah's Data Science Examples

This post attempts to provide an example using simulated data of how to forecast student enrollment at a four-year college or university. It uses Markov chains with simulated data. The complete code to reproduce this post can be found here

Describe a model:

In most North Americana four-year universities, students progress through the four-year curriculum, and then graduate after four years. However, some students leave after the first year, other after their third year. Using a discrete markov chain model can be an easy tool to forecast the number of students a given University will have the following semester, or following year.

The markovchain package in R was developed and works well for forecasting students in this scenario. The complete vigenette is here. The vignette provides the theoretical background, and practical examples.

Data

Below is a table showing an example of the student-level data. There is a student identifier, their current semester, and their following semester. The current semester and the following semester are the key variables that will make up the transition matrix.

Student-level data example

Id	Semester	NumberSemester	ProbOfMovingOn	MovedOn	CurrentSemester	FutureSemester
1	F	1	0.114	1	F1	W1
1	W	1	0.622	1	W1	F2
1	F	2	0.609	1	F2	W2
1	W	2	0.623	1	W2	F3
1	F	3	0.861	1	F3	W3
1	W	3	0.640	1	W3	Leave

To create the transition matrix, we need to calculate the chances students have from moving from one state to another. For instance, if a student enters in their first semester, Fall 1 (F1), then we need to calculate the probability of them moving to other states. In our example, the following states could be Winter 1 (W1), or they could leave the institution.

Transition Matrix

CurrentSemester	Following Semester
CurrentSemester	F2	F3	F4	Graduated	Leave	Not graduated	W1	W2	W3	W4
F1	0.000	0.000	0.000	0.000	0.098	0.000	0.902	0.000	0.000	0.000
F2	0.000	0.000	0.000	0.000	0.053	0.000	0.000	0.947	0.000	0.000
F3	0.000	0.000	0.000	0.000	0.172	0.000	0.000	0.000	0.828	0.000
F4	0.000	0.000	0.000	0.000	0.099	0.000	0.000	0.000	0.000	0.901
Leave	0.127	0.042	0.197	0.099	0.085	0.056	0.113	0.085	0.099	0.099
W1	0.934	0.000	0.000	0.000	0.066	0.000	0.000	0.000	0.000	0.000
W2	0.000	0.947	0.000	0.000	0.053	0.000	0.000	0.000	0.000	0.000
W3	0.000	0.000	0.917	0.000	0.083	0.000	0.000	0.000	0.000	0.000
W4	0.000	0.000	0.000	0.820	0.000	0.180	0.000	0.000	0.000	0.000

Making a forecast

To make a forecast, we need to input a number of students at a particular state. For our example, we enter 100 students at F1. Given the transition matrix, the number of students will continue to decline until the process is complete.

The below table shows the final outcome.

As we can see, while we start with one hundred students, that decreases over time.

Final Forecast

Semester	Student Count
Fall-1	100.00000
Fall-2	98.22689
Fall-3	94.05890
Fall-4	85.15074
Winter-1	97.08060
Winter-2	97.74008
Winter-3	82.36622
Winter-4	80.17097

Bootstrapping

The first process provides point estimates. It uses the underlying probabilities based on historical student transitions to forecast future enrollment. However, it cannot provide us a range of outcomes. In order to provide a range of outcomes, we bootstrap the underlying data. The bootstrap is a simple sample with replacement. We sample with replacement the students, and not the rows of the dataset. Some students will be included multiple times, and other will not be included at all.

In this example, we use the bootstrap to calculate a number of possibilities for a forecast.

We use 500 bootstraps in this example. This creates 500 transition matrices.

THe final plot shows the distribution for the fall-2 and fall-3 semesters.

Lessons learned:

Overall, the discrete Markov chain method is very useful for forecasting student enrollment. However, it is important to note, that any forecast is as good as the underlying data it relies on. As the demographics of colleges and universities change, this method may not prove to be the best method going forward.