This post attempts to provide an example using simulated data of how to forecast student enrollment at a four-year college or university. It uses Markov chains with simulated data. The complete code to reproduce this post can be found here
Describe a model:
In most North Americana four-year universities, students progress through the four-year curriculum, and then graduate after four years. However, some students leave after the first year, other after their third year. Using a discrete markov chain model can be an easy tool to forecast the number of students a given University will have the following semester, or following year.
The markovchain
package in R
was developed and works well for forecasting students in this scenario. The complete vigenette is here. The vignette provides the theoretical background, and practical examples.
Data
Below is a table showing an example of the student-level data. There is a student identifier, their current semester, and their following semester. The current semester and the following semester are the key variables that will make up the transition matrix.
Student-level data example | ||||||
---|---|---|---|---|---|---|
Id | Semester | NumberSemester | ProbOfMovingOn | MovedOn | CurrentSemester | FutureSemester |
1 | F | 1 | 0.114 | 1 | F1 | W1 |
1 | W | 1 | 0.622 | 1 | W1 | F2 |
1 | F | 2 | 0.609 | 1 | F2 | W2 |
1 | W | 2 | 0.623 | 1 | W2 | F3 |
1 | F | 3 | 0.861 | 1 | F3 | W3 |
1 | W | 3 | 0.640 | 1 | W3 | Leave |
To create the transition matrix, we need to calculate the chances students have from moving from one state to another. For instance, if a student enters in their first semester, Fall 1 (F1), then we need to calculate the probability of them moving to other states. In our example, the following states could be Winter 1 (W1), or they could leave the institution.
Transition Matrix | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
CurrentSemester | Following Semester | |||||||||
F2 | F3 | F4 | Graduated | Leave | Not graduated | W1 | W2 | W3 | W4 | |
F1 | 0.000 | 0.000 | 0.000 | 0.000 | 0.098 | 0.000 | 0.902 | 0.000 | 0.000 | 0.000 |
F2 | 0.000 | 0.000 | 0.000 | 0.000 | 0.053 | 0.000 | 0.000 | 0.947 | 0.000 | 0.000 |
F3 | 0.000 | 0.000 | 0.000 | 0.000 | 0.172 | 0.000 | 0.000 | 0.000 | 0.828 | 0.000 |
F4 | 0.000 | 0.000 | 0.000 | 0.000 | 0.099 | 0.000 | 0.000 | 0.000 | 0.000 | 0.901 |
Leave | 0.127 | 0.042 | 0.197 | 0.099 | 0.085 | 0.056 | 0.113 | 0.085 | 0.099 | 0.099 |
W1 | 0.934 | 0.000 | 0.000 | 0.000 | 0.066 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
W2 | 0.000 | 0.947 | 0.000 | 0.000 | 0.053 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
W3 | 0.000 | 0.000 | 0.917 | 0.000 | 0.083 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
W4 | 0.000 | 0.000 | 0.000 | 0.820 | 0.000 | 0.180 | 0.000 | 0.000 | 0.000 | 0.000 |
Making a forecast
To make a forecast, we need to input a number of students at a particular state. For our example, we enter 100 students at F1. Given the transition matrix, the number of students will continue to decline until the process is complete.
The below table shows the final outcome.
As we can see, while we start with one hundred students, that decreases over time.
Final Forecast | |
---|---|
Semester | Student Count |
Fall-1 | 100.00000 |
Fall-2 | 98.22689 |
Fall-3 | 94.05890 |
Fall-4 | 85.15074 |
Winter-1 | 97.08060 |
Winter-2 | 97.74008 |
Winter-3 | 82.36622 |
Winter-4 | 80.17097 |
Bootstrapping
The first process provides point estimates. It uses the underlying probabilities based on historical student transitions to forecast future enrollment. However, it cannot provide us a range of outcomes. In order to provide a range of outcomes, we bootstrap the underlying data. The bootstrap is a simple sample with replacement. We sample with replacement the students, and not the rows of the dataset. Some students will be included multiple times, and other will not be included at all.
In this example, we use the bootstrap to calculate a number of possibilities for a forecast.
We use 500 bootstraps in this example. This creates 500 transition matrices.
THe final plot shows the distribution for the fall-2 and fall-3 semesters.
Lessons learned:
Overall, the discrete Markov chain method is very useful for forecasting student enrollment. However, it is important to note, that any forecast is as good as the underlying data it relies on. As the demographics of colleges and universities change, this method may not prove to be the best method going forward.