Predicting Undergrad Student Outcomes

Project Overview

This data set contains data on fictitious undergraduate students. It includes demographic information, performance and outcomes. I used R to analyze and visualize what is significantly correlated with, and potentially predictive of, undergraduate outcomes of gpa and staying in school (“Persistence”).

There were many questions I would have asked if I could have spoken to someone who knew the data. For example, did high school GPA always exist on a six point scale? Or is it possible that one person with a 3.9 (out of 4) was actually a better performer than someone with a 5.25 (out of 6)?

Data Exploration

The file contains almost 41,000 records. A preview:

## Observations: 40,871
## Variables: 22
## $ StudentID              <chr> "0x20963268D1921549B421E95910F2A6A7BACB...
## $ cohort                 <dbl> 2015, 2013, 2013, 2013, 2013, 2013, 201...
## $ Student_level          <chr> "UN", "UN", "UN", "UN", "UN", "UN", "UN...
## $ Gender                 <chr> "F", "F", "F", "M", "F", "F", "F", "M",...
## $ ethnicity              <chr> "White", "White", "White", "White", "Wh...
## $ citizenship            <chr> "Domestic", "Domestic", "Domestic", "Do...
## $ feebased               <chr> "IN-STATE", "IN-STATE", "IN-STATE", "IN...
## $ First_Gen_Adms         <chr> "Y", "N", "N", "N", "N", "N", "N", "N",...
## $ First_Gen_FAFSA        <chr> "Y", "U", "N", "N", "U", "N", "N", "U",...
## $ Entrant_Summer_Fall    <chr> "N", "N", "N", "N", "N", "N", "N", "N",...
## $ Pell_1st_Yr            <chr> "Y", "N", "N", "N", "N", "N", "N", "N",...
## $ STEM_first             <chr> "N", "Y", "Y", "N", "N", "Y", "N", "N",...
## $ begin_trans_credits    <dbl> 0, 0, 7, 16, 0, 34, 12, 11, 7, 0, 0, 8,...
## $ FirstTermGPA           <dbl> 3.0000, 3.2500, 3.2307, 3.7000, 2.8214,...
## $ First_cum_passed_hours <dbl> 14, 10, 20, 31, 14, 47, 26, 21, 20, 14,...
## $ Persist1               <dbl> 100, 100, 100, 100, 100, 100, 0, 100, 1...
## $ HS_GPA                 <dbl> 3.750, 3.660, 3.890, 3.958, 4.089, 3.96...
## $ Atmpt_Hours            <dbl> 14, 10, 13, 15, 14, 13, 14, 15, 13, 14,...
## $ Passed_Hours           <dbl> 14, 10, 13, 15, 14, 13, 14, 10, 13, 14,...
## $ ACT_COMP               <dbl> 24, 23, 28, NA, NA, 33, 24, 30, 30, 22,...
## $ SAT_New                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ SAT_Old                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...

I noticed that there were missing values in the SAT column and decided to check the number of NA in all columns.

##                        na_count
## StudentID                     0
## cohort                        0
## Student_level                 0
## Gender                        0
## ethnicity                     0
## citizenship                   0
## feebased                      1
## First_Gen_Adms                0
## First_Gen_FAFSA               0
## Entrant_Summer_Fall           0
## Pell_1st_Yr                   0
## STEM_first                    0
## begin_trans_credits           0
## FirstTermGPA               8935
## First_cum_passed_hours        0
## Persist1                    129
## HS_GPA                     6693
## Atmpt_Hours                 308
## Passed_Hours                308
## ACT_COMP                  16961
## SAT_New                   35862
## SAT_Old                   36117

Because of the high number of missing in the entrance exams I decided to ignore those columns. There are also many GPA values missing, both high school (an important explanatory feature) and college gpa (one of the outcomes I want to evaluate). I dropped all rows that had any NA, leaving 26,550 of 40,871.

I also re-coded some of the columns based on some initial decision tree analysis that I did (not shown). I re-named so columns to be more self-explanatory, changed some column types to factors, and dropped some columns that I did not want to include in my final trees. This included (not limited to)

  • Ethnicity: I found that generally the categories shown in the case statement below were being grouped together. A complete list of ethnicities is below.
  • Transfer credits: decision trees were generally splitting this at 2 or 2.5 hours.
  • First semester attempted hours: Changed from being numeric to binary indicator of whether or not the student attempted what I assumed to be a full load (12 hours)
  • FirstSemPassHrs: Derived how many hours they passed as the difference between the student’s cumulative hours after first semester and their transfer hours
## [1] "White"           "BLANK"           "Black"           "Hispanic"       
## [5] "Multiple"        "Asian/Hawaii/PI" "Amer Ind"        "INTERNATIONAL"

This shows correlations between the numeric variables:

Factor reduction using Decision Tree

Then I constructed decision trees on my two primary outcomes of interest: First semester GPA, and persistence (did student come back to school for their sophomore year). Smaller values of the “complexity parameter” (cp) in rpart lead to more complex trees (more splitting). I arrived at the parameters I used for each through trial and error. My initial observations:

GPA

  • High school performance (GPA primarily, but also transfer credits) appears to be highly associated with first semester GPA. Transfer credits
  • Demographics also play a role, though a statistical model should be used to control for other things associated with one’s demographics.

Persistence

  • First year college performance is most influential with one’s decision to return to school
  • In-state students are more likely to return to school
  • 2015 had lower persistence for some reason
#GPA
tree<-rpart(FirstTermGPA~entry.year + gender + WhiteAsianIntl + InState + First_Gen_Adms + FirstCollege + Entrant_Summer_Fall + PellGrant + STEM_first + TferCredits2.5 + HS_GPA + FirstSemHrs12
    ,cp=0.003,data=dat2)
#plot with rpart.plot
prp(tree,type=5,branch=1,fallen.leaves = TRUE,digits=3,varlen=0,
    box.palette="auto",pal.thresh = 3,leaf.round=1,tweak=1.2)

##HS GPA, pell grant status, ethnicity, transfer credits
#persistence
tree<-rpart(Persists~gender+WhiteAsianIntl+InState
            +FirstCollege+Entrant_Summer_Fall+PellGrant
            +STEM_first+TferCredits2.5+HS_GPA+FirstSemHrs12+FirstSemPassHrs
            +FirstTermGPA
            ,cp=0.001,data=dat2)
prp(tree,type=5,branch=1,fallen.leaves = TRUE,digits=3,varlen=0,
    box.palette="auto",pal.thresh = 3,leaf.round=1,tweak=1.2)

Statistical modeling the relationship between the important features and student outcomes

I created a statistical model for each outcome to better estimate both the significance and impact of the performance and demographic features.

GPA
I used Tobit which is a censored regression. The data is both left-censored (because a zero is the lowest possible GPA, even though one person could average a 50 and another person a 30) and right-censored (four is the max but one student may have taken tougher courses than another). Basic OLS assumes that the outcome is continuous which is not the case here. Tobit regression is an appropriate model. I had not done Tobit in R before and used this site as a guide to implement with the vglm() function from the VGAM package.
All of the included variables were significant. The demographic variables had smaller impacts on GPA than performance (eg, Males had a lower gpa by 0.1 points than females while an increase of 1 point in HS GPA is associated with an increase of 0.84 points in college gpa).

summary(m <- vglm(FirstTermGPA~ gender+WhiteAsianIntl+InState
                  +FirstCollege+PellGrant+TferCredits2.5+HS_GPA+FirstSemHrs12,
                  tobit(Lower=0,Upper = 4), data = dat2))
## 
## Call:
## vglm(formula = FirstTermGPA ~ gender + WhiteAsianIntl + InState + 
##     FirstCollege + PellGrant + TferCredits2.5 + HS_GPA + FirstSemHrs12, 
##     family = tobit(Lower = 0, Upper = 4), data = dat2)
## 
## 
## Pearson residuals:
##             Min      1Q   Median     3Q    Max
## mu       -6.817 -0.6069 -0.02792 0.6062  3.201
## loge(sd) -1.120 -0.7804 -0.56003 0.1815 28.841
## 
## Coefficients: 
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1   -0.397691   0.055272  -7.195 6.24e-13 ***
## (Intercept):2   -0.338908   0.005165 -65.617  < 2e-16 ***
## genderM         -0.113078   0.009111 -12.411  < 2e-16 ***
## WhiteAsianIntlY  0.147717   0.012944  11.412  < 2e-16 ***
## InStateY         0.102720   0.012819   8.013 1.12e-15 ***
## FirstCollegeO   -0.266055   0.028263  -9.413  < 2e-16 ***
## FirstCollegeU    0.099352   0.011889   8.357  < 2e-16 ***
## FirstCollegeY   -0.143508   0.014656  -9.792  < 2e-16 ***
## PellGrantY      -0.157731   0.012291 -12.833  < 2e-16 ***
## TferCredits2.5Y  0.173627   0.009749  17.810  < 2e-16 ***
## HS_GPA           0.848412   0.014385  58.980  < 2e-16 ***
## FirstSemHrs12Y   0.340961   0.025924  13.152  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Number of linear predictors:  2 
## 
## Names of linear predictors: mu, loge(sd)
## 
## Log-likelihood: -28332.96 on 53092 degrees of freedom
## 
## Number of iterations: 8 
## 
## No Hauck-Donner effect found in any of the estimates

Persistence
Since persistence is binary I used logistic regression. The logit results are first displayed as log odds, then converted to odds ratios, and then plotted using the effects package. This plot holds all other variable values at a constant level and then estimates the change in likelihood of the outcome variable along the range of actual feature values. For binary features, like gender, a simple straight line shows that women are slightly less likely to persist than men. For numeric variables like first term GPA the line can be a curve and shows how much persistence is affected by GPA.
Not all variables are statistically significant. Gender, in-state status and the performance variables are most significant. The plots show that one’s previous performance explains much more of the probability of one not returning to school.

#logistic regression using these vars
mylogit <- glm( persists~ gender+WhiteAsianIntl+InState
                +FirstCollege+PellGrant
                +TferCredits2.5+FirstSemHrs12+FirstSemPassHrs
                +FirstTermGPA
               , data = dat2, family = "binomial")
summary(mylogit)
## 
## Call:
## glm(formula = persists ~ gender + WhiteAsianIntl + InState + 
##     FirstCollege + PellGrant + TferCredits2.5 + FirstSemHrs12 + 
##     FirstSemPassHrs + FirstTermGPA, family = "binomial", data = dat2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9756   0.2392   0.2912   0.3735   1.9036  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -2.11320    0.14676 -14.399  < 2e-16 ***
## genderM          0.20517    0.05069   4.048 5.17e-05 ***
## WhiteAsianIntlY -0.06883    0.06502  -1.059 0.289798    
## InStateY         0.64632    0.05986  10.797  < 2e-16 ***
## FirstCollegeO   -0.20942    0.12362  -1.694 0.090257 .  
## FirstCollegeU    0.25391    0.07225   3.514 0.000441 ***
## FirstCollegeY   -0.17409    0.07239  -2.405 0.016177 *  
## PellGrantY      -0.08643    0.06538  -1.322 0.186216    
## TferCredits2.5Y  0.08767    0.05363   1.635 0.102121    
## FirstSemHrs12Y   0.19817    0.11296   1.754 0.079363 .  
## FirstSemPassHrs  0.10188    0.01116   9.129  < 2e-16 ***
## FirstTermGPA     0.86249    0.03824  22.555  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 14813  on 26551  degrees of freedom
## Residual deviance: 12278  on 26540  degrees of freedom
## AIC: 12302
## 
## Number of Fisher Scoring iterations: 6
oddsratios<-exp(cbind(OR = coef(mylogit), confint(mylogit)))
## Waiting for profiling to be done...
plot(allEffects(mylogit))

Summary

Using data on about 25,000 (fictitious) students I found that both demographic characteristics and recent performance can help predict student outcomes. Decision tree aided in determining which features were most associated with variance in the outcomes of first semester gpa and student variance. Statistical models demonstrated that recent performance has a larger nominal impact than demographics.
Given these results my recommendations would be to target primary interventions at students with the worst recent performance (for incoming freshman, their high school performance; for other students, whatever happened in the last year or semester). Secondary interventions could target out-of-state and international students who are less likely to return, perhaps because of financial, cultural or social reasons.