Note

Go to the end to download the full example code

Student Enrollment#

In this example, we show how to reproduce the model of student enrollment from Bergman et.al. (2020) with Gurobi Machine Learning.

This model was developed in the context of the development of Janos, a toolkit similar to Gurobi Machine Learning to integrate ML models and Mathematical Optimization.

This example illustrates in particular how to use the logistic regression.

We also show how to deal with fixed features in the optimization model using pandas data frames.

In this model, data of students admissions in a college is used to predict the probability that a student enrolls to the college.

The data has 3 features: the SAT and GPA scores of each student, and the scholarship (or merit) that was offered to each student. Finally, it is known if each student decided to join the college or not.

Based on this data a logistic regression is trained to predict the probability that a student joins the college.

Using this regression model, Bergman et.al. (2020) proposes the following student enrollment problem. The Admission Office has data for SAT and GPA scores of the admitted students for the incoming class, and they would want to offer scholarships to students with the goal of maximizing the expected number of students that enroll in the college. There is a total of $n$ students that are admitted. The maximal budget for the sum of all scholarships offered is $0.2 n \, \text{K\$}$ and each student can be offered a scholarship of at most $2.5 \, \text{K\$}$.

This problem can be expressed as a mathematical optimization problem as follows. Two vectors of decision variables $x$ and $y$ of dimension $n$ are used to model respectively the scholarship offered to each student in $\text{K\$}$ and the probability that they join. Denoting by $g$ the prediction function for the probability of the logistic regression we then have for each student $i$:

\[y_i = g(x_i, SAT_i, GPA_i),\]

with $SAT_i$ and $GPA_i$ the (known) SAT and GPA score of each student.

The objective is to maximize the sum of the $y$ variables and the budget constraint imposes that the sum of the variables $x$ is less or equal to $0.2n$. Also, each variable $x_i$ is between 0 and 2.5.

The full model then reads:

\[\begin{split} \begin{aligned} &\max \sum_{i=1}^n y_i \\ &\text{subject to:}\\ &\sum_{i=1}^n x_i \le 0.2*n,\\ &y_i = g(x_i, SAT_i, GPA_i) & & i = 1, \ldots, n,\\ & 0 \le x \le 2.5. \end{aligned}\end{split}\]

Note that in this example differently to Bergman et.al. (2020) we scale the features for the regression. Also, to fit in Gurobi’s limited size license we only consider the problem where $n=250$.

We note also that the model may differ from the objectives of Admission Offices and don’t encourage its use in real life. The example is for illustration purposes only.

Importing packages and retrieving the data#

We import the necessary packages. Besides the usual (numpy, gurobipy, pandas), for this we will use Scikit-learn’s Pipeline, StandardScaler and LogisticRegression.

import sys

import gurobipy as gp
import gurobipy_pandas as gppd
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor

from gurobi_ml import add_predictor_constr

We now retrieve the historical data used to build the regression from Janos repository.

The features we use for the regression are "merit" (scholarship), "SAT" and "GPA" and the target is "enroll". We store those values.

# Base URL for retrieving data
janos_data_url = "https://raw.githubusercontent.com/INFORMSJoC/2020.1023/master/data/"
historical_data = pd.read_csv(
    janos_data_url + "college_student_enroll-s1-1.csv", index_col=0
)

# classify our features between the ones that are fixed and the ones that will be
# part of the optimization problem
features = ["merit", "SAT", "GPA"]
target = "enroll"

Fit the logistic regression#

For the regression, we use a pipeline with a standard scaler and a logistic regression. We build it using the make_pipeline from scikit-learn.

# Run our regression
scaler = StandardScaler()
regression = LogisticRegression(random_state=1)
pipe = make_pipeline(scaler, regression)
pipe.fit(X=historical_data.loc[:, features], y=historical_data.loc[:, target])

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression', LogisticRegression(random_state=1))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Optimization Model#

We now turn to building the mathematical optimization model for Gurobi.

First, retrieve the data for the new students. We won’t use all the data there, we randomly pick 250 students from it.

# Retrieve new data used to build the optimization problem
studentsdata = pd.read_csv(janos_data_url + "college_applications6000.csv", index_col=0)

nstudents = 25

# Select randomly nstudents in the data
studentsdata = studentsdata.sample(nstudents, random_state=1)

We can now create the our model.

Since our data is in pandas data frames, we use the package gurobipy-pandas to help create the variables directly using the index of the data frame.

# Start with classical part of the model
m = gp.Model()

# The y variables are modeling the probability of enrollment of each student. They are indexed by students data
y = gppd.add_vars(m, studentsdata, name="enroll_probability")


# We want to complete studentsdata with a column of decision variables to model the "merit" feature.
# Those variable are between 0 and 2.5.
# They are added using the gppd extension and the resulting dataframe is stored in
# students_opt_data.
students_opt_data = studentsdata.gppd.add_vars(m, lb=0.0, ub=2.5, name="merit")

# We denote by x the (variable) "merit" feature
x = students_opt_data.loc[:, "merit"]

# Make sure that studentsdata contains only the features column and in the right order
students_opt_data = students_opt_data.loc[:, features]

m.update()

# Let's look at our features dataframe for the optimization
students_opt_data[:10]

	merit	SAT	GPA
StudentID
1484	<gurobi.Var merit[1484]>	1512	3.61
2186	<gurobi.Var merit[2186]>	1148	3.06
2521	<gurobi.Var merit[2521]>	1090	2.76
3722	<gurobi.Var merit[3722]>	1044	2.55
3728	<gurobi.Var merit[3728]>	1424	3.64
4525	<gurobi.Var merit[4525]>	1040	2.44
235	<gurobi.Var merit[235]>	1030	2.61
4736	<gurobi.Var merit[4736]>	1399	3.42
5840	<gurobi.Var merit[5840]>	1090	2.54
2940	<gurobi.Var merit[2940]>	1417	3.69

We add the objective and the budget constraint:

m.setObjective(y.sum(), gp.GRB.MAXIMIZE)

m.addConstr(x.sum() <= 0.2 * nstudents)
m.update()

Finally, we insert the constraints from the regression. In this model we want to have use the probability estimate of a student joining the college, so we choose the parameter output_type to be "probability_1". Note that due to the shapes of the studentsdata data frame and y, this will insert one regression constraint for each student.

With the print_stats function we display what was added to the model.

pred_constr = add_predictor_constr(
    m, pipe, students_opt_data, y, output_type="probability_1"
)

pred_constr.print_stats()

Model for pipe:
150 variables
100 constraints
25 general constraints
Input has shape (25, 3)
Output has shape (25, 1)

Pipeline has 2 steps:

--------------------------------------------------------------------------------
Step            Output Shape    Variables              Constraints
                                                Linear    Quadratic      General
================================================================================
std_scaler           (25, 3)          125           75            0            0

log_reg              (25, 1)           25           25            0           25

--------------------------------------------------------------------------------

We can now optimize the problem. With Gurobi ≥ 11.0, the attribute FuncNonLinear is automatically set to 1 by Gurobi machine learning on the nonlinear constraints it adds in order to deal algorithmically with the logistic function.

Older versions of Gurobi would make a piece-wise linear approximation of the logistic function. You can refer to older versions of this documentation for dealing with those approximations.

m.optimize()

Gurobi Optimizer version 11.0.1 build v11.0.1rc0 (linux64 - "Ubuntu 20.04.6 LTS")

CPU model: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz, instruction set [SSE2|AVX|AVX2|AVX512]
Thread count: 1 physical cores, 2 logical processors, using up to 2 threads

Optimize a model with 101 rows, 200 columns and 275 nonzeros
Model fingerprint: 0x95d199bd
Model has 25 general constraints
Variable types: 200 continuous, 0 integer (0 binary)
Coefficient statistics:
  Matrix range     [4e-01, 2e+02]
  Objective range  [1e+00, 1e+00]
  Bounds range     [2e+00, 2e+03]
  RHS range        [7e-01, 1e+03]
Presolve removed 100 rows and 162 columns
Presolve time: 0.00s
Presolved: 132 rows, 39 columns, 291 nonzeros
Presolved model has 19 nonlinear constraint(s)

Solving non-convex MINLP

Variable types: 39 continuous, 0 integer (0 binary)
Found heuristic solution: objective 13.7903088

Root relaxation: objective 1.385440e+01, 12 iterations, 0.00 seconds (0.00 work units)

    Nodes    |    Current Node    |     Objective Bounds      |     Work
 Expl Unexpl |  Obj  Depth IntInf | Incumbent    BestBd   Gap | It/Node Time

     0     0   13.85440    0    7   13.79031   13.85440  0.46%     -    0s
     0     0   13.81174    0    7   13.79031   13.81174  0.16%     -    0s
     0     0   13.80064    0    7   13.79031   13.80064  0.07%     -    0s
     0     0   13.79856    0    7   13.79031   13.79856  0.06%     -    0s
     0     0   13.79785    0    7   13.79031   13.79785  0.05%     -    0s
     0     2   13.79785    0    7   13.79031   13.79785  0.05%     -    0s
H   99    37                      13.7903090   13.79349  0.02%   5.3    0s

Explored 145 nodes (752 simplex iterations) in 0.06 seconds (0.02 work units)
Thread count was 2 (of 2 available processors)

Solution count 2: 13.7903 13.7903

Optimal solution found (tolerance 1.00e-04)
Best objective 1.379030897877e+01, best bound 1.379166027170e+01, gap 0.0098%

We print the error using get_error (note that we take the maximal error over all input vectors).

print(
    "Maximum error in approximating the regression {:.6}".format(
        np.max(pred_constr.get_error())
    )
)

Maximum error in approximating the regression 8.50528e-07

Finally, note that we can directly get the input values for the regression in a solution as a pandas dataframe using input_values.

pred_constr.input_values

	merit	SAT	GPA
StudentID
1484	0.000000	1512.0	3.61
2186	0.000000	1148.0	3.06
2521	0.000000	1090.0	2.76
3722	0.000000	1044.0	2.55
3728	0.000000	1424.0	3.64
4525	0.000000	1040.0	2.44
235	0.000000	1030.0	2.61
4736	0.000000	1399.0	3.42
5840	0.000000	1090.0	2.54
2940	0.000000	1417.0	3.69
3054	1.100390	1303.0	3.26
868	0.000000	1062.0	2.72
277	0.894914	1305.0	3.14
5799	0.000000	1187.0	2.92
3513	0.000000	1383.0	3.28
5790	0.000000	1434.0	3.64
3199	0.000000	1429.0	3.54
5909	1.237965	1288.0	3.39
5719	0.000000	1488.0	3.87
2688	0.000000	1480.0	3.80
251	0.000000	1512.0	3.59
5462	0.112486	1218.0	3.02
3053	0.000000	1428.0	3.42
2712	0.688197	1248.0	3.23
3772	0.966048	1299.0	3.20

Total running time of the script: (0 minutes 0.554 seconds)

Gallery generated by Sphinx-Gallery