.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/example2_student_admission.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_example2_student_admission.py: Student Enrollment ================== In this example, we show how to reproduce the model of student enrollment from Bergman et.al. (2020) with Gurobi Machine Learning. This model was developed in the context of the development of `Janos `__, a toolkit similar to Gurobi Machine Learning to integrate ML models and Mathematical Optimization. This example illustrates in particular how to use the logistic regression. We also show how to deal with fixed features in the optimization model using pandas data frames. In this model, data of students admissions in a college is used to predict the probability that a student enrolls to the college. The data has 3 features: the SAT and GPA scores of each student, and the scholarship (or merit) that was offered to each student. Finally, it is known if each student decided to join the college or not. Based on this data a logistic regression is trained to predict the probability that a student joins the college. Using this regression model, Bergman et.al. (2020) proposes the following student enrollment problem. The Admission Office has data for SAT and GPA scores of the admitted students for the incoming class, and they would want to offer scholarships to students with the goal of maximizing the expected number of students that enroll in the college. There is a total of :math:`n` students that are admitted. The maximal budget for the sum of all scholarships offered is :math:`0.2 n \, \text{K\$}` and each student can be offered a scholarship of at most :math:`2.5 \, \text{K\$}`. This problem can be expressed as a mathematical optimization problem as follows. Two vectors of decision variables :math:`x` and :math:`y` of dimension :math:`n` are used to model respectively the scholarship offered to each student in :math:`\text{K\$}` and the probability that they join. Denoting by :math:`g` the prediction function for the probability of the logistic regression we then have for each student :math:`i`: .. math:: y_i = g(x_i, SAT_i, GPA_i), with :math:`SAT_i` and :math:`GPA_i` the (known) SAT and GPA score of each student. The objective is to maximize the sum of the :math:`y` variables and the budget constraint imposes that the sum of the variables :math:`x` is less or equal to :math:`0.2n`. Also, each variable :math:`x_i` is between 0 and 2.5. The full model then reads: .. math:: \begin{aligned} &\max \sum_{i=1}^n y_i \\ &\text{subject to:}\\ &\sum_{i=1}^n x_i \le 0.2*n,\\ &y_i = g(x_i, SAT_i, GPA_i) & & i = 1, \ldots, n,\\ & 0 \le x \le 2.5. \end{aligned} Note that in this example differently to Bergman et.al. (2020) we scale the features for the regression. Also, to fit in Gurobi’s limited size license we only consider the problem where :math:`n=250`. We note also that the model may differ from the objectives of Admission Offices and don’t encourage its use in real life. The example is for illustration purposes only. Importing packages and retrieving the data ------------------------------------------ We import the necessary packages. Besides the usual (``numpy``, ``gurobipy``, ``pandas``), for this we will use Scikit-learn’s Pipeline, StandardScaler and LogisticRegression. .. GENERATED FROM PYTHON SOURCE LINES 83-98 .. code-block:: Python import sys import gurobipy as gp import gurobipy_pandas as gppd import numpy as np import pandas as pd from matplotlib import pyplot as plt from sklearn.linear_model import LogisticRegression from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.tree import DecisionTreeRegressor from gurobi_ml import add_predictor_constr .. GENERATED FROM PYTHON SOURCE LINES 99-106 We now retrieve the historical data used to build the regression from Janos repository. The features we use for the regression are ``"merit"`` (scholarship), ``"SAT"`` and ``"GPA"`` and the target is ``"enroll"``. We store those values. .. GENERATED FROM PYTHON SOURCE LINES 106-119 .. code-block:: Python # Base URL for retrieving data janos_data_url = "https://raw.githubusercontent.com/INFORMSJoC/2020.1023/master/data/" historical_data = pd.read_csv( janos_data_url + "college_student_enroll-s1-1.csv", index_col=0 ) # classify our features between the ones that are fixed and the ones that will be # part of the optimization problem features = ["merit", "SAT", "GPA"] target = "enroll" .. GENERATED FROM PYTHON SOURCE LINES 120-127 Fit the logistic regression --------------------------- For the regression, we use a pipeline with a standard scaler and a logistic regression. We build it using the ``make_pipeline`` from ``scikit-learn``. .. GENERATED FROM PYTHON SOURCE LINES 127-135 .. code-block:: Python # Run our regression scaler = StandardScaler() regression = LogisticRegression(random_state=1) pipe = make_pipeline(scaler, regression) pipe.fit(X=historical_data.loc[:, features], y=historical_data.loc[:, target]) .. raw:: html
Pipeline(steps=[('standardscaler', StandardScaler()),
                    ('logisticregression', LogisticRegression(random_state=1))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 136-144 Optimization Model ~~~~~~~~~~~~~~~~~~ We now turn to building the mathematical optimization model for Gurobi. First, retrieve the data for the new students. We won’t use all the data there, we randomly pick 250 students from it. .. GENERATED FROM PYTHON SOURCE LINES 144-154 .. code-block:: Python # Retrieve new data used to build the optimization problem studentsdata = pd.read_csv(janos_data_url + "college_applications6000.csv", index_col=0) nstudents = 25 # Select randomly nstudents in the data studentsdata = studentsdata.sample(nstudents, random_state=1) .. GENERATED FROM PYTHON SOURCE LINES 155-161 We can now create the our model. Since our data is in pandas data frames, we use the package gurobipy-pandas to help create the variables directly using the index of the data frame. .. GENERATED FROM PYTHON SOURCE LINES 161-187 .. code-block:: Python # Start with classical part of the model m = gp.Model() # The y variables are modeling the probability of enrollment of each student. They are indexed by students data y = gppd.add_vars(m, studentsdata, name="enroll_probability") # We want to complete studentsdata with a column of decision variables to model the "merit" feature. # Those variable are between 0 and 2.5. # They are added using the gppd extension and the resulting dataframe is stored in # students_opt_data. students_opt_data = studentsdata.gppd.add_vars(m, lb=0.0, ub=2.5, name="merit") # We denote by x the (variable) "merit" feature x = students_opt_data.loc[:, "merit"] # Make sure that studentsdata contains only the features column and in the right order students_opt_data = students_opt_data.loc[:, features] m.update() # Let's look at our features dataframe for the optimization students_opt_data[:10] .. raw:: html
merit SAT GPA
StudentID
1484 <gurobi.Var merit[1484]> 1512 3.61
2186 <gurobi.Var merit[2186]> 1148 3.06
2521 <gurobi.Var merit[2521]> 1090 2.76
3722 <gurobi.Var merit[3722]> 1044 2.55
3728 <gurobi.Var merit[3728]> 1424 3.64
4525 <gurobi.Var merit[4525]> 1040 2.44
235 <gurobi.Var merit[235]> 1030 2.61
4736 <gurobi.Var merit[4736]> 1399 3.42
5840 <gurobi.Var merit[5840]> 1090 2.54
2940 <gurobi.Var merit[2940]> 1417 3.69


.. GENERATED FROM PYTHON SOURCE LINES 188-190 We add the objective and the budget constraint: .. GENERATED FROM PYTHON SOURCE LINES 190-197 .. code-block:: Python m.setObjective(y.sum(), gp.GRB.MAXIMIZE) m.addConstr(x.sum() <= 0.2 * nstudents) m.update() .. GENERATED FROM PYTHON SOURCE LINES 198-208 Finally, we insert the constraints from the regression. In this model we want to have use the probability estimate of a student joining the college, so we choose the parameter ``output_type`` to be ``"probability_1"``. Note that due to the shapes of the ``studentsdata`` data frame and ``y``, this will insert one regression constraint for each student. With the ``print_stats`` function we display what was added to the model. .. GENERATED FROM PYTHON SOURCE LINES 208-216 .. code-block:: Python pred_constr = add_predictor_constr( m, pipe, students_opt_data, y, output_type="probability_1" ) pred_constr.print_stats() .. rst-class:: sphx-glr-script-out .. code-block:: none Model for pipe: 150 variables 100 constraints 25 general constraints Input has shape (25, 3) Output has shape (25, 1) Pipeline has 2 steps: -------------------------------------------------------------------------------- Step Output Shape Variables Constraints Linear Quadratic General ================================================================================ std_scaler (25, 3) 125 75 0 0 log_reg (25, 1) 25 25 0 25 -------------------------------------------------------------------------------- .. GENERATED FROM PYTHON SOURCE LINES 217-227 We can now optimize the problem. With Gurobi ≥ 11.0, the attribute ``FuncNonLinear`` is automatically set to 1 by Gurobi machine learning on the nonlinear constraints it adds in order to deal algorithmically with the logistic function. Older versions of Gurobi would make a piece-wise linear approximation of the logistic function. You can refer to `older versions of this documentation `__ for dealing with those approximations. .. GENERATED FROM PYTHON SOURCE LINES 227-231 .. code-block:: Python m.optimize() .. rst-class:: sphx-glr-script-out .. code-block:: none Gurobi Optimizer version 11.0.3 build v11.0.3rc0 (linux64 - "Ubuntu 20.04.6 LTS") CPU model: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz, instruction set [SSE2|AVX|AVX2|AVX512] Thread count: 1 physical cores, 2 logical processors, using up to 2 threads Optimize a model with 101 rows, 200 columns and 275 nonzeros Model fingerprint: 0x95d199bd Model has 25 general constraints Variable types: 200 continuous, 0 integer (0 binary) Coefficient statistics: Matrix range [4e-01, 2e+02] Objective range [1e+00, 1e+00] Bounds range [2e+00, 2e+03] RHS range [7e-01, 1e+03] Presolve removed 100 rows and 162 columns Presolve time: 0.00s Presolved: 132 rows, 39 columns, 291 nonzeros Presolved model has 19 nonlinear constraint(s) Solving non-convex MINLP Variable types: 39 continuous, 0 integer (0 binary) Found heuristic solution: objective 13.7903088 Root relaxation: objective 1.385440e+01, 12 iterations, 0.00 seconds (0.00 work units) Nodes | Current Node | Objective Bounds | Work Expl Unexpl | Obj Depth IntInf | Incumbent BestBd Gap | It/Node Time 0 0 13.85440 0 7 13.79031 13.85440 0.46% - 0s 0 0 13.81174 0 7 13.79031 13.81174 0.16% - 0s 0 0 13.80064 0 7 13.79031 13.80064 0.07% - 0s 0 0 13.79856 0 7 13.79031 13.79856 0.06% - 0s 0 0 13.79785 0 7 13.79031 13.79785 0.05% - 0s 0 2 13.79785 0 7 13.79031 13.79785 0.05% - 0s Explored 89 nodes (508 simplex iterations) in 0.04 seconds (0.01 work units) Thread count was 2 (of 2 available processors) Solution count 1: 13.7903 Optimal solution found (tolerance 1.00e-04) Best objective 1.379030883056e+01, best bound 1.379166027171e+01, gap 0.0098% .. GENERATED FROM PYTHON SOURCE LINES 232-236 We print the error using :func:`get_error` (note that we take the maximal error over all input vectors). .. GENERATED FROM PYTHON SOURCE LINES 236-244 .. code-block:: Python print( "Maximum error in approximating the regression {:.6}".format( np.max(pred_constr.get_error()) ) ) .. rst-class:: sphx-glr-script-out .. code-block:: none Maximum error in approximating the regression 3.74643e-09 .. GENERATED FROM PYTHON SOURCE LINES 245-248 Finally, note that we can directly get the input values for the regression in a solution as a pandas dataframe using input_values. .. GENERATED FROM PYTHON SOURCE LINES 248-252 .. code-block:: Python pred_constr.input_values .. raw:: html
merit SAT GPA
StudentID
1484 3.597305e-07 1512.0 3.61
2186 7.387056e-07 1148.0 3.06
2521 0.000000e+00 1090.0 2.76
3722 0.000000e+00 1044.0 2.55
3728 3.597305e-07 1424.0 3.64
4525 0.000000e+00 1040.0 2.44
235 0.000000e+00 1030.0 2.61
4736 3.597469e-07 1399.0 3.42
5840 0.000000e+00 1090.0 2.54
2940 3.597305e-07 1417.0 3.69
3054 1.100257e+00 1303.0 3.26
868 0.000000e+00 1062.0 2.72
277 8.959954e-01 1305.0 3.14
5799 7.388164e-07 1187.0 2.92
3513 3.605119e-07 1383.0 3.28
5790 3.597305e-07 1434.0 3.64
3199 3.597306e-07 1429.0 3.54
5909 1.237991e+00 1288.0 3.39
5719 3.597305e-07 1488.0 3.87
2688 3.597305e-07 1480.0 3.80
251 3.597305e-07 1512.0 3.59
5462 1.123099e-01 1218.0 3.02
3053 3.597317e-07 1428.0 3.42
2712 6.878658e-01 1248.0 3.23
3772 9.655745e-01 1299.0 3.20


.. GENERATED FROM PYTHON SOURCE LINES 253-255 Copyright © 2023 Gurobi Optimization, LLC .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.178 seconds) .. _sphx_glr_download_auto_examples_example2_student_admission.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: example2_student_admission.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: example2_student_admission.py ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_