{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n# Student Enrollment\n\nIn this example, we show how to reproduce the model of student\nenrollment from Bergman et.al. (2020) with Gurobi Machine Learning.\n\nThis model was developed in the context of the development of\n[Janos](https://github.com/INFORMSJoC/2020.1023)_, a toolkit similar\nto Gurobi Machine Learning to integrate ML models and Mathematical\nOptimization.\n\nThis example illustrates in particular how to use the logistic\nregression.\n\nWe also show how to deal with fixed features in the optimization model\nusing pandas data frames.\n\nIn this model, data of students admissions in a college is used to\npredict the probability that a student enrolls to the college.\n\nThe data has 3 features: the SAT and GPA scores of each student, and the\nscholarship (or merit) that was offered to each student. Finally, it is\nknown if each student decided to join the college or not.\n\nBased on this data a logistic regression is trained to predict the\nprobability that a student joins the college.\n\nUsing this regression model, Bergman et.al. (2020) proposes the\nfollowing student enrollment problem. The Admission Office has data for\nSAT and GPA scores of the admitted students for the incoming class, and\nthey would want to offer scholarships to students with the goal of\nmaximizing the expected number of students that enroll in the college.\nThere is a total of $n$ students that are admitted. The maximal\nbudget for the sum of all scholarships offered is\n$0.2 n \\, \\text{K\\$}$ and each student can be offered a\nscholarship of at most $2.5 \\, \\text{K\\$}$.\n\nThis problem can be expressed as a mathematical optimization problem as\nfollows. Two vectors of decision variables $x$ and $y$ of\ndimension $n$ are used to model respectively the scholarship\noffered to each student in $\\text{K\\$}$ and the probability that\nthey join. Denoting by $g$ the prediction function for the\nprobability of the logistic regression we then have for each student\n$i$:\n\n\\begin{align}y_i = g(x_i, SAT_i, GPA_i),\\end{align}\n\nwith $SAT_i$ and $GPA_i$ the (known) SAT and GPA score of\neach student.\n\nThe objective is to maximize the sum of the $y$ variables and the\nbudget constraint imposes that the sum of the variables $x$ is\nless or equal to $0.2n$. Also, each variable $x_i$ is\nbetween 0 and 2.5.\n\nThe full model then reads:\n\n\\begin{align}\\begin{aligned} &\\max \\sum_{i=1}^n y_i \\\\\n &\\text{subject to:}\\\\\n &\\sum_{i=1}^n x_i \\le 0.2*n,\\\\\n &y_i = g(x_i, SAT_i, GPA_i) & & i = 1, \\ldots, n,\\\\\n & 0 \\le x \\le 2.5. \\end{aligned}\\end{align}\n\nNote that in this example differently to Bergman et.al. (2020) we scale\nthe features for the regression. Also, to fit in Gurobi\u2019s limited size\nlicense we only consider the problem where $n=250$.\n\nWe note also that the model may differ from the objectives of Admission\nOffices and don\u2019t encourage its use in real life. The example is for\nillustration purposes only.\n\n## Importing packages and retrieving the data\n\nWe import the necessary packages. Besides the usual (``numpy``,\n``gurobipy``, ``pandas``), for this we will use Scikit-learn\u2019s Pipeline,\nStandardScaler and LogisticRegression.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import sys\n\nimport gurobipy as gp\nimport gurobipy_pandas as gppd\nimport numpy as np\nimport pandas as pd\nfrom matplotlib import pyplot as plt\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.tree import DecisionTreeRegressor\n\nfrom gurobi_ml import add_predictor_constr"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now retrieve the historical data used to build the regression from\nJanos repository.\n\nThe features we use for the regression are ``\"merit\"`` (scholarship),\n``\"SAT\"`` and ``\"GPA\"`` and the target is ``\"enroll\"``. We store those\nvalues.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Base URL for retrieving data\njanos_data_url = \"https://raw.githubusercontent.com/INFORMSJoC/2020.1023/master/data/\"\nhistorical_data = pd.read_csv(\n janos_data_url + \"college_student_enroll-s1-1.csv\", index_col=0\n)\n\n# classify our features between the ones that are fixed and the ones that will be\n# part of the optimization problem\nfeatures = [\"merit\", \"SAT\", \"GPA\"]\ntarget = \"enroll\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Fit the logistic regression\n\nFor the regression, we use a pipeline with a standard scaler and a\nlogistic regression. We build it using the ``make_pipeline`` from\n``scikit-learn``.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Run our regression\nscaler = StandardScaler()\nregression = LogisticRegression(random_state=1)\npipe = make_pipeline(scaler, regression)\npipe.fit(X=historical_data.loc[:, features], y=historical_data.loc[:, target])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Optimization Model\n\nWe now turn to building the mathematical optimization model for Gurobi.\n\nFirst, retrieve the data for the new students. We won\u2019t use all the data\nthere, we randomly pick 250 students from it.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Retrieve new data used to build the optimization problem\nstudentsdata = pd.read_csv(janos_data_url + \"college_applications6000.csv\", index_col=0)\n\nnstudents = 25\n\n# Select randomly nstudents in the data\nstudentsdata = studentsdata.sample(nstudents, random_state=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now create the our model.\n\nSince our data is in pandas data frames, we use the package\ngurobipy-pandas to help create the variables directly using the index of\nthe data frame.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Start with classical part of the model\nm = gp.Model()\n\n# The y variables are modeling the probability of enrollment of each student. They are indexed by students data\ny = gppd.add_vars(m, studentsdata, name=\"enroll_probability\")\n\n\n# We want to complete studentsdata with a column of decision variables to model the \"merit\" feature.\n# Those variable are between 0 and 2.5.\n# They are added using the gppd extension and the resulting dataframe is stored in\n# students_opt_data.\nstudents_opt_data = studentsdata.gppd.add_vars(m, lb=0.0, ub=2.5, name=\"merit\")\n\n# We denote by x the (variable) \"merit\" feature\nx = students_opt_data.loc[:, \"merit\"]\n\n# Make sure that studentsdata contains only the features column and in the right order\nstudents_opt_data = students_opt_data.loc[:, features]\n\nm.update()\n\n# Let's look at our features dataframe for the optimization\nstudents_opt_data[:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We add the objective and the budget constraint:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"m.setObjective(y.sum(), gp.GRB.MAXIMIZE)\n\nm.addConstr(x.sum() <= 0.2 * nstudents)\nm.update()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we insert the constraints from the regression. In this model we\nwant to have use the probability estimate of a student joining the\ncollege, so we choose the parameter ``output_type`` to be\n``\"probability_1\"``. Note that due to the shapes of the ``studentsdata``\ndata frame and ``y``, this will insert one regression constraint for\neach student.\n\nWith the ``print_stats`` function we display what was added to the\nmodel.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"pred_constr = add_predictor_constr(\n m, pipe, students_opt_data, y, output_type=\"probability_1\"\n)\n\npred_constr.print_stats()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now optimize the problem. With Gurobi \u2265 11.0, the attribute\n``FuncNonLinear`` is automatically set to 1 by Gurobi machine learning\non the nonlinear constraints it adds in order to deal algorithmically\nwith the logistic function.\n\nOlder versions of Gurobi would make a piece-wise linear approximation of\nthe logistic function. You can refer to [older versions of this\ndocumentation](https://gurobi-machinelearning.readthedocs.io/en/v1.3.0/mlm-examples/student_admission.html)_\nfor dealing with those approximations.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"m.optimize()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We print the error using\n:func:`get_error`\n(note that we take the maximal error over all input vectors).\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(\n \"Maximum error in approximating the regression {:.6}\".format(\n np.max(pred_constr.get_error())\n )\n)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, note that we can directly get the input values for the\nregression in a solution as a pandas dataframe using input_values.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"pred_constr.input_values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright \u00a9 2023 Gurobi Optimization, LLC\n\n\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.18"
}
},
"nbformat": 4,
"nbformat_minor": 0
}