Mixed Integer Formulations

In this page, we give a quick overview of the mixed-integer formulations used to represent the various regression models supported by the package.

Our goal is in particular to highlight the cases where the formulations are not exact and how to deal with potential errors in the solution. This applies in particular to our models for logistic regression and decision trees (also random forest and gradient boosting that are based on decision trees).

Throughout, we denote by \(x\) the input of the regression (i.e. the independent variables) and \(y\) the output of the regression model (i.e. the dependent variables).

Linear Regression

Denoting by \(\beta \in \mathbb R^{p+1}\) the computed weights of linear regression, its model takes the form

\[y = \sum_{i=1}^p \beta_i x_i + \beta_0.\]

Since this is linear, it can be represented directly in Gurobi using linear constraints. Note that the model fits other techniques than ordinary linear regression such as Ridge or Lasso.

Logistic Regression

Denoting by \(f(x) = \frac{1}{1 + e^{-x}}\) the standard logistic function and with the same notations as above, the model for logistic regression reads

\[y = f(\sum_{i=1}^p \beta_i x_i + \beta_0) = \frac{1}{1 + e^{- \sum_{i=1}^p \beta_i x_i - \beta_0}}\]

This model is formulated in Gurobi by using the logistic general function constraint. First an intermediate free variable \(\omega = \sum_{i=1}^p \beta_i x_i + \beta_0\) is created, and then we can express \(y = f(\omega)\) using the general constraint.

With version 11, Gurobi introduced direct algorithmic support of nonlinear functions. We enable it by setting the attribute FuncNonLinear to 1 for the logistic functions created by Gurobi Machine Learning.

Older versions of Gurobi make a piecewise linear approximation of the logistic function. By default, the approximation guarantees a maximal error of \(10^{-2}\). Those parameters can be tuned by setting the pwl_attributes keyword argument when the constraint is added.

Neural Networks

The package currently models dense neural network with ReLU activations. For a given neuron the relation between its inputs and outputs is given by:

\[y = \max(\sum_{i=1}^p \beta_i x_i + \beta_0, 0).\]

The relationship is formulated in the optimization model by using Gurobi \(max\) general constraint with:

\[ \begin{align}\begin{aligned}& \omega = \sum_{i=1}^p \beta_i x_i + \beta_0\\& y = \max(\omega, 0)\end{aligned}\end{align} \]

with \(\omega\) an auxiliary free variable. The neurons are then connected according to the topology of the network.

Decision Tree Regression

In a decision tree, each leaf \(l\) is defined by a set of constraints on the input features of the tree that correspond to the branches taken in the path leading to \(l\). For a node \(v\), we denote by \(i_v\) the feature used for splitting and by \(\theta_v\) the value at which the split is made. At a leaf \(l\) of the tree, we have a set \(\mathcal L_l\) of inequalities of the form \(x_{i_v} \le \theta_v\) corresponding to the left branches leading to \(l\) and a set \(\mathcal R_l\) of inequalities of the form \(x_{i_v} > \theta_v\) corresponding to the right branches.

We formulate decision trees by introducing one binary decision variable \(\delta_l\) for each leaf of the tree (and each input vector).

We introduce the constraint

\[\sum_{l} \delta_l = 1,\]

imposing that exactly one leaf is chosen.

Then for each leaf, the inequalities describing \(\mathcal L_l\) and \(\mathcal R_l\) are imposed using indicator constraints:

\begin{align*} & \delta_l = 1 \rightarrow x_{i_v} \le \theta_v, & & \text{for } x_{i_v} \le \theta_v \in \mathcal L_l,\\ & \delta_l = 1 \rightarrow x_{i_v} \ge \theta_v + \epsilon, & & \text{for } x_{i_v} > \theta_v \in \mathcal R_l. \end{align*}

Two numerical parameters control the accuracy of this formulation, both exposed as keyword arguments of add_decision_tree_regressor_constr.

epsilon (\(\epsilon\), default 0) approximates the strict inequality in \(\mathcal R_l\). For \(\epsilon\) to correctly discriminate left and right branches it must exceed Gurobi’s FeasibilityTol (default \(10^{-6}\)); below that threshold the solver treats \(x_{i_v} \ge \theta_v + \epsilon\) and \(x_{i_v} \ge \theta_v\) as equivalent. Setting \(\epsilon\) above the feasibility tolerance does enforce the correct branch, but creates a gap \([\theta_v,\,\theta_v + \epsilon]\) with no feasible solution, which may make the model infeasible when the inputs are tightly constrained. The default of 0 avoids this, at the cost of ambiguity exactly at a split boundary.

safety_floor (default 0, i.e. disabled) addresses a different issue: when \(|\theta_v|\) itself is smaller than FeasibilityTol, the solver treats 0 and \(\theta_v\) as equal and the indicator constraints become ineffective. Setting safety_floor clamps those thresholds to \(\pm\,\text{safety\_floor}\), which fixes the issue provided safety_floor \(\ge\) FeasibilityTol. Because the clamping shifts decision boundaries it can distort models whose legitimate thresholds are genuinely near zero, so the parameter is opt-in.

Random Forest Regression

The regression model of Random Forests is a linear combination of decision trees. Each decision tree is represented using the model above. The same difficulties with the choice of \(\epsilon\) and safety_floor apply to this case.

We note additionally that the random forests are often very large and generating their representation in Gurobi may take a significant amount of time.

Gradient Boosting Regression

The gradient boosting regressor is a linear combination of decision trees. Each decision tree is represented using the model above. The same difficulties with the choice of \(\epsilon\) and safety_floor apply to this case.

We note additionally that the gradient boosting regressors are often very large and generating their representation in Gurobi may take a significant amount of time.