{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "LRHTRZG42RbR" }, "source": [ "```{index} single: application; regression\n", "```\n", "\n", "# LAD Regression\n", "\n", "Linear regression is a supervised machine learning technique that produces a linear model predicting values of a dependent variable from known values of one or more independent variables. Linear regression has a long history dating back to at least the 19th century and is a mainstay of modern data analysis. \n", "\n", "This notebook demonstrates a technique for linear regression based on LP that use the Least Absolute Deviation (LAD) as the metric to quantify the goodness of the model prediction. The sum of absolute values of errors is the $L_1$ norm which is known to have favorable robustness characteristics in practical use. We follow closely this [paper](https://www.jstor.org/stable/1402501)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "# install dependencies and select solver\n", "%pip install -q amplpy numpy matplotlib scikit-learn ipywidgets\n", "\n", "SOLVER = \"highs\"\n", "\n", "from amplpy import AMPL, ampl_notebook\n", "\n", "ampl = ampl_notebook(\n", " modules=[\"highs\"], # modules to install\n", " license_uuid=\"default\", # license to use\n", ") # instantiate AMPL object and register magics" ] }, { "cell_type": "markdown", "metadata": { "id": "VKxb7hUB2pDa" }, "source": [ "## Generate data\n", "\n", "The Python [scikit learn](https://scikit-learn.org/stable/) library for machine learning provides a full-featured collection of tools for regression. The following cell uses `make_regression` from scikit learn to generate a synthetic data set for use in subsequent cells. The data consists of a numpy array `y` containing `n_samples` of one dependent variable $y$, and an array `X` containing `n_samples` observations of `n_features` independent explanatory variables." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 265 }, "id": "u58KqWC5M_FR", "outputId": "1ba3863d-09e4-4631-a945-75042f26bf88" }, "outputs": [], "source": [ "from sklearn.datasets import make_regression\n", "import numpy as np\n", "\n", "n_features = 1\n", "n_samples = 1000\n", "noise = 30\n", "\n", "# generate regression dataset\n", "np.random.seed(2020)\n", "X, y = make_regression(n_samples=n_samples, n_features=n_features, noise=noise)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Visualization\n", "\n", "Before going further, it is generally useful to prepare an initial visualization of the data. The following cell presents a scatter plot of $y$ versus $x$ for the special case of one explanatory variable, and a histogram of the difference between $y$ and the mean value $\\bar{y}$. This histogram will provide a reference against which to compare the residual error in $y$ after regression." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "if n_features == 1:\n", " plt.scatter(X, y, alpha=0.3)\n", " plt.xlabel(\"X\")\n", " plt.ylabel(\"y\")\n", " plt.grid(True)\n", "\n", "plt.figure()\n", "plt.hist(y - np.mean(y), bins=int(np.sqrt(len(y))))\n", "plt.title(\"histogram y - mean(y)\")\n", "plt.ylabel(\"counts\")" ] }, { "cell_type": "markdown", "metadata": { "id": "MiHG-Csw2vbM" }, "source": [ "## Model\n", "\n", "Suppose we have a finite dataset consisting of $n$ points $\\{({X}^{(i)}, y^{(i)})\\}_{i=1,\\dots,n}$ with ${X}^{(i)} \\in \\mathbb{R}^k$ and $y^{(i)} \\in \\mathbb{R}$. A linear regression model assumes the relationship between the vector of $k$ regressors ${X}$ and the dependent variable $y$ is linear. This relationship is modeled through an error or deviation term $e_i$, which quantifies how much each of the data points diverge from the model prediction and is defined as follows:\n", "\n", "$$\n", "\\begin{equation}\\label{eq:regression}\n", " e_i:= y^{(i)} - {m}^\\top {X}^{(i)} - b = y^{(i)} - \\sum_{j=1}^k X^{(i)}_j m_j - b,\n", "\\end{equation}\n", "$$\n", "\n", "for some real numbers $m_1,\\dots,m_k$ and $b$.\n", "\n", "The Least Absolute Deviation (LAD) is a possible statistical optimality criterion for such a linear regression. Similar to the well-known least-squares technique, it attempts to find a vector of linear coefficients ${m}=(m_1,\\dots,m_k)$ and intercept $b$ so that the model closely approximates the given set of data. The method minimizes the sum of absolute errors, that is $\\sum_{i=1}^n \\left |e_i \\right|$.\n", "\n", "The LAD regression is formulated as an optimization problem with the intercept $b$, the coefficients $m_i$'s, and the errors $e_i$'s as decision variables, namely\n", "\n", "$$\n", "\\begin{align}\n", " \\min \\quad & \\sum_{i=1}^n |e_i| \\\\\n", " \\text{s.t.} \\quad & e_i = y^{(i)} - {m}^\\top {X}^{(i)} - b & \\forall\\, i=1,\\dots,n.\n", "\\end{align}\n", "$$\n", "\n", "In general, the appearance of an absolute value term indicates the problem is nonlinear and, worse, that the objective function is not differentiable when any $e_i = 0$. However, for this case where the objective is to minimize a sum of absolute errors, one can reformulate the decision variables to transform this into a linear problem. More specifically, introducing for every term $e_i$ two new variables $e_i^-, e_i^+ \\geq 0$, we can rewrite the model as\n", "\n", "$$\n", "\\begin{align}\n", " \\min \\quad & \\sum_{i=1}^n ( e_i^+ + e_i^-) \\\\\n", " \\text{s.t.} \\quad & e_i^+ - e_i^- = y^{(i)} - {m}^\\top {X}^{(i)}-b & \\forall\\, i=1, \\dots, n \\\\\n", " & e_i^+, e_i^- \\geq 0 & \\forall\\, i=1, \\dots, n\n", "\\end{align}\n", "$$\n", "\n", "The following cell provides a direct implementation of LAD regression." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile lad_regression.mod\n", "\n", "# indexing sets\n", "set I;\n", "set J;\n", "\n", "# parameters\n", "param y{I};\n", "param X{I, J};\n", "\n", "# variables\n", "var ep{I} >= 0;\n", "var em{I} >= 0;\n", "var m{J} >= 0;\n", "var b;\n", "\n", "# constraints\n", "s.t. residuals {i in I}:\n", " ep[i] - em[i] == y[i] - sum{j in J}(X[i, j] * m[j]) - b;\n", "\n", "# objective\n", "minimize sum_of_abs_errors: sum{i in I}(ep[i] + em[i]);" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "SKIqjt5CPSJf" }, "outputs": [], "source": [ "def lad_regression(X, y):\n", " ampl = AMPL()\n", " ampl.read(\"lad_regression.mod\")\n", "\n", " n, k = X.shape\n", "\n", " # note use of Python style zero based indexing\n", " ampl.set[\"I\"] = range(n)\n", " ampl.set[\"J\"] = range(k)\n", "\n", " ampl.param[\"y\"] = y\n", " ampl.param[\"X\"] = X\n", "\n", " ampl.option[\"solver\"] = SOLVER\n", " ampl.solve()\n", "\n", " return ampl\n", "\n", "\n", "m = lad_regression(X, y)\n", "m.display(\"m\")\n", "m.display(\"b\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualizing the Results" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "m_m = list(m.var[\"m\"].to_dict().values())\n", "m_b = m.var[\"b\"].value()\n", "\n", "y_fit = np.array([sum(x[j] * v for j, v in enumerate(m_m)) + m_b for x in X])\n", "\n", "if n_features == 1:\n", " plt.scatter(X, y, alpha=0.3, label=\"data\")\n", " plt.plot(X, y_fit, \"r\", label=\"y_fit\")\n", " plt.xlabel(\"X\")\n", " plt.ylabel(\"y\")\n", " plt.grid(True)\n", " plt.legend()\n", "\n", "plt.figure()\n", "plt.hist(y - np.mean(y), bins=int(np.sqrt(len(y))), alpha=0.5, label=\"y - mean(y)\")\n", "plt.hist(y - y_fit, bins=int(np.sqrt(len(y))), color=\"r\", alpha=0.8, label=\"y - y_fit\")\n", "plt.title(\"histogram of residuals\")\n", "plt.ylabel(\"counts\")\n", "plt.legend()" ] } ], "metadata": { "colab": { "collapsed_sections": [], "include_colab_link": true, "name": "fit absolute", "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 4 }