Predicting and Optimizing Avocado Sales with Python + Amplpy#

Description: In this notebook, we explore a real-world example of demand estimation and supply optimization using a Kaggle dataset on avocado sales. We start by training a machine learning model to estimate demand and then formulate and solve an optimization model in AMPL to maximize revenue while minimizing waste and transportation costs.

Tags: scikit-learn, machine-learning, price-prediction, forecast, gurobi, amplpy, kaggle

Notebook author: Marcos Dominguez Velad <marcos@ampl.com>

Model author: N/A

# Install dependencies
%pip install -q amplpy numpy pandas scikit-learn

# Google Colab & Kaggle integration
from amplpy import AMPL, ampl_notebook

ampl = ampl_notebook(
    modules=["gurobi"],  # modules to install
    license_uuid="default",  # license to use
)  # instantiate AMPL object and register magics

🔍 Step 1: Clean, Load and Prepare the Data#

We begin by loading a dataset from Kaggle, which contains historical data on avocado prices and sales volumes across various U.S. regions. This dataset is widely used for time-series analysis and demand forecasting.

Our focus is on conventional avocados (as opposed to organic) sold in three selected regions:

Albany
Northeast
SouthCentral

To prepare the data for modeling, we:

Clean column names to avoid formatting issues
Filter the dataset to include only conventional avocados in the selected regions
Convert the Date column to datetime format and extract the month
Add a binary peak indicator for peak demand months (November, December, January)
Create a year index (e.g., 2015 → 0, 2016 → 1, etc.)
Encode region as numeric codes for use in modeling

This preprocessed dataset is the foundation for training our demand prediction model and setting up the optimization problem.

import pandas as pd
import numpy as np

df = pd.read_csv(
    "https://raw.githubusercontent.com/ampl/colab.ampl.com/master/authors/marcos-dv/miscellaneous/avocado.csv"
)

# Clean column names
df.columns = df.columns.str.strip()

# Filter: only 'conventional' avocados and selected regions
df = df[df["type"] == "conventional"]
selected_regions = ["Albany", "Northeast", "SouthCentral"]
df = df[df["region"].isin(selected_regions)]

# Extract month from Date
df["Date"] = pd.to_datetime(df["Date"])
df["month"] = df["Date"].dt.month
df["peak"] = df["month"].isin([11, 12, 1]).astype(int)
df["year_index"] = df["year"] - 2015

# Encode region as categorical code
df["region_code"] = df["region"].astype("category").cat.codes

🤖 Step 2: Train a Linear Regression Model#

We use a simple linear regression model from scikit-learn to predict avocado demand (i.e., Total Volume) based on the following features:

AveragePrice of avocados
Year index (how many years since 2015)
Peak season indicator
Region code (encoded as categorical numeric values)

This model helps us estimate demand patterns across time and regions.

from sklearn.linear_model import LinearRegression

# Set input features and target
X = df[["AveragePrice", "year_index", "peak", "region_code"]]
y = df["Total Volume"]  # This is the demand

# Map region codes back to names
region_map = dict(zip(df["region"], df["region_code"]))
region_codes = dict((v, k) for k, v in region_map.items())

# === Step 2: Train regression model ===
model = LinearRegression()
model.fit(X, y)

# === Step 3: Extract coefficients for AMPL ===
intercept = model.intercept_
coef_price, coef_year_index, coef_peak, _ = model.coef_

# Estimate per-region demand adjustment using residuals
df["residual"] = y - model.predict(X)
region_effects = df.groupby("region_code")["residual"].mean().to_dict()

🧮 Step 3: Define and Load the AMPL Model#

We define an optimization model in AMPL using the parameters from our regression step. The model includes:

A demand expression based on the trained ML model
Decision variables for price and quantity per region
Calculations for sales, revenue, waste, and costs
A constraint to ensure total supply is fixed
An objective function to maximize net profit

Once defined, we use the amplpy package to load the model and inject all parameters into the AMPL environment.

%%writefile avocados.mod
reset;
set R;

param cost_waste;
param cost_transport {R};
param price_min;
param price_max;
param quantity_min {R};
param quantity_max {R};
param total_amount_of_supply;
param coefficients_intercept;
param coefficients_region {R};
param coefficients_price;
param coefficients_year_index;
param coefficients_peak;
param data_year;
param data_peak;

var price {r in R} >= price_min, <= price_max;
var quantity {r in R} >= quantity_min[r], <= quantity_max[r];

var demand_expr {r in R} =
    coefficients_intercept +
    coefficients_region[r] +
    coefficients_price * price[r] +
    coefficients_year_index * (data_year - 2015) +
    coefficients_peak * data_peak;

var sales {r in R} = min(demand_expr[r], quantity[r]);
var revenue {r in R} = sales[r] * price[r];
var waste {r in R} = quantity[r] - demand_expr[r];
var costs {r in R} = cost_waste * waste[r] + cost_transport[r] * quantity[r];

maximize obj: sum {r in R} (revenue[r] - costs[r]);

subject to supply: sum {r in R} quantity[r] = total_amount_of_supply;

Overwriting avocados.mod

📉 Step 4: Extract Model Coefficients and Load Data into Ampl#

After training the model, we extract the regression coefficients to quantify the impact of each variable on avocado demand. We also compute region-specific residual averages to capture localized adjustments to the global model.

These coefficients will be used directly in our optimization model.

We construct a set of parameters to be passed to AMPL. These include:

Cost parameters for waste and transport
Price and quantity bounds per region
Estimated demand function coefficients
Region-specific demand adjustments
Total supply capacity for distribution

The goal is to find the optimal price and quantity allocation across regions that maximizes net revenue.

regions = list(region_codes.values())
data_year = 2020
data_peak = 1
total_supply = 500000

parameters = {
    "R": regions,
    "cost_waste": 0.2,
    "cost_transport": {r: 0.1 for r in regions},
    "price_min": 0.5,
    "price_max": 2.5,
    "quantity_min": {r: 10000 for r in regions},
    "quantity_max": {r: 300000 for r in regions},
    "total_amount_of_supply": total_supply,
    "coefficients_intercept": intercept,
    "coefficients_price": coef_price,
    "coefficients_year_index": coef_year_index,
    "coefficients_peak": coef_peak,
    "coefficients_region": {
        region_codes[i]: region_effects.get(i, 0) for i in region_codes
    },
    "data_year": data_year,
    "data_peak": data_peak,
}

ampl.read("avocados.mod")

# Load data into AMPL
ampl.set["R"] = parameters["R"]
ampl.param["cost_waste"] = parameters["cost_waste"]
ampl.param["price_min"] = parameters["price_min"]
ampl.param["price_max"] = parameters["price_max"]
ampl.param["coefficients_intercept"] = parameters["coefficients_intercept"]
ampl.param["coefficients_price"] = parameters["coefficients_price"]
ampl.param["coefficients_year_index"] = parameters["coefficients_year_index"]
ampl.param["coefficients_peak"] = parameters["coefficients_peak"]
ampl.param["data_year"] = parameters["data_year"]
ampl.param["data_peak"] = parameters["data_peak"]
ampl.param["total_amount_of_supply"] = parameters["total_amount_of_supply"]

ampl.param["cost_transport"] = parameters["cost_transport"]
ampl.param["quantity_min"] = parameters["quantity_min"]
ampl.param["quantity_max"] = parameters["quantity_max"]
ampl.param["coefficients_region"] = parameters["coefficients_region"]

# === Step 6: Solve and show results ===
ampl.solve(solver="gurobi")
assert ampl.solve_result == "solved", ampl.solve_result

Gurobi 12.0.1:Gurobi 12.0.1: optimal solution; objective 1712697.331
3 simplex iterations
1 branching node

🚀 Step 6: Solve and Analyze Results#

We use the Gurobi solver to solve the optimization problem. The solution provides us with:

Optimal price for avocados in each region
Optimal quantity of avocados to allocate per region

Finally, we print out the results, offering actionable insights into how pricing and distribution should be managed across the selected markets.

# Output results
print("\nOptimal Prices and Quantities:")
price = ampl.var["price"].to_dict()
quantity = ampl.var["quantity"].to_dict()
for r in parameters["R"]:
    print(f"Region {r}: Price = ${price[r]:.2f}, Quantity = {quantity[r]:.0f}")

Optimal Prices and Quantities:
Region Albany: Price = $0.50, Quantity = 10000
Region Northeast: Price = $2.50, Quantity = 300000
Region SouthCentral: Price = $2.50, Quantity = 190000

✅ Summary#

This notebook combines:

Machine learning for demand forecasting
Mathematical programming for decision optimization

Together, they form a powerful workflow for data-driven supply chain planning.

Want to learn more? 📘 Check out mobook.ampl.com for 50+ optimization case studies using AMPL and Python.