{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-success\">\n",
    "    <b>ARTIFICIAL INTELLIGENCE (E016350A)</b> <br>\n",
    "ALEKSANDRA PIZURICA <br>\n",
    "GHENT UNIVERSITY <br>\n",
    "AY 2024/2025 <br>\n",
    "Assistant: Nicolas Vercheval\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Multivariable linear regression: Predict fuel efficiency\n",
    "\n",
    "It is time to see what linear regression can do with inputs that are vectors of multiple attributes.\n",
    "\n",
    "This notebook uses the classic [Auto MPG Dataset](https://archive.ics.uci.edu/ml/datasets/auto+mpg) and builds a model to predict the fuel efficiency of the late 1970s and early 1980s automobiles. To do this, provide the model with a description of many automobiles from that period. This description includes attributes like cylinders, displacement, horsepower, and weight.\n",
    "\n",
    "In this notebook, we will predict the \"miles per gallon\" attribute `mpg` by using **linear regression**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.metrics import r2_score, mean_squared_error\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "import requests\n",
    "from io import StringIO"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our dataset has the following attributes:\n",
    "- mpg\n",
    "- cylinders\n",
    "- displacement\n",
    "- horsepower\n",
    "- weight\n",
    "- acceleration\n",
    "- model-year"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# download the data using the request module\n",
    "request = requests.get(\"http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data\")\n",
    "\n",
    "# convert the download into a file containing a string with StringIO\n",
    "if request.status_code == 200:  # downloaded without errors\n",
    "    file_str_io = StringIO(request.text)\n",
    "else:\n",
    "    print(\"Download file manually and replace file_srt_io with its path\")\n",
    "    \n",
    "column_names = ['mpg', 'Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration', 'Model Year', 'Origin']\n",
    "raw_dataset = pd.read_csv(file_str_io, names=column_names, na_values = \"?\", comment='\\t', sep=\" \", skipinitialspace=True)\n",
    "\n",
    "df = raw_dataset.copy()\n",
    "\n",
    "# show a small subset of the data to give you a feel for what we're working with.\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# remove any unknown values in the data\n",
    "df = df.replace('?', np.nan)\n",
    "df = df.dropna()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# separate the target variable mpg\n",
    "X = df.drop('mpg', axis=1)          \n",
    "y = df[['mpg']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `train_test_split` function divides the data set into training and test depending on the passed `test_size` relationship. In our case, 25% of the data will be taken as a test set, and the rest (75%) as the training set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# divide the data into training and test set\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Feature normalization:\n",
    "Normalizing the features is often a good idea in machine learning. In multivariable regression, normalization helps us relate the importance of an attribute to the magnitude of the relative weight (but be careful with possible correlation between attributes). Sklearn offers classes that make this operation easy and safe from *data leakage*. We will see talk more about these functionalities in future exercises."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Exercise: Perform feature normalization on the training and test set\n",
    "\n",
    "Normalize the features using the `StandardScaler` of scikit-learn. \n",
    "\n",
    "On which dataset(s) do you fit the `StandardScaler`? Why?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scaler = # Your code here...\n",
    "X_train.update(\n",
    "    # Your code here...\n",
    ")\n",
    "X_test.update(\n",
    "    # Your code here...\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Creating the model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Exercise: Train the linear regression model\n",
    "\n",
    "Train a `LinearRegression` model on the training data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reg = # Your code here..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Show the coefficients"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Intercept is {:.3f}\\n'.format(reg.intercept_[0]))\n",
    "for idx, col_name in enumerate(X_train.columns):\n",
    "    print('Coefficient for {} equals {:.3f}'.format(col_name, reg.coef_[0][idx]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluate the model\n",
    "\n",
    "Now we evaluate the linear model using the R², MSE and RMSE functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# evaluate the R^2 metric.\n",
    "r2_test = reg.score(X_test, y_test)\n",
    "r2_train = reg.score(X_train, y_train)\n",
    "print('R^2 test = {:.3f}'.format(r2_test))\n",
    "print('R^2 train = {:.3f}'.format(r2_train))\n",
    "\n",
    "# alternatively\n",
    "# r2_test = r2_score(y_test, reg.predict(X_test))\n",
    "# r2_train = r2_score(y_train, reg.predict(X_train))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# calculate the mean squared error\n",
    "mse_test = mean_squared_error(y_test, reg.predict(X_test))\n",
    "mse_train = mean_squared_error(y_train, reg.predict(X_train))\n",
    "print('\\nMSE test = {:.3f}'.format(mse_test))\n",
    "print('MSE train = {:.3f}'.format(mse_train))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# calculate the root mean squared error (RMSE) to return to the order of magnitude of the target variable.\n",
    "rmse_test = np.sqrt(mse_test)\n",
    "rmse_train = np.sqrt(mse_train)\n",
    "print('\\nRMSE test = {:.3f}'.format(rmse_test))\n",
    "print('RMSE train = {:.3f}'.format(rmse_train))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}