<div class="alert alert-block alert-success">
    <b>ARTIFICIAL INTELLIGENCE (E016350A)</b> <br>
ALEKSANDRA PIZURICA <br>
GHENT UNIVERSITY <br>
AY 2024/2025 <br>
Assistant: Nicolas Vercheval
</div>

# Multivariable linear regression: Predict fuel efficiency

It is time to see what linear regression can do with inputs that are vectors of multiple attributes.

This notebook uses the classic [Auto MPG Dataset](https://archive.ics.uci.edu/ml/datasets/auto+mpg) and builds a model to predict the fuel efficiency of the late 1970s and early 1980s automobiles. To do this, provide the model with a description of many automobiles from that period. This description includes attributes like cylinders, displacement, horsepower, and weight.

In this notebook, we will predict the "miles per gallon" attribute `mpg` by using **linear regression**.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler
import requests
from io import StringIO

Our dataset has the following attributes:
- mpg
- cylinders
- displacement
- horsepower
- weight
- acceleration
- model-year

In [None]:
# download the data using the request module
request = requests.get("http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data")

# convert the download into a file containing a string with StringIO
if request.status_code == 200:  # downloaded without errors
    file_str_io = StringIO(request.text)
else:
    print("Download file manually and replace file_srt_io with its path")
    
column_names = ['mpg', 'Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration', 'Model Year', 'Origin']
raw_dataset = pd.read_csv(file_str_io, names=column_names, na_values = "?", comment='\t', sep=" ", skipinitialspace=True)

df = raw_dataset.copy()

# show a small subset of the data to give you a feel for what we're working with.
df.head()

In [None]:
# remove any unknown values in the data
df = df.replace('?', np.nan)
df = df.dropna()

In [None]:
# separate the target variable mpg
X = df.drop('mpg', axis=1)          
y = df[['mpg']]

The `train_test_split` function divides the data set into training and test depending on the passed `test_size` relationship. In our case, 25% of the data will be taken as a test set, and the rest (75%) as the training set.

In [None]:
# divide the data into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

### Feature normalization:
Normalizing the features is often a good idea in machine learning. In multivariable regression, normalization helps us relate the importance of an attribute to the magnitude of the relative weight (but be careful with possible correlation between attributes). Sklearn offers classes that make this operation easy and safe from *data leakage*. We will see talk more about these functionalities in future exercises.

#### Exercise: Perform feature normalization on the training and test set

Normalize the features using the `StandardScaler` of scikit-learn. 

On which dataset(s) do you fit the `StandardScaler`? Why?

In [None]:
scaler = # Your code here...
X_train.update(
    # Your code here...
)
X_test.update(
    # Your code here...
)

### Creating the model

#### Exercise: Train the linear regression model

Train a `LinearRegression` model on the training data.

In [None]:
reg = # Your code here...

### Show the coefficients

In [None]:
print('Intercept is {:.3f}\n'.format(reg.intercept_[0]))
for idx, col_name in enumerate(X_train.columns):
    print('Coefficient for {} equals {:.3f}'.format(col_name, reg.coef_[0][idx]))

### Evaluate the model

Now we evaluate the linear model using the RÂ², MSE and RMSE functions

In [None]:
# evaluate the R^2 metric.
r2_test = reg.score(X_test, y_test)
r2_train = reg.score(X_train, y_train)
print('R^2 test = {:.3f}'.format(r2_test))
print('R^2 train = {:.3f}'.format(r2_train))

# alternatively
# r2_test = r2_score(y_test, reg.predict(X_test))
# r2_train = r2_score(y_train, reg.predict(X_train))

In [None]:
# calculate the mean squared error
mse_test = mean_squared_error(y_test, reg.predict(X_test))
mse_train = mean_squared_error(y_train, reg.predict(X_train))
print('\nMSE test = {:.3f}'.format(mse_test))
print('MSE train = {:.3f}'.format(mse_train))

In [None]:
# calculate the root mean squared error (RMSE) to return to the order of magnitude of the target variable.
rmse_test = np.sqrt(mse_test)
rmse_train = np.sqrt(mse_train)
print('\nRMSE test = {:.3f}'.format(rmse_test))
print('RMSE train = {:.3f}'.format(rmse_train))