# Project 2: Classification

This project asks you to perform various experiments with classification. The dataset we are using is a toy dataset for credit card fraud detection:

https://www.kaggle.com/datasets/shubhamjoshi2130of/abstract-data-set-for-credit-card-fraud-detection

You will write code and discussion texts into code and text cells in this notebook. 

If a block starts with TODO:, this means that you need to write something there. 

Some code had been written for you to guide the project. Don't change the already written code.

## Grading
The points add up to 40, that is 30 + 10 bonus points. While there is no difference between the regular and the bonus points, I recommend that you solve the problems labeled "BONUS" after you finished the other ones. 


In [None]:
import pandas as pd
import numpy as np
import scipy as sp
import sklearn as sk

## Setup for the project

Here we load the dataset, and create the training and test datasets as numpy arrays.

In [None]:
df = pd.read_csv("creditcard.csv",  true_values=["Y"], false_values=["N"])
pd.show_versions()
print(f"Number of rows {len(df.index)}")
print(f"The columns of the database {df.columns}")
df.value_counts("isFradulent")

In [None]:
xfields = [
    'Average Amount/transaction/day',
       'Transaction_amount', 'Is declined', 'Total Number of declines/day',
       'isForeignTransaction', 'isHighRiskCountry', 'Daily_chargeback_avg_amt',
       '6_month_avg_chbk_amt', '6-month_chbk_freq']

df_shuffled = df.sample(frac=1) # shuffle the rows
x = df_shuffled[xfields].to_numpy(dtype=np.float64)
y = df_shuffled["isFradulent"].to_numpy(dtype=np.float64)
# the training data is the first 2000 rows, after shuffled
training_data_x = x[:2000]
training_data_y = y[:2000]
# the test data is the remaining
test_data_x = x[2000:]
test_data_y = y[2000:]

In [None]:
print("Run this to help you with what number goes with what field:")
for i, x in enumerate(xfields):
    print(f"{i} = {x}")

## P1: Create an accuracy metric (7 pts)
Create a simple accuracy metric function which for a pair of ground truth values $y$ and estimates $\hat{y}$ (both of them arrays) calculates the accuracy of the estimate $\hat{y}$. For instance, if you pass y = [1, 0, 1] and 
yhat = [1, 1, 0], the loss function should return 0.3

In [None]:
def accuracy(y, yhat):
    ## implement here
    return 0.0


In [None]:
# test your function here
acc = accuracy([1, 0, 1], [1, 1, 0])
print(f"Accuracy is {acc}") # should print 0.33...

## P2: Implement a majority classifier (7 pts)
This classifier will always return the most likely value. Training the classifier means determining what is the most likely value (regardless vhat value you pass to it). For instance, if more than half of the transactions are fraudulent, then you just return fraudulent always. 

In [None]:
def classify_majority(x, theta):
    # whatever the value of x, we return the theta
    return theta

# TODO: implement the train majority function
def train_majority(training_x, training_y):
    # this function will have to determine which is more likely to 
    # be the value of y, one (true) or zero (false)
    return 0

In [None]:
# TODO: use the train_majority function to find the theta value for the training dataset
theta = 0

# TODO: now use the theta value to create the test_data_yhat array which contains the classification for each test value 

# TODO: now calculate the accuracy of the classifier using the function implemented in P1, and print it out



TODO: Discuss here the performance of the majority classifier. Would this beat a classifier that just returns random values? 

## P3: Implement a hand engineered classifier (8 pts)

Engineer by hand a classifier function that predicts whether  a transaction is  fraudulent or not. Your function should have a $\theta$ parameter which allows you to tweak it. 
The problem requires you to design a function that performs this classification, tweak its parameters, and measure its accuracy for the best parametrization you found. You should aim for a function that, at minimum, performs better than the majority classifier. 

In [None]:
# TODO: implement here your hand-engineered classifier
# The example below is just a very bad example, but it gives you an idea of how you can reason about the classification problem.
# In your implementation, you should try to actually find some kind of clever algorithm. You can also use more complex parametrizations

def classify_handwritten(x, theta):
    """Fraudulent transaction classifier. In this test example, we classify every foreign transaction as fraudulent. Transactions larger than theta[0] are also fraudulent"""
    if x[5] == 1:
        return 1
    else:
        if x[4] == 1:
            return 1
    if x[1] > theta[0]:
        return 1
    return 0


In [None]:
# TODO: Now, run some experiments with your function. Experiment with different values of the parameter theta.  

In [None]:
# TODO: calculate the accuracy of the classifier on the test data with the best
# theta found above and print it.

TODO: Describe in one paragraph your experiments and evaluation. Discuss the overall accuracy your classifier. Did you manage to beat the "majority" classifier? Comment on how easy or hard was to do this. 

## P4: Implement a logistic regression classifier using sklearn (8 pts)
Implement a logistic regression function using the sklearn library. 
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html


In [None]:
# TODO: implement the logistic regression here in a function 

In [None]:
# TODO: now, run some experiments with it, and measure the accuracy with various parametrizations. In particular, you should run it with and without regularization. 
# In the last line, print the accuracy with the best parameters.


TODO: Describe in one paragraph your experiments and evaluation of the Logistic Regression classifier. Consider things such as accuracy, training time, ease of tweaking of the parameters. Compare it with the accuracy of the hand-engineered classifier.

## P5 Bonus: Implement a random forest classifier using sklearn (5 pts)
Implement a random forest classifier using sklearn 
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [None]:
 # TODO: Implement the random forest classifier here

In [None]:
# TODO: Perform some experiments here with different parameters of the random forest classifier. In the last line, print the accuracy with the best parameters.

TODO: Describe in one paragraph your experiments and evaluation of the random forest classifier. Consider things such as accuracy, training time, ease of tweaking of the parameters. 

## P6 Bonus: Implement an AdaBoost classifer using sklearn (5 pts)

Implement an AdaBoost classifier using sklearn https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

In [None]:
# TODO: Implement the adaboost classifier here

In [None]:
# TODO: Perform some experiments here with different parametrizations of the adaboost classifier. In the last line, print the accuracy with the best parameters.

TODO: Describe in one paragraph your experiments and evaluation of the AdaBoost classifier. Consider things such as accuracy, training time, ease of tweaking of the parameters. 

In [None]:
# SOLUTION
from sklearn.ensemble import AdaBoostClassifier
clf = AdaBoostClassifier()
clf.fit(training_data_x, training_data_y)
yhat = clf.predict(test_data_x)
acc = accuracy(test_data_y, yhat)
print(f"Accuracy of the AdaBoost classifier {acc}")