Final Project

UBC Key Capabilities in Data Science

Introduction to Machine Learning - April 09 - May 27, 2024

Noriko Kono

Introduction

I am working on a project to develop machine learning models to predict the fat content levels in cheese products. I acknowledged that the data were obtained from Kaggle and followed an Open Government Licence (Canada). The main question I seek to address through this project is: "Can the model accurately predict whether a cheese product has a higher or lower fat content based on the given data?" This question falls within the realm of supervised machine learning, specifically classification, as our target variable "FatLevel" is categorical, with boolean values representing "higher fat" and "lower fat". I am adhering to the dataset description provided per my superiors' guidelines. The dataset description will be attached below for reference.

data_instructions_and_suggestions_resized.png

Exploratory Data Analysis (EDA)

Sprit the data into the train and test sets

From Module 3:

Train/validation/test split

  • Train: Used to fit our models.
  • Validation: Used to assess our model during model tuning.
  • Test: Unseen data used for a final assessment.

Most of the columns are categorical, and four columns contain missing values.
I need to transform the categorical features into numeric ones and handle missing values.

There are two kinds of binary classification problems:

  1. Distinguishing between two classes
  2. Spotting a class (fraud transaction, spam, disease)

The above description is from the Module 7 lecture video. The project involves solving a binary classification problem. I will proceed to distinguish between two classes: high and low fat contents.

Modele:7:

Addressing class imbalance

A very important question to ask yourself: “Why do I have a class imbalance?”

  • Is it because one class is much rarer than the other?

  • Is it because of my data collection methods?

But, if you answer “no” to both of these, it may be fine to just ignore the class imbalance.

Although I observe an imbalance in the class distribution, I will disregard it under the assumption that the answers to both questions are "no." However, I will utilize the hyperparameter class_weight="balanced", which is recommended when dealing with imbalanced classes for some models.

Visualizations

I will assess the models using metrics such as accuracy, precision, recall, f1 score, and the Confusion Matrix. These metrics were introduced in this program and I understand that they are widely accepted standards in the field.


Identify different types of features

Numeric Features

Categorical Features

I am examining this column further because it significantly influences the outcome.

By Copilot

Feature 0 (0.514878): This feature has the highest importance score, which means the model heavily relies on this feature for making predictions.

Binary Features

Start with Baseline Without Preprocessing

I was so confused. As a novice, I didn’t understand why DummyClassifier accepted the categorical features without preprocessing. I have spent many hours trying to figure out why. I thought the models only utilized the numerical values. AI provided me with the following explanation

DummyClassifier_answer_from_bing.png

Since the official scikit-learn website stated, "Do not use it for real problems," I assume this is not a concern. Besides, it only returned the poor results.

DummyClassifier is a classifier that makes predictions using simple rules.

This classifier is useful as a simple baseline to compare with other (real) classifiers. Do not use it for real problems.

Eventually, I found the answer.

instruction_by_Ela.png

Preprocessing

Preprocessing: Transforming input data into a format a machine learning model can use and understand.

From Module 6

Do we need to preprocess categorical values in the target column? Generally, there is no need for this when doing classification. sklearn is fine with categorical labels (y-values) for classification problems.

Briefly justify my choices

I concluded as below. I observed many methods to investigate each column. I wondered if there were any ordinal features, but I didn't find it.
Split the numeric, categorical and binary features
make_pipeline
make_column_transformer

Methods & Results

3 types of scores:

  • Training score: The score that our model gets on the same data that it was trained on. (seen data - training data)
  • Validation score: The mean validation score from cross-validation).
  • Test score: This is the score from the data that we locked away.
Start with Baseline With Preprocessing

Processing does not contribute to the improvement of the DummyClassifier’s performance.


Automated Hyperparameter Optimization

KNeighborsClassifier


We have imbalanced data; therefore, I implemented the hyperparameter class_weight="balanced".

SVC

RandomForestClassifier

pipe = make_pipeline(preprocessor, RandomForestClassifier(class_weight="balanced", random_state=77))

Randomized hyperparameter optimization


pipe = make_pipeline(preprocessor, RandomForestClassifier(class_weight="balanced", random_state=77, max_depth=141))


The best search result

Ultimately, my model has attained a test score of 0.86, a training accuracy of 0.90 and a test accuracy of 0.83. In my humble assessment, these results are reasonably good.

Confusion Matrix

confusion_matrix NumPy array

Precision, recall and f1-score

Three commonly used metrics recall, precision and f1 score, which are based on the confusion matrix.

$precision = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}$

$recall = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negative}}$

$f1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$

Feature importances

Linear Regression

I needed to change target values to binary.

Regression measurements


Writing ✏️

Further Attempt

I am not sure of the right approach, but I wanted to use CountVectorizer. Therefore, I made the model to predict Cheese names by their characteristics.🧀

bing_cheese_platter_resized.jpg

Content credentials Generated with AI ∙ May 20, 2024 at 10:18 PM

CountVectorizer converts a collection of text documents to a matrix of word counts.

Discussion

References

References

I want to clarify that my main reference when working on this project was all the learning materials provided in this program.  My goal was to meet the requirements and also learn how to properly document and present my work. As a result, there may be similarities in the presentation.

I need to clarify that some of the content in this notebook is not original. I have borrowed certain parts from online resources. I have utilized Microsoft Bing AI as a valuable tool for brainstorming and gathering resources. I have also used Grammarly to improve my writing.

Main Resources I used