How I Raised My ML Skills Using This Simple Project

Struggles in Learning ML

Do you spend hours of your day in front of large textbooks to learn mathematical concepts of machine learning such as linear algebra, calculus, or statistics? Learning Machine learning concept is highly challenging and time-consuming. But there is no need to worry. I will explain an easy solution to overcome the fear of learning large, tougher concepts of machine learning in less than an hour. Get ready to raise your ML skills.

Discovering an Easy and Relevant Path

I spent a long time browsing the web to figure out the easiest way to enhance my machine learning skills. I found lots of articles and YouTube videos, but most of them were too technical and challenging to understand without a solid foundation in ML-related mathematics. 

I was looking for hands-on projects, so I could experiment and learn theoretical concepts simultaneously. Finally, I decided to build a spam-detection machine learning model, which doesn't require much deeper ML knowledge and serves as a real-world problem for learning each step in training a model in machine learning. Our model will predict whether an email is spam or not.

During this journey, we uncover the secrets behind each stage of machine learning model training and testing, including visualizing result using a confusion matrix.

How to Train a Simple ML Model

Training a machine learning model involves multiple steps, such as  

  • collecting the appropriate data

  • preprocessing or cleaning the data

  • choosing the best algorithm for our model

  • evaluating and visualizing the results. 

After completing the fourth step, we will get a solid idea of how machine learning models are trained to predict information.

Collecting Data for our ML Model

Getting the proper dataset is the first and most important step in any machine-learning model. For the machines to learn a new topic or subject, they require data. For our machine learning models, we need to collect as many spam and non-spam emails as possible. Collecting such a large quantity of emails is pretty difficult and requires enormous time.

But we don't need to worry. UC Irvine University has a big collection of data known as the UC Irvine Machine Learning Repository. It includes a vast number of emails, both spam and non-spam.

The data inside the repository is available to the public for experimentation and research on machine learning.

To download the dataset, go to the spambase repository and download the zip file, or you can directly import the dataset into your codebase.

First, install ucimlrepo using the command:

pip install ucimlrepo

then import and use

from ucimlrepo import fetch_ucirepo  
# fetch dataset 
spambase = fetch_ucirepo(id=94) 

# data (as pandas dataframes) 
X = spambase.data.features 
y = spambase.data.targets  

# metadata 
print(spambase.metadata)

# variable information 
print(spambase.variables) 

Preprocessing Data for Better Results

As you can see, the dataset is just a CSV file that contains 4601 rows and 57 columns of numerical data.

Wait, why just numerics instead of actual emails? The machine learning model that we are going to build here only accepts numerical data as input and predicts the result as numbers; we can't directly feed emails that contain text or images as input.

In the context of our project Data preprocessing is the process of cleaning, organizing, and transforming the raw email data before it is used to train a machine learning model. 

Researchers from UC Irvine University have already transformed the raw email data into numerical values. So the majority of problems in our data preprocessing stage are already solved.

In our dataset, as I said before, we have 4601 rows, which represent emails, and 57 features, which mean columns. Each row in the dataset represents an email, and the columns represent various features extracted from the email. In machine learning, ‘features’ are individual measurement properties of data that we use as inputs to our machine learning model. Based on the features, our machine learning model will predict whether the email is spam or not.

Feature Analysis

Our features in the dataset consist of the frequency of specific words and characters and the length of consecutive sequences of capital letters. These are some features from our dataset

  • Capital_run_length_average

  • Capital_run_length_longest

  • Word_freq_free

  • Word_freq_business

  • word_freq_credit

  • char_freq_$.

The word_freq_business column contains the normalized value of the count of the word ‘business’ in the email. Most spam emails contain common text such as money, credit, business, offer, dollar symbol, etc

So 57 features in our dataset indicate a normalized frequency value of 57 words or characters in emails.

Preprocessing using Pandas and Numpy

Pandas and Numpy are two famous libraries in Python for doing various data manipulation and analysis operations. In this project, for some preprocessing tasks, we use these libraries.

Our first step is to read the dataset using the Pandas library.

import pandas as pd
df = pd.read_csv('spambase.csv', header=None)

Since we are using a large dataset of 4602 rows, there is the possibility of duplicate data, and the duplicate data will affect the performance of our trained model. 

To figure out the duplicate rows: 

duplicated_rows = df.duplicated()
# count the number of duplicated rows
num_duplicate_rows = sum(duplicated_rows)
print(num_duplicate_rows)

Our dataset contains 391 duplicate rows. To drop the duplicate data, call the drop_duplicates function in the Pandas dataframe.

df = df.drop_duplicates()

Now our dataset is sort of perfect, without repeated or duplicate values. There are other exciting steps to clean up and make our dataset more stable and accurate.

1 powerful reason a day nudging you to read
so that you can read more, and level up in life.

Sent throughout the year. Absolutely FREE.

Splitting into Input Features and Target Variables

Input features are the values given to the algorithm, and the target variable is what the model aims to predict. In our spam email example, word and character frequency are input features, and the target variable is 0, indicating whether the email is spam. Columns 1 to 56 are input variables, and the last column is the target variable.

To split into input features and target variables:

X = df.iloc[:, 0:57]
y = df.iloc[:, 57]

Feature scaling

Imagine you are baking a cake. The ingredients in the receipt list are in different units- some in cups, some in grams, and some in tablespoons. It could get confusing, right? Feature scaling is like converting all those measurements to the same unit, say grams, making it easier to understand and follow the recipe accurately.

image source

In our context, feature scaling involves adjusting the scale of input features to a standard range. This is crucial because some machine learning algorithms are sensitive to the scale of input features. The goal of feature scaling is to prevent some features from dominating others due to differences in scale. For our dataset, we are using StandardScalar().

sc_X = StandardScaler()
X_scaled = sc_X.fit_transform(X)

The StandardScaler helps us standardize our features by removing mean and scale-to-unit variance. So that we can ensure all features contribute equally to the model's training process.

Principal Component Analysis

We currently have 57 columns (features) in our dataset. It's important to reduce these columns to increase our model's efficiency and computation time. This reduction is necessary to maintain a lower feature count, as having too many features (57 in this case) can lead to issues like low accuracy and poor model performance.

from sklearn.decomposition import PCA

pca = PCA(0.95)

# call pca fit_transform method
X_pca = pca.fit_transform(X_scaled)

X.shape,X_pca.shape

To achieve this, we use a machine learning technique called principal component analysis (PCA).

By applying PCA with a specified threshold (in this case, 0.95), we were able to reduce our features from 57 to 49, ensuring a more streamlined and effective dataset for our machine learning model.

Split the Dataset

Our dataset is ready for model training. Before we proceed, we split it into two parts: a training set and a testing set. Typically, we divide the larger dataset into training and testing sets using an 8:2 ratio. The majority of the data (80%) is used to train the model, while the remaining 20% is reserved for testing the model's accuracy and correctness.

x_train, x_test, y_train, y_test = train_test_split(X_pca,y,random_state=0, test_size=0.2)

We created four subsets: x_train and y_train for training the machine learning model, and x_test and y_test for evaluating the trained model's performance.

The parameters used for the split are as follows:

  • test_size=0.2: This splits the dataset into an 8:2 ratio.

  • random_state=0: Setting the random state ensures a consistent split every time the code is run. If this parameter isn't set, running the code again might result in a different split. The different split is because the train-test split function typically uses a random process to shuffle and partition the data into training and testing sets.

Choosing the Best Algorithm

Selecting the best machine learning algorithm for training our model depends on various factors, such as:

  • characteristics of the dataset 

  • the nature of the problem

  • the quantity of data available, etc

For this project, we chose the decision tree algorithm, one of the most popular supervised machine learning classification algorithms. Our spam detection project is a binary classification problem; we predict whether our email is spam (positive class) or not spam (negative class).

What is a decision tree?

Imagine that today you are going outside for a walk. You looked at the sky and this question came up 

  • ‘Is it raining today?’ If yes, you will take an umbrella. If not, then the next question is,

  • “Is it cloudy today?"? If yes, it's better to take an umbrella. If not, then the next question is, 

  • “Is it sunny today?” If yes, then there is no need for an umbrella. This is how a decision tree works; you end up with a conclusion after a series of questions and decisions.

A decision tree is a hierarchical tree structure that consists of nodes, where each node represents a decision. Nodes are connected through edges, and the final node, the leaf node, represents the final decision.

How Decision Tree Works for Our Model

For our spam detection model, at each node, the decision tree will check for specific conditions, and based on the condition, it goes to the next node.

The conditions are based on the features that we described in our dataset. For example, consider two features: word_freq_make and  char_freq_! Each node in the decision tree will check if the corresponding value in the row is less than or greater than the threshold value.

This flow chart gives you a basic idea of what is happening inside the tree, but in real cases, thousands or millions of comparisons are happening inside the algorithm to predict the accurate output.

In a decision tree with 52 features, the number of comparisons can be substantial. At each internal node of the tree, the algorithm makes a comparison based on one feature and a threshold value. So, at each internal node, the algorithm is making a decision based on one feature and its associated threshold. In a decision tree with 52 features, this process happens up to 52 times at each internal node. The total number of comparisons at a single internal node is equal to the number of features.

For our model, the total number of nodes we can check using sklearn

total_nodes = dTree.tree_.node_count
print(f"The total number of nodes in the decision tree is: {total_nodes}")

The total number of nodes in our decision tree is 405.

Training the Model

Now we have a basic idea of the decision tree algorithm, so the next step is to train the model using a decision tree classifier

First, we need to create an instance of the decision tree classifier, which is available in the sklearn library.

from sklearn.tree import DecisionTreeClassifier

dTree = DecisionTreeClassifier() # Creating an instance

dTree.fit(X_train,y_train)  # Training the model

For training the model, we call the fit function and pass the parameters X_train and y_train to the subset we generated from our dataset. The execution time of the training depends on the size of the dataset and the complexity of the algorithm.

Test our Model

After successful training, our next step is to provide testing data for calculating the accuracy and precision of our model.

prediction = dTree.predict(X_test)
print(prediction)

The output prediction is a large array of binary values, so it's difficult to understand whether our prediction worked or not. There are different ways to visualize the result and evaluate the prediction.

Confusion Matrix

The confusion matrix is a simple and powerful tool that provides a clear picture of how well the classification happens.

Sklearn provides a function called confusion_matrix to visualize the classification.

from sklearn.metrics import confusion_matrix

import matplotlib.pyplot as plt

import seaborn as sns

# Importing necessary libraries

from sklearn.metrics import confusion_matrix

import seaborn as sns

import matplotlib.pyplot as plt

# Creating confusion matrix

cm = confusion_matrix(y_test, predictions)

# Plotting the confusion matrix as a heatmap

sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)

# Adding title and labels

plt.title("Confusion Matrix")

plt.xlabel("Predicted")

plt.ylabel("True")

# Displaying the heatmap

plt.show()

For plotting the confusion matrix, we used Matplotlib and the Seaborn library in Python.

  • True Negative (TN): 443 - The model correctly identified 443 non-spam emails.

  • False Positive (FP): 60 - The model incorrectly identified 60 non-spam emails as spam.

  • True Positive (TP): 294 - The model correctly identified 294 spam emails.

  • False Negative (FN): 45 - The model incorrectly identified 45 spam emails as non-spam.

Precision And Accuracy of the Model

Of all the instances predicted as positive, how many were actually positive? The answer to this question is in the precision of our model.

Precision is calculated using the formula 

True Positive /(True Positive + False Positive)

means 294/(294+60) = 0.83 = 83%

Which is not too bad.

Accuracy of a model means, ‘Of all the instances, how many were correctly predicted by the model?’

Accuracy is calculated using the formula

Number of correct predictions/Total number of predictions

737/843 = 0.875 = 87%

Recall and F1 score are the other two metrics we use for evaluating the model.

Here is a detailed graph for precision, recall, and F1 score for spam and non-spam classes. 0 indicates non-spam and 1 indicates spam emails.

The entire code for this project is available in this Jupyter notebook.

Conclusion

We can see how effective our model is; with just 4000 rows, it achieved more than 80% accuracy and precision. There are further fine tuning techniques to increase this percentage. You can also try to change the decision tree algorithm with other machine learning algorithms, such as

  • KN - K-Nearest Neighbors

  • SVM - Support Vector Machine

  • Naive Bayes, etc.and compare the results.

There are many other similar projects with large datasets available to learn more machine learning concepts. Just continue exploring new projects, doing experiments, and enhancing your machine learning skills every day.

FeedZap: Read 2X Books This Year

FeedZap helps you consume your books through a healthy, snackable feed, so that you can read more with less time, effort and energy.