Hello everyone. In this tutorial, we will see how to use Gaussian Naive Baye's Algorithm for a Bank Note Authentication Problem and see how well it classifies the Bank Notes as original or fake.
Naive Bayes itself is a very simple and useful algorithm, but it has its variants that sometimes perform much better than the basic version of Naive Baye's Algorithm. The widely used variants of Naive Baye's are:
Gaussian Naive Bayes
Multinomial Naive Bayes
Bernoulli Naive Bayes
In this tutorial, I'll be using Gaussian Naive Bayes but feel free to use this code with other classifiers and see the difference in performance.
We saw in the last post the working of the Naive Bayes algorithm from scratch. We saw that to obtain the probability of a classification result to be true or false, we need to calculate three things:
Prior Probability [P(H)]
Likelihood Probability [P(E|H)]
Priori Probability [P(E)]
In the basic version, we calculated the Likelihood Probability using the formula:
When using Gaussian Naive Bayes, it assumes that the values of the features are continuous and are usually normally distributed i.e. they have a Gaussian Distribution. The Likelihood for each feature is then represented by the formula:
from IPython.display import Image
%matplotlib inline
Image(filename= 'C:/PythonProjects/NaiveBayes/gaussian.png', width='600',height='800')
Let's take an example dataset and see how this formula fits in. This dataset is from Wikipedia.
Image(filename= 'C:/PythonProjects/NaiveBayes/wiki-example.png',width='500',height='500')
So, if we were to apply the above formula for each feature to this dataset, the equation would look like this for one of the features...
Image(filename= 'C:/PythonProjects/NaiveBayes/gaussianlikelihood.png',width='',height='1200')
and similarly for the rest of the features. So, this means that while using Gaussian NB algorithm, we need to calculate the Mean and the Variance for each class and feature in the data.
So, let's get back to work. In this tutorial, we'll be doing a Bank Note Authentication using Gaussain Naive Bayes Algorithm and find out the probability that if a Bank Note is Original one or a Fake. So, Let's get started.
You can download the dataset for this tutorial from here [https://archive.ics.uci.edu/ml/datasets/banknote+authentication]
The first step as usual is to import the dependencies. For this code, we require two libraries:
Numpy: for numerical computations
Pandas: for data analysis
# Import Dependencies
import numpy as np
import pandas as pd
As I mentioned earlier, for this tutorial, we'll be using the Bank Note Authentication dataset. So, let's import it.
# Import Dataset
df = pd.read_csv('dataset/data_banknote_authentication.csv')
# Have a look at the dataset
df.head()
Currently it does not have any column name for features. So, let's add it. This dataset has the following features:
# Adding names to Columns
df.columns = ['variance','skewness','curtosis','entropy','class']
df.head()
# Describe the Data
df.describe()
So, the dataset has a total of 1371 data points and no missing data points.
# Count the number of Original and Fake Notes
# Original: Class = 1
# Fake: Class = 0
num_classOriginal = df['class'][df['class'] == 1].count()
num_classFake = df['class'][df['class'] == 0].count()
total = len(df)
print('Number of Original Bank Notes: ',num_classOriginal)
print('Number of Fake Bank Notes: ',num_classFake)
print('Total number of Notes: ',total)
Now that we know the total number of original and Fake notes, we can now find out the probability of fake and original notes.
# Calculating the Prior Probabilities
# Probability(Original Note)
Probb_Original = num_classOriginal/total
print('Probability of Original Notes in Dataset: ',Probb_Original)
# Probability(Fake Note)
Probb_Fake = num_classFake/total
print('Probability of Fake Notes in Dataset: ',Probb_Fake)
As we discussed in the above section that we need to calculate the Mean and the Variance for each class and feature in the data. So, let's do that using pandas.
# Data Mean
data_mean = df.groupby('class').mean()
print('Mean: \n',data_mean)
print('\n')
# Data Variance
data_variance = df.groupby('class').var()
print('Variance: \n',data_variance)
So, doing this, we get the Mean and the Variance for each feature in each class of the given dataset. Now that we have got all the values that we require to find out the probability, let's define our function that we discussed in the begenning where we enter the Mean and the Variance values to find out the Likelihood Probability.
# Function to Calculate Likelihood Probability
def p_x_given_y(x, mean_y, variance_y):
probb = 1/(np.sqrt(2*np.pi*variance_y)) * np.exp((-(x-mean_y)**2)/(2*variance_y))
return probb
So, now that we have defined our all functions, it's time to put in the values and calculate the probability if our Note is Original or a Fake. But wait, we haven't defined our test data yet. Let's do that first.
# Testing Data
# Originally, this data represents Fake Bank Note
a = [3.2032,5.7588,-0.75345,-0.61251]
So, now finally we can calculate the Probability of the note with the features in "a" as an Original or a Fake Note.
# Probability the Notes are Original
prob_orig = [Probb_Original * p_x_given_y(a[0], data_mean['variance'][1], data_variance['variance'][1]) * p_x_given_y(a[1], data_mean['skewness'][1], data_variance['skewness'][1]) * p_x_given_y(a[2], data_mean['curtosis'][1], data_variance['curtosis'][1]) * p_x_given_y(a[3], data_mean['entropy'][1], data_variance['entropy'][1])]
# Probability the Notes are Fake
prob_fake = [Probb_Fake * p_x_given_y(a[0], data_mean['variance'][0], data_variance['variance'][0]) * p_x_given_y(a[1], data_mean['skewness'][0], data_variance['skewness'][0]) * p_x_given_y(a[2], data_mean['curtosis'][1], data_variance['curtosis'][0]) * p_x_given_y(a[3], data_mean['entropy'][0], data_variance['entropy'][0])]
So, now that we have defined the probabilities, let's see the result.
# Testing the Classifier
if (prob_orig > prob_fake):
print('Congratulations !! Your Bank Note is Original...')
else:
print('Sorry !! Your Bank Note is a Fake !!')