In this tutorial, we will be writing the code from scratch for Linear Regression using the Second Approach that we studied i.e. using Gradient Descent and then we will move on to plotting the "Best Fit Line". So, let's get started.
Numpy: for numerical calculations
Pandas: to load the data and modify it
Matplotlib: to plot the data
# Import Dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
We discussed in the starting tutorials that a straight line is represented by the equation:
where,
Let's firstly load the dataset. We would like to see if the dataset is balanced or not, how many elements it has in each column, the min, max and mean values.
Here, we will be using the same "Swedish Insurance Dataset" for this to make a comparison from the previous code.
So, let's get started.
# Load the data using Pandas
df = pd.read_csv('dataset/Insurance-dataset.csv')
# Let's have a look at the data, what it looks like, how many data points are there in the data.
print(df.head())
Data is in the form of two columns, X and Y. X is the total number of claims and Y represents the claims in thousands of Swedish Kronor.
Now, let's describe our data.
# Describe the Data
df.describe()
So, both the columns have equal number of data points. Hence, the dataset is stable. No need to modify the data. We also get the mean, max values in both columns etc.
Now, let's put the data in the form to be input to the function we just defined above.
# Load the data in the form to be input to the function for Best Fit Line
X = np.array(df['X'], dtype=np.float64)
y = np.array(df['Y'], dtype=np.float64)
Before going any further, let's first plot our data and see if it's linear or not. Remember, we require linear data to plot a best fit line and neglect the Outliers.
So, let's plot the data.
# Scatter Plot of the Input Data
fig,ax = plt.subplots()
ax.scatter(X,y)
ax.set_xlabel('X')
ax.set_ylabel('y')
ax.set_title('Input Data Scatter Plot')
From the above plot, we can see that the data is pretty linear except for 1 or 2 points. But that's ok. We can work with that.
Let's now define the Cost Function for Linear Regression. As we have seen in the last post for Gradient Descent for Linear Regression, we defined the Cost Function as follows:
where
So, let's define this function.
# Cost Function
# Cost Function [J] = (1/2n) * (sum((y_hat-y)**2))
# where,
# n: total number of items in a column of dataset. i.e. 63 in this case.
# len(X): Taking the length of column X gives us the value of "n"0
def cost_Function(m,b,X,y):
return sum(((m*X + b) - y)**2)/(2*float(len(X)))
Now, since we have the function to calculate the cost, the next step is ofcourse to get the values for "m" and "b". But, here, unlike the previous code, we'll be using Gradient Descent to iteratively go over the values and find the best value for "m" and "b" to get the "Best Fit Line".
So, let's get started.
What are the equations for Gradient Descent ??
Well, we saw in the post that to find the values of "m" and "b" using Gradient Descent, we first need to find the gradient or the derivative of the cost function w.r.t m and b.
So, the equation that we get are:
After this, we update the values of "m" and "b" simultaneously. i.e. firstly we get the derivative values and then update "m" and "b" at once as:
where,
Finding these values is fine. But we are just missing one small part for performing Gradient Descent. We cannot find the perfect value for "m" and "b" in a single iteration. It is a gradual process. Remember in the plot of Gradient Descent when we try to move down the slope of the Convex Function.
To reach the bottom/minima, we will require more than one iteration. Hence, we iterate over the equations defined above multiple times till the error cannot be reduced further.
Now, after all these equations, it's time to implement them.
# Gradient Descent
# X,y: Input Data Points
# m,b: Initial Slope and Bias
# alpha: Learning Rate
# iters: Number of Iterations for which we need to run Gradient Descent.
def gradientDescent(X,y,m,b,alpha,iters):
# Initialize Values of Gradients
gradient_m = 0
gradient_b = 0
# n: Number of items in a row
n = float(len(X))
a = 0
# Array to store values of error for analysis
hist = []
# Perform Gradient Descent for iters
for _ in range(iters):
# Perform Gradient Descent
for i in range(len(X)):
gradient_m = (1/n) * X[i] * ((m*X[i] + b) - y[i])
gradient_b = (1/n) * ((m*X[i] + b) - y[i])
m = m - (alpha*gradient_m)
b = b - (alpha*gradient_b)
# Calculate the change in error with new values of "m" and "b"
a = cost_Function(m,b,X,y)
hist.append(a)
return [m,b,hist]
So, now that we have performed the Gradient Descent, let's now run the code and provide initial values.
# Learning Rate
lr = 0.0001
# Initial Values of "m" and "b"
initial_m = 0
initial_b = 0
# Number of Iterations
iterations = 1000
print("Starting gradient descent...")
# Check error with initial Values of m and b
print("Initial Error at m = {0} and b = {1} is error = {2}".format(initial_m, initial_b, cost_Function(initial_m, initial_b, X, y)))
# Run Gradient Descent to get new values for "m" and "b"
[m,b,hist] = gradientDescent(X, y, initial_m, initial_b, lr, iterations)
# New Values of "m" and "b" after Gradient Descent
print('Values obtained after {0} iterations are m = {1} and b = {2}'.format(iterations,m,b))
Now that we have obtained the new values for "m" and "b", it's time to plot the "Best Fit Line" and see how well it fits the data. To do that, we first need to get the values for y_hat.
where "m" and "b" are the new values obtained after performing Gradient Descent.
# Calculating y_hat
y_hat = (m*X + b)
print('y_hat: ',y_hat)
# Scatter Plot of the Input Data and Plot for Best Fit Line
fig,ax = plt.subplots()
ax.scatter(X,y,c='r')
ax.plot(X,y_hat,c='y')
ax.set_xlabel('X')
ax.set_ylabel('y')
ax.set_title('Best Fit Line Plot')
As we can clearly see that this line passes through most of the data points and is a Best Fit Line. Now, let's do our second test on this. let's take an arbitrary input and see if we are able to get a value for that.
# Testing using arbitrary Input Value
predict_X = 76
predict_y = (m*predict_X + b)
print('predict_y: ',predict_y)
# Scatter Plot, Best Fit Line and Prediction Plot
fig,ax = plt.subplots()
ax.scatter(X,y)
ax.scatter(predict_X,predict_y,c='r',s=100)
ax.plot(X,y_hat,c='y')
ax.set_xlabel('X')
ax.set_ylabel('y')
ax.set_title('Prediction Plot')
Well, we can see that the line goes through the point or in other words, point lies on the line and that is what was the aim to get a Best Fit Line.
Great work till here. But let's also see how the error goes down !! Remember we used an extra variable and stored the cost_function values into an array. These values show how Gradient Descent moves down reducing the Cost Function.
So, let's plot it.
# Error Plot
fig,ax = plt.subplots()
ax.plot(hist)
ax.set_title('Cost Function Over Time')
As we can see that the Cost Function values decrease over time and near about 700, it starts to get saturated. Hence, even if we run the Gradient Descent Optimizer for near about 700 iterations, we will get the same results.
Well, this ends our coding section for this post.