Till now, we have implemented Linear Regression from scratch using R-Squared method as well as Gradient Descent. But, when in production or for fast prototyping, we require a library from which we can use these algorithms easily.
So, here we will study about how to implement Linear Regression using Scikit-Learn.
So, let's get started.
numpy: for numerical calculations
Pandas: for accessing data and data modelling
sklearn.linear_model.LinearRegression: Linear Regression using Scikit-Learn
model_selection: cross validation to divide data into test and training sets.
r2_score: To get the R-Squared Error for both Training and Testing.
mean_squared_error: To get the Training and Testing Mean Squared Error.
matplotlib: To plot the data
# Import Dependencies
import numpy as np
import pandas as pd
from sklearn import model_selection
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
%matplotlib inline
So, after getting all the dependencies, the first step would be to load the data. For this code, we will be taking the "Swedish Insurance Dataset". This is a very simple dataset to start with and involves predicting the total payment for all the claims in thousands of Swedish Kronor (y) given the total number of claims (X). This means that for a new number of claims (X) we will be able to predict the total payment of claims (y).
Let's load our data and have a look at it.
# Load Data
df = pd.read_csv('dataset/Insurance-dataset.csv')
# Let's have a look at the data, what it looks like, how many data points are there in the data.
print(df.head())
Data is in the form of two columns, X and Y. X is the total number of claims and Y represents the claims in thousands of Swedish Kronor.
Now, let's describe our data.
df.describe()
So, both the columns have equal number of data points. Hence, the dataset is stable. No need to modify the data. We also get the mean, max values in both columns etc.
Now, let's put the data in the form to be input to the function we just defined above.
# Load the data in the form to be input to the function for Best Fit Line
X = np.array(df['X'], dtype=np.float64).reshape(-1,1)
y = np.array(df['Y'], dtype=np.float64).reshape(-1,1)
# Shape of Input Data Points
print(X.shape)
print(y.shape)
So, now to main part of this tutorial. Let's define the classifier and use it to make predictions, calculate accuracy and confidence of the Model. Let's get started.
# Linear Regression
# Define a Regression Model
clf = LinearRegression()
# Train the Model
clf.fit(X,y)
# Predictions for "X"
y_predict = clf.predict(X)
print('Predicted "y" values: ',y_predict)
So these are the predicted values of "y" w.r.t input "X" using Linear Regression. As simple as that.
Let's see the plot of this data.
# Function to Plot the data using the Model created above
fig,ax = plt.subplots(figsize=(10,8))
ax.scatter(X,y,c='b')
ax.plot(X,clf.predict(X),c='r')
ax.set_xlabel('X')
ax.set_ylabel('y')
ax.set_title('Linear Regression')
Now, once we have defined the Model, we can do anything with it like making predicitions, accuracy etc. But, before that, to test the accuracy of the Model, we require some test data i.e. some data to train the Model on and some data, different from the training data, to test the Model on.
But, why different data points in Testing Set as compared to Training set ??
Because, if we use the same dataset on which we train the Model, the model will give a very great accuracy as the model has already seen those points before. So, we require data points that are different from the training points to get the actual accuracy of the model.
So, can we divide the model as it is, in some ratio like 80:20 ?? Well, that is possible but the thing is that soon you will find that the accuracy of the model using this is somewhat less as it should be. But why ?? Because, sometimes, during this 80:20 division of data, there is leakage of data points from Training to Testing data which leads to a decrease in accurcy.
Well, you would ask "What is the Solution then ??"
Here, we introduce a new algorithm called "Cross-Validation". In this we make 3 sets of data: Training, Testing and Validation set. This provides us with a dataset without data leakage. Also, we can test the Model firstly on the Validation set to check the Trained Model and then check it's Predicitons and Accuracy on Test Data Points.
So, let's do that.
# Cross Validation
# X_train, X_test: Training and Testing Points from "X"
# y_train, y_test: Training and Testing Points from "y"
# test_size: Size of the Testing Dataset. Here, I have taken 10% of data as Test Data.
# Try to change this and see the difference in the Accuracy of the Model.
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.1)
# Training Model on Training Data
# n_jobs: Number of jobs you want to run at once in parallel.
clf_LR = LinearRegression(n_jobs=10)
clf_LR.fit(X_train,y_train)
So, now that we have trained our Model, let's find out the predicted values.
# We need to predict the labels for the test data i.e. X_test
y_predict = clf_LR.predict(X_test)
print('Predicted "y" values: ',y_predict)
Let's also find out the values of "m" and "b" for our "Best Fit Line" for this Model.
# Slope(m) and Bias(b) which is also called as the y-intercept.
m = clf_LR.coef_[0]
b = clf_LR.intercept_
print('Slope(m): ',m)
print('Bias(b): ',b)
Well, we have seen the predicted values of "y". But what is the number of values that got Misclassified i.e. what is the confidence of the Model that it predicted the values accurately ??
# Confidence
conf = clf_LR.score(X_test,y_test)
print('Confidence: ',conf)
So, we can see that the model is "72.42%" confident that the values are predicted accurately.
# R-Squared Error
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)
print('R2 Score: Train = %.3f ; Test = %.3f' % (r2_score(y_train,y_train_pred),r2_score(y_test,y_predict)))
As we know that R-Squared value shows the accuracy of the Model. So, it has a accuracy of "84%" on Training Data and "72.40%" on Testing Data.
So, this code ends our post for now.