Linear Regression using Scikit-Learn

Till now, we have implemented Linear Regression from scratch using R-Squared method as well as Gradient Descent. But, when in production or for fast prototyping, we require a library from which we can use these algorithms easily.

So, here we will study about how to implement Linear Regression using Scikit-Learn.

So, let's get started.

Step-1: Import Dependencies

  • numpy: for numerical calculations

  • Pandas: for accessing data and data modelling

  • sklearn.linear_model.LinearRegression: Linear Regression using Scikit-Learn

  • model_selection: cross validation to divide data into test and training sets.

  • r2_score: To get the R-Squared Error for both Training and Testing.

  • mean_squared_error: To get the Training and Testing Mean Squared Error.

  • matplotlib: To plot the data

In [1]:
# Import Dependencies

import  numpy as np
import pandas as pd
from sklearn import model_selection
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
%matplotlib inline

Step-2: Load Dataset

So, after getting all the dependencies, the first step would be to load the data. For this code, we will be taking the "Swedish Insurance Dataset". This is a very simple dataset to start with and involves predicting the total payment for all the claims in thousands of Swedish Kronor (y) given the total number of claims (X). This means that for a new number of claims (X) we will be able to predict the total payment of claims (y).

Let's load our data and have a look at it.

In [2]:
# Load Data

df = pd.read_csv('dataset/Insurance-dataset.csv')
In [3]:
# Let's have a look at the data, what it looks like, how many data points are there in the data.

print(df.head())
     X      Y
0  108  392.5
1   19   46.2
2   13   15.7
3  124  422.2
4   40  119.4

Data is in the form of two columns, X and Y. X is the total number of claims and Y represents the claims in thousands of Swedish Kronor.

Now, let's describe our data.

In [4]:
df.describe()
Out[4]:
X Y
count 63.000000 63.000000
mean 22.904762 98.187302
std 23.351946 87.327553
min 0.000000 0.000000
25% 7.500000 38.850000
50% 14.000000 73.400000
75% 29.000000 140.000000
max 124.000000 422.200000

So, both the columns have equal number of data points. Hence, the dataset is stable. No need to modify the data. We also get the mean, max values in both columns etc.

Now, let's put the data in the form to be input to the function we just defined above.

In [11]:
# Load the data in the form to be input to the function for Best Fit Line

X = np.array(df['X'], dtype=np.float64).reshape(-1,1)
y = np.array(df['Y'], dtype=np.float64).reshape(-1,1)
In [12]:
# Shape of Input Data Points

print(X.shape)
print(y.shape)
(63, 1)
(63, 1)

Step-3 Defining the Model

So, now to main part of this tutorial. Let's define the classifier and use it to make predictions, calculate accuracy and confidence of the Model. Let's get started.

In [14]:
# Linear Regression

# Define a Regression Model
clf = LinearRegression()

# Train the Model
clf.fit(X,y)

# Predictions for "X"
y_predict = clf.predict(X)
print('Predicted "y" values: ',y_predict)
Predicted "y" values:  [[ 388.68743025]
 [  84.8571334 ]
 [  64.37419204]
 [ 443.30860721]
 [ 156.54742816]
 [ 214.58242868]
 [  98.51242764]
 [  67.7880156 ]
 [ 173.61654596]
 [  54.13272136]
 [  37.06360356]
 [ 183.85801664]
 [  57.54654492]
 [  98.51242764]
 [  43.89125068]
 [  26.82213288]
 [ 101.9262512 ]
 [  40.47742712]
 [  30.23595644]
 [  98.51242764]
 [  40.47742712]
 [  50.7188978 ]
 [  50.7188978 ]
 [  30.23595644]
 [ 118.995369  ]
 [  43.89125068]
 [  33.64978   ]
 [  88.27095696]
 [  43.89125068]
 [  33.64978   ]
 [  19.99448576]
 [ 105.34007476]
 [  40.47742712]
 [  37.06360356]
 [  95.09860408]
 [  57.54654492]
 [ 228.23772292]
 [  60.96036848]
 [  33.64978   ]
 [  74.61566272]
 [  64.37419204]
 [ 224.82389936]
 [ 159.96125172]
 [ 146.30595748]
 [ 207.75478156]
 [ 159.96125172]
 [  57.54654492]
 [ 112.16772188]
 [  47.30507424]
 [  30.23595644]
 [  78.02948628]
 [  64.37419204]
 [  64.37419204]
 [  71.20183916]
 [  47.30507424]
 [ 118.995369  ]
 [ 122.40919256]
 [ 101.9262512 ]
 [  50.7188978 ]
 [ 125.82301612]
 [  67.7880156 ]
 [ 200.92713444]
 [ 108.75389832]]

So these are the predicted values of "y" w.r.t input "X" using Linear Regression. As simple as that.

Let's see the plot of this data.

In [22]:
# Function to Plot the data using the Model created above

fig,ax = plt.subplots(figsize=(10,8))
ax.scatter(X,y,c='b')
ax.plot(X,clf.predict(X),c='r')
ax.set_xlabel('X')
ax.set_ylabel('y')
ax.set_title('Linear Regression')
Out[22]:
<matplotlib.text.Text at 0x1bfcc109eb8>

Step-4: Cross-Validation

Now, once we have defined the Model, we can do anything with it like making predicitions, accuracy etc. But, before that, to test the accuracy of the Model, we require some test data i.e. some data to train the Model on and some data, different from the training data, to test the Model on.

But, why different data points in Testing Set as compared to Training set ??

Because, if we use the same dataset on which we train the Model, the model will give a very great accuracy as the model has already seen those points before. So, we require data points that are different from the training points to get the actual accuracy of the model.

So, can we divide the model as it is, in some ratio like 80:20 ?? Well, that is possible but the thing is that soon you will find that the accuracy of the model using this is somewhat less as it should be. But why ?? Because, sometimes, during this 80:20 division of data, there is leakage of data points from Training to Testing data which leads to a decrease in accurcy.

Well, you would ask "What is the Solution then ??"

Here, we introduce a new algorithm called "Cross-Validation". In this we make 3 sets of data: Training, Testing and Validation set. This provides us with a dataset without data leakage. Also, we can test the Model firstly on the Validation set to check the Trained Model and then check it's Predicitons and Accuracy on Test Data Points.

So, let's do that.

In [35]:
# Cross Validation

# X_train, X_test: Training and Testing Points from "X"
# y_train, y_test: Training and Testing Points from "y"
# test_size: Size of the Testing Dataset. Here, I have taken 10% of data as Test Data. 
# Try to change this and see the difference in the Accuracy of the Model.

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.1)
In [36]:
# Training Model on Training Data
# n_jobs: Number of jobs you want to run at once in parallel.

clf_LR = LinearRegression(n_jobs=10)
clf_LR.fit(X_train,y_train)
Out[36]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=10, normalize=False)

So, now that we have trained our Model, let's find out the predicted values.

In [37]:
# We need to predict the labels for the test data i.e. X_test

y_predict = clf_LR.predict(X_test)
print('Predicted "y" values: ',y_predict)
Predicted "y" values:  [[ 100.27093055]
 [  34.85546687]
 [ 227.65893876]
 [ 103.71384969]
 [  27.96962859]
 [  65.84173914]
 [ 158.80055594]]

Step-5 Slope and Bias

Let's also find out the values of "m" and "b" for our "Best Fit Line" for this Model.

In [38]:
# Slope(m) and Bias(b) which is also called as the y-intercept.

m = clf_LR.coef_[0]
b = clf_LR.intercept_
print('Slope(m): ',m)
print('Bias(b): ',b)
Slope(m):  [ 3.44291914]
Bias(b):  [ 21.0837903]

Step-6 Confidence of Model

Well, we have seen the predicted values of "y". But what is the number of values that got Misclassified i.e. what is the confidence of the Model that it predicted the values accurately ??

In [40]:
# Confidence

conf = clf_LR.score(X_test,y_test)
print('Confidence: ',conf)
Confidence:  0.724224051616

So, we can see that the model is "72.42%" confident that the values are predicted accurately.

Step-7: R-Squared Value

In [43]:
# R-Squared Error

y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)

print('R2 Score: Train = %.3f ; Test = %.3f' % (r2_score(y_train,y_train_pred),r2_score(y_test,y_predict)))
R2 Score: Train = 0.840 ; Test = 0.724

As we know that R-Squared value shows the accuracy of the Model. So, it has a accuracy of "84%" on Training Data and "72.40%" on Testing Data.

So, this code ends our post for now.