Linear Regression using Scikit-Learn¶

Till now, we have implemented Linear Regression from scratch using R-Squared method as well as Gradient Descent. But, when in production or for fast prototyping, we require a library from which we can use these algorithms easily.

So, here we will study about how to implement Linear Regression using Scikit-Learn.

So, let's get started.

Step-1: Import Dependencies¶

numpy: for numerical calculations
Pandas: for accessing data and data modelling
sklearn.linear_model.LinearRegression: Linear Regression using Scikit-Learn
model_selection: cross validation to divide data into test and training sets.
r2_score: To get the R-Squared Error for both Training and Testing.
mean_squared_error: To get the Training and Testing Mean Squared Error.
matplotlib: To plot the data

# Import Dependencies

import  numpy as np
import pandas as pd
from sklearn import model_selection
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
%matplotlib inline

Step-2: Load Dataset¶

So, after getting all the dependencies, the first step would be to load the data. For this code, we will be taking the "Swedish Insurance Dataset". This is a very simple dataset to start with and involves predicting the total payment for all the claims in thousands of Swedish Kronor (y) given the total number of claims (X). This means that for a new number of claims (X) we will be able to predict the total payment of claims (y).

Let's load our data and have a look at it.

# Load Data

df = pd.read_csv('dataset/Insurance-dataset.csv')

# Let's have a look at the data, what it looks like, how many data points are there in the data.

print(df.head())

     X      Y
0  108  392.5
1   19   46.2
2   13   15.7
3  124  422.2
4   40  119.4

Data is in the form of two columns, X and Y. X is the total number of claims and Y represents the claims in thousands of Swedish Kronor.

Now, let's describe our data.

df.describe()

So, both the columns have equal number of data points. Hence, the dataset is stable. No need to modify the data. We also get the mean, max values in both columns etc.

Now, let's put the data in the form to be input to the function we just defined above.

# Load the data in the form to be input to the function for Best Fit Line

X = np.array(df['X'], dtype=np.float64).reshape(-1,1)
y = np.array(df['Y'], dtype=np.float64).reshape(-1,1)

# Shape of Input Data Points

print(X.shape)
print(y.shape)

(63, 1)
(63, 1)

Step-3 Defining the Model¶

So, now to main part of this tutorial. Let's define the classifier and use it to make predictions, calculate accuracy and confidence of the Model. Let's get started.

# Linear Regression

# Define a Regression Model
clf = LinearRegression()

# Train the Model
clf.fit(X,y)

# Predictions for "X"
y_predict = clf.predict(X)
print('Predicted "y" values: ',y_predict)

Predicted "y" values:  [[ 388.68743025]
 [  84.8571334 ]
 [  64.37419204]
 [ 443.30860721]
 [ 156.54742816]
 [ 214.58242868]
 [  98.51242764]
 [  67.7880156 ]
 [ 173.61654596]
 [  54.13272136]
 [  37.06360356]
 [ 183.85801664]
 [  57.54654492]
 [  98.51242764]
 [  43.89125068]
 [  26.82213288]
 [ 101.9262512 ]
 [  40.47742712]
 [  30.23595644]
 [  98.51242764]
 [  40.47742712]
 [  50.7188978 ]
 [  50.7188978 ]
 [  30.23595644]
 [ 118.995369  ]
 [  43.89125068]
 [  33.64978   ]
 [  88.27095696]
 [  43.89125068]
 [  33.64978   ]
 [  19.99448576]
 [ 105.34007476]
 [  40.47742712]
 [  37.06360356]
 [  95.09860408]
 [  57.54654492]
 [ 228.23772292]
 [  60.96036848]
 [  33.64978   ]
 [  74.61566272]
 [  64.37419204]
 [ 224.82389936]
 [ 159.96125172]
 [ 146.30595748]
 [ 207.75478156]
 [ 159.96125172]
 [  57.54654492]
 [ 112.16772188]
 [  47.30507424]
 [  30.23595644]
 [  78.02948628]
 [  64.37419204]
 [  64.37419204]
 [  71.20183916]
 [  47.30507424]
 [ 118.995369  ]
 [ 122.40919256]
 [ 101.9262512 ]
 [  50.7188978 ]
 [ 125.82301612]
 [  67.7880156 ]
 [ 200.92713444]
 [ 108.75389832]]

So these are the predicted values of "y" w.r.t input "X" using Linear Regression. As simple as that.

Let's see the plot of this data.

# Function to Plot the data using the Model created above

fig,ax = plt.subplots(figsize=(10,8))
ax.scatter(X,y,c='b')
ax.plot(X,clf.predict(X),c='r')
ax.set_xlabel('X')
ax.set_ylabel('y')
ax.set_title('Linear Regression')

<matplotlib.text.Text at 0x1bfcc109eb8>

Step-4: Cross-Validation¶

Now, once we have defined the Model, we can do anything with it like making predicitions, accuracy etc. But, before that, to test the accuracy of the Model, we require some test data i.e. some data to train the Model on and some data, different from the training data, to test the Model on.

But, why different data points in Testing Set as compared to Training set ??

Because, if we use the same dataset on which we train the Model, the model will give a very great accuracy as the model has already seen those points before. So, we require data points that are different from the training points to get the actual accuracy of the model.

So, can we divide the model as it is, in some ratio like 80:20 ?? Well, that is possible but the thing is that soon you will find that the accuracy of the model using this is somewhat less as it should be. But why ?? Because, sometimes, during this 80:20 division of data, there is leakage of data points from Training to Testing data which leads to a decrease in accurcy.

Well, you would ask "What is the Solution then ??"

Here, we introduce a new algorithm called "Cross-Validation". In this we make 3 sets of data: Training, Testing and Validation set. This provides us with a dataset without data leakage. Also, we can test the Model firstly on the Validation set to check the Trained Model and then check it's Predicitons and Accuracy on Test Data Points.

So, let's do that.

# Cross Validation

# X_train, X_test: Training and Testing Points from "X"
# y_train, y_test: Training and Testing Points from "y"
# test_size: Size of the Testing Dataset. Here, I have taken 10% of data as Test Data. 
# Try to change this and see the difference in the Accuracy of the Model.

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.1)

# Training Model on Training Data
# n_jobs: Number of jobs you want to run at once in parallel.

clf_LR = LinearRegression(n_jobs=10)
clf_LR.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=10, normalize=False)

So, now that we have trained our Model, let's find out the predicted values.

# We need to predict the labels for the test data i.e. X_test

y_predict = clf_LR.predict(X_test)
print('Predicted "y" values: ',y_predict)

Predicted "y" values:  [[ 100.27093055]
 [  34.85546687]
 [ 227.65893876]
 [ 103.71384969]
 [  27.96962859]
 [  65.84173914]
 [ 158.80055594]]

Step-5 Slope and Bias¶

Let's also find out the values of "m" and "b" for our "Best Fit Line" for this Model.

# Slope(m) and Bias(b) which is also called as the y-intercept.

m = clf_LR.coef_[0]
b = clf_LR.intercept_
print('Slope(m): ',m)
print('Bias(b): ',b)

Slope(m):  [ 3.44291914]
Bias(b):  [ 21.0837903]

Step-6 Confidence of Model¶

Well, we have seen the predicted values of "y". But what is the number of values that got Misclassified i.e. what is the confidence of the Model that it predicted the values accurately ??

# Confidence

conf = clf_LR.score(X_test,y_test)
print('Confidence: ',conf)

Confidence:  0.724224051616

So, we can see that the model is "72.42%" confident that the values are predicted accurately.

Step-7: R-Squared Value¶

# R-Squared Error

y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)

print('R2 Score: Train = %.3f ; Test = %.3f' % (r2_score(y_train,y_train_pred),r2_score(y_test,y_predict)))

R2 Score: Train = 0.840 ; Test = 0.724

As we know that R-Squared value shows the accuracy of the Model. So, it has a accuracy of "84%" on Training Data and "72.40%" on Testing Data.

So, this code ends our post for now.

	X	Y
count	63.000000	63.000000
mean	22.904762	98.187302
std	23.351946	87.327553
min	0.000000	0.000000
25%	7.500000	38.850000
50%	14.000000	73.400000
75%	29.000000	140.000000
max	124.000000	422.200000