# Import Dependencies
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
# Load Data
# Load Dataset
boston = load_boston()
# Print out the Dataset
print(boston)
# Seperate Data into Features and Labels and load them as a Pandas Dataframe
# Features
features_df = pd.DataFrame(np.array(boston.data), columns=[boston.feature_names])
features_df.head()
# Labels
labels_df = pd.DataFrame(np.array(boston.target), columns=['labels'])
labels_df.head()
For this tutorial, we'll be doing the train test split first as we need to normalize only the train and test features and not the labels.
# Train Test Split
from sklearn.model_selection import train_test_split
We'll be splitting the data into train and test set. The test set will comprise of only 20% of the dataset and the rest will be the training data.
# Train Test Split
# Training Data = 80% of Dataset
# Test Data = 20% of Dataset
X_train, X_test, y_train, y_test = train_test_split(features_df, labels_df, test_size=0.2, random_state=101)
Now, let's have a look at the shape and the type of the split dataset. This tells us that the split has been nicely done and the nu,ber of labels are equal to the number of features in the dataset. An unbalanced feature label combination can lead to errors in the future. So, check it now.
print(X_train.shape)
print(type(X_train))
print(y_train.shape)
print(type(y_train))
print(X_test.shape)
print(type(X_test))
print(y_test.shape)
print(type(y_test))
So, as we can see from above lines, the train features have 404 rows which is the same as those in the training labels. Similarly, the number of rows in the test features is equal to those in the test labels. Also, note that all the test and train data are of the same type i.e. they all are a Pandas Dataframe as expected.
So, now that we have our separated training and test features, we can apply the normalization onto our data.
# Normalize Data
from sklearn.preprocessing import StandardScaler
The StandardScalar algorithm standardizes the features by removing the mean and scaling to unit variance. In this, the centering and scaling happens independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using the transform method.
# Define the Preprocessing Method and Fit Training Data to it
scaler = StandardScaler()
scaler.fit(X_train)
# Make X_train to be the Scaled Version of Data
# This process scales all the values in all 6 columns and replaces them with the new values
X_train = pd.DataFrame(data=scaler.transform(X_train), columns=X_train.columns, index=X_train.index)
# Visualized the Normalized Data
print(X_train)
Note that in the above line when we print the features, they no longer have values in different scales but rather they all are in the same scale.
For this tutorial, we'll be using the data for input as a numpy array. So, we'll just convert the data to a numpy array as follows.
# Converting from Pandas Dataframe to Numpy Arrays
X_train = np.array(X_train)
y_train = np.array(y_train)
# Get the Type of Training Data
type(X_train), type(y_train)
Next, we'll apply the same normalization onto the test features so that they all also come onto the same scale.
# Apply same Normalization for Test Features
scal = StandardScaler()
scal.fit(X_test)
# Make X_test to be the Scaled Version of Data
# This process scales all the values in all columns and replaces them with the new values
X_test = pd.DataFrame(data=scal.transform(X_test), columns=X_test.columns, index=X_test.index)
print(X_test)
# Convert test features and Labels to Numpy Arrays
X_test = np.array(X_test)
y_test = np.array(y_test)
# Get the Type of Test Data
type(X_test), type(y_test)