# Import Dependencies
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
# Load Data
# Load Dataset
boston = load_boston()
# Print out the Dataset
print(boston)
As we can see from above, the dataset contains a few things:
1. target: These are the labels for the data i.e. the target prices for the house.
2. data: Data is an array of arrays where each value in an array corresponds to a value related to a feature.
3. DESCR: Description of the Dataset.
4. feature_names: These are the names corresponding to all the features in the dataset.
# Seperate Data into Features and Labels and load them as a Pandas Dataframe
# Features
features_df = pd.DataFrame(np.array(boston.data), columns=[boston.feature_names])
features_df.head()
The above dataframe shows the first 5 rows from the dataset along with the column names.
# Get the shape of the features
features_df.shape
This shows that the features are a pandas dataframe consisting of 506 values divided over 13 columns i.e. 506 rows and 13 columns.
# Describe the Dataset
features_df.describe()
The main thing to check initially in the dataset is that if it is balanced or not. As we can see from the description above, all the features have an equal number of value count. So, the features are pretty well balanced.
# Get Feature Correlation to each other
features_df.corr()
The above line gives the correlation between various features. Although this is not required but it is good to see that how the different features correlate with each other. A correlation value which is greater than 0 implies that the features havea good correlation and are almost directly proportional in terms of any changes i.e. if one increases, other increases and vice-versa. Similarly, if the correlation value is less than 0 then that implies that the features are not well correlated and they have an inverse relationship i.e. if one increases the other decreases and vice-versa.
# Labels
labels_df = pd.DataFrame(np.array(boston.target), columns=['labels'])
labels_df.head()
The above line prints the first five labels for the dataset.
labels_df.shape
The labels have a shape of 506,1 i.e. it has 506 rows and 1 column i.e. 506 values.
# Combined Data
combined_data = pd.concat([features_df,labels_df], axis=1)
combined_data.head()
# Find Correlation of each feature w.r.t Labels
# CRIM
combined_data['CRIM'].corr(combined_data['labels'])
# ZN
combined_data['ZN'].corr(combined_data['labels'])
# INDUS
combined_data['INDUS'].corr(combined_data['labels'])
# CHAS
combined_data['CHAS'].corr(combined_data['labels'])
# NOX
combined_data['NOX'].corr(combined_data['labels'])
# RM
combined_data['RM'].corr(combined_data['labels'])
# AGE
combined_data['AGE'].corr(combined_data['labels'])
# DIS
combined_data['DIS'].corr(combined_data['labels'])
# RAD
combined_data['RAD'].corr(combined_data['labels'])
# TAX
combined_data['TAX'].corr(combined_data['labels'])
# PTRATIO
combined_data['PTRATIO'].corr(combined_data['labels'])
# B
combined_data['B'].corr(combined_data['labels'])
# LSTAT
combined_data['LSTAT'].corr(combined_data['labels'])
The above lines show the individual correlation of each feature w.r.t the labels. The +ve values show that they direcly affect the house prices whereas the -ve ones show that they inveresely affect the house prices.
# Combined Correlation
combined_data.corr()