Project Title: Personal Bank Loan Modelling

YHills Internship Project

</center> By: Aashish Bansal
Email: aashish22bansal@gmail.com
The dataset contains data on 5000 customers
The aim is to construct a model that can identify potential customers who have a higher probability of purchasing loan. Output column is </b>Personal Loan</b>.

1. Connecting to Google Drive¶

Import the datasets and libraries, check datatype, statistical summary, shape, null values etc

In [1]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive

2. Importing the required Libraries¶

In [2]:
#Importing libraries
from matplotlib import pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

3. Importing Dataset¶

I have use Bank Loan Modeling dataset.

Dataset is a csv file.

Pandas read_csv function is used to read Data sheet.

In [3]:
#Importing the dataset
df = pd.read_csv("/content/drive/MyDrive/Project - YHills - Personal Loan Modelling/dataset/Bank_Personal_Loan_Modelling.csv")

4.2 Dataset Description¶

There are 12 features.

Features are detailed below:

Feature Description
Age Customer's age
Experience Number of years of professional experience
Income Annual income of the customer
ZIPCode Home Address ZIP code
Family Family size of the customer
CCAvg Average spending on credit cards per month
Education Education Level:
1: Undergrad
2: Graduate
3: Advanced/Professional
Mortgage Value of house mortgage (if any)
Securities Account Does the customer have a securities account with the bank?
CD Account Does the customer have a certificate of deposit (CD) account with the bank?
Online Does the customer use internet banking facilities?
CreditCard Does the customer uses a credit card issued by UniversalBank?
Personal Loan Did this customer accept the personal loan offered in the last campaign?

4. Basic Data Analysis¶

4.1 Checking some rows in the dataset¶

In [4]:
#To display top 5 rows
df.head()
Out[4]:
ID Age Experience Income ZIP Code Family CCAvg Education Mortgage Personal Loan Securities Account CD Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
In [5]:
#To display bottom 5 rows
df.tail()
Out[5]:
ID Age Experience Income ZIP Code Family CCAvg Education Mortgage Personal Loan Securities Account CD Account Online CreditCard
4995 4996 29 3 40 92697 1 1.9 3 0 0 0 0 1 0
4996 4997 30 4 15 92037 4 0.4 1 85 0 0 0 1 0
4997 4998 63 39 24 93023 2 0.3 3 0 0 0 0 0 0
4998 4999 65 40 49 90034 3 0.5 2 0 0 0 0 1 0
4999 5000 28 4 83 92612 3 0.8 1 0 0 0 0 1 1

4.2 Checking the highest and lowest values¶

In [6]:
df.head(10).style.background_gradient(cmap="PuBuGn")
Out[6]:
ID Age Experience Income ZIP Code Family CCAvg Education Mortgage Personal Loan Securities Account CD Account Online CreditCard
0 1 25 1 49 91107 4 1.600000 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.500000 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.000000 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.700000 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.000000 2 0 0 0 0 0 1
5 6 37 13 29 92121 4 0.400000 2 155 0 0 0 1 0
6 7 53 27 72 91711 2 1.500000 2 0 0 0 0 1 0
7 8 50 24 22 93943 1 0.300000 3 0 0 0 0 0 1
8 9 35 10 81 90089 3 0.600000 2 104 0 0 0 1 0
9 10 34 9 180 93023 1 8.900000 3 0 1 0 0 0 0

4.3 Check the types of the Data¶

In [7]:
# To find the dtypes in the DataFrame of each columns
df.dtypes
Out[7]:
ID                      int64
Age                     int64
Experience              int64
Income                  int64
ZIP Code                int64
Family                  int64
CCAvg                 float64
Education               int64
Mortgage                int64
Personal Loan           int64
Securities Account      int64
CD Account              int64
Online                  int64
CreditCard              int64
dtype: object

5. Statistical Analysis¶

5.1 Viewing some basic Statistical details¶

In [8]:
df.describe()
Out[8]:
ID Age Experience Income ZIP Code Family CCAvg Education Mortgage Personal Loan Securities Account CD Account Online CreditCard
count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.00000 5000.000000 5000.000000
mean 2500.500000 45.338400 20.104600 73.774200 93152.503000 2.396400 1.937938 1.881000 56.498800 0.096000 0.104400 0.06040 0.596800 0.294000
std 1443.520003 11.463166 11.467954 46.033729 2121.852197 1.147663 1.747659 0.839869 101.713802 0.294621 0.305809 0.23825 0.490589 0.455637
min 1.000000 23.000000 -3.000000 8.000000 9307.000000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000
25% 1250.750000 35.000000 10.000000 39.000000 91911.000000 1.000000 0.700000 1.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000
50% 2500.500000 45.000000 20.000000 64.000000 93437.000000 2.000000 1.500000 2.000000 0.000000 0.000000 0.000000 0.00000 1.000000 0.000000
75% 3750.250000 55.000000 30.000000 98.000000 94608.000000 3.000000 2.500000 3.000000 101.000000 0.000000 0.000000 0.00000 1.000000 1.000000
max 5000.000000 67.000000 43.000000 224.000000 96651.000000 4.000000 10.000000 3.000000 635.000000 1.000000 1.000000 1.00000 1.000000 1.000000
In [9]:
# Transpose of df.describe()
df.describe().T
Out[9]:
count mean std min 25% 50% 75% max
ID 5000.0 2500.500000 1443.520003 1.0 1250.75 2500.5 3750.25 5000.0
Age 5000.0 45.338400 11.463166 23.0 35.00 45.0 55.00 67.0
Experience 5000.0 20.104600 11.467954 -3.0 10.00 20.0 30.00 43.0
Income 5000.0 73.774200 46.033729 8.0 39.00 64.0 98.00 224.0
ZIP Code 5000.0 93152.503000 2121.852197 9307.0 91911.00 93437.0 94608.00 96651.0
Family 5000.0 2.396400 1.147663 1.0 1.00 2.0 3.00 4.0
CCAvg 5000.0 1.937938 1.747659 0.0 0.70 1.5 2.50 10.0
Education 5000.0 1.881000 0.839869 1.0 1.00 2.0 3.00 3.0
Mortgage 5000.0 56.498800 101.713802 0.0 0.00 0.0 101.00 635.0
Personal Loan 5000.0 0.096000 0.294621 0.0 0.00 0.0 0.00 1.0
Securities Account 5000.0 0.104400 0.305809 0.0 0.00 0.0 0.00 1.0
CD Account 5000.0 0.060400 0.238250 0.0 0.00 0.0 0.00 1.0
Online 5000.0 0.596800 0.490589 0.0 0.00 1.0 1.00 1.0
CreditCard 5000.0 0.294000 0.455637 0.0 0.00 0.0 1.00 1.0

We can observe that Experience has some negative values

5.2 Checking the shape of DataFrame¶

In [10]:
# To check the Dimensionality of the DataFrame
df.shape
Out[10]:
(5000, 14)

5.3 Checking for Null values¶

In [11]:
# To check the total null values
df.isnull().sum()
Out[11]:
ID                    0
Age                   0
Experience            0
Income                0
ZIP Code              0
Family                0
CCAvg                 0
Education             0
Mortgage              0
Personal Loan         0
Securities Account    0
CD Account            0
Online                0
CreditCard            0
dtype: int64

6. Exploratory Data Analysis¶

6.1 Checking the number of Unique Elements in each columns¶

In [12]:
df.nunique()
Out[12]:
ID                    5000
Age                     45
Experience              47
Income                 162
ZIP Code               467
Family                   4
CCAvg                  108
Education                3
Mortgage               347
Personal Loan            2
Securities Account       2
CD Account               2
Online                   2
CreditCard               2
dtype: int64

Zip Code has 467 distinct value. It is nominal variable. It will not affect the prediction. So we will drop zip code column

In [13]:
#Drop the Zip Code Column
df.drop(['ZIP Code'], axis = 1) 
Out[13]:
ID Age Experience Income Family CCAvg Education Mortgage Personal Loan Securities Account CD Account Online CreditCard
0 1 25 1 49 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 4 1.0 2 0 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
4995 4996 29 3 40 1 1.9 3 0 0 0 0 1 0
4996 4997 30 4 15 4 0.4 1 85 0 0 0 1 0
4997 4998 63 39 24 2 0.3 3 0 0 0 0 0 0
4998 4999 65 40 49 3 0.5 2 0 0 0 0 1 0
4999 5000 28 4 83 3 0.8 1 0 0 0 0 1 1

5000 rows × 13 columns

6.2 Checking Specific Details¶

6.2.1 Checking the Number of people with zero mortgage¶

In [14]:
df[df['Mortgage'] == 0]['Mortgage'].value_counts()
Out[14]:
0    3462
Name: Mortgage, dtype: int64

6.2.2 Checking the Number of people with zero Credit Card spending per month¶

In [15]:
df[df['CCAvg'] == 0]['CCAvg'].value_counts()
Out[15]:
0.0    106
Name: CCAvg, dtype: int64

6.2.3 Obtaining Value counts of Family column¶

In [16]:
df['Family'].value_counts()
Out[16]:
1    1472
2    1296
4    1222
3    1010
Name: Family, dtype: int64

6.2.4 Obtaining Value counts of Securities Account column¶

In [17]:
df['Securities Account'].value_counts()
Out[17]:
0    4478
1     522
Name: Securities Account, dtype: int64

6.2.5 Obtaining Value counts of CD Account column¶

In [18]:
df['CD Account'].value_counts()
Out[18]:
0    4698
1     302
Name: CD Account, dtype: int64

6.2.6 Obtaining Value counts of Credit Card column¶

In [19]:
df['CreditCard'].value_counts()
Out[19]:
0    3530
1    1470
Name: CreditCard, dtype: int64

6.2.7 Obtaining Value counts of Education column¶

In [20]:
df['Education'].value_counts()
Out[20]:
1    2096
3    1501
2    1403
Name: Education, dtype: int64

6.2.8 Obtaining Value counts of Online column¶

In [21]:
# 
df['Online'].value_counts()
Out[21]:
1    2984
0    2016
Name: Online, dtype: int64

6.3 Plotting different Attributes¶

6.3.1 Counting the Age Attribute¶

In [22]:
age = df['Age'].value_counts().head(25)
ax = age.plot.bar(width=.9,color="Green") 
plt.title("Age",size=20)
plt.xlabel("Age")
plt.ylabel("Count")
for i, v in age.reset_index().iterrows():
    ax.text(i, v.Age + 1.5, v.Age, color='green',rotation=90)

6.3.2 Checking the Experience Attribute¶

In [23]:
experience = df['Experience'].value_counts().head(25)
ax = experience.plot.bar(width=.9,color="Purple") 
plt.title("Experience",size=20)
plt.xlabel("Experience")
plt.ylabel("Count")
for i, v in experience.reset_index().iterrows():
    ax.text(i, v.Experience + 1.5, v.Experience, color='Purple',rotation=90)

6.3.3 Checking the Income Attribute¶

In [24]:
Income = df['Income'].value_counts().head(25)
ax = Income.plot.bar(width=.9,color="Indigo") 
plt.title("Income",size=20)
plt.xlabel("Income in Dollar")
plt.ylabel("Count")
for i, v in Income.reset_index().iterrows():
    ax.text(i, v.Income + 1.5, v.Income, color='Indigo',rotation=90)

6.3.4 Checking the permitted Loans based on Income¶

In [25]:
maxIncome = df.loc[(df['Personal Loan']==1),'Income'].max()
minIncome = df.loc[(df['Personal Loan']==1),'Income'].min()
print(maxIncome)
print(minIncome)
203
60
In [26]:
sns.scatterplot(x="Personal Loan", y="Income", data=df, hue="Personal Loan")
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f032775a190>

6.3.5 Comparing Experience with respect to Age Attribute¶

In [27]:
sns.barplot(x="Age", y="Experience", data=df, ci=None)
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f03278d0090>

6.3.6 Comparing Income with respect to Experience¶

In [28]:
sns.scatterplot(x="Experience", y="Income", data=df, color='black')
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f032712d950>

6.3.7 Comparing Martgage with respect to Income¶

In [29]:
sns.scatterplot(x="Income", y="Mortgage", hue="Mortgage", data=df)
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f0326f9cf50>

6.3.8 Comparing Mortgage with respect to Personal Loan¶

In [30]:
sns.stripplot(y="Mortgage", x="Personal Loan", data=df)
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f0328eb1550>

6.3.9 Comparing Mortage with respect to Education¶

In [31]:
sns.relplot(x="Education", y="Mortgage", data=df, hue="Education")
Out[31]:
<seaborn.axisgrid.FacetGrid at 0x7f032562f7d0>

6.3.10 Comparing Family Size with respect to Mortgage¶

In [32]:
sns. relplot(x="Mortgage", y="Family", data=df, hue="Family")
Out[32]:
<seaborn.axisgrid.FacetGrid at 0x7f0325657e50>

6.3.11 Checking the granted loans with respect to Family¶

In [33]:
family = df.Family[df['Personal Loan']==1].value_counts().sort_index()
#ax=plot(kind='bar',alpha=0.5,color="Orange")
ax = family.plot.bar(width=.9,color="Gray") 
plt.title("Family VS Personal Loan",size=20)
plt.xlabel("Family")
plt.ylabel("Count of taken Personal Loan")
for i, v in family.reset_index().iterrows():
    ax.text(i, v.Family + 1.5, v.Family, color='Indigo')

7. Data Cleaning¶

7.1 Checking if any Column has Null Values¶

In [34]:
df.isnull().sum()
Out[34]:
ID                    0
Age                   0
Experience            0
Income                0
ZIP Code              0
Family                0
CCAvg                 0
Education             0
Mortgage              0
Personal Loan         0
Securities Account    0
CD Account            0
Online                0
CreditCard            0
dtype: int64

7.2 Checking if any Column has Negative values¶

In [35]:
df.lt(0).any()
Out[35]:
ID                    False
Age                   False
Experience             True
Income                False
ZIP Code              False
Family                False
CCAvg                 False
Education             False
Mortgage              False
Personal Loan         False
Securities Account    False
CD Account            False
Online                False
CreditCard            False
dtype: bool

7.3 Dropping Irrelavant column¶

The variable ID does not add any interesting information. There is no association between a person's customer ID and loan, also it does not provide any general conclusion for future potential loan customers. We can neglect this information for our model prediction.

In [36]:
# To check the counts of negative values in experience column
df[df['Experience'] < 0]['Experience'].count()
Out[36]:
52
In [37]:
#To check the ammount of negative values
df[df['Experience'] < 0]['Experience'].value_counts()
Out[37]:
-1    33
-2    15
-3     4
Name: Experience, dtype: int64

Since Experience Column and Age are highly correlated. Drop Experience Column

In [38]:
# Dropping the ID and Experience column
df.drop(['ID','Experience'],axis=1,inplace=True)
In [39]:
#To display top 5 rows
df.head()
Out[39]:
Age Income ZIP Code Family CCAvg Education Mortgage Personal Loan Securities Account CD Account Online CreditCard
0 25 49 91107 4 1.6 1 0 0 1 0 0 0
1 45 34 90089 3 1.5 1 0 0 1 0 0 0
2 39 11 94720 1 1.0 1 0 0 0 0 0 0
3 35 100 94112 1 2.7 2 0 0 0 0 0 0
4 35 45 91330 4 1.0 2 0 0 0 0 0 1
In [40]:
#To display bottom 5 rows
df.tail()
Out[40]:
Age Income ZIP Code Family CCAvg Education Mortgage Personal Loan Securities Account CD Account Online CreditCard
4995 29 40 92697 1 1.9 3 0 0 0 0 1 0
4996 30 15 92037 4 0.4 1 85 0 0 0 1 0
4997 63 24 93023 2 0.3 3 0 0 0 0 0 0
4998 65 49 90034 3 0.5 2 0 0 0 0 1 0
4999 28 83 92612 3 0.8 1 0 0 0 0 1 1
In [41]:
#To check the names of each column
df.columns
Out[41]:
Index(['Age', 'Income', 'ZIP Code', 'Family', 'CCAvg', 'Education', 'Mortgage',
       'Personal Loan', 'Securities Account', 'CD Account', 'Online',
       'CreditCard'],
      dtype='object')

8. Univariate Analysis¶

8.1 Checking Age Distribution¶

In [42]:
sns.distplot(df["Age"])
plt.show()

So, Age have normal distributions

8.2 Checking Income Distribution¶

In [43]:
sns.distplot(df["Income"])
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f03255e8110>

So, Income is left-skewed distributions

8.3 Checking Mortgage Distribution¶

In [44]:
sns.distplot(df["Mortgage"])
Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f03255e8250>

So, Mortgage is highly skewed.

8.4 Checking Credit Card Average Distribution¶

In [45]:
sns.distplot(df["CCAvg"])
Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f031cc5fd50>

So, Credit Card Average is left-skewed distribution

We have to do some feature engineering on Income, CCAVg and Mortgage variables. Because if we use skewed then it will create fault in logistic regression.

8.5 Checking Family Distribution¶

In [46]:
# Count Plot to show Family Distributions
sns.countplot(x='Family',data=df)
Out[46]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f031cbc79d0>

8.6 Checking Education Distribution¶

In [47]:
# Count Plot to show Education Distributions
sns.countplot(x='Education',data=df)
Out[47]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f03291176d0>

8.7 Checking Credit Card Distribution¶

In [48]:
# Count Plot to show Credit Card Distribution
sns.countplot(x='CreditCard',data=df)
Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f031cb1c3d0>

8.8 Checking Online Distributions¶

In [49]:
#  Count Plot to show Online Distributions
sns.countplot(x='Online',data=df)
Out[49]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f031ca8cc90>

9. Multivariate Analysis¶

9.1 Checking Influence of income and education on personal loan¶

In [50]:
sns.boxplot(x='Education',y='Income',hue='Personal Loan',data=df)
Out[50]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f031c9ec4d0>

Observation : It seems the customers whose education level is 1 is having more income. However customers who has taken the personal loan have the same income levels

9.2 Obtaining the count of Securities Account¶

In [51]:
sns.countplot(x="Securities Account", data=df,hue="Personal Loan")
Out[51]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f031c9557d0>

Observation : Majority of customers who does not have loan have securities account

9.3 Checking the count of Loan granted with respect to Family Size¶

In [52]:
sns.countplot(x='Family',data=df,hue='Personal Loan')
Out[52]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f031c89aa50>

Observation: Family size does not have any impact in personal loan. But it seems families with size of 3 are more likely to take loan. When considering future campaign this might be good association.

9.4 Obtaining the count of CD Account¶

In [53]:
sns.countplot(x='CD Account',data=df,hue='Personal Loan')
Out[53]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f031c8405d0>

Observation: Customers who does not have CD account , does not have loan as well. This seems to be majority. But almost all customers who has CD account has loan as well

9.5 Checking Correlation between Credit Card Average and Income¶

In [54]:
# CCAvg Credit average and income are highly correlated
fig, ax = plt.subplots(figsize=(12,10))
sns.heatmap(df.corr(), cmap='afmhot' , annot = True)
Out[54]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f03271bd310>

10. Overall Attributes Distributions¶

In [55]:
sns.pairplot(df)
Out[55]:
<seaborn.axisgrid.PairGrid at 0x7f031cc6db90>

11. Apply necessary transformations for the feature variables¶

In [56]:
data_X = df.loc[:, df.columns  != 'Personal Loan']
data_Y = df[['Personal Loan']]
In [57]:
data_X
Out[57]:
Age Income ZIP Code Family CCAvg Education Mortgage Securities Account CD Account Online CreditCard
0 25 49 91107 4 1.6 1 0 1 0 0 0
1 45 34 90089 3 1.5 1 0 1 0 0 0
2 39 11 94720 1 1.0 1 0 0 0 0 0
3 35 100 94112 1 2.7 2 0 0 0 0 0
4 35 45 91330 4 1.0 2 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ...
4995 29 40 92697 1 1.9 3 0 0 0 1 0
4996 30 15 92037 4 0.4 1 85 0 0 1 0
4997 63 24 93023 2 0.3 3 0 0 0 0 0
4998 65 49 90034 3 0.5 2 0 0 0 1 0
4999 28 83 92612 3 0.8 1 0 0 0 1 1

5000 rows × 11 columns

In [58]:
data_Y
Out[58]:
Personal Loan
0 0
1 0
2 0
3 0
4 0
... ...
4995 0
4996 0
4997 0
4998 0
4999 0

5000 rows × 1 columns

11.1 Applying the Yeo Johnson method of Transformation on the Income variable¶

In [59]:
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson',standardize=False)
pt.fit(data_X['Income'].values.reshape(-1,1))
temp = pt.transform(data_X['Income'].values.reshape(-1,1))
data_X['Income'] = pd.Series(temp.flatten())
In [60]:
# Distplot to show transformed Income variable
sns.distplot(data_X['Income'])
plt.show()

11.2 Applying the Yeo Johnson method of Transformation on the Credit Card Average variable¶

In [61]:
pt = PowerTransformer(method='yeo-johnson',standardize=False)
pt.fit(data_X['CCAvg'].values.reshape(-1,1))
temp = pt.transform(data_X['CCAvg'].values.reshape(-1,1))
data_X['CCAvg'] = pd.Series(temp.flatten())
In [62]:
# Distplot to show transformed CCAvg variable
sns.distplot(data_X['CCAvg'])
plt.show()

11.3 Binning on Mortgage variable¶

In [63]:
data_X['Mortgage_Int'] = pd.cut(data_X['Mortgage'],
                               bins=[0,100,200,300,400,500,600,700],
                               labels= [0,1,2,3,4,5,6],
                               include_lowest =True)
data_X.drop('Mortgage', axis = 1, inplace= True)

11.4 Univariate Analysis¶

In [64]:
## 9.6% of all the applicants get approved for personal loan
tempDF = pd.DataFrame(df['Personal Loan'].value_counts()).reset_index()
tempDF.columns = ['Labels', 'Personal Loan']
fig1, ax1 = plt.subplots(figsize=(10,8))
explode = (0, 0.15)
ax1.pie(tempDF['Personal Loan'] , explode= explode, autopct= '%1.1f%%',
       shadow=True , startangle = 70)
ax1.axis('equal')
plt.title('Personal Loan Percentage')
plt.show()
In [65]:
# To display top 5 rows
data_X.head()
Out[65]:
Age Income ZIP Code Family CCAvg Education Securities Account CD Account Online CreditCard Mortgage_Int
0 25 6.827583 91107 4 0.845160 1 1 0 0 0 0
1 45 5.876952 90089 3 0.814478 1 1 0 0 0 0
2 39 3.504287 94720 1 0.633777 1 0 0 0 0 0
3 35 8.983393 94112 1 1.107427 2 0 0 0 0 0
4 35 6.597314 91330 4 0.633777 2 0 0 0 1 0

12. Normalising and Splitting the dataset¶

In [66]:
from sklearn.model_selection import train_test_split

The dataset will be split into training and testing data in the ratio of 70:30 respectively.
We use stratify parameter of train_test_split function to get the same class distribution across train and test sets.

In [67]:
# Splitting the data into train and test. 
X_train,X_test,Y_train,Y_test = train_test_split(data_X,data_Y,test_size = 0.3, random_state = 0,stratify = data_Y)
In [68]:
X_train
Out[68]:
Age Income ZIP Code Family CCAvg Education Securities Account CD Account Online CreditCard Mortgage_Int
3789 51 5.058173 94301 3 0.322049 1 0 0 1 1 0
758 64 5.948841 90266 1 0.814478 2 1 0 0 0 0
2868 52 5.651776 94923 4 0.902279 1 0 0 1 1 0
2550 32 4.661500 93106 1 0.384645 3 0 0 1 0 1
2150 62 7.097040 91320 1 0.544710 1 1 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ...
3597 56 6.937650 92028 3 0.954467 3 0 0 1 0 0
4670 52 11.394571 94305 1 0.874387 1 0 0 1 0 0
988 63 5.728502 94998 1 0.928941 2 0 0 0 0 0
2037 35 6.991517 95616 2 0.633777 2 0 0 0 1 0
2174 30 9.691160 95605 2 1.179285 1 0 0 1 0 0

3500 rows × 11 columns

In [69]:
X_test
Out[69]:
Age Income ZIP Code Family CCAvg Education Securities Account CD Account Online CreditCard Mortgage_Int
9 34 11.100150 93023 1 1.722825 3 0 0 0 0 0
461 55 8.302424 92123 2 1.271937 1 1 0 0 0 0
3700 48 9.831967 94608 1 1.497521 1 1 0 0 0 0
1559 59 9.049404 92677 4 1.162177 2 0 0 1 0 1
4558 44 8.341020 95521 2 0.322049 1 0 0 1 1 0
... ... ... ... ... ... ... ... ... ... ... ...
2180 58 6.414718 91380 2 0.845160 3 0 0 1 0 0
3484 45 7.044639 92104 3 1.067713 2 0 0 0 0 1
2965 53 5.651776 91605 2 0.322049 3 0 0 0 1 1
2493 34 6.827583 94025 1 1.067713 3 0 0 0 0 0
3224 45 7.299875 94025 3 0.253539 3 1 0 1 0 0

1500 rows × 11 columns

In [70]:
Y_train
Out[70]:
Personal Loan
3789 0
758 0
2868 0
2550 0
2150 0
... ...
3597 0
4670 0
988 0
2037 0
2174 0

3500 rows × 1 columns

In [71]:
Y_test
Out[71]:
Personal Loan
9 1
461 0
3700 0
1559 1
4558 0
... ...
2180 0
3484 0
2965 0
2493 0
3224 0

1500 rows × 1 columns

In [72]:
X_train.reset_index(drop= True, inplace= True);
X_test.reset_index(drop= True, inplace= True);
Y_train.reset_index(drop= True, inplace= True);
Y_test.reset_index(drop= True, inplace= True);
In [73]:
X_train
Out[73]:
Age Income ZIP Code Family CCAvg Education Securities Account CD Account Online CreditCard Mortgage_Int
0 51 5.058173 94301 3 0.322049 1 0 0 1 1 0
1 64 5.948841 90266 1 0.814478 2 1 0 0 0 0
2 52 5.651776 94923 4 0.902279 1 0 0 1 1 0
3 32 4.661500 93106 1 0.384645 3 0 0 1 0 1
4 62 7.097040 91320 1 0.544710 1 1 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ...
3495 56 6.937650 92028 3 0.954467 3 0 0 1 0 0
3496 52 11.394571 94305 1 0.874387 1 0 0 1 0 0
3497 63 5.728502 94998 1 0.928941 2 0 0 0 0 0
3498 35 6.991517 95616 2 0.633777 2 0 0 0 1 0
3499 30 9.691160 95605 2 1.179285 1 0 0 1 0 0

3500 rows × 11 columns

In [74]:
X_test
Out[74]:
Age Income ZIP Code Family CCAvg Education Securities Account CD Account Online CreditCard Mortgage_Int
0 34 11.100150 93023 1 1.722825 3 0 0 0 0 0
1 55 8.302424 92123 2 1.271937 1 1 0 0 0 0
2 48 9.831967 94608 1 1.497521 1 1 0 0 0 0
3 59 9.049404 92677 4 1.162177 2 0 0 1 0 1
4 44 8.341020 95521 2 0.322049 1 0 0 1 1 0
... ... ... ... ... ... ... ... ... ... ... ...
1495 58 6.414718 91380 2 0.845160 3 0 0 1 0 0
1496 45 7.044639 92104 3 1.067713 2 0 0 0 0 1
1497 53 5.651776 91605 2 0.322049 3 0 0 0 1 1
1498 34 6.827583 94025 1 1.067713 3 0 0 0 0 0
1499 45 7.299875 94025 3 0.253539 3 1 0 1 0 0

1500 rows × 11 columns

In [75]:
Y_train
Out[75]:
Personal Loan
0 0
1 0
2 0
3 0
4 0
... ...
3495 0
3496 0
3497 0
3498 0
3499 0

3500 rows × 1 columns

In [76]:
Y_test
Out[76]:
Personal Loan
0 1
1 0
2 0
3 1
4 0
... ...
1495 0
1496 0
1497 0
1498 0
1499 0

1500 rows × 1 columns

13. Standardizing the Dataset¶

In [77]:
from sklearn.preprocessing import StandardScaler

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

We will apply the StandardScaler to the dataset to standardize the input variables

In [78]:
for ind, column in enumerate(X_train.columns):
    scaler = StandardScaler()
    
    #fit to train data
    scaler.fit(X_train[[column]])
    
    #transform train data
    np_array = scaler.transform(X_train[[column]])
    X_train.loc[: , column] = pd.Series(np_array.flatten())
    
    #transform test data
    np_array = scaler.transform(X_test[[column]])
    X_test.loc[: , column] = pd.Series(np_array.flatten())
    

14. Logistic Regression¶

In [79]:
from sklearn.linear_model import LogisticRegression

14.1 Creating the Logistic Regression Model¶

In [80]:
model_LR = LogisticRegression(random_state = 0)

14.2 Training the Model¶

In [81]:
model_LR.fit(X_train, Y_train)
Out[81]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

14.3 Evaluating the Model¶

In [82]:
from sklearn.metrics import confusion_matrix, recall_score , precision_score , f1_score , accuracy_score, roc_auc_score

14.3.1 Predicting from the Model¶

In [84]:
X_test_predictions_model_LR = model_LR.predict(X_test)
X_test_predictions_model_LR
Out[84]:
array([1, 0, 0, ..., 0, 0, 0])

14.3.2 Obtaining the Accuracy of the Model on Training Data¶

In [86]:
# Accuracy of train data
model_LR.score(X_train, Y_train)
Out[86]:
0.9568571428571429

14.3.3 Obtaining the Accuracy of the Model on Testing Data¶

In [87]:
# Accuracy of test data
model_LR.score(X_test,Y_test)
Out[87]:
0.9546666666666667

14.3.4 Obtaining the Confusion Matrix¶

In [88]:
# Defining the Confusion Matrix
def Confusion_Matrix(actual, predicted):
    cm = confusion_matrix(actual, predicted)
    fig, ax = plt.subplots(figsize=(8,6))
    ax.set_ylim([0,5])
    sns.heatmap(cm, annot=True, fmt= '.2f', xticklabels= [0,1], yticklabels=[0,1])
    plt.ylabel('Observed')
    plt.xlabel('Predicted')
    plt.show()
In [89]:
Y_test.shape
Out[89]:
(1500, 1)
In [90]:
print('Confusion Matrix')
print(Confusion_Matrix(Y_test,X_test_predictions_model_LR.reshape(-1,1)))
Confusion Matrix
None
In [91]:
from sklearn.metrics import confusion_matrix
confusion_matrix_model_LR = confusion_matrix(Y_test,X_test_predictions_model_LR.reshape(-1,1))
confusion_matrix_model_LR
Out[91]:
array([[1338,   18],
       [  50,   94]])

14.4. Print all the metrics related for evaluating the model performance¶

In [92]:
from sklearn.metrics import classification_report
print(classification_report(Y_test,X_test_predictions_model_LR))
              precision    recall  f1-score   support

           0       0.96      0.99      0.98      1356
           1       0.84      0.65      0.73       144

    accuracy                           0.95      1500
   macro avg       0.90      0.82      0.85      1500
weighted avg       0.95      0.95      0.95      1500

14.5 Obtaining the ROC-AUC Score¶

In [93]:
print("Roc Auc Score: ", roc_auc_score(Y_test,X_test_predictions_model_LR))
Roc Auc Score:  0.819751720747296

For Logistic Regression we got 95% accuracy for test data. The F1 score is 0.73. Now lets compare that values with other models.

15. Random Forest Classifier¶

Random forest is an ensemble machine learning algorithm.

It is perhaps the most popular and widely used machine learning algorithm given its good or excellent performance across a wide range of classification and regression predictive modeling problems.

It works in four steps:

1)Select random samples from a given dataset.

2)Construct a decision tree for each sample and get a prediction result from each decision tree.

3)Perform a vote for each predicted result.

4)Select the prediction result with the most votes as the final prediction.

15.1 Creating the Random Forest Model¶

In [94]:
from sklearn.ensemble import RandomForestClassifier
model_RF = RandomForestClassifier(n_estimators=500, max_depth=8)

15.2 Training the Model¶

In [95]:
model_RF.fit(X_train, Y_train)
Out[95]:
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=8, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=500,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

15.2 Evaluating the Model¶

In [96]:
X_test_predictions_model_RF = model_RF.predict(X_test)
X_test_predictions_model_RF
Out[96]:
array([1, 0, 0, ..., 0, 0, 0])

15.2.1 Obtaining the Accuracy of the Model on Training Data¶

In [97]:
# Accuracy of train data
model_RF.score(X_train, Y_train)
Out[97]:
0.9945714285714286

15.2.2 Obtaining the Accuracy of the Model on Testing Data¶

In [98]:
# Accuracy of test data
model_RF.score(X_test,Y_test)
Out[98]:
0.9873333333333333

15.2.3 Obtaining the Confusion Matrix¶

In [99]:
print('Confusion Matrix')
print(Confusion_Matrix(Y_test,X_test_predictions_model_RF.reshape(-1,1)))
Confusion Matrix
None
In [100]:
from sklearn.metrics import confusion_matrix
confusion_matrix_RF = confusion_matrix(Y_test,X_test_predictions_model_RF.reshape(-1,1))
confusion_matrix_RF
Out[100]:
array([[1354,    2],
       [  17,  127]])

15.4. Print all the metrics related for evaluating the model performance¶

In [101]:
from sklearn.metrics import classification_report
print(classification_report(Y_test,X_test_predictions_model_RF))
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      1356
           1       0.98      0.88      0.93       144

    accuracy                           0.99      1500
   macro avg       0.99      0.94      0.96      1500
weighted avg       0.99      0.99      0.99      1500

15.5 Obtaining the ROC AUC Score¶

In [103]:
print("ROC AUC Score: ", roc_auc_score(Y_test,X_test_predictions_model_RF))
ROC AUC Score:  0.9402347590953786

The ROC AUC score and F1 score are higher then Logistic Regression model.

16. Decision Tree Classifier¶

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

In [104]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RepeatedStratifiedKFold

16.1 Creating the Model¶

In [105]:
model_DT = DecisionTreeClassifier(random_state=0, max_depth=8)

16.2 Training the Model¶

In [106]:
model_DT.fit(X_train, Y_train)
Out[106]:
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=8, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=0, splitter='best')

16.3 Evaluating the Model¶

In [107]:
X_test_predictions_model_DT = model_DT.predict(X_test)
X_test_predictions_model_DT
Out[107]:
array([1, 0, 0, ..., 0, 0, 0])

16.3.1 Obtaining the Accuracy of the Model on Training Data¶

In [108]:
# Accuracy of train data
model_DT.score(X_train, Y_train)
Out[108]:
0.996

16.3.2 Obtaining the Accuracy of the Model on Testing Data¶

In [109]:
# Accuracy of test data
model_DT.score(X_test,Y_test)
Out[109]:
0.98

16.3.3 Obtaining the Confusion Matrix¶

In [110]:
print('Confusion Matrix')
print(Confusion_Matrix(Y_test,X_test_predictions_model_DT.reshape(-1,1)))
Confusion Matrix
None
In [111]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test,X_test_predictions_model_DT.reshape(-1,1))
cm
Out[111]:
array([[1344,   12],
       [  18,  126]])

16.4 Print all the metrics related for evaluating the model performance¶

In [112]:
from sklearn.metrics import classification_report
print(classification_report(Y_test,X_test_predictions_model_DT))
              precision    recall  f1-score   support

           0       0.99      0.99      0.99      1356
           1       0.91      0.88      0.89       144

    accuracy                           0.98      1500
   macro avg       0.95      0.93      0.94      1500
weighted avg       0.98      0.98      0.98      1500

16.5 Obtaining the ROC-AUC Score¶

In [113]:
print("Roc Auc Score: ", roc_auc_score(Y_test,X_test_predictions_model_DT))
Roc Auc Score:  0.933075221238938

17. Naive Bayes¶

Bayes’ Theorem provides a way that we can calculate the probability of a piece of data belonging to a given class, given our prior knowledge. Bayes’ Theorem is stated as:

P(class|data) = (P(data|class) * P(class)) / P(data)
Where P(class|data) is the probability of class given the provided data.

17.1 Creating the Model¶

In [114]:
from sklearn.naive_bayes import GaussianNB
model_NB = GaussianNB()

17.2 Training the Model¶

In [115]:
model_NB.fit(X_train,Y_train)
Out[115]:
GaussianNB(priors=None, var_smoothing=1e-09)

17.3 Evaluating the Model¶

In [116]:
X_test_predictions_model_NB = model_NB.predict(X_test)
X_test_predictions_model_NB
Out[116]:
array([1, 0, 0, ..., 0, 0, 0])

17.3.1 Obtaining the Accuracy of the Model on Training Data¶

In [117]:
# Accuracy of train data
model_NB.score(X_train, Y_train)
Out[117]:
0.9105714285714286

17.3.2 Obtaining the Accuracy of the Model on Testing Data¶

In [118]:
# Accuracy of test data
model_NB.score(X_test,Y_test)
Out[118]:
0.9153333333333333

17.3.3 Obtaining the Confusion Matrix¶

In [119]:
print('Confusion Matrix')
print(Confusion_Matrix(Y_test,X_test_predictions_model_NB.reshape(-1,1)))
Confusion Matrix
None
In [120]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test,X_test_predictions_model_NB.reshape(-1,1))
cm
Out[120]:
array([[1294,   62],
       [  65,   79]])

17.4 Print all the metric related for evaluating the Model Performance¶

In [121]:
from sklearn.metrics import classification_report
print(classification_report(Y_test,X_test_predictions_model_NB))
              precision    recall  f1-score   support

           0       0.95      0.95      0.95      1356
           1       0.56      0.55      0.55       144

    accuracy                           0.92      1500
   macro avg       0.76      0.75      0.75      1500
weighted avg       0.91      0.92      0.91      1500

17.5 Obtaining the ROC_AUC Score¶

In [122]:
print("Roc Auc Score: ", roc_auc_score(Y_test,X_test_predictions_model_NB))
Roc Auc Score:  0.7514441986234022

18. Results obtained in the Model¶

In the first step of this project we imported various libraries and our data. Than we found out various things about our data.

1) We have to make the model to predict whether a person will take personal loan or not.
2) We found that age and experience are highly correlated so we droped the experience column.
3) ID and ZIPcode were not contributing factors for a person to take loan so we dropped them.
4) The Income and CCAvg column were left skewed so we applied Power transformation to them to normalize them.
5) The mortgage column was also skewed but since it was discrete so rather than power transformation, we use binning technique.

After this we used several models to make predictions.

  1. Logistic Regression
  2. Random Forest Classifier
  3. Decision Tree Classifier
  4. Naive Bayes

1. Logistic Regression


ACCURACY SCORE: 95.68571428571429%

CONFUSION MATRIX: [[1338, 18], [ 50, 94]]

CLASSIFICATION REPORT: precision recall f1-score support

                0       0.96      0.99      0.98      1356
                1       0.84      0.65      0.73       144

         accuracy                           0.95      1500
        macro avg       0.90      0.82      0.85      1500
     weighted avg       0.95      0.95      0.95      1500



2. Random Forest Classifier


ACCURACY SCORE: 98.73333333333333%

CONFUSION MATRIX: [[1354, 2], [ 17, 127]]

CLASSIFICATION REPORT: precision recall f1-score support

                 0       0.99      1.00      0.99      1356
                 1       0.98      0.89      0.93       144

          accuracy                           0.99      1500
         macro avg       0.98      0.94      0.96      1500
      weighted avg       0.99      0.99      0.99      1500


3. Decision Tree Classifier


ACCURACY SCORE: 98%

CONFUSION MATRIX: [[1344, 12], [ 18, 126]]

CLASSIFICATION REPORT: precision recall f1-score support

                  0       0.99      0.99      0.99      1356
                  1       0.91      0.88      0.89       144

           accuracy                           0.98      1500
          macro avg       0.95      0.93      0.94      1500
       weighted avg       0.98      0.98      0.98      1500


4. Naive Bayes


ACCURACY SCORE: 91.05714285714286%

CONFUSION MATRIX: [[1294, 62] [ 65, 79]]

CLASSIFICATION REPORT: precision recall f1-score support

                   0       0.95      0.95      0.95      1356
                   1       0.56      0.55      0.55       144

            accuracy                           0.92      1500
           macro avg       0.76      0.75      0.75      1500
        weighted avg       0.91      0.92      0.91      1500


Conclusion


Four classification algorithms were used in this project. From the implementation, it seems like Random Forest Classifier have the highest accuracy and we can choose that as our final model