Titanic Case Analysis

42 minute read

Titanic: Machine Learning from Disaster

Predict survival on the Titanic

Defining the problem statement
Collecting the data
Exploratory data analysis
Feature engineering
Modelling
Testing

1. Defining the problem statement:

Conducting an analysis on predicting the sort of people who were likely to survive. We are going to apply the tools of machine learning to predict which passengers survivied the tragedy

from IPython.display import Image
Image(url= "https://static1.squarespace.com/static/5006453fe4b09ef2252ba068/5095eabce4b06cb305058603/5095eabce4b02d37bef4c24c/1352002236895/100_anniversary_titanic_sinking_by_esai8mellows-d4xbme8.jpg")

2. Collecting the Data:

The data comes from Kaggle, they are providing us with training and testing set. link to the data is found on my github: https://github.com/Haalibrahim/Titanic-case-study

Load libraries, train and test data

#Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()

plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

train=pd.read_csv('train.csv')
test=pd.read_csv('test.csv')

3. Exploratory data analysis:

Data Dictionary:

Survived: 0 = No, 1= Yes
Pclass: Ticket class 1 = 1st, 2= 2nd, 3 = 3rd
Sibsp: Number of siblings / spouses aboard the titanic
Parch: Number of parents/ childern abroad the titanic
Ticket: Ticket number
Fare: price of the ticket
Cabin: Number of the Cabin
Embarked: Port of Embarkation, C= Cherbourg , Q= Queenstown, S= Southhampton

train.head(10)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

test.head()

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S

train.shape

(891, 12)

test.shape

(418, 11)

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

train.isna().sum() # = train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

test.isna().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

#Let us explore the data: (pay attention to the age column)
train.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

test.describe()

	PassengerId	Pclass	Age	SibSp	Parch	Fare
count	418.000000	418.000000	332.000000	418.000000	418.000000	417.000000
mean	1100.500000	2.265550	30.272590	0.447368	0.392344	35.627188
std	120.810458	0.841838	14.181209	0.896760	0.981429	55.907576
min	892.000000	1.000000	0.170000	0.000000	0.000000	0.000000
25%	996.250000	1.000000	21.000000	0.000000	0.000000	7.895800
50%	1100.500000	3.000000	27.000000	0.000000	0.000000	14.454200
75%	1204.750000	3.000000	39.000000	1.000000	0.000000	31.500000
max	1309.000000	3.000000	76.000000	8.000000	9.000000	512.329200

Bar Chart for Categorical Features

Pclass
Sex
SibSp ( # of siblings and spouse)
Parch ( # of parents and children)
Embarked
Cabin

# Count of people who survive:
train['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64

def bar_chart(feature):
    Survived= train[train['Survived']==1][feature].value_counts()
    Dead= train[train['Survived']==0][feature].value_counts()
    df = pd.DataFrame([Survived,Dead])
    df.index= ['Survived','Dead']
    df.plot(kind='bar', stacked =True, figsize =(10,5))

bar_chart('Sex')

linearly separable data

Women are more likely to survive than Men

bar_chart('Pclass')

linearly separable data

The Chart confirms 1st class more likely survivied than other classes The Chart confirms 3rd class more likely dead than other classes

bar_chart('SibSp')

linearly separable data

The Chart confirms a person aboarded with more than 2 siblings or spouse more likely survived The Chart confirms a person aboarded without siblings or spouse more likely dead

bar_chart('Parch')

linearly separable data

The Chart confirms a person aboarded with more than 2 parents or children more likely survived The Chart confirms a person aboarded alone more likely dead

bar_chart('Embarked')

linearly separable data

The Chart confirms a person aboarded from C slightly more likely survived
The Chart confirms a person aboarded from Q more likely dead
The Chart confirms a person aboarded from S more likely dead

4. Feature engineering

The purpose of Feautre engineering is to prepare the data for machine learning algorithms by creating feature vectors.
Feature vector is an n dimensional vector that represent some object

train.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

4.1 how titanic sank?

sank from the bow of the ship where third class rooms located
conclusion, Pclass is key feature for classifier

Image(url= "https://static1.squarespace.com/static/5006453fe4b09ef2252ba068/t/5090b249e4b047ba54dfd258/1351660113175/TItanic-Survival-Infographic.jpg?format=1500w")

4.2 Name

train_test_data=[train,test] # combining train and test dataset

train['Title']=0
for i in train_test_data:
    train['Title']=train.Name.str.extract('([A-Za-z]+)\.') #extracting Name initials
    test['Title']=test.Name.str.extract('([A-Za-z]+)\.')

train['Title'].value_counts()

Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Mlle          2
Major         2
Col           2
Jonkheer      1
Don           1
Lady          1
Sir           1
Mme           1
Ms            1
Capt          1
Countess      1
Name: Title, dtype: int64

train['Title'].value_counts().sum()

test['Title'].value_counts()

Mr        240
Miss       78
Mrs        72
Master     21
Rev         2
Col         2
Dr          1
Dona        1
Ms          1
Name: Title, dtype: int64

test['Title'].value_counts().sum()

pd.crosstab(train.Title,train.Sex).T.style.background_gradient(cmap='summer_r')

Title	Capt	Col	Countess	Don	Dr	Jonkheer	Lady	Major	Master	Miss	Mlle	Mme	Mr	Mrs	Ms	Rev	Sir
Sex
female	0	0	1	0	1	0	1	0	0	182	2	1	0	125	1	0	0
male	1	2	0	1	6	1	0	2	40	0	0	0	517	0	0	6	1

pd.crosstab(test.Title,test.Sex).T.style.background_gradient(cmap='summer_r')

Title	Col	Dona	Dr	Master	Miss	Mr	Mrs	Ms	Rev
Sex
female	0	1	0	0	78	0	72	1	0
male	2	0	1	21	0	240	0	0	2

train['Title'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess',
                               'Jonkheer','Col','Rev','Capt','Sir','Don'],['Miss',
                                'Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr'],inplace=True)

test['Title'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess',
                               'Jonkheer','Col','Rev','Capt','Sir','Don'],['Miss',
                                'Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr'],inplace=True)

train.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Title
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	Mr
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	Mrs
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	Miss
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	Mrs
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	Mr

train['Title'].value_counts()

Mr        529
Miss      186
Mrs       127
Master     40
Other       9
Name: Title, dtype: int64

train['Title'].value_counts().sum()

test['Title'].value_counts()

Mr        241
Miss       79
Mrs        72
Master     21
Other       4
Dona        1
Name: Title, dtype: int64

test['Title'].value_counts().sum()

# delete unnecessary feature from dataset
train.drop('Name', axis=1, inplace=True)
test.drop('Name', axis=1, inplace=True)

train.head()

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Title
0	1	0	3	male	22.0	1	A/5 21171	7.2500	NaN	S	Mr
1	2	1	1	female	38.0	1	PC 17599	71.2833	C85	C	Mrs
2	3	1	3	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	Miss
3	4	1	1	female	35.0	1	113803	53.1000	C123	S	Mrs
4	5	0	3	male	35.0	0	373450	8.0500	NaN	S	Mr

test.head()

	PassengerId	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Title
0	892	3	male	34.5	0	0	330911	7.8292	NaN	Q	Mr
1	893	3	female	47.0	1	0	363272	7.0000	NaN	S	Mrs
2	894	2	male	62.0	0	0	240276	9.6875	NaN	Q	Mr
3	895	3	male	27.0	0	0	315154	8.6625	NaN	S	Mr
4	896	3	female	22.0	1	1	3101298	12.2875	NaN	S	Mrs

pd.crosstab(train.Title,train.Sex).T.style.background_gradient(cmap='summer_r')

Title	Master	Miss	Mr	Mrs	Other
Sex
female	0	186	1	127	0
male	40	0	528	0	9

bar_chart('Title')

linearly separable data

4.3 Age

print('Oldest person Survived was of:',train['Age'].max())
print('Youngest person Survived was of:',train['Age'].min())
print('Average person Survived was of:',train['Age'].mean())

Oldest person Survived was of: 80.0
Youngest person Survived was of: 0.42
Average person Survived was of: 29.69911764705882

f,ax=plt.subplots(1,2,figsize=(18,8))
sns.violinplot('Pclass','Age',hue='Survived',data=train,split=True,ax=ax[0])
ax[0].set_title('PClass and Age vs Survived')
ax[0].set_yticks(range(0,110,10))
sns.violinplot("Sex","Age", hue="Survived", data=train,split=True,ax=ax[1])
ax[1].set_title('Sex and Age vs Survived')
ax[1].set_yticks(range(0,110,10))
plt.show()

linearly separable data

train.groupby('Title')['Age'].mean()

Title
Master     4.574167
Miss      21.860000
Mr        32.739609
Mrs       35.981818
Other     45.888889
Name: Age, dtype: float64

train.loc[(train.Age.isnull()) & (train.Title=='Mr'),'Age']=33
train.loc[(train.Age.isnull()) & (train.Title=='Mrs'),'Age']=36
train.loc[(train.Age.isnull()) & (train.Title=='Master'),'Age']=5
train.loc[(train.Age.isnull()) & (train.Title=='Miss'),'Age']=22
train.loc[(train.Age.isnull()) & (train.Title=='Other'),'Age']=46

train.Age.isnull().any()

False

Title map

Mr : 0
Miss : 1
Mrs: 2
Master: 3
Other: 4

title_mapping = {"Mr": 0, "Miss": 1, "Mrs": 2,
                 "Master": 3, "Other":4 }
for dataset in train_test_data:
    dataset['Title'] = dataset['Title'].map(title_mapping)

bar_chart('Title')

linearly separable data

train.head()

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Title
0	1	0	3	male	22.0	1	A/5 21171	7.2500	NaN	S	0
1	2	1	1	female	38.0	1	PC 17599	71.2833	C85	C	2
2	3	1	3	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	1
3	4	1	1	female	35.0	1	113803	53.1000	C123	S	2
4	5	0	3	male	35.0	0	373450	8.0500	NaN	S	0

test.head()

	PassengerId	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Title
0	892	3	male	34.5	0	0	330911	7.8292	NaN	Q	0.0
1	893	3	female	47.0	1	0	363272	7.0000	NaN	S	2.0
2	894	2	male	62.0	0	0	240276	9.6875	NaN	Q	0.0
3	895	3	male	27.0	0	0	315154	8.6625	NaN	S	0.0
4	896	3	female	22.0	1	1	3101298	12.2875	NaN	S	2.0

f,ax=plt.subplots(1,2,figsize=(20,20))
train[train['Survived']==0].Age.plot.hist(ax=ax[0],bins=20,edgecolor='black',color='red')
ax[0].set_title('Survived = 0')
x1=list(range(0,85,5))
ax[0].set_xticks(x1)
train[train['Survived']==1].Age.plot.hist(ax=ax[1],bins=20,edgecolor='black',color='green')
x2=list(range(0,85,5))
ax[1].set_xticks(x2)
ax[1].set_title('Survived = 1')
plt.show()

linearly separable data

facet =sns.FacetGrid(train, hue='Survived',aspect=4)
facet.map(sns.kdeplot,'Age',shade=True)
facet.set(xlim=(0, train['Age'].max()))
facet.add_legend()
plt.show()

linearly separable data

grid = sns.FacetGrid(train, col='Survived', row='Pclass', size=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend();

linearly separable data

for dataset in train_test_data:
    dataset.loc[dataset['Age']<= 16, 'Age']= 0,
    dataset.loc[(dataset['Age']> 16) & (dataset['Age']<=26), 'Age']= 1,
    dataset.loc[(dataset['Age']> 26) & (dataset['Age']<=36), 'Age']= 2,
    dataset.loc[(dataset['Age']> 36) & (dataset['Age']<=62), 'Age']= 3,
    dataset.loc[dataset['Age']> 62, 'Age']= 4

bar_chart("Age")

linearly separable data

train.head()

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Title
0	1	0	3	male	1.0	1	A/5 21171	7.2500	NaN	S	0
1	2	1	1	female	3.0	1	PC 17599	71.2833	C85	C	2
2	3	1	3	female	1.0	0	STON/O2. 3101282	7.9250	NaN	S	1
3	4	1	1	female	2.0	1	113803	53.1000	C123	S	2
4	5	0	3	male	2.0	0	373450	8.0500	NaN	S	0

4.4 Sex

train.groupby(['Sex', 'Survived'])['Survived'].count()

Sex     Survived
female  0            81
        1           233
male    0           468
        1           109
Name: Survived, dtype: int64

train.groupby('Sex')[['Survived']].mean()

	Survived
Sex
female	0.742038
male	0.188908

train[['Sex','Survived']].groupby(['Sex']).mean().plot.bar()
sns.countplot('Sex',hue='Survived',data=train,)
plt.show()

linearly separable data

sex_mapping = {"male": 0, "female": 1}
for dataset in train_test_data:
    dataset['Sex'] = dataset['Sex'].map(sex_mapping)

bar_chart('Sex')

linearly separable data

train.head()

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Title
0	1	0	3	0	1.0	1	A/5 21171	7.2500	NaN	S	0
1	2	1	1	1	3.0	1	PC 17599	71.2833	C85	C	2
2	3	1	3	1	1.0	0	STON/O2. 3101282	7.9250	NaN	S	1
3	4	1	1	1	2.0	1	113803	53.1000	C123	S	2
4	5	0	3	0	2.0	0	373450	8.0500	NaN	S	0

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Sex            891 non-null int64
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
Title          891 non-null int64
dtypes: float64(2), int64(7), object(3)
memory usage: 83.7+ KB

test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Sex            418 non-null int64
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
Title          417 non-null float64
dtypes: float64(3), int64(5), object(3)
memory usage: 36.0+ KB

4.5 Pclass

sns.countplot('Pclass', hue='Survived', data=train)
plt.title('Pclass: Sruvived vs Dead')
plt.show()

linearly separable data

pd.crosstab([train.Sex,train.Survived],train.Pclass,margins=True).style.background_gradient(cmap='summer_r')

	Pclass	1	2	3	All
Sex	Survived
0	0	77	91	300	468
0	1	45	17	47	109
1	0	3	6	72	81
1	1	91	70	72	233
All		216	184	491	891

sns.factorplot('Pclass', 'Survived', hue='Sex', data=train)
plt.show()

linearly separable data

sns.factorplot('Pclass','Survived',col='Title',data=train)
plt.show()

linearly separable data

4.6 SibSip

pd.crosstab([train.SibSp],train.Survived).style.background_gradient('summer_r')

Survived	0	1
SibSp
0	398	210
1	97	112
2	15	13
3	12	4
4	15	3
5	5	0
8	7	0

f,ax=plt.subplots(1,2,figsize=(20,8))
sns.barplot('SibSp','Survived', data=train,ax=ax[0])
ax[0].set_title('SipSp vs Survived in BarPlot')
sns.factorplot('SibSp','Survived', data=train,ax=ax[1])
ax[1].set_title('SibSp vs Survived in FactorPlot')
plt.close(2)
plt.show()

linearly separable data

pd.crosstab(train.SibSp,train.Pclass).style.background_gradient('summer_r')

Pclass	1	2	3
SibSp
0	137	120	351
1	71	55	83
2	5	8	15
3	3	1	12
4	0	0	18
5	0	0	5
8	0	0	7

f,ax=plt.subplots(1,2,figsize=(20,8))
sns.barplot('Pclass','Survived', data=train,ax=ax[0])
ax[0].set_title('Pclass vs Survived in BarPlot')
sns.factorplot('Pclass','Survived', data=train,ax=ax[1])
ax[1].set_title('Pclass vs Survived in FactorPlot')
plt.close(2)
plt.show()

linearly separable data

sns.countplot(train['Survived'],label="Count")

<matplotlib.axes._subplots.AxesSubplot at 0x1be652cdc48>

linearly separable data

4.7 Embarked

FacetGrid = sns.FacetGrid(train, row='Embarked', size=4.5, aspect=1.6)
FacetGrid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette=None,  order=None, hue_order=None )
FacetGrid.add_legend()

<seaborn.axisgrid.FacetGrid at 0x1be65530cc8>

linearly separable data

Pclass1 = train[train['Pclass']==1]['Embarked'].value_counts()
Pclass2 = train[train['Pclass']==2]['Embarked'].value_counts()
Pclass3 = train[train['Pclass']==3]['Embarked'].value_counts()
df = pd.DataFrame([Pclass1, Pclass2, Pclass3])
df.index = ['1st class','2nd class', '3rd class']
df.plot(kind='bar',stacked=True, figsize=(10,5))

<matplotlib.axes._subplots.AxesSubplot at 0x1be654a8388>

linearly separable data

fill out missing embark with S embark

for dataset in train_test_data:
    dataset['Embarked'] = dataset['Embarked'].fillna('S')

train.head()

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Title
0	1	0	3	0	1.0	1	A/5 21171	7.2500	NaN	S	0
1	2	1	1	1	3.0	1	PC 17599	71.2833	C85	C	2
2	3	1	3	1	1.0	0	STON/O2. 3101282	7.9250	NaN	S	1
3	4	1	1	1	2.0	1	113803	53.1000	C123	S	2
4	5	0	3	0	2.0	0	373450	8.0500	NaN	S	0

embarked_mapping = {"S": 0, "C": 1, "Q": 2}
for dataset in train_test_data:
    dataset['Embarked'] = dataset['Embarked'].map(embarked_mapping)

train.Embarked.isnull().any()

False

#Look at survival rate by sex, age and class
age = pd.cut(train['Age'], [0, 18, 80])
train.pivot_table('Survived', ['Sex', age], 'Pclass')

	Pclass	1	2	3
Sex	Age
0	(0, 18]	0.352941	0.082474	0.114379
1	(0, 18]	0.977273	0.909091	0.486486

4.8 Fare price

# fill for missing value with the median fare for each class:
train['Fare'].fillna(train.groupby('Pclass')['Fare'].transform('median'), inplace= True)
test['Fare'].fillna(test.groupby('Pclass')['Fare'].transform('median'), inplace= True)

#Plot the Prices Paid Of Each Class
plt.scatter(train['Fare'], train['Pclass'],  color = 'purple', label='Passenger Paid')
plt.ylabel('Class')
plt.xlabel('Price / Fare')
plt.title('Price Of Each Class')
plt.legend()
plt.show()

linearly separable data

facet =sns.FacetGrid(train, hue='Survived',aspect=4)
facet.map(sns.kdeplot,'Fare',shade=True)
facet.set(xlim=(0, train['Fare'].max()))
facet.add_legend()
plt.show()

linearly separable data

facet =sns.FacetGrid(train, hue='Survived',aspect=4)
facet.map(sns.kdeplot,'Fare',shade=True)
facet.set(xlim=(0, train['Fare'].max()))
facet.add_legend()
plt.xlim(0,200)

(0, 200)

linearly separable data

 for dataset in train_test_data:
    dataset.loc[ dataset['Fare'] <= 17, 'Fare'] = 0,
    dataset.loc[(dataset['Fare'] > 17) & (dataset['Fare'] <= 30), 'Fare'] = 1,
    dataset.loc[(dataset['Fare'] > 30) & (dataset['Fare'] <= 100), 'Fare'] = 2,
    dataset.loc[ dataset['Fare'] > 100, 'Fare'] = 3

train.head()

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Title
0	1	0	3	0	1.0	1	A/5 21171	0.0	NaN	0	0
1	2	1	1	1	3.0	1	PC 17599	2.0	C85	1	2
2	3	1	3	1	1.0	0	STON/O2. 3101282	0.0	NaN	0	1
3	4	1	1	1	2.0	1	113803	2.0	C123	0	2
4	5	0	3	0	2.0	0	373450	0.0	NaN	0	0

train.Fare.isnull().any()

False

4.9 Cabins

train.Cabin.value_counts()

B96 B98        4
C23 C25 C27    4
G6             4
E101           3
C22 C26        3
              ..
A26            1
B86            1
B4             1
A10            1
C110           1
Name: Cabin, Length: 147, dtype: int64

for dataset in train_test_data:
    dataset['Cabin']=dataset['Cabin'].str[:1]

Pclass1 = train[train['Pclass']==1]['Cabin'].value_counts()
Pclass2 = train[train['Pclass']==2]['Cabin'].value_counts()
Pclass3 = train[train['Pclass']==3]['Cabin'].value_counts()
df=pd.DataFrame([Pclass1, Pclass2, Pclass3])
df.index= ['1st class','2nd class','3rd class']
df.plot(kind = 'bar', stacked = True, figsize =(10,5))

<matplotlib.axes._subplots.AxesSubplot at 0x1be673cf4c8>

linearly separable data

cabin_mapping = {"A": 0, "B": 0.4, "C": 0.8, "D": 1.2, "E": 1.6, "F": 2, "G": 2.4, "T": 2.8}
for dataset in train_test_data:
    dataset['Cabin'] = dataset['Cabin'].map(cabin_mapping)

# fill missing Fare with median fare for each Pclass
train["Cabin"].fillna(train.groupby("Pclass")["Cabin"].transform("median"), inplace=True)
test["Cabin"].fillna(test.groupby("Pclass")["Cabin"].transform("median"), inplace=True)

train.head()

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Title
0	1	0	3	0	1.0	1	A/5 21171	0.0	2.0	0	0
1	2	1	1	1	3.0	1	PC 17599	2.0	0.8	1	2
2	3	1	3	1	1.0	0	STON/O2. 3101282	0.0	2.0	0	1
3	4	1	1	1	2.0	1	113803	2.0	0.8	0	2
4	5	0	3	0	2.0	0	373450	0.0	2.0	0	0

train.Embarked.isnull().any()

False

4.10 FamilySize

train["FamilySize"] = train["SibSp"] + train["Parch"] + 1
test["FamilySize"] = test["SibSp"] + test["Parch"] + 1

facet = sns.FacetGrid(train, hue="Survived",aspect=4)
facet.map(sns.kdeplot,'FamilySize',shade= True)
facet.set(xlim=(0, train['FamilySize'].max()))
facet.add_legend()
plt.xlim(0)

(0, 11.0)

png

family_mapping = {1: 0, 2: 0.4, 3: 0.8, 4: 1.2, 5: 1.6, 6: 2, 7: 2.4, 8: 2.8, 9: 3.2, 10: 3.6, 11: 4}
for dataset in train_test_data:
    dataset['FamilySize'] = dataset['FamilySize'].map(family_mapping)

train.head()

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Title	FamilySize
0	1	0	3	0	1.0	1	A/5 21171	0.0	2.0	0	0	0.4
1	2	1	1	1	3.0	1	PC 17599	2.0	0.8	1	2	0.4
2	3	1	3	1	1.0	0	STON/O2. 3101282	0.0	2.0	0	1	0.0
3	4	1	1	1	2.0	1	113803	2.0	0.8	0	2	0.4
4	5	0	3	0	2.0	0	373450	0.0	2.0	0	0	0.0

#Print the unique values in the columns
print(train['Sex'].unique())
print(train['Embarked'].unique())

['male' 'female']
['S' 'C' 'Q' nan]

5. Modeling

5.1 Supervised Learning

# Importing Classifier Modules
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

import numpy as np

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Sex            891 non-null int64
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          891 non-null float64
Embarked       891 non-null int64
Title          891 non-null int64
FamilySize     891 non-null float64
dtypes: float64(4), int64(8), object(1)
memory usage: 90.6+ KB

features_drop = ['Ticket', 'SibSp', 'Parch']
train = train.drop(features_drop, axis=1)
test = test.drop(features_drop, axis=1)
train = train.drop(['PassengerId'], axis=1)

train_data = train.drop('Survived', axis=1)
target = train['Survived']

train_data.shape, target.shape

((891, 8), (891,))

train_data.head()

	Pclass	Sex	Age	Fare	Cabin	Embarked	Title	FamilySize
0	3	0	1.0	0.0	2.0	0	0	0.4
1	1	1	3.0	2.0	0.8	1	2	0.4
2	3	1	1.0	0.0	2.0	0	1	0.0
3	1	1	2.0	2.0	0.8	0	2	0.4
4	3	0	2.0	0.0	2.0	0	0	0.0

target.head()

  0
  1
  1
  1
  0
Name: Survived, dtype: int64

colormap = plt.cm.viridis
plt.figure(figsize=(12,12))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(train.astype(float).corr(),linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x1be6750f088>

linearly separable data

# Split the dataset into 80% Training set and 20% Testing set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(train_data, target, test_size = 0.2, random_state = 0)

# Scale the data to bring all features to the same level of magnitude
# This means the data will be within a specific range for example 0 -100 or 0 - 1

#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

def models(X_train,Y_train):


  #Using Logistic Regression Algorithm to the Training Set
  from sklearn.linear_model import LogisticRegression
  log = LogisticRegression(random_state = 0)
  log.fit(X_train, Y_train)

  #Using KNeighborsClassifier Method of neighbors class to use Nearest Neighbor algorithm
  from sklearn.neighbors import KNeighborsClassifier
  knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
  knn.fit(X_train, Y_train)

  #Using SVC method of svm class to use Support Vector Machine Algorithm
  from sklearn.svm import SVC
  svc_lin = SVC(kernel = 'linear', random_state = 0)
  svc_lin.fit(X_train, Y_train)

  #Using SVC method of svm class to use Kernel SVM Algorithm
  from sklearn.svm import SVC
  svc_rbf = SVC(kernel = 'rbf', random_state = 0)
  svc_rbf.fit(X_train, Y_train)

  #Using GaussianNB method of naïve_bayes class to use Naïve Bayes Algorithm
  from sklearn.naive_bayes import GaussianNB
  gauss = GaussianNB()
  gauss.fit(X_train, Y_train)

  #Using DecisionTreeClassifier of tree class to use Decision Tree Algorithm
  from sklearn.tree import DecisionTreeClassifier
  tree = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
  tree.fit(X_train, Y_train)

  #Using RandomForestClassifier method of ensemble class to use Random Forest Classification algorithm
  from sklearn.ensemble import RandomForestClassifier
  forest = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
  forest.fit(X_train, Y_train)

  #print model accuracy on the training data.
  print('[0]Logistic Regression Training Accuracy:', log.score(X_train, Y_train))
  print('[1]K Nearest Neighbor Training Accuracy:', knn.score(X_train, Y_train))
  print('[2]Support Vector Machine (Linear Classifier) Training Accuracy:', svc_lin.score(X_train, Y_train))
  print('[3]Support Vector Machine (RBF Classifier) Training Accuracy:', svc_rbf.score(X_train, Y_train))
  print('[4]Gaussian Naive Bayes Training Accuracy:', gauss.score(X_train, Y_train))
  print('[5]Decision Tree Classifier Training Accuracy:', tree.score(X_train, Y_train))
  print('[6]Random Forest Classifier Training Accuracy:', forest.score(X_train, Y_train))

  return log, knn, svc_lin, svc_rbf, gauss, tree, forest

#Get and train all of the models
model = models(X_train,Y_train)

[0]Logistic Regression Training Accuracy: 0.8314606741573034
[1]K Nearest Neighbor Training Accuracy: 0.8553370786516854
[2]Support Vector Machine (Linear Classifier) Training Accuracy: 0.8286516853932584
[3]Support Vector Machine (RBF Classifier) Training Accuracy: 0.848314606741573
[4]Gaussian Naive Bayes Training Accuracy: 0.7823033707865169
[5]Decision Tree Classifier Training Accuracy: 0.901685393258427
[6]Random Forest Classifier Training Accuracy: 0.8932584269662921

#Show the confusion matrix and accuracy for all of the models on the test data
#Classification accuracy is the ratio of correct predictions to total predictions made.
from sklearn.metrics import confusion_matrix
for i in range(len(model)):
  cm = confusion_matrix(Y_test, model[i].predict(X_test))

  #extracting true_positives, false_positives, true_negatives, false_negatives
  TN, FP, FN, TP = confusion_matrix(Y_test, model[i].predict(X_test)).ravel()

  print(cm)
  print('Model[{}] Testing Accuracy = "{} !"'.format(i,  (TP + TN) / (TP + TN + FN + FP)))
  print()# Print a new line

[[92 18]
 [17 52]]
Model[0] Testing Accuracy = "0.8044692737430168 !"

[[97 13]
 [20 49]]
Model[1] Testing Accuracy = "0.8156424581005587 !"

[[91 19]
 [17 52]]
Model[2] Testing Accuracy = "0.7988826815642458 !"

[[100  10]
 [ 22  47]]
Model[3] Testing Accuracy = "0.8212290502793296 !"

[[82 28]
 [ 7 62]]
Model[4] Testing Accuracy = "0.8044692737430168 !"

[[97 13]
 [25 44]]
Model[5] Testing Accuracy = "0.7877094972067039 !"

[[100  10]
 [ 23  46]]
Model[6] Testing Accuracy = "0.8156424581005587 !"

#Get Feature importance
forest=model[6]
importances = pd.DataFrame({'feature':train_data.iloc[:,0:8].columns, 'importance': np.round(forest.feature_importances_, 3)})
importances= importances.sort_values('importance', ascending=False).set_index('feature')
importances

	importance
feature
Title	0.257
Sex	0.135
FamilySize	0.129
Age	0.127
Cabin	0.107
Fare	0.097
Pclass	0.084
Embarked	0.064

#Visualize the importance
importances.plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x1be649b7b48>

linearly separable data

#print the prediction of the random forest classifier
pred = model[6].predict(X_test)
print(pred)

print()

#print the actual value
print(Y_test)

[0 0 0 1 0 0 1 1 0 1 0 1 0 1 1 1 0 0 0 1 0 1 0 0 1 1 0 1 1 0 0 1 0 0 0 0 0
0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0
1 0 0 0 1 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 1 0
0 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 1
0 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0]

  0
  0
  0
   1
  1
      ..
  1
  0
  1
  0
  0
Name: Survived, Length: 179, dtype: int64

#my survival
#Pclass	Sex	Age	Fare	Cabin	Embarked	Title	FamilySize

#scaling my survival
my_survival = [[2, 0, 3, 2, 0.8, 0, 0, 0.8]]

#scaling:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
my_survival_scaled = sc.fit_transform(my_survival)

#print prediction of my survival using Random forest calssifier:
pred= model[6].predict(my_survival_scaled)

print(pred)

if pred  == 0:
    print('Oh no you didn not survive')
else:
    print('nice! you survived!')

[1]
nice! you survived!

ROC and AUC using logistic regression as an example

  #Using Logistic Regression Algorithm to the Training Set
  from sklearn.linear_model import LogisticRegression
  log = LogisticRegression(random_state = 0)
  log.fit(X_train, Y_train)
Y_predict=log.predict(X_test)

from sklearn.metrics import accuracy_score
print(accuracy_score(Y_test,Y_predict))
pd.crosstab(Y_test, Y_predict)

0.8044692737430168

col_0	0	1
Survived
0	92	18
1	17	52

log.predict_proba(X_test)

array([[0.90562449, 0.09437551],
       [0.92146315, 0.07853685],
       [0.66984458, 0.33015542],
       [0.03898146, 0.96101854],
       [0.39961516, 0.60038484],
       [0.56001427, 0.43998573],
       [0.09451128, 0.90548872],
       [0.13543067, 0.86456933],
       [0.52243664, 0.47756336],
       [0.21745097, 0.78254903],
       [0.91920705, 0.08079295],
       [0.31239561, 0.68760439],
       [0.8869834 , 0.1130166 ],
       [0.176777  , 0.823223  ],
       [0.04009628, 0.95990372],
       [0.22778632, 0.77221368],
       [0.88025683, 0.11974317],
       [0.81108283, 0.18891717],
       [0.92146315, 0.07853685],
       [0.3501963 , 0.6498037 ],
       [0.73578675, 0.26421325],
       [0.03936127, 0.96063873],
       [0.8869834 , 0.1130166 ],
       [0.56001427, 0.43998573],
       [0.32009766, 0.67990234],
       [0.07297558, 0.92702442],
       [0.92146315, 0.07853685],
       [0.32009766, 0.67990234],
       [0.09101505, 0.90898495],
       [0.67756724, 0.32243276],
       [0.90562449, 0.09437551],
       [0.31239561, 0.68760439],
       [0.92146315, 0.07853685],
       [0.56001427, 0.43998573],
       [0.9478045 , 0.0521955 ],
       [0.53010852, 0.46989148],
       [0.94930607, 0.05069393],
       [0.8163354 , 0.1836646 ],
       [0.73578675, 0.26421325],
       [0.87645115, 0.12354885],
       [0.81178102, 0.18821898],
       [0.85739412, 0.14260588],
       [0.92146315, 0.07853685],
       [0.96340222, 0.03659778],
       [0.05396473, 0.94603527],
       [0.92146315, 0.07853685],
       [0.92146315, 0.07853685],
       [0.17244067, 0.82755933],
       [0.87645115, 0.12354885],
       [0.80556426, 0.19443574],
       [0.5941448 , 0.4058552 ],
       [0.48225335, 0.51774665],
       [0.176777  , 0.823223  ],
       [0.88025683, 0.11974317],
       [0.56475591, 0.43524409],
       [0.81108283, 0.18891717],
       [0.71688627, 0.28311373],
       [0.72293216, 0.27706784],
       [0.75205183, 0.24794817],
       [0.93781784, 0.06218216],
       [0.85739412, 0.14260588],
       [0.60996825, 0.39003175],
       [0.13191367, 0.86808633],
       [0.51004073, 0.48995927],
       [0.44867709, 0.55132291],
       [0.8869834 , 0.1130166 ],
       [0.24865623, 0.75134377],
       [0.15615491, 0.84384509],
       [0.176777  , 0.823223  ],
       [0.06453911, 0.93546089],
       [0.09101505, 0.90898495],
       [0.75868937, 0.24131063],
       [0.61702161, 0.38297839],
       [0.92146315, 0.07853685],
       [0.88025683, 0.11974317],
       [0.27682631, 0.72317369],
       [0.85032561, 0.14967439],
       [0.67436994, 0.32563006],
       [0.92146315, 0.07853685],
       [0.78425929, 0.21574071],
       [0.90693144, 0.09306856],
       [0.50490792, 0.49509208],
       [0.13417399, 0.86582601],
       [0.82161595, 0.17838405],
       [0.8505506 , 0.1494494 ],
       [0.04500599, 0.95499401],
       [0.07865182, 0.92134818],
       [0.46394494, 0.53605506],
       [0.11003162, 0.88996838],
       [0.44746386, 0.55253614],
       [0.68184598, 0.31815402],
       [0.88025683, 0.11974317],
       [0.22942292, 0.77057708],
       [0.0746709 , 0.9253291 ],
       [0.67756724, 0.32243276],
       [0.90562449, 0.09437551],
       [0.16189054, 0.83810946],
       [0.96340222, 0.03659778],
       [0.09570488, 0.90429512],
       [0.48994053, 0.51005947],
       [0.98438827, 0.01561173],
       [0.26908461, 0.73091539],
       [0.88025683, 0.11974317],
       [0.94626097, 0.05373903],
       [0.42887836, 0.57112164],
       [0.39500171, 0.60499829],
       [0.11403093, 0.88596907],
       [0.75605068, 0.24394932],
       [0.77567238, 0.22432762],
       [0.32768611, 0.67231389],
       [0.94930607, 0.05069393],
       [0.06305814, 0.93694186],
       [0.88306269, 0.11693731],
       [0.32009766, 0.67990234],
       [0.80401578, 0.19598422],
       [0.22158155, 0.77841845],
       [0.2478141 , 0.7521859 ],
       [0.02291349, 0.97708651],
       [0.94930607, 0.05069393],
       [0.283029  , 0.716971  ],
       [0.88306269, 0.11693731],
       [0.8869834 , 0.1130166 ],
       [0.88025683, 0.11974317],
       [0.11654445, 0.88345555],
       [0.92146315, 0.07853685],
       [0.69177453, 0.30822547],
       [0.90562449, 0.09437551],
       [0.91920705, 0.08079295],
       [0.8163354 , 0.1836646 ],
       [0.22869217, 0.77130783],
       [0.23949934, 0.76050066],
       [0.88025683, 0.11974317],
       [0.88025683, 0.11974317],
       [0.42151289, 0.57848711],
       [0.80164361, 0.19835639],
       [0.88025683, 0.11974317],
       [0.92393466, 0.07606534],
       [0.45630075, 0.54369925],
       [0.8980445 , 0.1019555 ],
       [0.73578675, 0.26421325],
       [0.8163354 , 0.1836646 ],
       [0.07782203, 0.92217797],
       [0.92146315, 0.07853685],
       [0.23949934, 0.76050066],
       [0.10297104, 0.89702896],
       [0.32009766, 0.67990234],
       [0.8163354 , 0.1836646 ],
       [0.20322749, 0.79677251],
       [0.06080351, 0.93919649],
       [0.92146315, 0.07853685],
       [0.75868937, 0.24131063],
       [0.42903417, 0.57096583],
       [0.54530973, 0.45469027],
       [0.92146315, 0.07853685],
       [0.12729888, 0.87270112],
       [0.73578675, 0.26421325],
       [0.40702062, 0.59297938],
       [0.87645115, 0.12354885],
       [0.23949934, 0.76050066],
       [0.4177658 , 0.5822342 ],
       [0.92146315, 0.07853685],
       [0.85739412, 0.14260588],
       [0.06930139, 0.93069861],
       [0.5574304 , 0.4425696 ],
       [0.90023128, 0.09976872],
       [0.88025683, 0.11974317],
       [0.94930607, 0.05069393],
       [0.88025683, 0.11974317],
       [0.88025683, 0.11974317],
       [0.88025683, 0.11974317],
       [0.94930607, 0.05069393],
       [0.06930139, 0.93069861],
       [0.92146315, 0.07853685],
       [0.92146315, 0.07853685],
       [0.19436312, 0.80563688],
       [0.92146315, 0.07853685],
       [0.07096286, 0.92903714],
       [0.88025683, 0.11974317],
       [0.88025683, 0.11974317]])

log.predict_proba(X_test)[:,1]

array([0.09437551, 0.07853685, 0.33015542, 0.96101854, 0.60038484,
       0.43998573, 0.90548872, 0.86456933, 0.47756336, 0.78254903,
       0.08079295, 0.68760439, 0.1130166 , 0.823223  , 0.95990372,
       0.77221368, 0.11974317, 0.18891717, 0.07853685, 0.6498037 ,
       0.26421325, 0.96063873, 0.1130166 , 0.43998573, 0.67990234,
       0.92702442, 0.07853685, 0.67990234, 0.90898495, 0.32243276,
       0.09437551, 0.68760439, 0.07853685, 0.43998573, 0.0521955 ,
       0.46989148, 0.05069393, 0.1836646 , 0.26421325, 0.12354885,
       0.18821898, 0.14260588, 0.07853685, 0.03659778, 0.94603527,
       0.07853685, 0.07853685, 0.82755933, 0.12354885, 0.19443574,
       0.4058552 , 0.51774665, 0.823223  , 0.11974317, 0.43524409,
       0.18891717, 0.28311373, 0.27706784, 0.24794817, 0.06218216,
       0.14260588, 0.39003175, 0.86808633, 0.48995927, 0.55132291,
       0.1130166 , 0.75134377, 0.84384509, 0.823223  , 0.93546089,
       0.90898495, 0.24131063, 0.38297839, 0.07853685, 0.11974317,
       0.72317369, 0.14967439, 0.32563006, 0.07853685, 0.21574071,
       0.09306856, 0.49509208, 0.86582601, 0.17838405, 0.1494494 ,
       0.95499401, 0.92134818, 0.53605506, 0.88996838, 0.55253614,
       0.31815402, 0.11974317, 0.77057708, 0.9253291 , 0.32243276,
       0.09437551, 0.83810946, 0.03659778, 0.90429512, 0.51005947,
       0.01561173, 0.73091539, 0.11974317, 0.05373903, 0.57112164,
       0.60499829, 0.88596907, 0.24394932, 0.22432762, 0.67231389,
       0.05069393, 0.93694186, 0.11693731, 0.67990234, 0.19598422,
       0.77841845, 0.7521859 , 0.97708651, 0.05069393, 0.716971  ,
       0.11693731, 0.1130166 , 0.11974317, 0.88345555, 0.07853685,
       0.30822547, 0.09437551, 0.08079295, 0.1836646 , 0.77130783,
       0.76050066, 0.11974317, 0.11974317, 0.57848711, 0.19835639,
       0.11974317, 0.07606534, 0.54369925, 0.1019555 , 0.26421325,
       0.1836646 , 0.92217797, 0.07853685, 0.76050066, 0.89702896,
       0.67990234, 0.1836646 , 0.79677251, 0.93919649, 0.07853685,
       0.24131063, 0.57096583, 0.45469027, 0.07853685, 0.87270112,
       0.26421325, 0.59297938, 0.12354885, 0.76050066, 0.5822342 ,
       0.07853685, 0.14260588, 0.93069861, 0.4425696 , 0.09976872,
       0.11974317, 0.05069393, 0.11974317, 0.11974317, 0.11974317,
       0.05069393, 0.93069861, 0.07853685, 0.07853685, 0.80563688,
       0.07853685, 0.92903714, 0.11974317, 0.11974317])

np.where(log.predict_proba(X_test)[:,1] >0.3,1,0) #threshold 0.3

array([0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1,
       0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1,
       0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       1, 0, 0])

th3= np.where(log.predict_proba(X_test)[:,1] >0.3,1,0) #threshold 0.3
th4=np.where(log.predict_proba(X_test)[:,1] >0.4,1,0) #threshold 0.4
th5=np.where(log.predict_proba(X_test)[:,1] >0.5,1,0) #threshold 0.5
th6=np.where(log.predict_proba(X_test)[:,1] >0.6,1,0) #threshold 0.6
th8=np.where(log.predict_proba(X_test)[:,1] >0.8,1,0) #threshold 0.6

pd.crosstab(Y_test, th3)

col_0	0	1
Survived
0	81	29
1	9	60

pd.crosstab(Y_test, th4)

col_0	0	1
Survived
0	89	21
1	9	60

pd.crosstab(Y_test, th5)

col_0	0	1
Survived
0	92	18
1	17	52

pd.crosstab(Y_test, th6)

col_0	0	1
Survived
0	97	13
1	23	46

def predict_thrs (log,X_test,thrs):
    import numpy as np
    Y_predict = np.where(log.predict_proba(X_test)[:,1] >thrs,1,0)
    return Y_predict

import numpy as np
from sklearn.metrics import confusion_matrix
for thr in np.arange(0,1.1,0.1):
    Y_predict= predict_thrs(log, X_test,thr)
    print("Threshold :", thr)
    print(confusion_matrix(Y_test, Y_predict))

Threshold : 0.0
[[  0 110]
 [  0  69]]
Threshold : 0.1
[[35 75]
 [ 2 67]]
Threshold : 0.2
[[70 40]
 [ 8 61]]
Threshold : 0.30000000000000004
[[81 29]
 [ 9 60]]
Threshold : 0.4
[[89 21]
 [ 9 60]]
Threshold : 0.5
[[92 18]
 [17 52]]
Threshold : 0.6000000000000001
[[97 13]
 [23 46]]
Threshold : 0.7000000000000001
[[100  10]
 [ 30  39]]
Threshold : 0.8
[[106   4]
 [ 38  31]]
Threshold : 0.9
[[109   1]
 [ 50  19]]
Threshold : 1.0
[[110   0]
 [ 69   0]]

from sklearn.metrics import roc_curve, roc_auc_score

tpr,fpr,thrs= roc_curve(Y_test, log.predict_proba(X_test)[:,1])

tpr

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.00909091, 0.00909091, 0.01818182,
       0.01818182, 0.02727273, 0.02727273, 0.03636364, 0.03636364,
       0.03636364, 0.03636364, 0.05454545, 0.05454545, 0.08181818,
       0.08181818, 0.09090909, 0.09090909, 0.09090909, 0.11818182,
       0.11818182, 0.12727273, 0.12727273, 0.14545455, 0.14545455,
       0.15454545, 0.15454545, 0.16363636, 0.16363636, 0.17272727,
       0.17272727, 0.18181818, 0.18181818, 0.19090909, 0.19090909,
       0.22727273, 0.24545455, 0.28181818, 0.31818182, 0.33636364,
       0.34545455, 0.37272727, 0.37272727, 0.38181818, 0.4       ,
       0.4       , 0.43636364, 0.46363636, 0.5       , 0.63636364,
       0.63636364, 0.67272727, 0.69090909, 0.72727273, 0.73636364,
       0.74545455, 0.9       , 0.90909091, 0.90909091, 0.92727273,
       0.97272727, 0.99090909, 1.        ])

fpr

array([0.        , 0.01449275, 0.13043478, 0.15942029, 0.23188406,
       0.26086957, 0.27536232, 0.27536232, 0.31884058, 0.31884058,
       0.34782609, 0.34782609, 0.36231884, 0.36231884, 0.39130435,
       0.43478261, 0.49275362, 0.49275362, 0.50724638, 0.50724638,
       0.53623188, 0.53623188, 0.56521739, 0.5942029 , 0.60869565,
       0.66666667, 0.66666667, 0.68115942, 0.68115942, 0.69565217,
       0.69565217, 0.71014493, 0.71014493, 0.76811594, 0.76811594,
       0.79710145, 0.79710145, 0.8115942 , 0.84057971, 0.86956522,
       0.86956522, 0.86956522, 0.86956522, 0.86956522, 0.86956522,
       0.88405797, 0.88405797, 0.89855072, 0.89855072, 0.89855072,
       0.91304348, 0.91304348, 0.91304348, 0.94202899, 0.94202899,
       0.97101449, 0.97101449, 0.97101449, 0.97101449, 0.97101449,
       0.98550725, 0.98550725, 0.98550725, 1.        , 1.        ,
       1.        , 1.        , 1.        ])

thrs

array([1.97708651, 0.97708651, 0.93546089, 0.93069861, 0.92134818,
       0.90898495, 0.90548872, 0.90429512, 0.88596907, 0.88345555,
       0.86808633, 0.86582601, 0.86456933, 0.84384509, 0.82755933,
       0.823223  , 0.77841845, 0.77130783, 0.77057708, 0.76050066,
       0.75134377, 0.73091539, 0.716971  , 0.68760439, 0.67990234,
       0.60038484, 0.59297938, 0.5822342 , 0.57112164, 0.57096583,
       0.55253614, 0.55132291, 0.54369925, 0.49509208, 0.48995927,
       0.46989148, 0.45469027, 0.4425696 , 0.43998573, 0.4058552 ,
       0.32563006, 0.32243276, 0.27706784, 0.26421325, 0.24394932,
       0.24131063, 0.19835639, 0.19598422, 0.19443574, 0.18891717,
       0.18821898, 0.1836646 , 0.1494494 , 0.12354885, 0.11974317,
       0.11693731, 0.1130166 , 0.09976872, 0.09437551, 0.09306856,
       0.08079295, 0.07853685, 0.07606534, 0.06218216, 0.0521955 ,
       0.05069393, 0.03659778, 0.01561173])

import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve,roc_auc_score
%matplotlib inline
fpr,tpr,thrs= roc_curve(Y_test, log.predict_proba(X_test)[:,1])
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.0])
plt.plot(fpr,tpr,color='red')
plt.show()

linearly separable data

roc_auc_score(Y_test, log.predict_proba(X_test)[:,1])

0.8704216073781291

5.2 Unsupervised learning:

5.2.1 Kmeans

from sklearn.cluster import KMeans
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from matplotlib import pyplot as plt
%matplotlib inline

df = pd.read_csv('Unsupervised.csv')

df.head()

	Name	Age	Fare
0	Braund, Mr. Owen Harris	22.0	7.2500
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	38.0	71.2833
2	Heikkinen, Miss. Laina	26.0	7.9250
3	Futrelle, Mrs. Jacques Heath (Lily May Peel)	35.0	53.1000
4	Allen, Mr. William Henry	35.0	8.0500

df.isnull().sum()

Name      0
Age     177
Fare      0
dtype: int64

df['Initial']=0
for i in df:
    df['Initial']=df.Name.str.extract('([A-Za-z]+)\.') #extracting Name initials

pd.crosstab(df.Initial,train_data.Sex).T.style.background_gradient(cmap='summer_r')

Initial	Capt	Col	Countess	Don	Dr	Jonkheer	Lady	Major	Master	Miss	Mlle	Mme	Mr	Mrs	Ms	Rev	Sir
Sex
0	1	2	0	1	6	1	0	2	40	0	0	0	517	0	0	6	1
1	0	0	1	0	1	0	1	0	0	182	2	1	0	125	1	0	0

df['Initial'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess',
                               'Jonkheer','Col','Rev','Capt','Sir','Don'],['Miss',
                                'Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr'],inplace=True)

df.groupby('Initial')['Age'].mean()

Initial
Master     4.574167
Miss      21.860000
Mr        32.739609
Mrs       35.981818
Other     45.888889
Name: Age, dtype: float64

df.loc[(df.Age.isnull()) & (df.Initial=='Mr'),'Age']=33
df.loc[(df.Age.isnull()) & (df.Initial=='Mrs'),'Age']=36
df.loc[(df.Age.isnull()) & (df.Initial=='Master'),'Age']=5
df.loc[(df.Age.isnull()) & (df.Initial=='Miss'),'Age']=22
df.loc[(df.Age.isnull()) & (df.Initial=='Other'),'Age']=46

df.Age.isnull().any()

False

plt.scatter(df.Age,df.Fare)

<matplotlib.collections.PathCollection at 0x1be6786f688>

linearly separable data

km = KMeans(n_clusters=3)
km

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

y_predicted= km.fit_predict(df[['Age','Fare']])
y_predicted

array([1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 2, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       2, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 2, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 2, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 2, 1, 1, 1, 1, 1, 0, 0, 0,
       1, 0, 0, 2, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0,
       1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0,
       1, 0, 1, 2, 1, 1, 2, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 2, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 2,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 2, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0,
       0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 2, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 2, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 2, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 0, 2, 1, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

df['cluster']=y_predicted
df.head()

	Name	Age	Fare	Initial	cluster
0	Braund, Mr. Owen Harris	22.0	7.2500	Mr	1
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	38.0	71.2833	Mrs	0
2	Heikkinen, Miss. Laina	26.0	7.9250	Miss	1
3	Futrelle, Mrs. Jacques Heath (Lily May Peel)	35.0	53.1000	Mrs	0
4	Allen, Mr. William Henry	35.0	8.0500	Mr	1

%pylab inline

Populating the interactive namespace from numpy and matplotlib

df1=df[df.cluster==0]
df2=df[df.cluster==1]
df3=df[df.cluster==2]
plt.scatter(df1.Age,df1['Fare'],color = 'green')
plt.scatter(df2.Age,df2['Fare'],color = 'red')
plt.scatter(df3.Age,df3['Fare'],color = 'black')
plt.xlabel('Age')
plt.ylabel('Fare')
plt.legend()

No handles with labels found to put in legend.

<matplotlib.legend.Legend at 0x1be67b69188>

linearly separable data

df1=df[df.cluster==0]
df2=df[df.cluster==1]
df3=df[df.cluster==2]
plt.scatter(df1['Fare'],df1.Age,color = 'green')
plt.scatter(df2['Fare'],df2.Age,color = 'red')
plt.scatter(df3['Fare'],df3.Age,color = 'black')
plt.ylabel('Age')
plt.xlabel('Fare')
plt.legend()

No handles with labels found to put in legend.

<matplotlib.legend.Legend at 0x1be677f7fc8>

linearly separable data

df

	Name	Age	Fare	Initial	cluster
0	Braund, Mr. Owen Harris	22.0	7.2500	Mr	2
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	38.0	71.2833	Mrs	0
2	Heikkinen, Miss. Laina	26.0	7.9250	Miss	2
3	Futrelle, Mrs. Jacques Heath (Lily May Peel)	35.0	53.1000	Mrs	0
4	Allen, Mr. William Henry	35.0	8.0500	Mr	2
...	...	...	...	...	...
886	Montvila, Rev. Juozas	27.0	13.0000	Other	2
887	Graham, Miss. Margaret Edith	19.0	30.0000	Miss	2
888	Johnston, Miss. Catherine Helen "Carrie"	22.0	23.4500	Miss	2
889	Behr, Mr. Karl Howell	26.0	30.0000	Mr	2
890	Dooley, Mr. Patrick	32.0	7.7500	Mr	2

891 rows × 5 columns

sc=MinMaxScaler()
sc.fit(df[['Age']])
df['Age']=sc.transform(df[['Age']])

df

	Name	Age	Fare	Initial	cluster
0	Braund, Mr. Owen Harris	0.271174	7.2500	Mr	1
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	0.472229	71.2833	Mrs	0
2	Heikkinen, Miss. Laina	0.321438	7.9250	Miss	1
3	Futrelle, Mrs. Jacques Heath (Lily May Peel)	0.434531	53.1000	Mrs	0
4	Allen, Mr. William Henry	0.434531	8.0500	Mr	1
...	...	...	...	...	...
886	Montvila, Rev. Juozas	0.334004	13.0000	Other	1
887	Graham, Miss. Margaret Edith	0.233476	30.0000	Miss	1
888	Johnston, Miss. Catherine Helen "Carrie"	0.271174	23.4500	Miss	1
889	Behr, Mr. Karl Howell	0.321438	30.0000	Mr	1
890	Dooley, Mr. Patrick	0.396833	7.7500	Mr	1

891 rows × 5 columns

sc=MinMaxScaler()
sc.fit(df[['Fare']])
df['Fare']=sc.transform(df[['Fare']])

df

	Name	Age	Fare	Initial	cluster
0	Braund, Mr. Owen Harris	0.271174	0.014151	Mr	1
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	0.472229	0.139136	Mrs	0
2	Heikkinen, Miss. Laina	0.321438	0.015469	Miss	1
3	Futrelle, Mrs. Jacques Heath (Lily May Peel)	0.434531	0.103644	Mrs	0
4	Allen, Mr. William Henry	0.434531	0.015713	Mr	1
...	...	...	...	...	...
886	Montvila, Rev. Juozas	0.334004	0.025374	Other	1
887	Graham, Miss. Margaret Edith	0.233476	0.058556	Miss	1
888	Johnston, Miss. Catherine Helen "Carrie"	0.271174	0.045771	Miss	1
889	Behr, Mr. Karl Howell	0.321438	0.058556	Mr	1
890	Dooley, Mr. Patrick	0.396833	0.015127	Mr	1

891 rows × 5 columns

km= KMeans(n_clusters=3)
y_predicted = km.fit_predict(df[['Age', 'Fare']])
y_predicted

array([2, 0, 0, 0, 0, 0, 1, 2, 0, 2, 2, 1, 2, 0, 2, 1, 2, 0, 0, 0, 0, 0,
       2, 0, 2, 0, 0, 2, 2, 0, 0, 0, 2, 1, 0, 1, 0, 2, 2, 2, 0, 0, 0, 2,
       2, 0, 0, 2, 0, 2, 2, 2, 1, 0, 1, 0, 2, 0, 2, 2, 2, 0, 1, 2, 0, 2,
       0, 2, 2, 0, 0, 2, 2, 0, 0, 0, 0, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0,
       2, 2, 0, 2, 1, 0, 1, 0, 1, 2, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 2,
       1, 2, 2, 2, 2, 2, 1, 0, 2, 2, 2, 0, 0, 0, 1, 2, 0, 2, 2, 1, 0, 2,
       1, 0, 0, 2, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 0, 0, 1, 2, 1, 0,
       0, 1, 2, 0, 0, 2, 1, 0, 0, 2, 2, 2, 0, 1, 0, 0, 1, 2, 2, 2, 1, 2,
       2, 1, 0, 0, 2, 0, 2, 2, 2, 0, 0, 1, 0, 0, 0, 2, 2, 2, 1, 1, 0, 0,
       2, 2, 0, 0, 0, 1, 2, 2, 0, 0, 2, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0,
       2, 0, 1, 0, 0, 2, 2, 2, 2, 2, 0, 0, 1, 2, 2, 2, 1, 2, 2, 0, 2, 2,
       0, 2, 0, 1, 0, 2, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 2, 1, 0,
       2, 0, 2, 0, 1, 0, 0, 0, 0, 0, 2, 1, 1, 0, 2, 0, 1, 0, 2, 2, 0, 0,
       0, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 1, 2, 0, 2, 2, 0, 2, 2, 2,
       0, 0, 2, 2, 0, 0, 1, 0, 2, 1, 0, 1, 2, 0, 0, 2, 0, 0, 1, 0, 0, 2,
       2, 1, 1, 2, 0, 0, 0, 1, 1, 1, 2, 2, 0, 0, 0, 2, 0, 0, 2, 0, 2, 0,
       2, 0, 0, 0, 2, 0, 2, 2, 0, 0, 1, 0, 0, 0, 1, 0, 2, 2, 0, 2, 2, 2,
       2, 0, 2, 0, 2, 2, 1, 2, 0, 0, 0, 2, 2, 0, 0, 2, 0, 2, 0, 2, 2, 2,
       0, 1, 2, 0, 0, 0, 2, 0, 2, 0, 1, 2, 2, 2, 0, 0, 0, 0, 1, 0, 0, 2,
       0, 2, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0, 2, 1, 2, 2, 2, 1, 0,
       1, 2, 0, 0, 0, 2, 2, 0, 2, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0,
       1, 1, 0, 0, 0, 1, 0, 2, 0, 0, 0, 2, 2, 0, 0, 0, 2, 2, 2, 0, 1, 1,
       2, 2, 0, 1, 0, 2, 0, 2, 1, 1, 2, 0, 1, 0, 2, 2, 2, 2, 2, 0, 2, 2,
       0, 0, 0, 0, 0, 0, 0, 1, 2, 1, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 1, 0,
       0, 2, 2, 0, 2, 0, 0, 2, 1, 0, 0, 2, 0, 2, 2, 0, 1, 1, 2, 0, 0, 2,
       2, 0, 0, 2, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 0, 0, 0, 1, 1,
       0, 2, 2, 2, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 1, 1, 2, 0, 0, 1, 1, 2,
       0, 0, 2, 1, 0, 1, 2, 0, 0, 1, 0, 0, 0, 0, 2, 1, 0, 0, 2, 0, 0, 2,
       0, 0, 2, 0, 0, 1, 2, 2, 2, 1, 1, 2, 0, 0, 1, 1, 0, 0, 2, 0, 0, 0,
       0, 0, 2, 2, 2, 0, 2, 1, 2, 1, 0, 2, 0, 2, 2, 2, 2, 2, 0, 0, 2, 1,
       1, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 2, 1, 1, 2, 0,
       2, 2, 1, 0, 2, 2, 2, 2, 0, 2, 0, 0, 1, 1, 1, 2, 1, 0, 2, 0, 2, 0,
       0, 0, 1, 0, 2, 2, 2, 0, 1, 0, 1, 2, 1, 0, 0, 0, 2, 2, 0, 1, 0, 2,
       0, 2, 0, 0, 0, 2, 0, 2, 2, 0, 1, 1, 0, 0, 0, 0, 2, 2, 0, 1, 2, 0,
       2, 0, 2, 2, 0, 2, 1, 2, 0, 2, 0, 0, 0, 0, 2, 0, 2, 1, 0, 0, 0, 0,
       2, 1, 1, 0, 1, 2, 0, 2, 0, 1, 2, 2, 0, 0, 0, 0, 2, 2, 2, 1, 0, 2,
       2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2,
       0, 0, 2, 0, 0, 2, 1, 0, 0, 0, 2, 0, 0, 2, 0, 1, 2, 2, 0, 2, 2, 0,
       2, 0, 0, 0, 2, 2, 0, 0, 2, 0, 0, 0, 0, 0, 2, 1, 2, 2, 1, 2, 1, 1,
       2, 0, 0, 2, 1, 2, 2, 0, 0, 0, 0, 2, 0, 1, 0, 1, 0, 2, 2, 2, 0, 1,
       0, 0, 2, 0, 0, 0, 0, 2, 2, 0, 0])

df['cluster']=y_predicted
df

	Name	Age	Fare	Initial	cluster
0	Braund, Mr. Owen Harris	0.271174	0.014151	Mr	2
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	0.472229	0.139136	Mrs	0
2	Heikkinen, Miss. Laina	0.321438	0.015469	Miss	0
3	Futrelle, Mrs. Jacques Heath (Lily May Peel)	0.434531	0.103644	Mrs	0
4	Allen, Mr. William Henry	0.434531	0.015713	Mr	0
...	...	...	...	...	...
886	Montvila, Rev. Juozas	0.334004	0.025374	Other	0
887	Graham, Miss. Margaret Edith	0.233476	0.058556	Miss	2
888	Johnston, Miss. Catherine Helen "Carrie"	0.271174	0.045771	Miss	2
889	Behr, Mr. Karl Howell	0.321438	0.058556	Mr	0
890	Dooley, Mr. Patrick	0.396833	0.015127	Mr	0

891 rows × 5 columns

km.cluster_centers_

array([[0.40370991, 0.04801379],
       [0.64396415, 0.11667009],
       [0.20482494, 0.05977556]])

df1=df[df.cluster==0]
df2=df[df.cluster==1]
df3=df[df.cluster==2]
plt.scatter(df1.Age,df1['Fare'],color = 'green')
plt.scatter(df2.Age,df2['Fare'],color = 'red')
plt.scatter(df3.Age,df3['Fare'],color = 'black')
plt.xlabel('Age')
plt.ylabel('Fare')
plt.legend()

No handles with labels found to put in legend.

<matplotlib.legend.Legend at 0x1be66036e88>

linearly separable data

df1=df[df.cluster==0]
df2=df[df.cluster==1]
df3=df[df.cluster==2]
plt.scatter(df1['Fare'],df1.Age,color = 'green')
plt.scatter(df2['Fare'],df2.Age,color = 'red')
plt.scatter(df3['Fare'],df3.Age,color = 'black')
plt.ylabel('Age')
plt.xlabel('Fare')
plt.legend()

No handles with labels found to put in legend.

<matplotlib.legend.Legend at 0x1be65296a08>

linearly separable data

k_rng=range(1,10)
sse=[]
for k in k_rng:
    km=KMeans(n_clusters=k)
    km.fit(df[['Age','Fare']])
    sse.append(km.inertia_)

sse

[33.163249933113455,
859245724888236,
110890535996486,
754991374324431,
378184765822988,
303403333391413,
352438977975344,
759949542306482,
227046378111474]

plt.xlabel('K')
plt.ylabel('Sum of squared error')
plt.plot(k_rng,sse)

[<matplotlib.lines.Line2D at 0x1be651dcd08>]

linearly separable data

Hierarchical Clustering

import numpy as np
from matplotlib import pyplot as plt
import pandas as pd

importing data:

df=pd.read_csv('Unsupervised.csv')

df

	Name	Age	Fare
0	Braund, Mr. Owen Harris	22.0	7.2500
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	38.0	71.2833
2	Heikkinen, Miss. Laina	26.0	7.9250
3	Futrelle, Mrs. Jacques Heath (Lily May Peel)	35.0	53.1000
4	Allen, Mr. William Henry	35.0	8.0500
...	...	...	...
886	Montvila, Rev. Juozas	27.0	13.0000
887	Graham, Miss. Margaret Edith	19.0	30.0000
888	Johnston, Miss. Catherine Helen "Carrie"	NaN	23.4500
889	Behr, Mr. Karl Howell	26.0	30.0000
890	Dooley, Mr. Patrick	32.0	7.7500

891 rows × 3 columns

df.isnull().sum()

Name      0
Age     177
Fare      0
dtype: int64

df['Initial']=0
for i in df:
    df['Initial']=df.Name.str.extract('([A-Za-z]+)\.') #extracting Name initials

pd.crosstab(df.Initial,train_data.Sex).T.style.background_gradient(cmap='summer_r')


df['Initial'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess',
                               'Jonkheer','Col','Rev','Capt','Sir','Don'],['Miss',
                                'Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr'],inplace=True)

df.groupby('Initial')['Age'].mean()

df.loc[(df.Age.isnull()) & (df.Initial=='Mr'),'Age']=33
df.loc[(df.Age.isnull()) & (df.Initial=='Mrs'),'Age']=36
df.loc[(df.Age.isnull()) & (df.Initial=='Master'),'Age']=5
df.loc[(df.Age.isnull()) & (df.Initial=='Miss'),'Age']=22
df.loc[(df.Age.isnull()) & (df.Initial=='Other'),'Age']=46

df.Age.isnull().any()

False

df.head()

	Name	Age	Fare	Initial
0	Braund, Mr. Owen Harris	22.0	7.2500	Mr
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	38.0	71.2833	Mrs
2	Heikkinen, Miss. Laina	26.0	7.9250	Miss
3	Futrelle, Mrs. Jacques Heath (Lily May Peel)	35.0	53.1000	Mrs
4	Allen, Mr. William Henry	35.0	8.0500	Mr

X= df.iloc[:,[1,2]].values

array([[22.    ,  7.25  ],
       [38.    , 71.2833],
       [26.    ,  7.925 ],
       ...,
       [22.    , 23.45  ],
       [26.    , 30.    ],
       [32.    ,  7.75  ]])

Using dendrogram to find the optimal number of clusters

import scipy.cluster.hierarchy as sch

dendrogram = sch.dendrogram(sch.linkage(X, method = 'ward'))

linearly separable data

Fitting hierarchical clustering to the dataset

from sklearn.cluster import AgglomerativeClustering
AgglomerativeClustering()
hc=AgglomerativeClustering(affinity='euclidean', linkage='ward')
y_hc= hc.fit_predict(X)

plt.scatter(X[y_hc==0,0], X[y_hc==0,1], s = 100, c= 'red', label='cluster 1')
plt.scatter(X[y_hc==1,0], X[y_hc==1,1], s = 100, c= 'blue', label='cluster 2')
plt.scatter(X[y_hc==2,0], X[y_hc==2,1], s = 100, c= 'green', label='cluster 3')
plt.title('Clusters of Passengers')
plt.xlabel('Age')
plt.ylabel('Fare')
plt.legend()
plt.show()

linearly separable data

The End

from IPython.display import Image
Image(url="https://d13ezvd6yrslxm.cloudfront.net/wp/wp-content/images/titanic-jack-rose-door.jpg")

Share on

Twitter Facebook Google+ LinkedIn

Hamad Al-Ibrahim

Titanic Case Analysis

Titanic: Machine Learning from Disaster

Predict survival on the Titanic

1. Defining the problem statement:

2. Collecting the Data:

Load libraries, train and test data

3. Exploratory data analysis:

Data Dictionary:

Bar Chart for Categorical Features

4. Feature engineering

4.1 how titanic sank?

4.2 Name

4.3 Age

Title map

4.4 Sex

4.5 Pclass

4.6 SibSip

4.7 Embarked

4.8 Fare price

4.9 Cabins

4.10 FamilySize

5. Modeling

5.1 Supervised Learning

ROC and AUC using logistic regression as an example

5.2 Unsupervised learning:

5.2.1 Kmeans

Hierarchical Clustering

The End

Share on

You May Also Enjoy

Analyzing Uber Data

Using SQL in Python, Citi Bike Analysis, as a Case Study

	Pclass	Sex	Age	Fare	Cabin	Embarked	Title	FamilySize
0	3	0	1.0	0.0	2.0	0	0	0.4
1	1	1	3.0	2.0	0.8	1	2	0.4
2	3	1	1.0	0.0	2.0	0	1	0.0
3	1	1	2.0	2.0	0.8	0	2	0.4
4	3	0	2.0	0.0	2.0	0	0	0.0

	Pclass	Sex	Age	Fare	Cabin	Embarked	Title	FamilySize
0	3	0	1.0	0.0	2.0	0	0	0.4
1	1	1	3.0	2.0	0.8	1	2	0.4
2	3	1	1.0	0.0	2.0	0	1	0.0
3	1	1	2.0	2.0	0.8	0	2	0.4
4	3	0	2.0	0.0	2.0	0	0	0.0

	Pclass	Sex	Age	Fare	Cabin	Embarked	Title	FamilySize
0	3	0	1.0	0.0	2.0	0	0	0.4
1	1	1	3.0	2.0	0.8	1	2	0.4
2	3	1	1.0	0.0	2.0	0	1	0.0
3	1	1	2.0	2.0	0.8	0	2	0.4
4	3	0	2.0	0.0	2.0	0	0	0.0