Predicting passenger survival using classification

In this tutorial, we will use machine learning to predict passenger survival with R. On high level, we will take following steps to do the prediction analysis.

Table of Contents


Environment Setup


back to top

  • Install R
  • Install R Studio
  • Kaggle account to download the dataset.

Reading the Titanic dataset from a CSV file


back to top

Create an account on Kaggle website and Click on following link to download csv file - https://www.kaggle.com/c/titanic/download/train.csv . Move the file to project data directory - ex. C:/slabs/ml-r-projects/Predicting-passenger-survival/data

Understand data

Below image shows sample data for the the titanic dataset.

Sample titanic dataset

Here is the description given by kaggle for given data.

VARIABLE DESCRIPTIONS:
survival        Survival (0 = No; 1 = Yes)
pclass          Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

Load the file

Run following commands in R Studio.

#####  Set working directory
setwd('C:/slabs/ml-r-projects/Predicting-passenger-survival/data')

#####  Validate current working directory
getwd()

#####  Load the csv file
train.data = read.csv("train.csv", na.strings=c("NA", ""))

Evaluate the data

Once data is loaded, evaluate the data using str function.

#####  View data type of training dataset
class(train.data)

#####  Check data with str function
str(train.data)

'data.frame':	891 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ...
 $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...

Notice above the different attributes and their data type and the first few values of each attribute.

Data Cleansing and Transformation


back to top

From the Survived and Pclass attribute, you can see the data type as int from str function output. Ideally, they should be categorical variables. Hence, we will transform the data from int to a factor type via the factor function.

What is factor type? Often your data needs to be grouped by category. R has a special collection type called a factor to track these categorized values.

Execute below code to transform the variable from the int numeric type to the factor categorical type.

train.data$Survived = factor(train.data$Survived)

train.data$Pclass = factor(train.data$Pclass)

Detecting missing values

As you know, input sample data may have missing values which can impact the overall prediction.

#####  use amelia package to find missing values
install.packages("Amelia")
require(Amelia)
missmap(train.data, main="Titanic Data - Missings Map", col=c("red", "green"),
 legend=FALSE)

Missing Values titanic dataset

Let’s find missing values from the population.


#####  Find missing percentage for age column
sum(is.na(train.data$Age) == TRUE) /  length(train.data$Age)

0.1986532

#####  Generalize and find the missing values for entire dataset
sapply(train.data, function(attribute) {sum(is.na(attribute)==TRUE)/ length(attribute)
;}) 

PassengerId    Survived      Pclass        Name         Sex         Age       SibSp       Parch      Ticket        Fare 
0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.198653199 0.000000000 0.000000000 0.000000000 0.000000000 
      Cabin    Embarked 
0.771043771 0.002244669 

Notice that we have missing values for 3 attributes:

  • Cabin - 77.1%
  • Embarked - 0.22 %
  • Age - 19.8%

Handle Cabin attribute missing values

Notice that Cabin attribute has lots of values missing and hence this attribute is not of much help for our analysis. Hence, its safe to ignore this attribute.

Handle Embarked attribute missing values

Embarked attribute has only 0.22% values missing. Let’s analyze that further. We will use table command to group list of values.

#####  Find the distribution of the values
table(train.data$Embarked, useNA = "always")

   C    Q    S <NA> 
 168   77  644    2 
 

Table function simply creates tabular results of categorical variables.

Since Values having NA are only 2, we can assign these 2 values to most counted port based on probability.

#####  Find the distribution of the values
train.data$Embarked[which(is.na(train.data$Embarked))] = 'S';

#####  Find the distribution of the values
table(train.data$Embarked, useNA = "always")

   C    Q    S <NA> 
 168   77  646    0 
 

Handle age attribute missing values

For Age attribute, about 20 percent of the value is missing. There could be various strategies to handle this. One simple way is to infer the missing value with the title of each passenger. We could find the mean of age by each title and assign those values in place of missing values.

Let’s find out first total breakdown By Titles such as Mr, Mrs, Miss, and Master. In order to discover the types of titles contained in the names of train.data,

  • we first tokenize train.data$Name by blank (a regular expression pattern as “\s+”), and then
  • count the frequency of occurrence with the table function.
  • After this, since the name title often ends with a period, we use the regular expression to grep the word containing the period.

Execute commands below to get high level breakdown of data by titles.


train.data$Name = as.character(train.data$Name)
table_words = table(unlist(strsplit(train.data$Name, "\\s+")))
sort(table_words [grep('\\.',names(table_words))], decreasing=TRUE)


Mr.     Miss.      Mrs.   Master. 
 517       182       125        40 
Dr.      Rev.      Col.    Major. 
7         6         2            2 
Mlle.     Capt. Countess.    Don. 
2         1         1                1 
Jonkheer.        L.     Lady .      Mme. 
 1         1         1         1 
Ms.      Sir. 
1         1 
 

Now, Let’s find out high level breakdown of missing age data by titles. We will use str_match function present in the stringr package to get a substring containing a period. We will then bind the column with cbind.

library(stringr) 
tb_data = cbind(train.data$Age, str_match(train.data$Name, " [a-zA-Z]+\\."))
table(tb_data[is.na(tb_data[,1]),2])

 Dr.  	Master.    	Miss.  Mr.     Mrs. 
 1        4            36       119       17 
 

Final step is to calculate mean value for each title and assign it to missing values under that title.

mean.mr = mean(train.data$Age[grepl(" Mr\\.", train.data$Name) 
				& !is.na(train.data$Age)])
mean.mrs = mean(train.data$Age[grepl(" Mrs\\.", train.data$Name) 
				& !is.na(train.data$Age)])
mean.dr = mean(train.data$Age[grepl(" Dr\\.", train.data$Name) 
				& !is.na(train.data$Age)])
mean.miss = mean(train.data$Age[grepl(" Miss\\.", train.data$Name) 
				& !is.na(train.data$Age)])
mean.master =  mean(train.data$Age[grepl(" Master\\.", train.data$Name) 
				& !is.na(train.data$Age)])

train.data$Age[grepl(" Mr\\.", train.data$Name) 
				& is.na(train.data$Age)] = mean.mr
train.data$Age[grepl(" Mrs\\.", train.data$Name) 
				& is.na(train.data$Age)] = mean.mrs
train.data$Age[grepl(" Dr\\.", train.data$Name) 
				& is.na(train.data$Age)] = mean.dr
train.data$Age[grepl(" Miss\\.", train.data$Name) 
				& is.na(train.data$Age)] = mean.miss
train.data$Age[grepl(" Master\\.", train.data$Name) 
				& is.na(train.data$Age)] = mean.master

#####  Find missing values to check there is none.
sum(is.na(train.data$Age) == TRUE) /  length(train.data$Age)

Basic Data Exploration and Visualization


back to top

Breakdown by Gender

Run below commands to find breakdown by gender.

#####  Find count by sex
table(train.data$Sex)

female   male 
   314    577 
   
#####  Plot it in graph
barplot(table(train.data$Sex), col=c("orange","lightgreen"), names= c("Female", "Male"), main="Breakdown by Gender")

Observation: There were more male passenger’s than female.

Sample titanic dataset

Breakdown by port of embarkation

Run below commands to find breakdown by port of embarkation.

#####  Find the distribution of the values
barplot(table(train.data$Embarked), col=c("green","blue", "orange"), names= c("Cherbourg", "Queenstown", "Southampton"), main="Port of Embarkation")

Observation: Southampton outnumbered others in terms of port of embarkation.

Sample titanic dataset

Breakdown by Class

Run below commands to find breakdown by class.

#####  Find count by class
table(train.data$Pclass)

  1   2   3 
216 184 491

#####  Plot class count in graph
barplot(table(train.data$Pclass), col=c("green","blue", "orange"), names= c("First", "Second", "Third"), main="Breakdown by Class")

Observation: There were more passengers in 3rd class then 1st and second classes.

Sample titanic dataset

Breakdown by sex for each Class

Run below commands to find breakdown by sex for each class.

#####  show breakdown by gender for each class
table( train.data$Sex, train.data$Pclass)
        
  	1   2   3
  female  94  76 144
  male   122 108 347
 
#####  show barplot
countsTable = table( train.data$Sex, train.data$Pclass)
barplot(countsTable,  col=c("orange","lightgreen"), legend = c("Female", "Male"), names= c("First", "Second", "Third"), main= "Passenger breakdown by sex for each Class")

Observation: Male outnumbered females in each class.

Sample titanic dataset

Hist distribution by Passenger age

Run below commands to see distribution by passenger age.

#####  Find the distribution of the values
hist(train.data$Age, main="Passenger age distribution", xlab = "Age")

Observation: Most of the passengers were in the age 20-40. It will be interesting to find later if survival rate also correlates with age distribution or not.

Sample titanic dataset

Passenger fate breakdown

Run below commands to find how many passengers survived.

#####  Find the distribution of the values
countsTable = table(train.data$Survived)
barplot(countsTable, col=c("red","green"), names= c("Perished", "Survived"), 
legend = c("Perished", "Survived"), main="Survived/Perished breakdown")

Observation: It’s clear that more people perished than survived. It will be interesting to find passenger fate by age, sex, class.

Sample titanic dataset

Passenger fate by age

Run below commands to find Passenger fate by age

#####  Find the distribution of the values
countsTable = table( train.data$Survived, train.data$Age)
barplot(countsTable,  col=c("red","green"), legend = c("Perished", "Survived"), main = "Passenger fate by age")

Observation: In terms of %, Many people perished in the age of 20-40.

Sample titanic dataset

Passenger fate by Sex

Run below commands to find Passenger fate by Sex

#####  show survival count by Sex
 table( train.data$Survived, train.data$Sex)
   
    female male
  0     81  468
  1    233  109
  
#####  Bar chart
countsTable = table( train.data$Survived, train.data$Sex)
barplot(countsTable,  col=c("red","green"), legend = c("Perished", "Survived"), names= c("Female", "Male"), main = "Passenger fate by Sex")

Sample titanic dataset

Mosaic charts makes it even easier to visualize the breakdown.

#####  Mosaic Charts
mosaicplot(train.data$Sex ~ train.data$Survived, 
	main="Passenger fate by Sex", shade=FALSE, color=c("red","green"), 
         xlab="Sex", ylab="Survived")

Observation: In terms of %, females survival rate was almost 3 times higher than man.

Sample titanic dataset

Passenger fate by travelling class

Run below commands to find Passenger fate by travelling class

#####  Find survived and perished breakdown
table( train.data$Survived)

  0   1 
549 342

#####  Find survived and perished breakdown by class
table( train.data$Survived, train.data$Pclass)
   
  1   2   3
  0  80  97 372
  1 136  87 119

#####  Plot the graph
countsTable = table( train.data$Survived, train.data$Pclass)
barplot(countsTable,  col=c("red","green"), legend = c("Perished", "Survived"), names= c("First", "Second", "Third"), main= "Passenger fate by Class")

Sample titanic dataset

Let’s see how mosaic charts looks like

#####  Mosaic Charts
mosaicplot(train.data$Pclass ~ train.data$Survived, 
	main="Passenger fate by Pclass", shade=FALSE, color=c("red","green"), 
         xlab="Pclass", ylab="Survived")

Observation: There were higher chances of survival in first class than 2nd or 3rd class.

Sample titanic dataset

Observations

Till now, we had following observations:

  • There were more male passenger’s than female. Male outnumbered females in each class.
  • Southampton outnumbered others in terms of port of embarkation.
  • Most of the passengers were in the age 20-40.
  • There were more passengers in 3rd class then 1st and second classes.
  • It’s clear that more people perished than survived.
  • Traveling class did influence the odds of a passenger’s survival. There were higher chances of survival in first class than 2nd or 3rd class.
  • Female passengers had a higher rate of survival than males.

We can also use correlogram package to confirm some of the things which we observed in above section.

#####  Install
install.packages("corrgram")

require(corrgram)
corrgram.data <- train.data
#####  change from factor type to numeric type to include data in correlogram
corrgram.data$Survived <- as.numeric(corrgram.data$Survived)
corrgram.data$Pclass <- as.numeric(corrgram.data$Pclass)

#####  generate correlogram
corrgram.vars <- c("Survived", "Pclass", "Sex", "Age", 
                  "Fare", "Embarked")
corrgram(corrgram.data[,corrgram.vars], order=FALSE, 
         lower.panel=panel.ellipse, upper.panel=panel.pie, 
         text.panel=panel.txt, main="Titanic Data")

Observation: Below correlogram confirms a couple of observations namely, that survival odds drop with class, and age may not prove to be a significant predictor.

Sample titanic dataset

Predicting passenger survival by creating Prediction Model using decision tree


back to top

There are various algorithms we can use to predict passenger survival such as ctree, svm. For this tutorial, we will use ctree algorithm.

You can create a regression or classification tree via the function ctree(formula, data=). The type of tree created will depend on the outcome variable (nominal factor, ordered factor, numeric, etc.).

We will adopt following approach for predicting passenger survival.

  • Split the data into a training and testing set
  • Use the training set to generate a prediction model using decision tree.
  • From the decision tree, identify combination of variables which may be helpful in predicting the survival rate.
  • Run the prediction model on the testing dataset.

Execute below commands and generate ctree.

#####  Split the data in training and testing data
split.data = function(data, p = 0.7, s = 666){
     set.seed(s)
     index = sample(1:dim(data)[1])
     trainingdataset = data[index[1:floor(dim(data)[1] * p)], ]
     testingdataset = data[index[((ceiling(dim(data)[1] * p)) + 1):dim(data)[1]], ]
     return(list(trainingdataset = trainingdataset, testingdataset = testingdataset))
} 

allset= split.data(train.data, p = 0.7) 
titanic_trainingdataset = allset$trainingdataset 
titanic_testingdataset = allset$testingdataset

#####  Install and Use party package
install.packages('party')
require('party')

#####  Train and create ctree
train.ctree = ctree(Survived ~ Pclass + Sex + Age + SibSp + Fare + Parch + Embarked, data=titanic_trainingdataset)
train.ctree

		Conditional inference tree with 7 terminal nodes
		
		Response:  Survived 
		Inputs:  Pclass, Sex, Age, SibSp, Fare, Parch, Embarked 
		Number of observations:  623 
		
		1) Sex == {male}; criterion = 1, statistic = 173.672
		  2) Pclass == {2, 3}; criterion = 1, statistic = 30.951
		    3) Age <= 9; criterion = 0.993, statistic = 10.923
		      4) SibSp <= 1; criterion = 0.999, statistic = 14.856
		        5)*  weights = 10 
		      4) SibSp > 1
		        6)*  weights = 13 
		    3) Age > 9
		      7)*  weights = 280 
		  2) Pclass == {1}
		    8)*  weights = 87 
		1) Sex == {female}
		  9) Pclass == {1, 2}; criterion = 1, statistic = 59.504
		    10)*  weights = 125 
		  9) Pclass == {3}
		    11) Fare <= 23.25; criterion = 0.997, statistic = 12.456
		      12)*  weights = 85 
		    11) Fare > 23.25
		      13)*  weights = 23 
      

#####  Plot the tree (Green - Survived, Red - Perished)
plot(train.ctree, main="Applying cTree on Titanic Dataset", tp_args = list(fill = c("green","red")))


Sample titanic dataset

Observation: We can notice that a node 9 acts as a good decision boundary to predict the survival rates. This shows female passengers who were in first and second class mostly survived. Male passengers, who were in second and third class and aged greater than nine were almost perished.

Testing the Prediction model with confusion matrix


back to top

We build the prediction model in last step. In this step, we will measure the performance of the prediction model. The performance can be measured by whether the prediction result matches the original survival record present in the testing dataset.

We will generate confusion matrix provided by the caret package, which helps us measure the accuracy of predictions. Caret package helps in iterating and comparing different predictive models very convenient.

Test prediction model with ctree

Execute below commands.

#####  Use predict function to run prediction model
ctree.predict = predict(train.ctree, titanic_testingdataset)

#####  Install caret 
install.packages("e1071")
install.packages("caret")
require(caret)

#####  use a confusion matrix to generate the statistics of the output matrix
confusionMatrix(ctree.predict, titanic_testingdataset$Survived)

	Confusion Matrix and Statistics
		
		          Reference
		Prediction   0   1
		         0 160  23
		         1  16  68
		                                         
		               Accuracy : 0.8539         
		                 95% CI : (0.8058, 0.894)
		    No Information Rate : 0.6592         
		    P-Value [Acc > NIR] : 5.347e-13      
		                                         
		                  Kappa : 0.6688         
		 Mcnemar's Test P-Value : 0.3367         
		                                         
		            Sensitivity : 0.9091         
		            Specificity : 0.7473         
		         Pos Pred Value : 0.8743         
		         Neg Pred Value : 0.8095         
		             Prevalence : 0.6592         
		         Detection Rate : 0.5993         
		   Detection Prevalence : 0.6854         
		      Balanced Accuracy : 0.8282         
		                                         
		       'Positive' Class : 0 		     
			       

Observation: From confusion matrix, accuracy seems to be 85% which proves that using ctree is powerful enough to make survival predictions.

Test prediction model with ROC curve

Other quick way to test prediction model is by using ROC curve module which plots a curve according to its true positive rate against its false positive rate.

#####  Use predict function to run prediction model
train.ctree.predict = predict(train.ctree, titanic_testingdataset)
train.ctree.probability =  1- unlist(treeresponse(train.ctree, 
titanic_testingdataset), use.names=F)[seq(1,nrow(titanic_testingdataset)*2,2)]

#####  Install caret 
install.packages("ROCR")
require(ROCR)

#####  Create an ROCR prediction object from probabilities:
train.ctree.probability.rocr = prediction(train.ctree.probability, titanic_testingdataset$Survived)

#####  Prepare the ROCR performance object for the ROC curve (tpr=true positive rate, 
#####  fpr=false positive rate) and the area under curve (AUC)	     
train.ctree.performance = performance(train.ctree.probability.rocr, "tpr","fpr")
train.ctree.auc.performance =  performance(train.ctree.probability.rocr, measure = "auc", 
x.measure = "cutoff")

#####  Plot the ROC curve
plot(train.ctree.performance, col=2,colorize=T, main=paste("AUC:", train.ctree.auc.performance@y.values))
		       

Our model returns a value of 0.87, which suggests that the conditional inference tree model is powerful enough to make survival predictions.

Sample titanic dataset

References


# Reference
1 wehrley’s Blog
2 Machine Learning with R

Version History


Date Description
2016-09-01    Initial Version