Initial hypothesis

We hypothesized that there was a relationship between survival and sex, age group, passenger class, embarkment point and family size. We thought that passenger class would have the most signinficant effect.

Reltionship between Age Group and Survival

First, we labeled the variables to equate outcomes 0 and 1 with ‘died’ and ‘survived,’ then turned all the blank spaces in the ‘embarked’ column to ‘na’ so that we could omit them from our data set. Next, to make sure ‘embarked’ was a factor of just three levels, we defined them as ‘C,’ ‘S,’ and ‘Q.’ Finally, we dropped the name, cabin, and ticket number columns (since they were not relevant to our analysis).

titanic <- read.csv('https://cs.brown.edu/courses/cs100/studios/data/4/titanic.csv')
titanic$survived <- as.factor(titanic$survived) 
titanic$survived <- factor(titanic$survived,levels = c(0,1), labels = c("died", "survived"))
titanic$embarked[titanic$embarked == ""] <- NA
titanic <- na.omit(titanic)
titanic$embarked <- factor(titanic$embarked,levels = c("C","S","Q"))
titanic <- titanic[-c(3,8,10)]

We were interested in exploring how age group impacted surival rates on the Titanic, working off the hypothesis that children survived at higher rates than adults. We created a new variable called ‘age_group,’ so that anyone under the age of 15 would be considered a child, and those over 15 would be considered adults. Then, we added age group to the Titanic data set and created a table that illustrated the proportion of children and adults who survived or died within their respective age group (child vs. adult) before illustrating these findings in a barplot.

As we can see, only 38% of adults overall survived whereas 57% of children survived–this demonstrates that, on balance, children were much more likely to survive than adults. Furthermore, we tested the statistical significance of the relationship between age group and survival with a Chi-squared test (X-squared = 14.064), and found a p value of 0.0001767, which provides strong evidence that there is a statistically significant association between age group and survival rate.

age_group <- ifelse(titanic$age >15, c("Adult"), c("Child"))
titanic <- titanic %>% mutate(age_group)
survived_vs_age <- table(titanic$survived, titanic$age_group)
proportion_survived_vs_age<- prop.table(survived_vs_age,2)
barplot(proportion_survived_vs_age, main="Survival rates by age group", xlab="Age group", col=c("darkblue","red"), legend =    rownames(proportion_survived_vs_age), beside= TRUE)

chisq.test(survived_vs_age)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  survived_vs_age
## X-squared = 14.064, df = 1, p-value = 0.0001767

Relationship between embarkment point and survival

Next, we wanted to test for a relationship between embarkment point and survival. Our hypothesis was that there was a relationship between embarkment point and survival. Using the same process as exploring age group vs. survivability, we created a proportion table showing the proportion of those who survived at each embarkment point and again illustrated our findings in a bar chart.

We found that 62% of those embarking from Cherbourg survived, whereas only 36% of those who embarked at Southmapton survived, and only 26% of those who embarked at Queenstown survived. This shows that if you embarked at Cherbourg, you were more likely to survive than if you had embarked from either of the other two locations. Next, we tested this relationship for statistical significance using the Chi-squared test once again (finding a X-squared statistic of 52.91). With a p value of 3.242e-12, this evidence strongly suggests that the relationship between embarkment point and survivability is in fact statistically significant.

survived_vs_embarked <- table(titanic$survived, titanic$embarked)
proportion_survived_vs_embarked<- prop.table(survived_vs_embarked,2)
barplot(proportion_survived_vs_embarked, main="Distribution of survival at each embarkment point", xlab="Embarkment point", col=c("darkblue","red"), legend =    rownames(proportion_survived_vs_embarked), beside= TRUE)

chisq.test(survived_vs_embarked)
## 
##  Pearson's Chi-squared test
## 
## data:  survived_vs_embarked
## X-squared = 52.91, df = 2, p-value = 3.242e-12

Relationship between sex and survival

We then wanted to test the impact of sex on survivability, first comparing only male vs. female adults and excluding children. Our hypothesis was that women would survive at higher rates than men. We filtered the data set to include only male and female adults for our analysis. We created a proportion table showing the proportion of those who survived within each sex and again illustrated our findings in a bar chart.

Strikingly, 76.7% of adult females survived whereas only 17.7% of adult males survived, suggesting that if you were an adult female, you were much more likely to survive. We tested this relationship for statistical significance using the Chi-squared test once again (finding a X-squared statistic of 308.97). With a p value of 2.2e-16, this evidence strongly suggests that the relationship between adult sex and survivability is in fact statistically significant.

adults <- filter(titanic, titanic$age_group == "Adult")
survived_vs_sex <- table(adults$survived, adults$sex)
proportion_survived_vs_sex<- prop.table(survived_vs_sex,2)
barplot(proportion_survived_vs_sex, main="Distribution of survival by sex", xlab="Sex", col=c("darkblue","red"), legend =    rownames(proportion_survived_vs_sex), beside= TRUE)

chisq.test(survived_vs_sex)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  survived_vs_sex
## X-squared = 308.97, df = 1, p-value < 2.2e-16

Relationship between passenger class and survival

Next, we wanted to test for a relationship between passenger class and survival. Our hypothesis was that the lower the passenger class, the lower the survival rates. Again, we created a proportion table showing the proportion of those who survived within each passenger class and again illustrated our findings in a bar chart.

According to our findings, 63% of those in first class survived, versus 44% and 26% in second and third classes, respectively. Running the Chi-squared test, we found a X-squared value of 105.35 and a p-value of 2.2e-16, again providing evidence that strongly suggests there is an association between passenger class and survivability.

survived_vs_pclass <- table(titanic$survived, titanic$pclass)
proportion_survived_vs_pclass<- prop.table(survived_vs_pclass,2)
barplot(proportion_survived_vs_pclass, main="Distribution of survival by passenger class", xlab="Passenger Class", col=c("darkblue","red"), legend =    rownames(proportion_survived_vs_pclass), beside= TRUE)

chisq.test(survived_vs_pclass)
## 
##  Pearson's Chi-squared test
## 
## data:  survived_vs_pclass
## X-squared = 105.35, df = 2, p-value < 2.2e-16

Relationship between traveling with family and survival

Now that we’ve tested each variable for an association with survivability, we looked for a relationship between whether a person was traveling with family and whether they survived. Our hypthesis was that those traveling with family would survive at higher rates. To create the new variable, we added together the values in the ‘siblings onboard’ and ‘parents on board’ column. We then created a new variable called family_category_size, so that anyone travelling with no other family members was put in the category “without family” and those who traveled with more than 0 family members were put in the “with family” category.

family_size <- titanic$sibsp + titanic$parch
titanic <- titanic  %>% mutate(family_size)
family_size_category <- ifelse(titanic$family_size > 0, c("With family"), c("Without family"))
titanic <- titanic%>% mutate(family_size_category)

Using a barplot, we found that 52.4% of those traveling with family survived whereas 31.7% of those traveling individually survived. Runmning a Chi-squared test again suggests there is a statistically significant association between traveling with family and surviving, with a X-square value of 44.805 and a p-value of 2.177e-11. Those taveling with family survived at higher rates than those traveling without.

survived_vs_family <- table(titanic$survived, titanic$family_size_category)
proportion_survived_vs_family<- prop.table(survived_vs_family,2)
barplot(proportion_survived_vs_family, main="Distribution of survival when traveling with family", xlab="Traveling with family?", col=c("darkblue","red"), legend =    rownames(proportion_survived_vs_family), beside= TRUE)

chisq.test(survived_vs_family)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  survived_vs_family
## X-squared = 44.805, df = 1, p-value = 2.177e-11

Does passenger class have as strong an effect on survival as we once thought?

The logical next step was the test the relationship between passenger class and survival for individual passenger groups (men, women, and children) in order to test whether passenger class had a strong effect as we initally suspected.

To execute this, we first filtered the Titanic data set to create three new data sets (men, women, and children), defining them using both age and sex.

men <- filter(titanic, titanic$sex == "male", titanic$age > 15)
women <- filter(titanic, titanic$sex == "female", titanic$age > 15)
children <- filter(titanic, titanic$age < 15)

First, we looked at survivival rates within passenger class for men specifically. Based on our barplot of survival rates for men within each passenger class, 32% in first class survived. However, only 8% of those in second class survived and 15% of those in third class survived. This was surprising, because we had initially thought that the those in higher passenger classes would survive at higher rates, but men in second class survived at lower rates than men is third class.

men_vs_survived <- table(men$survived, men$pclass)
proportion_men_vs_survived<- prop.table(men_vs_survived, 2)
barplot(proportion_men_vs_survived, main="Survival rate of men by passenger class", xlab="Passenger class", col=c("darkblue","red"), legend =    rownames(proportion_men_vs_survived), beside= TRUE)

Now, looking at the barchart of women’s survival rates within passenger class, we find that 96.9% of women in first class survived, 87% of women in second class survived, and 46% of women in third class survived–overall surviving at significantly higher rates than men. This matched our initial hypothesis that the lower the passenger class, the lower the survival rate.

women_vs_survived <- table(women$survived, women$pclass)
proportion_women_vs_survived<- prop.table(women_vs_survived, 2)
barplot(proportion_women_vs_survived, main="Survival rate of women by passenger class", xlab="Passenger class", col=c("darkblue","red"), legend =    rownames(proportion_women_vs_survived), beside= TRUE)

Taking a look at our bar chart for the survival rate of children by passenger class also illustrates a similar survival story. 85.7% of children in first class survived, whereas 96.3% of children in second class survived, and 38.7% of children in third class survived. This was interesting because this time, a higher proportion of children in second class survived in comparison to first class.

children_vs_survived <- table(children$survived, children$pclass)
proportion_children_vs_survived<- prop.table(children_vs_survived,2) 
barplot(proportion_children_vs_survived, main="Survival rate of children by passenger class", xlab="Passenger class", col=c("darkblue","red"), legend =    rownames(proportion_children_vs_survived), beside= TRUE)

chisq.test(children_vs_survived)
## Warning in chisq.test(children_vs_survived): Chi-squared approximation may
## be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  children_vs_survived
## X-squared = 29.441, df = 2, p-value = 4.045e-07

Since the relationship between passenger class and survival (the lower the passenger class, the lower the survival rate) does not hold when you break the individuals into groups of men, women and children, gender and age played more of a factor than passenger class when determining who survived.

Does traveling with family have as strong an effect on survival as we once thought?

Then, we moved on to analyzing survivability rates within each major passenger group (men, women, and children) comparing whether or not each group was traveling with family or not. Interestingly, whether or not men traveled with family or not hardly made a difference, since our bar chart illustrates that 19% of those traveling with family survived versus 17% without family survived.

men_vs_survived2 <- table(men$survived, men$family_size_category)
proportion_men_vs_survived2<- prop.table(men_vs_survived2, 2)
barplot(proportion_men_vs_survived2, main="Survival rate of men given family size", xlab="Family size", col=c("darkblue","red"), legend =    rownames(proportion_men_vs_survived2), beside= TRUE)

Likewise, whether or not women were traveling with family also only created a marginal difference in slight favor of those traveling with family. Our bar chart data illustrates that 78.6% of women traveling with family survived versus 74% of women traveling individually.

women_vs_survived2 <- table(women$survived, women$family_size_category)
proportion_women_vs_survived2<- prop.table(women_vs_survived2, 2)
barplot(proportion_women_vs_survived2, main="Survival rate of women given family size", xlab="Family size", col=c("darkblue","red"), legend =    rownames(proportion_women_vs_survived2), beside= TRUE)

Curiously, the bar chart data for children’s survivability rate compared with whether or not they were traveling with family was slightly stacked in favor of those who were on their own–surviving at a rate of 57% versus 55.9% of those traveling with family. Granted, there’s hardly a difference between these two survivability rates.

children_vs_survived2 <- table(children$survived, children$family_size_category)
proportion_children_vs_survived2<- prop.table(children_vs_survived2, 2)
barplot(proportion_children_vs_survived2, main="Survival rate of children given family size", xlab="Family size", col=c("darkblue","red"), legend =    rownames(proportion_children_vs_survived2), beside= TRUE)

This shows that, when you break the passengers down into the categories of men, women, and children, whether or not they were traveling with family barely had an impact. Thus, this variable does not have as strong an effect on survival as we once hypothesized.

Does passenger class have as strong an effect on survival as we once thought?

Finally, we compared passenger group survival rates (men, women, & children) vs. embarkment point. First, taking a look at men’s survival rate by embarkemt point, we find that men were most likely to survive if they boarded at Cherbourg (33.3% survival rate) versus the two other embarkment points, Southmapton (14.6%) and Queenstown (9.5%).

men_vs_survived3 <- table(men$survived, men$embarked)
proportion_men_vs_survived3<- prop.table(men_vs_survived3, 2)
barplot(proportion_men_vs_survived3, main="Survival rate of men by embarkment point", xlab="Embarkment point", col=c("darkblue","red"), legend =    rownames(proportion_men_vs_survived3), beside= TRUE)

Next, we took a look at women’s survival rate by embarkment point. Nearly 94% of women who boarded at Cherbourg survived, whereas 73.7% who boarded at Southampton and 43.5% who boarded at Queenstown survived.

women_vs_survived3 <- table(women$survived, women$embarked)
proportion_women_vs_survived3<- prop.table(women_vs_survived3, 2)
barplot(proportion_women_vs_survived3, main="Survival rate of women by embarkment point", xlab="Embarkment point", col=c("darkblue","red"), legend =    rownames(proportion_women_vs_survived3), beside= TRUE)

Finally, exploring the relationship between children’s survival rate vs. embarkment point illustrates that 81% of children who embarked at Cherbourg survived versus 53% of children who embarked at Southampton survived. Clearly, some error may have occurred since we don’t have any data for children who embarked at Queenstown. Our findings for survivability rates by passenger group are consistent with those for the grouped embarkment points.

children_vs_survived3 <- table(children$survived, children$embarked)
proportion_children_vs_survived3<- prop.table(children_vs_survived3, 2)
barplot(proportion_children_vs_survived3, main="Survival rate of children given embarkment point", xlab="Embarkment point", col=c("darkblue","red"), legend =    rownames(proportion_children_vs_survived3), beside= TRUE)

Conclusion

We found that women and children survived at much higher rates than men, probably because they were let off of the ship first. Although we initially thought that passenger class would have the strongest effect on survival overall, we found that the relationship didn’t hold when we broke down the population into men, women, and children. We found that embarkment point also had an impact on survival, since those embarking at Cherbourg survived at much higher rates than the other two embarkment points.