Final Project

Introduction

Today’s society tends to admire beauty, intelligence, and wealth. The media focuses on those who are considered beautiful, intelligent people are consistently praised and respected, and the wealthy are envied because of their luxurious lifestyles. Many people set their ultimate goals to be proficient in one or more of these areas, and in the search for a romantic partner, men and women tend to put importance on these attributes.

To investigate this claim, data from a speed dating experiment was used. The above dataset was complied by two professors from Columbia Business School, Ray Fisman and Sheena Iyengar.

Subjects were gathered for experimental speed dating events hosted in enclosed room settings in which each participant would engage in four-minute conversations equivalent to a “first date” with each participant of the opposite gender. Using the speed dating design provided the experimenters with both experimental control and resulting data on decisions that were identical to real-life decisions in regards to what a person truly looks for in a romantic partner.

All subjects were students from Columbia University’s graduate and professional schools who responded to email and flier requests to participate in the study. Before partaking in the speed dating events, each subject was asked to complete a pre-event survey in which they were asked about attributes such as their name, email, the undergraduate institution they attended, and the zip code of where they grew up. They were also asked to provide scores to the six attributes of attractiveness, sincerity, intelligence, fun, ambition, and shared interests in regards to which attributes are most important when it comes to what they look for in potential dates. They were asked to provide scores from their own perspective and also from someone of the opposite gender’s perspective.

A random number of subjects were gathered for each speed dating event, so every event did not have the same number of male and female participants. After each participant engaged in a four-minute “first date” with their partner, they were asked to indicate on a scorecard whether or not they would like to meet their partner again for future dates. They were also asked to rate their partner on the six different attributes previously mentioned (attractiveness, sincerity, intelligence, fun, ambition, and shared interests). The dataset includes the results of fourteen speed dating events hosted by the experimenters.

Each subject was also asked to complete follow up surveys. One was provided one day after each subject’s respective event and the other was provided three to four weeks after. Both surveys asked the same questions as the pre-event survey in giving scores of importance to the six attributes, and the later survey also asekd if the participant went on more dates with partners they met from the event.

Hypotheses

The overarching claim made previously was that men and women tend to put importance on beauty, intelligence, and wealth in the search for a romantic partner.

Branching from this initial claim, the first hypothesis that will be tested in this investigation is that men tend to consider attractiveness to be the most important attribute in finding a romantic partner.

Because the dataset does not include direct information about the subjects’ wealth, it will be assumed that intelligence and wealth are highly correlated; intelligent people tend to also be wealthy. The second hypothesis that will be tested is that women tend to consider intelligence to be the most important attribute in finding a romantic partner.

Plan of Analysis

First, the dataset will be explored using simple analyses and visualizations. Attributes about the subjects, such as intended career field, age, and race will be examined to understand the demographic of the sample is like. Then, an initial exploration of what the participants considered important factors in finding a romantic partner will be performed by generating radar plots based upon the information the subjects provided in the pre-event survey. Regression analyses and decision trees classification analyses will be performed on the data to determine which attributes most influenced a subject’s partner in wanting a second meeting with the subject. Finally, the results will be compared with the intial hypotheses.

Data Preparation

The initial dataset contains 8378 observations and 195 variables. Some of the variables in the dataset also converted to factors, as follows.

raw_data$gender <- as.factor(raw_data$gender)
raw_data$career_c <- as.factor(raw_data$career_c)
raw_data$samerace <- as.factor(raw_data$samerace)
raw_data$race <- as.factor(raw_data$race)
raw_data$dec <- as.factor(raw_data$dec)
raw_data$date <- as.factor(raw_data$date)

Data cleaning was not performed on the entire dataset. This was because the data is mainly from surveys and scorecards, for which a subject could have left a response empty. Omitting a participant entirely because of a single empty response could have led to skewed results or a large amount of data loss.

Instead, subsets of appropriate data were taken for the individual analyses performed in this investigation, and data cleaning methods were used on these subsets.

Data Overview

To begin, exploratory data analyses were performed to gain a better understanding of the data itself.

data_career <- raw_data %>% 
  filter(!is.na(career_c)) %>% 
  select(iid,gender, career_c)
data_career <- unique(data_career, by = iid)

career_label <- c("Lawyer", "Academic/Research", "Psychologist",  
                  "Doctor/Medicine", "Engineer", "Creative Arts/Entertainment", 
                  "Banking/Business", "Real Estate", "International Affairs", 
                  "Undecided", "Social Work", "Speech Pathology", "Politics", 
                  "Sports/Athletics", "Other", "Journalism", "Architecture")

ggplot(data = data_career) +
  geom_bar(aes(career_c, fill=gender)) + 
  scale_x_discrete(label = career_label) + coord_flip() + 
  labs(title = "Distribution of Intended Career Fields", x = "Career Field", y = "Count") + 
  scale_fill_discrete("Gender", labels = c("Female", "Male"))

The above stacked bar plot shows the distribution of intended career fields of the subjects. Each career field is graphed with its corresponding participant count. The part of the bar that is colored blue corresponds to the male count, and the part of the bar that is colored red corresponds to the female count. Based on the plot, it is clear that most of the participants intended to go into the Banking/Business or Academic/Research fields.

data_age <- raw_data %>% filter(!is.na(age)) %>% select(iid, gender, age)
data_age <- filter(data_age, age < max(age))
data_age <- unique(data_age, by = iid)

ggplot(data = data_age, aes(x = age,fill = gender)) + coord_flip() + 
  geom_histogram(data = subset(data_age, gender == "0"), binwidth = 2, color = "white") +  
  geom_histogram(data = subset(data_age, gender == "1"), 
                 aes(y = ..count.. * (-1)), binwidth = 2, color = "white") + 
  scale_y_continuous(breaks = seq(-70, 70, 10), labels = abs(seq(-70, 70, 10)))+ 
  scale_x_continuous(breaks = seq(10, 45, 5), labels = seq(10, 45,5)) + 
  labs(title = "Distribution of Age", x = "Age", y = "Count") + 
  scale_fill_discrete("Gender", labels = c("Female", "Male"))

The above two-sided bar plot shows the distribution of age of the subjects. The left side of the bar plot corresponds to the male participants, and the right side of the bar plot corresponds to the female participants. Initially, a female participant aged 55 years was found but was removed as she was deemed an outlier. The plot shows a rather symmetric distribution for both genders. Most of the male and female participants were in their early to late twenties, which is reasonable as the subjects were all students at Columbia University’s graduate and professional schools.

data_race <- raw_data %>% filter(!is.na(race)) %>% select(iid, gender, race)
data_race <- unique(data_race, by = iid)

race_label <- c("Black/African American", "European/Caucasian American", 
                "Latino/Hispanic American", "Asian/Asian American", 
                "Naitive American", "Other")

ggplot(data = data_race) + 
  geom_bar(aes(x = gender,fill = race), position = "fill") + 
  labs(title = "Distribution of Race", x = "Gender", y = "Relative Frequency") +
  scale_fill_discrete("Race", labels = race_label) + scale_y_continuous(labels = percent) +
  scale_x_discrete(labels=c("0" = "Male", "1" = "Female"))

The above bar plot shows the distribution of race of the subjects. The first bar corresponds to the male participants, and the second bar corresponds to the female participants. The race distribution of the subjects seems to be quite similar between the males and the females. According to the experimenters who complied the dataset, the sample closely mirrors the overall race distribution of the graduate and professional students of Columbia University.

Radar Plots

Using the data from the participants’ pre-event surveys, radar plots were generated indicating what men and women looked for in romantic partners. As described previously, this survey asked each participant to distribute scores to the six attributes of attractiveness, sincerity, intelligence, fun, ambition, and shared interests in regards to which attributes are most important when it comes to what they look for in potential dates. They were asked to do so from both their own perspective and the perspective of someone of the opposite gender.

The scoring was performed differently by different subjects. For example, some subjects used a 100-point scale while many others did not. All the scores were normalized by taking each participants’ scores for individual attributes as a percentage of the sum of the scores for all the attributes.

raw_data <- raw_data %>% 
  mutate(sum1_1 = attr1_1 + sinc1_1 + intel1_1 + amb1_1 + shar1_1) %>%
  mutate(attr1_1 = (attr1_1/sum1_1)*100) %>% 
  mutate(sinc1_1 = (sinc1_1/sum1_1)*100) %>% 
  mutate(intel1_1 = (intel1_1/sum1_1)*100) %>% 
  mutate(amb1_1 = (amb1_1/sum1_1)*100) %>% 
  mutate(shar1_1 = (shar1_1/sum1_1)*100) %>% 
  mutate(sum2_1 = attr2_1 + sinc2_1 + intel2_1 + amb2_1 + shar2_1) %>%
  mutate(attr2_1 = (attr2_1/sum2_1)*100) %>% 
  mutate(sinc2_1 = (sinc2_1/sum2_1)*100) %>% 
  mutate(intel2_1 = (intel2_1/sum2_1)*100) %>% 
  mutate(amb2_1 = (amb2_1/sum2_1)*100) %>% 
  mutate(shar2_1 = (shar2_1/sum2_1)*100) 

data_features <- raw_data %>% filter(!is.na(sum1_1)) %>% 
  filter(!is.na(sum2_1)) %>% select(iid, gender, attr1_1:shar2_1)
data_features <- unique(data_features, by = idd) 

men <- filter(data_features, gender =="1")
women <-filter(data_features, gender =="0") 

row_label <- c("Self", "Majority")
column_label <- c("Attractive", "Sincere", "Intelligent", 
                  "Fun", "Ambitious", "Shared Interests")

Men

radar_men <- as.data.frame(matrix(0, nrow = 2, ncol = 6))
colnames(radar_men) <- column_label
rownames(radar_men) <- row_label

for (i in (1:nrow(radar_men))) {
  for(j in c(1:ncol(radar_men))) {
    if(i == 1) {
      radar_men[i, j] <- mean(men[ , 2 + j])
    }
     if(i == 2){
      radar_men[i,j] <- mean(women[ , 14 + j])
    }  
  }
}

radar_men = rbind(rep(40, 5) , rep(0, 5) , radar_men)
radarchart(radar_men, pcol= c( rgb(0.2, 0.5, 0.5, 0.9), rgb(0.7, 0.5, 0.1, 0.9)), 
           pfcol = c(rgb(0.2, 0.5, 0.5, 0.4), rgb(0.7, 0.5, 0.1, 0.4)),  
           plwd = 3 , plty = 1, vlcex = 0.8, 
           title = "Attributes males find most important in their female partner")
legend(x = 1, y = 1.2, legend = c("Male perspective", "Female perspective"), 
       bty = "n", pch = 20 , col = c(rgb(0.2, 0.5, 0.5, 0.4), rgb(0.7, 0.5, 0.1, 0.4)), 
       text.col = "black", cex = 0.8, pt.cex = 2)

This radar plot shows the attributes males find most important in their female partner from two perspectives. The blue area is from the male perspective, so it shows what males find most important in their female partner. The yellow area is from the female perspective, so it shows what females think males find most important in their female partner. While the attributes of shared interests, ambition, and fun seem to have similar weights of importance from both the male perspective and the female perspective, the other three attributes do not. Men tend put more importance on the attributes of sincerity and intelligence and less importance on the attractiveness attribute than women thought. Women also thought men would put the most importance on attractiveness, and while men do in fact put the most importance in this attribute out of the six possibilities, it is not to the extent that women thought.

Women

radar_women <- as.data.frame(matrix(0, nrow = 2, ncol = 6))
colnames(radar_women) <- column_label
rownames(radar_women) <- row_label

for (i in (1:nrow(radar_women))) {
  for(j in c(1:ncol(radar_women))) {
    if(i == 1) {
      radar_women[i,j] <- mean(women[ , 2 + j])
    }
     if( i == 2) {
      radar_women[i,j] <- mean(men[ , 14 + j])
    }  
  }
}

radar_women = rbind(rep(40, 5) , rep(0, 5) , radar_women)
radarchart(radar_women, pcol = c( rgb(0.8, 0.2, 0.5, 0.9), rgb(0.7, 0.5, 0.1, 0.9)), 
           pfcol = c(rgb(0.8, 0.2, 0.5, 0.4), rgb(0.7, 0.5, 0.1, 0.4)),  
           plwd = 3 , plty = 1, vlcex = 0.8,
           title = "Attributes females find most important in their male partner")
legend(x = 1, y = 1.2, legend = c("Female perspective", "Male perspective"), 
       bty = "n", pch = 20 , col = c(rgb(0.8, 0.2, 0.5, 0.4) , rgb(0.7, 0.5, 0.1, 0.4)), 
       text.col = "black", cex = 0.8, pt.cex = 2)

This radar plot shows the attributes females find most important in their male partner from two perspectives. The pink area is from the female perspective, so it shows what females find most important in their male partner. The yellow area is from the male perspective, so it shows what males think female find most important in their male partner. Similar to the first radar plot, the attributes of shared interests, ambition, and fun seem to have similar weights of importance from both the female perspective and the male perspective, and the other three, again, do not. Women seem to put more importance on the attributes of sincerity and intelligence and less importance on the attractiveness attribute than men thought. Men also thought women would put the most importance on the attractiveness attribute, just as women did to men.

It is interesting that both men and women overestimated the weight of importance of attractiveness and underestimated the weight of importance of sincerity and intelligence for the other gender. Perhaps this shows the skewed perspective of love and romantic attraction in today’s society.

The results of the radar plots seem to support the original hypotheses. However, it is important to note that while the partipants tended to respond in this way, their actual actions may not have been reflective of what they had initially said. Thus, further analysis had to be performed on the actual data from the speed dating events themselves.

Modeling I: Linear Regression

To begin the regression analysis, the data was manipulated so that the averages each subject’s ratings for the six attributes of attractiveness, sincerity, intelligence, fun, ambition, and shared interests were calculated. In each of the linear regression analyses, the independent variable was one of these average ratings. In the multiple regression analyses, the average ratings for all six attributes were considered.

The dependent variable for each of the analyses was the percent of times that a subject’s partner said they wanted to meet the subject for a second date. Because each speed dating event had a different numbers of participants, using the number of times a subject’s partner voted that they wanted to see the subject again seemed unfair. A percentage was taken so that the values could be compared relative to one another.

regression <- raw_data %>% group_by(iid) %>% 
  summarize(gender = mean(as.numeric(gender)), 
            attr = mean(attr_o, na.rm = TRUE), 
            sinc = mean(sinc_o, na.rm = TRUE), 
            intel = mean(intel_o, na.rm = TRUE), 
            fun = mean(fun_o, na.rm = TRUE), 
            amb = mean(amb_o, na.rm = TRUE), 
            shar = mean(shar_o, na.rm = TRUE), 
            selected = sum(dec_o, na.rm = TRUE), n = n())  %>% 
  mutate(selected_perc = selected/n*100) %>% 
  filter(selected_perc > 0)

Men

A pairs matrix was generated to visualize an overview of the relationships between each male subject’s average ratings for the six attributes and the percent of selections of the subject. Based on the matrix, the attributes with the greatest correlation with the percent of selections were attractiveness and fun. Linear regressions were performed with these attributes. Because the original hypothesis stated that women tend to find intelligence to be the most important attribute in their male partners, a linear regression for the intelligence attribute was also performed. As the pairs matrix initially showed, attractiveness has the strongest positive correlation with percent of selections with a correlation coefficient of 0.79. Intelligence has the lowest correlation with percent of selections among the three attributes selected with a correlation coefficient of 0.31.

regression_men <- regression %>% filter(gender == 1)

pairs.panels(regression_men[ , c(3:8, 11)], 
             method = "pearson", scale = TRUE, ellipses = FALSE, 
             labels = c("Attractive", "Sincere", "Intelligent", "Fun", 
                        "Ambitious", "Interests", "Selected"), 
             cex.labels = 1.3, hist.col = "lightblue")

reg_intel_men <- lm(selected_perc ~ intel, data = regression_men)
summary(reg_intel_men)

## 
## Call:
## lm(formula = selected_perc ~ intel, data = regression_men)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.707 -18.532  -0.084  17.682  58.149 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -46.039     17.879  -2.575   0.0106 *  
## intel         12.925      2.434   5.311  2.3e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.25 on 268 degrees of freedom
## Multiple R-squared:  0.09521,    Adjusted R-squared:  0.09184 
## F-statistic:  28.2 on 1 and 268 DF,  p-value: 2.298e-07

plot(data = regression_men, selected_perc ~ intel, main = "'Intelligence' Score vs Percent of selection", xlab = "'Intelligence' Score", ylab = "Percent of selection") 
abline(reg_intel_men, col = "blue")

reg_attr_men <- lm(selected_perc ~ attr, data = regression_men)
summary(reg_attr_men)

## 
## Call:
## lm(formula = selected_perc ~ attr, data = regression_men)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.027  -9.887  -0.923   9.275  48.056 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -62.8786     5.3639  -11.72   <2e-16 ***
## attr         17.1649     0.8146   21.07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.35 on 268 degrees of freedom
## Multiple R-squared:  0.6236, Adjusted R-squared:  0.6222 
## F-statistic:   444 on 1 and 268 DF,  p-value: < 2.2e-16

plot(data = regression_men, selected_perc ~ attr, main = "'Attractiveness' Score vs Percent of selection", xlab = "'Attractiveness' Score", ylab = "Percent of selection") 
abline(reg_attr_men, col = "blue")

reg_fun_men <- lm(selected_perc ~ fun, data = regression_men)
summary(reg_fun_men)

## 
## Call:
## lm(formula = selected_perc ~ fun, data = regression_men)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.397 -12.058   0.263  11.974  43.703 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -64.477      8.518  -7.569 6.06e-13 ***
## fun           17.253      1.288  13.391  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.1 on 268 degrees of freedom
## Multiple R-squared:  0.4009, Adjusted R-squared:  0.3986 
## F-statistic: 179.3 on 1 and 268 DF,  p-value: < 2.2e-16

plot(data=  regression_men, selected_perc ~ fun, main = "'Fun' Score vs Percent of selection", xlab = "'Fun' Score", ylab = "Percent of selection") 
abline(reg_fun_men, col = "blue")

reg_mult_men<-lm(selected_perc ~ attr + sinc + intel + fun + amb + shar, 
                 data = regression_men)
summary(reg_mult_men)

## 
## Call:
## lm(formula = selected_perc ~ attr + sinc + intel + fun + amb + 
##     shar, data = regression_men)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.349  -9.549  -0.860   9.289  48.347 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -66.4721    12.6617  -5.250 3.14e-07 ***
## attr         13.8817     1.0900  12.735  < 2e-16 ***
## sinc         -4.8869     2.0821  -2.347 0.019662 *  
## intel         2.4104     2.4407   0.988 0.324261    
## fun           5.7296     1.5845   3.616 0.000358 ***
## amb          -0.1195     1.8422  -0.065 0.948308    
## shar          1.0978     1.5512   0.708 0.479775    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.86 on 263 degrees of freedom
## Multiple R-squared:  0.6552, Adjusted R-squared:  0.6474 
## F-statistic:  83.3 on 6 and 263 DF,  p-value: < 2.2e-16

The multiple regression analysis shows that using a combination of all the attributes produced a moderately strong positive relationship with percent of selections. The adjusted R-squared value was 0.6474, meaning that 64.74% of the variation in the relationship between all the attributes and the percent of selections can be accounted for by the best-fit line from this regression analysis. The most statistically significant variables were attractiveness with the lowest p-value and fun with p-value equal to 0.000358. Sincerity also showed some degree of significance, but the attributes of sincerity, intelligence, and ambition had little importance in the model with p-values higher than 0.05.

Women

The same procedure was repeated for females. A pairs matrix was generated to visualize an overview of the relationships between each female subject’s average ratings for the six attributes and the percent of selections of the subject. Based on the matrix, the attributes with the greatest correlation with the percent of selections were, again, attractiveness and fun, just as it was for the males, and also shared interests. Linear regressions were performed with these attributes. Attractiveness has the strongest positive correlation with percent of selections with a correlation coefficient of 0.74. This supports the original hypothesis that men find attractiveness to be the most important attribute in their female partners. Fun and shared interests both have moderately strong positive linear relationships with percent of selections as the correlation coefficients are 0.65 and 0.64 respectively.

regression_women <- regression %>% filter(gender != 1)

pairs.panels(regression_women[ , c(3:8, 11)], 
             method = "pearson", scale = TRUE, ellipses = FALSE, 
             labels = c("Attractive", "Sincere", "Intelligent", "Fun", 
                        "Ambitious", "Interests", "Selected"), 
             cex.labels = 1.3, hist.col = "pink")

reg_attr_women <- lm(selected_perc ~ attr, data = regression_women)
summary(reg_attr_women)

## 
## Call:
## lm(formula = selected_perc ~ attr, data = regression_women)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.851  -8.587  -1.115   9.742  43.469 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -43.4859     4.8501  -8.966   <2e-16 ***
## attr         13.7285     0.7862  17.461   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.45 on 257 degrees of freedom
## Multiple R-squared:  0.5426, Adjusted R-squared:  0.5408 
## F-statistic: 304.9 on 1 and 257 DF,  p-value: < 2.2e-16

plot(data = regression_women, selected_perc ~ attr, main = "'Attractiveness' Score vs Percent of selection", xlab = "'Attractiveness' Score", ylab = "Percent of selection") 
abline(reg_attr_women, col = "red")

reg_fun_women <- lm(selected_perc ~ fun, data = regression_women)
summary(reg_fun_women)

## 
## Call:
## lm(formula = selected_perc ~ fun, data = regression_women)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.701 -10.365  -0.113   9.409  48.061 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -44.2514     6.2438  -7.087 1.32e-11 ***
## fun          13.0532     0.9576  13.632  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.28 on 257 degrees of freedom
## Multiple R-squared:  0.4196, Adjusted R-squared:  0.4174 
## F-statistic: 185.8 on 1 and 257 DF,  p-value: < 2.2e-16

plot(data = regression_women, selected_perc ~ fun, main = "'Fun' Score vs Percent of selection", xlab = "'Fun' Score", ylab = "Percent of selection") 
abline(reg_fun_women, col = "red")

reg_shar_women <- lm(selected_perc ~ shar, data = regression_women)
summary(reg_shar_women)

## 
## Call:
## lm(formula = selected_perc ~ shar, data = regression_women)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.205 -10.867   0.355  11.107  41.914 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -35.166      5.679  -6.192 2.34e-09 ***
## shar          13.601      1.015  13.407  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.39 on 257 degrees of freedom
## Multiple R-squared:  0.4115, Adjusted R-squared:  0.4093 
## F-statistic: 179.7 on 1 and 257 DF,  p-value: < 2.2e-16

plot(data = regression_women, selected_perc ~ shar, main = "'Shared Interests' Score vs Percent of selection", xlab = "'Shared Interests' Score", ylab = "Percent of selection") 
abline(reg_shar_women, col = "red")

reg_mult_women <- lm(selected_perc ~ attr + sinc + intel + fun + amb + shar, 
                     data = regression_women)
summary(reg_mult_women)

## 
## Call:
## lm(formula = selected_perc ~ attr + sinc + intel + fun + amb + 
##     shar, data = regression_women)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.821  -9.186   0.151   7.667  40.262 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -55.413      9.989  -5.547 7.32e-08 ***
## attr           9.662      1.068   9.045  < 2e-16 ***
## sinc          -1.732      1.526  -1.135  0.25732    
## intel          3.100      2.170   1.428  0.15445    
## fun            3.145      1.451   2.168  0.03111 *  
## amb           -2.819      1.570  -1.795  0.07378 .  
## shar           4.593      1.436   3.198  0.00156 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.63 on 252 degrees of freedom
## Multiple R-squared:  0.6009, Adjusted R-squared:  0.5914 
## F-statistic: 63.24 on 6 and 252 DF,  p-value: < 2.2e-16

The multiple regression analysis shows that using a combination of all the attributes produced a moderately strong positive relationship with percent of selections. The adjusted R-squared value was 0.5914, meaning that 59.14% of the variation in the relationship between all the attributes and the percent of selections can be accounted for by the best-fit line from this regression analysis. The most statistically significant variable was attractiveness with the lowest p-value. Shared interests and fun also showed moderate levels of significance, while other variables of sincerity, intelligence, and ambition had p-values higher than 0.05.

Modeling II: Decision Tree

A second analysis was conducted using decision trees to find which attributes were most important to men and women in selecting their partners. In the first set of decision trees, the six attributes of attractiveness, sincerity, intelligence, fun, ambition, and shared interests were used. In the second, different attributes that were not previously analyzed were used, such as career field, race, and age.

Using the Original Six Attributes

Men

For men, after training the classifier, the decision tree deemed that attractiveness and shared interests were the attributes that should be split on. For example, if a male participant’s average attractiveness rating was above 6.8 and their average shared interest rating was above 6.5, he had a 73% chance of being selected by his partner to have a second date. The accuracy of the decision tree was calculated to be above 70%, which is a moderately high accuracy.

men_tree <- raw_data %>% filter(gender == 1)

men_fit <- rpart(dec_o ~ attr_o + sinc_o + intel_o + fun_o + amb_o + shar_o, 
                 data = men_tree, method = "class")
rpart.plot(men_fit)

shuffled_men <- sample_n(men_tree, nrow(men_tree))
split_men <- 0.8*nrow(shuffled_men)
train_men <- shuffled_men[1:split_men, ]
test_men <- shuffled_men[(split_men + 1) : nrow(shuffled_men), ]

men_fit_test <- rpart(dec_o ~ attr_o + sinc_o + intel_o + fun_o + amb_o + shar_o, 
                      data = train_men, method = "class")
tree_predictions_men <- predict(men_fit_test, test_men, type = "class")
mean(tree_predictions_men == test_men$dec_o)

## [1] 0.7315036

Women

For women, after training the classifier, the decision tree deemed that attractiveness, fun, and shared interests were the three attributes that should have been split on. For example, if a female participant’s average attractiveness rating was above 6.2 and their average fun rating was above 6.8, she had a 79% chance of being selected by her partner to have a second date. If a female participant’s average attractiveness rating was above 6.2 but their average fun rating was less that 6.8, she had a 62% change of being selected by her partner to have a second date if her average shared interest rating was above 4.5. The accuracy of the decision tree was calculated to be approximately 70%, which is a moderately high accuracy.

women_tree <- raw_data %>% filter(gender != 1)

women_fit <- rpart(dec_o ~ attr_o + sinc_o + intel_o + fun_o + amb_o + shar_o, 
                   data = women_tree, method = "class")
rpart.plot(women_fit)

shuffled_women <- sample_n(women_tree, nrow(women_tree))
split_women <- 0.8*nrow(shuffled_women)
train_women <- shuffled_women[1:split_women, ]
test_women <- shuffled_women[(split_women + 1) : nrow(shuffled_women), ]

women_fit_test <-rpart(dec_o ~ attr_o + sinc_o + intel_o + fun_o + amb_o + shar_o, 
                       data = train_women, method = "class")
tree_predictions_women <- predict(women_fit_test, test_women, type = "class")
mean(tree_predictions_women == test_women$dec_o)

## [1] 0.7631579

Using New Attributes

The second set of decision trees used attributes not previously analyzed. These attributes include the number of date that the participant was for their partner, age, race, field of study, primary goal in participating in the speed dating event, how frequently the participant went on dates, and intended career field.

Men

For men, the decision tree deemed that the best attributes to split on were frequency of dates a male participant went on in general, race, intended career field, and age. Based on the decision tree, men who went on dates less than twice a month, were of the Black/African American or Asian/Pacific Islander/Asian-American race, were intending to be in the career field of Academia/Research, and were more than 30 years old were more unfavorable by the female participants.

men_fit_2 <-rpart(dec_o ~ order + age + race + field_cd + goal + date + career_c, 
                  data = men_tree, method = "class")
rpart.plot(men_fit_2)

shuffled_men_2 <- sample_n(men_tree, nrow(men_tree))
split_men_2 <- 0.8*nrow(shuffled_men_2)
train_men_2 <- shuffled_men_2[1:split_men_2, ]
test_men_2 <- shuffled_men_2[(split_men_2 + 1) : nrow(shuffled_men_2), ]

men_fit_test_2 <-rpart(dec_o ~ order + age + race + field_cd + goal + date + career_c, 
                       data = train_men_2, method = "class")
tree_predictions_men_2 <- predict(men_fit_test_2, test_men_2, type = "class")
mean(tree_predictions_men_2 == test_men_2$dec_o)

## [1] 0.6300716

Women

For women, the decision tree deemed that the best attributes to split on were career field, age, race, and field of study. Based on the decision tree, women who were intending to be in the career fields of Psychology, Engineering, Creative Arts/Entertainment, Social Work, Speech Pathology, Politics, Journalism, or Undecided were more unfavorable by the male participants. In addition, if the female participant was more than 32 years old, was of the Asian/Pacific Islander/Asian-American race, and studied Education, Biological Sciences/Chemistry/Physics, Social Work, Political Science/International Affairs, Film, Fine Arts/Arts Administration, Languages, Architecture, or were Undecided were also unfavorable candidates for a second date.

women_fit_2 <- rpart(dec_o ~ order + age + race + field_cd + goal + date + career_c,
                     data = women_tree, method = "class")
rpart.plot(women_fit_2)

shuffled_women_2 <- sample_n(women_tree, nrow(women_tree))
split_women_2 <- 0.8*nrow(shuffled_women_2)
train_women_2 <- shuffled_women_2[1:split_women_2, ]
test_women_2 <- shuffled_women_2[(split_women_2 + 1) : nrow(shuffled_women_2), ]

women_fit_test_2 <- rpart(dec_o ~ order + age + race + field_cd + goal + date + career_c,
                          data = train_women_2, method = "class")
tree_predictions_women_2 <- predict(women_fit_test_2, test_women_2, type = "class")
mean(tree_predictions_women_2 == test_women_2$dec_o)

## [1] 0.5586124

For both the men and the women, the accuracy of the decision trees for these attributes were lower than the decision trees for the six attributes of attractiveness, sincerity, intelligence, fun, ambition, and shared interests. The decision tree for men using the new attributes had an accuracy in the sixties, and the decision tree for women using the new attributes had an accuracy of fifties. This may indicate that while men and women do consider these attributes when looking for a romantic partner, they may not be as important as the six attributes analyzed previously.

Conclusion

According to the radar plots previously produced, men seemed to put the most importance in the attributes of attractiveness, sincerity, and intelligence while women seemed to put the most important in the attributes of intelligence and sincerity. It was interesting that both men and women overestimated for the other gender the importance of attractiveness – women thought men put a very large weight in attractiveness, and men also thought women put a large weight on attractiveness. When asked directly, both men and women said other attributes are just as or more important than attractiveness. However, based on this investigation into the data from the speed dating events, it appears that attractiveness is in fact the most important attribute in finding a romantic partner from both the male and female perspectives. This conclusion is based on the results of the regression analyses and the decision tree classification analyses, which both resulted in attractiveness being an important attribute for both men and women.

Perhaps attraction to a person is heavily dependent on appearance and beauty, but people are not as honest with themselves in saying so directly. Yet the data shows that beauty and physical attraction tends to be quite important in finding a romantic partner.

The original hypothesis for men holds true. Men do tend to consider attractiveness to be the most important factor in finding a romantic partner. The original hypothesis for women appears to be false. Women do not tend to consider intelligence to be the most important factor in finding a romantic partner, and rather, they also tend to consider attractiveness to be the most important factor in finding a romantic partner.

Final Project

Min Jeong Kang and Christie Gahm

11 December 2017

Introduction

Hypotheses

Plan of Analysis

Data Preparation

Data Overview

Radar Plots

Men

Women

Modeling I: Linear Regression

Men

Women

Modeling II: Decision Tree

Using the Original Six Attributes

Men

Women

Using New Attributes

Men

Women

Conclusion