We chose to explore the General Social Survey from the year 2016, which documents the perspective of those living in the U.S. on a range of issues. The survey itself, conducted every two years, contains 960 questions. For the year 2016, 2867 people were surveyed. We chose to focus on opinion on gun control–specifically, we wanted to find out which factors, both demographic and geographic, influence whether a person is more likely to be for or against a law which would require a person to obtain a police permit before they could buy a gun.
The premise and guiding question behind our analysis project is ‘which characteristics are likely to be found in respondents who stated that they oppose gun permits?’
We hypothesize that age, education level, geographic region, class, political party affiliation and whether or not the respondent already owns a gun will predict whether or not a respondent is more likely to oppose gun permits.
We found that only 18% of those living in the Northeast opposed gun permits, whereas 34% of those living in the South did. In the Midwest and the West, the proportion of people who opposed gun permits was 27% and 29% respectively. This suggests that if you live in the South, you are more likely to oppose gun permits.
We tested this relationship for significance using the chi-squared test and found a chi-squared statistic of 26.455. The associated p value was 7.658e-06. This provides strong evidence that there is a statistically significant relationship between region and opinion on gun permits.
The data set we’re first attaching (“guncontrol.csv”) includes the complete results of the General Social Survey for the year 2016, which includes 2867 observations and 960 variables. We found that removing NA on all the variables that we intended to test removed too many observations–thereby reducing the sample size of some variables and skewing the results of the some of the tests we ran. In each of the tests we ran, we reloaded the original data set and constructed a smaller data frame for the variables we intended to test (in this case the opinion on gun permits and region) and then removed NAs, ensuring the we include the maximum number of observations. The data set had two different types of missing valued that were labeled as “.i” and “.d”, so we turned these into ‘NA’ before ommitting.
Upon first running a chi-squared test for region and gunlaw (favoring or opposing gun permits), the test did not indicate that the relationship was significant–however, consolidating the 9 regions into 4 did yield a statistically significant relationship.
Here, we created a new column that we called region2, coded each of the 9 regions into a simplified set of 4: West, South, Midwest, and Northeast. Then, we created a proportion table for the proportion of respondents within each of the 4 regions who responded that they favoed or opposed gun permits.
barplot(proportion_gun_vs_region, main="Opinions on Gun Permits by Region", xlab="Region", ylab="Proportion", col=c("darkblue","red"), legend = rownames(proportion_gun_vs_region), beside= TRUE)
chisq.test(gun_vs_region)
##
## Pearson's Chi-squared test
##
## data: gun_vs_region
## X-squared = 26.455, df = 3, p-value = 7.658e-06
Finally, to better visualize the results of our proportion table, we created a barplot broken down by region and proportion of respondents who favor or oppose gun permits. This process serves as the foundation for each test we run between ‘gunlaw’ (whether or not they respondent favors or opposes gun permits) and the variable we want to test. As previously mentioned, running a chi-squared test on gunlaw vs. region shows that the two have a significant relationship, with a p value of 7.658e-06.
To create a visualization of this data, we first loaded the ‘states’ data set found in the maps library and once again added another column for region (region2) converted into four main regions (West, South, Midwest, and Northeast). In order to construct a color heatmap based on proportion of respondents who oppose gun permits, we filtered our proportion table by ‘oppose’and merged it with our ’states’ data set by region2 to create a data frame that would be easily translated into a heatmap with a color scale coded on the proportion of respondents who oppose gun permits.
ggplot(map.df, aes(x=long,y=lat,group=group,label=region2))+
geom_polygon(aes(fill=Proportion), color = "white")+
#white border not working?
geom_path()+
scale_fill_gradientn(colours=rev(heat.colors(10)),na.value="grey90")+
ggtitle("Regions with the Highest Proportion of Respondents Opposing Gun Permits")+
coord_fixed(1.3)+
theme(panel.background = element_blank(),
panel.grid = element_blank(),
axis.line = element_blank(),
axis.title = element_blank(),
axis.ticks = element_blank(),
axis.text = element_blank())
coord_map()
## <ggproto object: Class CoordMap, Coord>
## aspect: function
## distance: function
## is_linear: function
## labels: function
## limits: list
## orientation: NULL
## params: list
## projection: mercator
## range: function
## render_axis_h: function
## render_axis_v: function
## render_bg: function
## render_fg: function
## train: function
## transform: function
## super: <ggproto object: Class CoordMap, Coord>
As illustrated by the heatmap, those in the South oppose gun permits more so than other major regions.
We found that the age group with the highest proportion of those opposing gun permits was 18-34. 32.5% of 18-34 year olds oppose gun permits, whereas only 23.9% of those over the age of 65+ did, making those above the age of 65 least likely to oppose gun permits. For those in the age groups 35-49 and 50-64, the proportion of people who opposed gun permits was 28.8% and 28.3% respectively.
We tested this relationship for significance using the chi-squared test and found a chi-squared statistic of 7.8065. The associated p value was 0.05019. This value is slightly higher than 0.05, so there isn’t quite strong enough evidence at the 5% significance level to suggest that there is a statistically significant relationship between age and opinion on gun control.
Once again, we created a smaller data frame (containing only the variables gunlaw and age) from the large, uncleaned data set and then omitted NAs, resulting in 1850 responses.The data set had two different types of missing valued that were labeled as “.i” and “.d”, so we turned these into ‘NA’ before ommitting. Because chi-squared tests are suited for categorical data, we created a new column (age group) and classified the numeric age data into four categorical groups (18-34, 35-49,50-64,65+) before using the data to create a proportion table.
proportion_gun_vs_region
##
## Midwest Northeast South West
## favor 0.7312775 0.8164557 0.6608187 0.7128713
## oppose 0.2687225 0.1835443 0.3391813 0.2871287
barplot(proportion_gun_vs_age, main="Opinions on Gun Permits by Age Group", xlab="Age group", ylab="Proportion", col=c("darkblue","red"), legend = rownames(proportion_gun_vs_region), beside= TRUE)
chisq.test(gun_vs_age)
##
## Pearson's Chi-squared test
##
## data: gun_vs_age
## X-squared = 7.8065, df = 3, p-value = 0.05019
Again, to better visualize the results of our proportion table, we created a barplot broken down by age group and proportion of respondents who favor or oppose gun permits. Since our chi-squared test resulted in a p value of 0.05019, there isn’t quite enough evidence to point to a statistically significant relationship between age group and opinion on gun permits.
We found that the class group with the highest proportion of people opposing gun permits was working-class. 31.1% of the working-class group opposed gun permits, whereas only 14% of those who identified as upper-class opposed gun permits. For those who identified as lower class and middle class, the proportion of people who opposed gun permits was 27.7% and 25.9% respectively.
We tested this relationship for significance using the chi-squared test and found a chi-squared statistic of 9.9909. The associated p value was 0.01864. This value is lower than 0.05, so there is strong evidence at the 5% significance level to suggest that there is a statistically significant relationship between class and opinion on gun control.
Following the same protocol, we created a new data set containing only the varaibles gunlaw and class, removed NAs, then converted the data into a proportion table.The data set had two different types of missing valued that were labeled as “.i” and “.d”, so we turned these into ‘NA’ before ommitting.
proportion_gun_vs_class
##
## lower class middle class upper class working class
## favor 0.7225434 0.7409880 0.8666667 0.6887115
## oppose 0.2774566 0.2590120 0.1333333 0.3112885
barplot(proportion_gun_vs_class, main="Opinions on Gun Permits by Class", xlab="Class", ylab="Proportion", col=c("darkblue","red"), legend = rownames(proportion_gun_vs_class), beside= TRUE)
chisq.test(gun_vs_class)
##
## Pearson's Chi-squared test
##
## data: gun_vs_class
## X-squared = 10.585, df = 3, p-value = 0.01419
Here, we imported a new data set that contains data on the percentage of respondents within each class who oppose gun permits over time, from the time period 1972 to 2016. The uncleaned data set was not in a form that we could work with, and contained a lot of extra details that we did not need. We therefore selected only the cells A12:AC16. We then transposed the data set so that the year and percentage opposed for each class were columns instead of rows. We added column titles to this data frame. The uncleaned data set contained cell data with percent of a class opposed, with what we assume to be margin or error in the same cell. To convert this data set into a workable one, we had to convert each of the columns containing the percentage of those opposed to gun permits into numeric data (we created new columns called LowerClass2, WorkingClass2, MiddleClass2 and UpperClass2 to store this numeric data). This yielded a set of numbers that did not match with the original true value, so we manually replaced each number with the true ‘percent opposed’ values. Then, we contructed a line graph using ggplot to illustrate the change in percent opposed broken down by class over time.
ggplot(final_df1, aes(x=Year)) +
geom_line(aes(y=LowerClass2), colour = "blue", group=1, show.legend = TRUE) +
geom_line(aes(y=WorkingClass2), colour = "red", group=1, show.legend = TRUE) +
geom_line(aes(y=MiddleClass2), colour = "green", group=1, show.legend = TRUE) +
geom_line(aes(y=UpperClass2), colour = "purple", group=1, show.legend = TRUE) +
geom_point(aes(y=LowerClass2), colour = "blue") +
geom_point(aes(y=WorkingClass2), colour = "red") +
geom_point(aes(y=MiddleClass2), colour = "green") +
geom_point(aes(y=UpperClass2), colour = "purple") +
ggtitle("Percent Opposed to Gun Permits by Class Identification from 1972 to 2016", subtitle = "Lower Class = blue, Working Class = red, Middle Class = green, Upper Class = purple") + xlab("Year") + ylab("Percent Opposed") +
theme(text = element_text(size=12),
axis.text.x = element_text(angle=90, hjust=1))
The line graph of percent opposed to gun permits broken down by class over time illustrates how the percentage of those who oppose gun permits within each class has varied over time. Most notably, the fluctuations in percent opposed amongst members of the upper class seem to fluctuate the most dramatically from time period to time period (particularly between 1976 and 1977). Furthermore, in 1972, there was a bigger difference in the percentage opposed for who were lower class, working class, and middle class, whereas in 2016 the difference between these groups is much smaller, suggesting that their opinions are now more aligned.
We found that 40% of Republicans oppose gun permits, which was the highest proportion of all three political groups. Only 19.1% of democrats opposed gun permits and 29.8% of those who identified as independent or another party opposed gun permits. This suggests that if you are a Republican, you are more likely to oppose gun permits.
We tested this relationship for significance using the chi-squared test and found a chi-squared statistic of 56.311. The associated p value was 5.919e-13. This provides strong evidence that there is a statistically significant relationship between political party identification and opinion on gun permits.
As we did with the previous variable, we created a smaller data set from the original uncleaned one that contained only the variables gunlaw and partyid. To simplify analysis on political party, we consolidated the 8 response categories in partyid into a new variable (polparty) that only contained 3 categories: Democrat, Republican, and Independent
barplot(proportion_gun_vs_party, main="Opinions on Gun Permits by Political Party", xlab="Political Party", ylab="Proportion", col=c("darkblue","red"), legend = rownames(proportion_gun_vs_party), beside= TRUE)
chisq.test(gun_vs_party)
##
## Pearson's Chi-squared test
##
## data: gun_vs_party
## X-squared = 56.311, df = 2, p-value = 5.919e-13
To analyze party opinions on gun permits over time, we imported a new data set that contains data on the percentage of respondents within each party who oppose gun permits over time, from the time period 1972 to 2016. The uncleaned data set was not in a form that we could work with, and contained a lot of extra details that we did not need. We therefore selected only the cells A12:AC16. We then transposed the data set so that the year and percentage opposed for each party were columns instead of rows. We added column titles to this data frame. The uncleaned data set contained cell data with percent of a party opposed, with margin or error in the same cell. To convert this data set into a workable one, we had to convert each of the columns containing the percentage and margin of error of those opposed to gun permits into numeric data (we created new columns called Democrat2, Republican2, Independent2 to store this numeric data). This yielded a set of numbers that did not match with the original true value, so we manually replaced each number with the true ‘percent opposed’ values. Then, we contructed a line graph using ggplot to illustrate the change in percent opposed broken down by class over time.
ggplot(final_df, aes(x=Year)) + geom_line(aes(y=Democrat2), colour = "blue", group=1, show.legend = TRUE) + geom_line(aes(y=Republican2), colour = "red", group=1, show.legend = TRUE) + geom_line(aes(y=Independent2), colour = "green", group=1, show.legend = TRUE) + geom_point(aes(y=Democrat2), colour = "blue") + geom_point(aes(y=Republican2), colour = "red") + geom_point(aes(y=Independent2), colour = "green") + scale_colour_manual(values=c("Republican" = "red", "Independent" = "green", "Democrat" = "blue"), guide = "legend") + ggtitle("Percent Opposed to Gun Permits by Party Affiliation from 1972 to 2016", subtitle = "Republican = red, Democrat = blue, Independent = green") + xlab("Year") + ylab("Percentage") + theme(text = element_text(size=12),
axis.text.x = element_text(angle=90, hjust=1))
Notably, we found that the gap between the percentage of Democrats and Republicans who oppose gun permits has become significant wider over time. This supports evidence that political parties in the US have become increasingly polarized over time.
We found that 32.8% of those who completed high school but did not attend college opposed gun permits, which the was the highest proportion across the three education levels. 25.8% of those who did not finish high school and 23.2% of those who went on to college opposed gun permits. This suggests that high school graduates who did not attend college are more likely to oppose gun permits.
We tested this relationship for significance using the chi-squared test and found a chi-squared statistic of 18.991. The associated p value was 7.52e-05. This provides strong evidence that there is a statistically significant relationship between education level and opinion on gun permits.
Here, we created a data frame to contain only the variables gunlaw and degree. We created a new column that we called ‘educlevel’ which simplified the variable ‘degree’ from 5 categories to 3 categories : Less Than High School, High School, and College+. As we did with previous tests, we converted “.i” and “.d” into NA before ommitting these missing values.
barplot(proportion_gun_vs_educ, main="Opinions on Gun Permits by Education Level", xlab="Education Level", ylab="Proportion", col=c("darkblue","red"), legend = rownames(proportion_gun_vs_educ), beside= TRUE)
chisq.test(gun_vs_educ)
##
## Pearson's Chi-squared test
##
## data: gun_vs_educ
## X-squared = 18.991, df = 2, p-value = 7.52e-05
qplot(gunmodel$educlevel, gunmodel$owngun, color = gunlaw, data = gunmodel, geom = "jitter", main = "Education Level & Gun Ownership", xlab = "Education Level", ylab="Owns Gun")
From our jitter qplot for education level vs. owning a gun, it’s clearbased on the density of points that the majority of respondents surveyed don’t own a gun. However, we thought it was interesting that very few of those with a college education oppose gun permits, whereas many more people with only a high school education oppose gun permits. This matches was out earlier analysis found.
We found that 33.8% of men opposed gun permits, whereas only 24.1% of women did. This suggests that men are more likely to oppose gun permits.
We tested this relationship for significance using the chi-squared test and found a chi-squared statistic of 20.925. The associated p value was 4.776e-06. This provides strong evidence that there is a statistically significant relationship between sex and opinion on gun permits.
Here, we created a data frame to contain only the variables gunlaw and sex. As we did with previous tests, we converted “.i” and “.d” into NA before ommitting these missing values.
proportion_gun_vs_sex
##
## female male
## favor 0.7592233 0.6618357
## oppose 0.2407767 0.3381643
barplot(proportion_gun_vs_sex, main="Opinions on Gun Permits by Sex", xlab="Sex", ylab="Proportion", col=c("darkblue","red"), legend = rownames(proportion_gun_vs_sex), beside= TRUE)
chisq.test(gun_vs_sex)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: gun_vs_sex
## X-squared = 20.925, df = 1, p-value = 4.776e-06
We found that 30.1% of white people opposed gun permits, which the was the highest proportion across race categories. 24.2% of black people and 23.1% of those with another race opposed gun permits. This suggests that white people may be slightly more likely to oppose gun permits.
We tested this relationship for significance using the chi-squared test and found a chi-squared statistic of 7.1444. The associated p value was 0.02809. This value is lower than 0.05, so there is strong evidence at the 5% significance level to suggest that there is a statistically significant relationship between race and opinion on gun control.
Here, we created a data frame to contain only the variables gunlaw and race. As we did with previous tests, we converted “.i” and “.d” into NA before ommitting these missing values.
barplot(proportion_gun_vs_race, main="Opinions on Gun Permits by Race", xlab="Race", ylab="Proportion", col=c("darkblue","red"), legend = rownames(proportion_gun_vs_race), beside= TRUE)
chisq.test(gun_vs_race)
##
## Pearson's Chi-squared test
##
## data: gun_vs_race
## X-squared = 7.1444, df = 2, p-value = 0.02809
We found that 39.8% of those who had a gun at home opposed gun permits, whereas only 22.1% of those who did not have a gun at home opposed gun permits. This suggests that if you have a gun at home, you are more likely to oppose gun permits.
We tested this relationship for significance using the chi-squared test and found a chi-squared statistic of 60.845. The associated p value was 6.175e-15. This provides strong evidence that there is a statistically significant relationship between having a gun at home and opinion on gun permits.
Here, we created a data frame to contain only the variables gunlaw and owngun As we did with previous tests, we converted “.i” and “.d” into NA before ommitting these missing values. There were also three other types of missing values (“DK”, “IAP”, and “refused”) which we also converted into NA before ommitting.
proportion_gun_vs_ownership
##
## no yes
## favor 0.7794955 0.6024096
## oppose 0.2205045 0.3975904
barplot(proportion_gun_vs_ownership, main="Opinions on Gun Permits by Gun Ownership", xlab="Owns Gun", ylab="Proportion", col=c("darkblue","red"), legend = rownames(proportion_gun_vs_ownership), beside= TRUE)
chisq.test(gun_vs_ownership)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: gun_vs_ownership
## X-squared = 60.845, df = 1, p-value = 6.175e-15
The first step in building classification models was to create a new data set that merged all the variables we explored and tested. We then recreated the new variables that we used to simplify and classify many of the response variables (region2, agegroup, educlevel and polparty).
shuffled <- sample_n(gunmodel, nrow(gunmodel))
split <- 0.8 * nrow(shuffled)
train <- shuffled[1 : split, ]
test <- shuffled[(split + 1) : nrow(shuffled), ]
nb_class <- naiveBayes(gunlaw ~ owngun + educlevel + sex + region2 + polparty + race, data = train)
nb_predictions <- predict(nb_class, test, method="class")
mean(nb_predictions == test$gunlaw)
## [1] 0.7046784
feature1test <- jitter(as.numeric(test$region2))
feature1train <- jitter(as.numeric(train$region2))
feature2test <- jitter(as.numeric(test$polparty))
feature2train <- jitter(as.numeric(train$polparty))
feature3test <- jitter(as.numeric(test$owngun))
feature3train <- jitter(as.numeric(train$owngun))
feature4test <- jitter(as.numeric(test$educlevel))
feature4train <- jitter(as.numeric(train$educlevel))
feature5test <- jitter(as.numeric(test$sex))
feature5train <- jitter(as.numeric(train$sex))
feature6test <- jitter(as.numeric(test$race))
feature6train <- jitter(as.numeric(train$race))
knn_train_x <- data.frame(feature1train, feature2train, feature3train, feature4train, feature5train, feature6train)
knn_test_x <- data.frame(feature1test, feature2test, feature3test, feature4test, feature5test, feature6test)
knn_train_y <- train$gunlaw
knn_class <- knn(knn_train_x, knn_test_x, knn_train_y, k = 5)
knn_predictions <- knn_class
mean(knn_predictions == test$gunlaw)
## [1] 0.6403509
Our naive Bayes classification model, which attempts to accurately classify observations either as for or against gun permits based on the varaibeles that we found to have a statistically signinficant relationship with opinion (region, political party, gun ownership, education level, sex and race). The accuracy changes every time we run it, but it was around 70% every time. In order to construct a knn classification model, we first had to jitter and convert each of the train and test data for each of the six features we tested (region, political party, sex, race, education level and gun ownership) into a numeric before testing the model. The knn model was slightly less accurate than the naive Bayes model, achieving an accuracy of more or less 65%. These models do not have as high of an accuracy level as we hoped, but we think this indicates how complex and nuanced opinions about gun control are. Although we were able to detect relationships between opinions and the variables we selected, opinion on gun control is difficult to predict.