āIn this project, we examined the associations of Smoking in the Slovak Youth Survey. We wanted to know what variables correlated with smoking, and with that knowledge in hand, to create a decision tree that would predict smoking well.
Data is messy and we ran into troulbe. The correlations between some variables were much weaker than expected. As a result, vmany of our decision trees span only a single branchāthe tree classifier did not deem our variables significant enough to seperate our data. Perhaps this same fact can explain our low accuracy. For both the KNN and the Tree classifiers accuracy was questionably lowābelow 40% in some cases. Our models may not have worked with the right variables.
Nonetheless, we came away with some interesting findings. Numbers of friends can predict Smoking. Addredline seeking activities or weight cannot. In the end, this ruling out of variablesāof finding variables that couldnāt make the cut was an important part of our data analysis."
responses$Smoking <- as.factor(responses$Smoking)
levels(responses$Smoking)[levels(responses$Smoking)== "never smoked"] <- "Would never smoke"
levels(responses$Smoking)[levels(responses$Smoking)== "tried smoking"] <- "Tried Smoking"
levels(responses$Smoking)[levels(responses$Smoking)== "former smoker"] <- "Former Smoker"
levels(responses$Smoking)[levels(responses$Smoking)== "current smoker"] <- "Current Smoker"
responses$Smoking <- sort(responses$Smoking)
Smoking1 <-responses$Smoking[9:1010]
Smoking_vs_friends <- ddply(responses, .(Smoking), summarise,
mean_friends = mean(Number.of.friends, na.rm = TRUE))
gf_point(mean_friends ~ Smoking, data = Smoking_vs_friends, color = ~ Smoking, size = ~ mean_friends) %>%
gf_labs(title = "", caption = "")
Here we observe a negative correlation between smoking and number of friendsāless smoking correlates with more friends. Current smokers are particular outliers, dipping far below former and never-smokers
levels(responses$Weight)[levels(responses$Weight)=="drink a lot"] <- 3
levels(responses$Weight)[levels(responses$Weight)=="social drinker"] <- 2
levels(responses$Weight)[levels(responses$Weight)=="never"] <- 1
Smoking_vs_weight <- ddply(responses, .(Smoking), summarise,
mean_weight = mean(Weight, na.rm = TRUE))
gf_jitter(mean_weight ~ Smoking, data = Smoking_vs_weight, color = ~ mean_weight, size = ~ mean_weight) %>%
gf_labs(title = "", caption = "")
Although there appears to be little correlation betwen smoking and weight there are some interesting trends. Those who are currently smoking weigh more than those who have never smoked. The former smokers, who have permanently quit, weigh the least.
KNN
responses <- na.omit(responses)
feature_1 <- responses$Number.of.friends
feature_2 <- responses$Weight
knn_train <- data.frame(feature_1, feature_2)
knn_test <- knn_train
knn_label <- responses$Smoking
knn_class6 <- knn(knn_train, knn_test, knn_label, k = 6)
knn_class12 <- knn(knn_train, knn_test, knn_label, k = 12)
knn_class24 <- knn(knn_train, knn_test, knn_label, k = 24)
summary(knn_class6)
## Current Smoker Former Smoker Would never smoke
## 0 70 54 108
## Tried Smoking
## 454
summary(knn_class12)
## Current Smoker Former Smoker Would never smoke
## 0 56 50 92
## Tried Smoking
## 488
summary(knn_class24)
## Current Smoker Former Smoker Would never smoke
## 0 21 34 15
## Tried Smoking
## 616
Our KNN Classifiers have varying succsses. Since our dataset contains a total of 137 Current Smokers, 123 Former Smokers, 142 who have never smoked and 277 who have tried smoking, our models all drastically overpredited those who tried smoking. This error may have to do with the variety of the category of those who having tried smokingāthere may be strong ties across variables.
tree1 <- rpart(responses$Smoking ~ feature_1 + feature_2, data = responses, method = 'class')
rpart.plot(tree1)
Our tree is a single line with no nodes which means that the tree determiner could not find a significant nodeāwhich I suspect means that there is only a weak correlations between Number of Friends, weight and smoking
accuracy <- function(predictions, ground_truth)
mean(predictions == ground_truth)
accuracy(tree1, responses$Smoking)
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(predictions, ground_truth): longer object length is
## not a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(predictions, ground_truth): longer object length is
## not a multiple of shorter object length
## [1] 0
accuracy(knn_class24, responses$Smoking)
## [1] 0.4183673
shuffled <- sample_n(responses, nrow(responses))
split <- 0.8 * nrow(shuffled)
train <- shuffled[1 : split, ]
test <- shuffled[(split + 1) : nrow(shuffled), ]
train$Number.of.friends <- as.numeric(train$Number.of.friends)
train$Weight <- as.numeric(train$Weight)
feature_1 <- train$Number.of.friends
feature_2 <- train$Weight
knn_train <- data.frame(feature_1, feature_2)
knn_test <- knn_train
knn_label <- train$Smoking
knn_class1 <- knn(knn_train, knn_test, knn_label, k = 1)
accuracy(knn_class1, test$Smoking)
## [1] 0.310219
Our Tree has an accuracy of 41.8% while our KNN predictor with K=24 has an accuracy of 31.7%. Those are fairly low numbers that suggest a weak correaltion between variables.
tree1 <- rpart(Smoking ~ feature_1 + feature_2, data = train, method = 'class',control=rpart.control(maxdepth = 2))
# NA Omit is not eliminating blank spaces
Here the KNN predicitors overpredict to an even greater degree those who have Tried Smoking, reinforcing the narrative that there little conenction between adrenaline sports, musical instruments and smoking
# Decision tree
responses$Musical.instruments <- as.factor(responses$Musical.instruments)
responses$Adrenaline.sports <- as.factor(responses$Adrenaline.sports)
feature_1 <- responses$Adrenaline.sports
feature_2 <- responses$Musical.instruments
tree_classifier <- rpart(responses$Smoking ~ feature_1 + feature_2, data = responses, method = 'class')
rpart.plot(tree_classifier)
# Decision tree added variable
rpart.control(maxdepth = 5)
## $minsplit
## [1] 20
##
## $minbucket
## [1] 7
##
## $cp
## [1] 0.01
##
## $maxcompete
## [1] 4
##
## $maxsurrogate
## [1] 5
##
## $usesurrogate
## [1] 2
##
## $surrogatestyle
## [1] 0
##
## $maxdepth
## [1] 5
##
## $xval
## [1] 10
feature_1 <- responses$Adrenaline.sports
feature_2 <- responses$Musical.instruments
feature_3 <- responses$Number.of.friends
new_plot <- rpart(responses$Smoking ~ feature_1 + feature_3, data = responses, method = 'class')
rpart.plot(new_plot)
Here we add another variable and suddenly our tree changes. Friends by themselves predict have tried smoking well. Lack of friends (especially if answered below 1.5 on the survey) leans towards current smoking.
# Decision tree added variable
rpart.control(maxdepth = 5)
## $minsplit
## [1] 20
##
## $minbucket
## [1] 7
##
## $cp
## [1] 0.01
##
## $maxcompete
## [1] 4
##
## $maxsurrogate
## [1] 5
##
## $usesurrogate
## [1] 2
##
## $surrogatestyle
## [1] 0
##
## $maxdepth
## [1] 5
##
## $xval
## [1] 10
feature_1 <- responses$Adrenaline.sports
feature_2 <- responses$Musical.instruments
feature_3 <- responses$Number.of.friends
feature_4 <- responses$Weight
tree1 <- rpart(responses$Smoking ~ feature_1 + feature_2 + feature_3 + feature_4, data = responses, method = 'class')
rpart.plot(tree1)
By adding a fourth variable our decision tree again has zero nodes. With so many variables, no significant distinctions are made.
tree_predictions <- predict(tree1, responses, type = 'class')
knn_predictions <- knn_class2
The accuracy of both are tree and KNN predicition are quite low (around 40%. Smoking is more difficult to predict than we initially thought).
# Decision tree train
rpart.control(maxdepth = 5)
## $minsplit
## [1] 20
##
## $minbucket
## [1] 7
##
## $cp
## [1] 0.01
##
## $maxcompete
## [1] 4
##
## $maxsurrogate
## [1] 5
##
## $usesurrogate
## [1] 2
##
## $surrogatestyle
## [1] 0
##
## $maxdepth
## [1] 5
##
## $xval
## [1] 10
train <- na.omit(train)
feature_1 <- train$Adrenaline.sports
feature_2 <- train$Musical.instruments
feature_3 <- train$Number.of.friends
feature_4 <- train$Weight
tree3 <- rpart(train$Smoking ~ feature_1 + feature_2 + feature_3 + feature_4, data = responses, method = 'class')
rpart.plot(tree3)
#na.omit did not work to remove null values?
Our final tree looks much like our first oneāzero nodes and only one branch. Predicting smoking from adrenaline, music, friends and weight has been much more difficult thatn we predicted