“In this project, we examined the associations of Smoking in the Slovak Youth Survey. We wanted to know what variables correlated with smoking, and with that knowledge in hand, to create a decision tree that would predict smoking well.

Data is messy and we ran into troulbe. The correlations between some variables were much weaker than expected. As a result, vmany of our decision trees span only a single branch–the tree classifier did not deem our variables significant enough to seperate our data. Perhaps this same fact can explain our low accuracy. For both the KNN and the Tree classifiers accuracy was questionably low–below 40% in some cases. Our models may not have worked with the right variables.

Nonetheless, we came away with some interesting findings. Numbers of friends can predict Smoking. Addredline seeking activities or weight cannot. In the end, this ruling out of variables–of finding variables that couldn’t make the cut was an important part of our data analysis."

responses$Smoking <- as.factor(responses$Smoking)
levels(responses$Smoking)[levels(responses$Smoking)== "never smoked"] <- "Would never smoke"
levels(responses$Smoking)[levels(responses$Smoking)== "tried smoking"] <- "Tried Smoking"
levels(responses$Smoking)[levels(responses$Smoking)== "former smoker"] <- "Former Smoker"
levels(responses$Smoking)[levels(responses$Smoking)== "current smoker"] <- "Current Smoker"
responses$Smoking <- sort(responses$Smoking)
Smoking1 <-responses$Smoking[9:1010]

Smoking_vs_friends <- ddply(responses, .(Smoking), summarise,
  mean_friends = mean(Number.of.friends, na.rm = TRUE))
gf_point(mean_friends ~ Smoking, data = Smoking_vs_friends, color = ~ Smoking, size = ~ mean_friends) %>% 
gf_labs(title = "", caption = "")

Here we observe a negative correlation between smoking and number of friends–less smoking correlates with more friends. Current smokers are particular outliers, dipping far below former and never-smokers

levels(responses$Weight)[levels(responses$Weight)=="drink a lot"] <- 3
levels(responses$Weight)[levels(responses$Weight)=="social drinker"] <- 2
levels(responses$Weight)[levels(responses$Weight)=="never"] <- 1

Smoking_vs_weight <- ddply(responses, .(Smoking), summarise,
  mean_weight = mean(Weight, na.rm = TRUE))
gf_jitter(mean_weight ~ Smoking, data = Smoking_vs_weight, color = ~ mean_weight, size = ~ mean_weight) %>% 
gf_labs(title = "", caption = "")

Although there appears to be little correlation betwen smoking and weight there are some interesting trends. Those who are currently smoking weigh more than those who have never smoked. The former smokers, who have permanently quit, weigh the least.

KNN

responses <- na.omit(responses)
feature_1 <- responses$Number.of.friends
feature_2 <- responses$Weight
knn_train <- data.frame(feature_1, feature_2)
knn_test <- knn_train
knn_label <- responses$Smoking
knn_class6 <- knn(knn_train, knn_test, knn_label, k = 6)
knn_class12 <- knn(knn_train, knn_test, knn_label, k = 12)
knn_class24 <- knn(knn_train, knn_test, knn_label, k = 24)
summary(knn_class6)

##                      Current Smoker     Former Smoker Would never smoke 
##                 0                70                54               108 
##     Tried Smoking 
##               454

summary(knn_class12)

##                      Current Smoker     Former Smoker Would never smoke 
##                 0                56                50                92 
##     Tried Smoking 
##               488

summary(knn_class24)

##                      Current Smoker     Former Smoker Would never smoke 
##                 0                21                34                15 
##     Tried Smoking 
##               616

Our KNN Classifiers have varying succsses. Since our dataset contains a total of 137 Current Smokers, 123 Former Smokers, 142 who have never smoked and 277 who have tried smoking, our models all drastically overpredited those who tried smoking. This error may have to do with the variety of the category of those who having tried smoking–there may be strong ties across variables.

tree1 <- rpart(responses$Smoking ~ feature_1 + feature_2, data = responses, method = 'class')
rpart.plot(tree1)

Our tree is a single line with no nodes which means that the tree determiner could not find a significant node–which I suspect means that there is only a weak correlations between Number of Friends, weight and smoking

accuracy <- function(predictions, ground_truth) 
  mean(predictions == ground_truth)
accuracy(tree1, responses$Smoking)

## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length

## Warning in `==.default`(predictions, ground_truth): longer object length is
## not a multiple of shorter object length

## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length

## Warning in `==.default`(predictions, ground_truth): longer object length is
## not a multiple of shorter object length

## [1] 0

accuracy(knn_class24, responses$Smoking)

## [1] 0.4183673

shuffled <- sample_n(responses, nrow(responses))
split <- 0.8 * nrow(shuffled)
train <- shuffled[1 : split, ]
test <- shuffled[(split + 1) : nrow(shuffled), ]
train$Number.of.friends <- as.numeric(train$Number.of.friends)
train$Weight <- as.numeric(train$Weight)
feature_1 <- train$Number.of.friends
feature_2 <- train$Weight 
knn_train <- data.frame(feature_1, feature_2)
knn_test <- knn_train
knn_label <- train$Smoking
knn_class1 <- knn(knn_train, knn_test, knn_label, k = 1)
accuracy(knn_class1, test$Smoking)

## [1] 0.310219

Our Tree has an accuracy of 41.8% while our KNN predictor with K=24 has an accuracy of 31.7%. Those are fairly low numbers that suggest a weak correaltion between variables.

tree1 <- rpart(Smoking ~ feature_1 + feature_2, data = train, method = 'class',control=rpart.control(maxdepth = 2))

# NA Omit is not eliminating blank spaces

Ben’s Code

Here the KNN predicitors overpredict to an even greater degree those who have Tried Smoking, reinforcing the narrative that there little conenction between adrenaline sports, musical instruments and smoking

# Decision tree
responses$Musical.instruments <- as.factor(responses$Musical.instruments)
responses$Adrenaline.sports <- as.factor(responses$Adrenaline.sports)
feature_1 <- responses$Adrenaline.sports
feature_2 <- responses$Musical.instruments
tree_classifier <- rpart(responses$Smoking ~ feature_1 + feature_2, data = responses, method = 'class')

rpart.plot(tree_classifier)

# Decision tree added variable
rpart.control(maxdepth = 5)

## $minsplit
## [1] 20
## 
## $minbucket
## [1] 7
## 
## $cp
## [1] 0.01
## 
## $maxcompete
## [1] 4
## 
## $maxsurrogate
## [1] 5
## 
## $usesurrogate
## [1] 2
## 
## $surrogatestyle
## [1] 0
## 
## $maxdepth
## [1] 5
## 
## $xval
## [1] 10

feature_1 <- responses$Adrenaline.sports 
feature_2 <- responses$Musical.instruments
feature_3 <- responses$Number.of.friends
new_plot <- rpart(responses$Smoking ~ feature_1 + feature_3, data = responses, method = 'class')
rpart.plot(new_plot)

Here we add another variable and suddenly our tree changes. Friends by themselves predict have tried smoking well. Lack of friends (especially if answered below 1.5 on the survey) leans towards current smoking.

# Decision tree added variable
rpart.control(maxdepth = 5)

## $minsplit
## [1] 20
## 
## $minbucket
## [1] 7
## 
## $cp
## [1] 0.01
## 
## $maxcompete
## [1] 4
## 
## $maxsurrogate
## [1] 5
## 
## $usesurrogate
## [1] 2
## 
## $surrogatestyle
## [1] 0
## 
## $maxdepth
## [1] 5
## 
## $xval
## [1] 10

feature_1 <- responses$Adrenaline.sports 
feature_2 <- responses$Musical.instruments
feature_3 <- responses$Number.of.friends
feature_4 <- responses$Weight
tree1 <- rpart(responses$Smoking ~ feature_1 + feature_2 + feature_3 + feature_4, data = responses, method = 'class')
rpart.plot(tree1)

By adding a fourth variable our decision tree again has zero nodes. With so many variables, no significant distinctions are made.

tree_predictions <- predict(tree1, responses, type = 'class')
knn_predictions <- knn_class2

The accuracy of both are tree and KNN predicition are quite low (around 40%. Smoking is more difficult to predict than we initially thought).

# Decision tree train
rpart.control(maxdepth = 5)

## $minsplit
## [1] 20
## 
## $minbucket
## [1] 7
## 
## $cp
## [1] 0.01
## 
## $maxcompete
## [1] 4
## 
## $maxsurrogate
## [1] 5
## 
## $usesurrogate
## [1] 2
## 
## $surrogatestyle
## [1] 0
## 
## $maxdepth
## [1] 5
## 
## $xval
## [1] 10

train <- na.omit(train)
feature_1 <- train$Adrenaline.sports 
feature_2 <- train$Musical.instruments
feature_3 <- train$Number.of.friends
feature_4 <- train$Weight
tree3 <- rpart(train$Smoking ~ feature_1 + feature_2 + feature_3 + feature_4, data = responses, method = 'class')
rpart.plot(tree3)

#na.omit did not work to remove null values?

Our final tree looks much like our first one–zero nodes and only one branch. Predicting smoking from adrenaline, music, friends and weight has been much more difficult thatn we predicted