\documentclass{article}
\usepackage[english]{babel}
\usepackage[utf8]{inputenc}
\usepackage{fancyhdr}
\usepackage{color}
\usepackage{hyperref}
\usepackage{mathtools}
\usepackage{bbm}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{enumerate}
\usepackage[top=2cm, bottom=4.5cm, left=2.5cm, right=2.5cm]{geometry}
\usepackage{afterpage}
\newcommand\blankpage{%
\null
\thispagestyle{empty}%
\addtocounter{page}{-1}%
\newpage}
\DeclareMathOperator*{\argmin}{argmin}
% Edit these as appropriate
\newcommand\course{CS 1420}
\newcommand\semester{Spring 2022} % <-- current semester
%\newcommand\hwnumber{3} % <-- homework number
\newcommand\duedate{N/A}
\begin{document}
\begin{center}
\huge\textbf{Final Exam}
\end{center}
\begin{center}
\huge\textbf{CSCI 1420---Spring 2022}
\end{center}
\subsection*{Instructions}
\vspace{.1in}
\noindent {\bf Timeline:} The final will be posted on the course homepage no later than noon Eastern U.S. time on Thursday, May 12. It must be submitted on Gradescope by 11:59 PM Eastern U.S. time on Friday, May 13. No late days may be used on the final. For every minute past the deadline it is late, one percentage point will be deducted from the grade.
\vspace{.25in}
\noindent {\bf Exam Format:} There are eight problems, each worth 1/7 of the exam.
You may choose one problem to count as extra credit, worth 1/2 of a regular problem.
To indicate your choice, mark the blank line where indicated for that problem.
{\bf If you do not make a choice, or your choice is otherwise unclear, the last problem will be treated as extra credit. We will not adjust a student's selection to optimize their score.}
\vspace{.25in}
\noindent {\bf Submission Format:} You may submit a PDF using the provided Latex, or you may submit handwritten answers. You are encouraged to use Latex if possible, as any illegible parts of answers will be marked incorrect.
\vspace{.25in}
\noindent {\bf Academic Integrity:} \emph{The course collaboration policy does not apply to this exam.}
Instead, you may not communicate with anyone other than course staff about the exam in any way. You may consult the course textbook, course notes, slides, homeworks, recorded lectures, or existing messages on Ed. You may not post anything new that is public to Ed during the exam period. (See below.) Violating these instructions will be considered academic dishonesty.
\vspace{.25in}
\noindent {\bf Getting Help:} If you have any questions about the content of the exam or technical difficulties, please first consult the ``Official FAQ'' that will be pinned on Ed. If your question is not answered there, please email the HTA list: \texttt{cs1420headtas@lists.brown.edu} . We will respond as soon as possible, but please keep in mind that latency of up to 20 minutes is reasonable. In addition, the mailing list is only monitored from 9 AM to 10 PM Eastern U.S. time, so please plan accordingly. \\
\\
\noindent If you have any other issues or concerns, such as challenging or unexpected circumstances, please contact Steve directly: \texttt{stephen\_bach@brown.edu} .
%\afterpage{\blankpage}
\newpage
\subsection*{Problem 1 (The Bias-Complexity Tradeoff)}
\footnotesize{\_\_\_\_\_ Mark here to count this problem as extra credit.} \\ \\
\noindent Consider a binary ($\mathcal{Y} = \{-1, 1\}$) classification task with inputs in $\mathcal{X} = \{0, 1\}^d$, where $d$ is large ($> 1000$).
Rank the following four hypothesis classes in this setting according to the bias-complexity tradeoff, starting with the most bias (i.e., least complexity).
Indicate your rank with a number beside each hypothesis class (e.g., 1 is the most bias and 4 is the least bias).
\\ \\
After each choice, briefly (1--2 sentences) justify your answer.
You may cite without proof any facts contained in lectures, homeworks, or the textbook.
\\ \\
Reminder: the difference between homogeneous and non-homogeneous halfspaces is that non-homogeneous halfspaces have decision boundaries that are \emph{not} constrained to pass through the origin. In other words, they contain ``bias'' parameters.
\vspace{0.75in} \\
\noindent \_\_\_\_\_ Decision trees
\\ \\
%%%%%%%%
% Replace the below line with your answer
%%%%%%%%
\vspace{0.75in}
\\ \noindent \_\_\_\_\_ Non-homogeneous halfspaces
\\ \\
%%%%%%%%
% Replace the below line with your answer
%%%%%%%%
\vspace{0.75in}
\\ \noindent \_\_\_\_\_ Homogeneous halfspaces
\\ \\
%%%%%%%%
% Replace the below line with your answer
%%%%%%%%
\vspace{0.75in}
\\ \noindent \_\_\_\_\_ Boosted, non-homogeneous halfspaces with an ensemble size of 2
\\ \\
%%%%%%%%
% Replace the below line with your answer
%%%%%%%%
%\afterpage{\blankpage}
\newpage
\subsection*{Problem 2 (Model Selection and PCA)}
\footnotesize{\_\_\_\_\_ Mark here to count this problem as extra credit.} \\ \\
\begin{center}
\includegraphics{2022/final/fig/Question 2.pdf}
\end{center}
\noindent Suppose that we want to learn a halfspace classifier from $\mathcal{X} = \mathbb{R}^d$ to $\mathcal{Y} \in \{-1, 1\}$.
Instead of learning the halfspace over $\mathbb{R}^d$, we first apply principal component analysis (PCA) and learn the halfspace on the lower dimension representation of the features in $\mathbb{R}^n$.
Suppose that $d$ is large (say, $1000)$ and that we experiment with two different values of $n$: $5$ and $500$.
In other words, in one version we approximately represent each example with the top $5$ principal components, and in the other we use the top $500$ principal components.
For each version, we make a training curve plotting two errors against the number of training examples, $m$, pictured above.
\\ \\
Which curve (A or B) most likely came from the model with $n=5$ and which from the model with $n=500$?
Justify your answer in terms of the bias-complexity tradeoff, approximation error, and estimation error.
\\ \\
%%%%%%%%
% Replace the below line with your answer
%%%%%%%%
%\afterpage{\blankpage}
\newpage
\subsection*{Problem 3 (Support Vector Machines)}
\footnotesize{\_\_\_\_\_ Mark here to count this problem as extra credit.} \\ \\
We saw in the question portion of Lecture 16 that all training examples \emph{other than} the support vectors can be removed from the calculation of predictions at test time because they correspond to $\alpha_i=0$ in the prediction rule
\[
h({\bf x}) = \text{sign}\left[ \sum_{i = 1}^m \alpha_i K({\bf x}_i, {\bf x}) \right] \; .
\]
\\ \\
Suppose that somehow we knew what the support vectors were before \emph{training} as well. Could we remove the \emph{non-support vectors} from the training data and still obtain the same classifier?
\\ \\
Justify your answer in terms of hinge-loss soft SVM objective without kernels
\[
\argmin_{{\bf w} \in \mathbb{R}^d}~~\lambda \|{\bf w}\|^2 + \frac{1}{m} \sum_{i=1}^m \max\left(0, 1 - y_i \langle {\bf w}, {\bf x}_i \rangle \right) \; .
\]
\\ \\
%%%%%%%%
% Replace the below line with your answer
%%%%%%%%
%\afterpage{\blankpage}
\newpage
\subsection*{Problem 4 (PAC Learning)}
\footnotesize{\_\_\_\_\_ Mark here to count this problem as extra credit.} \\ \\
\noindent Recall the following sample complexity bounds for learning finite hypothesis classes with the 0--1 loss.
In the realizable setting we proved
\[
m_\mathcal{H}(\epsilon, \delta) \leq \left\ceil \frac{\log ( | \mathcal{H} | / \delta)}{\epsilon} \right\ceil
\]
and in the agnostic setting we proved
\[
m_\mathcal{H}(\epsilon, \delta) \leq \left\ceil \frac{ 2 \log ( 2 | \mathcal{H} | / \delta)}{\epsilon^2} \right\ceil \; .
\]
\\ \\
\noindent a. Suppose that for a given hypothesis class, you have sufficient training data to guarantee that the empirical risk minimizer had expected risk less than $0.05$ with probability at least $95\%$ in the realizable setting.
Suppose you want to adjust these guarantees, but you have to pay $\$0.10$ for every additional training example you need in order to meet them.
Would it be less expensive to either guarantee (1) expected risk less than $0.025$ with probability at least $95\%$ or (2) expected risk less than $0.05$ with probability at least $97.5\%$? Justify your answer in 1--2 sentences in terms of the bound above.
\\ \\
%%%%%%%%
% Replace the below line with your answer
%%%%%%%%
\vspace{2.5in}
\noindent b. What do these two bounds imply about the relative difficulty of learning in the realizable versus the agnostic settings? For example, if you wanted to select a hypothesis from a class $\mathcal{H}$ such that the empirical risk minimizer had expected risk within $0.05$ of that of the best hypothesis in the class with probability at least $95\%$, but you have to pay $\$0.10$ for every training example you used, would you rather be in the realizable or agnostic setting? Justify your answer in 1--2 sentences.
\\ \\
%%%%%%%%
% Replace the below line with your answer
%%%%%%%%
%\afterpage{\blankpage}
\newpage
\subsection*{Problem 5 (Feature Engineering)}
\footnotesize{\_\_\_\_\_ Mark here to count this problem as extra credit.} \\ \\
\noindent Consider the following training data set for learning a classifier on foods represented by features in $\mathcal{X}=\{-1, 1\}^2$ and labeled as either tasty ($y=1$) or not tasty ($y = -1$).
Suppose that we train a halfspace classifier with the preceptron algorithm and 0-1 loss, and we get an empirical risk of $\frac{1}{4}$.
Suppose that we also get a validation risk of $\frac{1}{4}$ on some held-out data.
\\ \\ \\
\begin{center}
\begin{tabular}{c | c | c || c}
Food & Sweet & Spicy & Tasty \\
\hline
Ice Cream & 1 & -1 & 1 \\
Celery & -1 & -1 & -1 \\
Curry & -1 & 1 & 1 \\
Ice Cream with Hot Sauce & 1 & 1 & -1
\end{tabular}
\end{center}
\\ \\ \\ \\
\noindent a. Suppose that we added a third feature to the training and validation datasets that was the product of the first two.
In other words, we created a feature $x_3 = x_1 \cdot x_2$ for all examples.
If we retrained the halfspace, would you expect empirical risk to go up or down? Why?
What about validation risk? In other words, would you expect this new model to generalize well? Why or why not?
\\ \\
%%%%%%%%
% Replace the below line with your answer
%%%%%%%%
\vspace{2.5in}
\noindent b. Suppose that we added a third feature to the training and validation datasets that was randomly generated, with $p(x_3 = 1) = 0.5$ and $p(x_3 = -1) = 0.5$.
If we retrained the halfspace, would you expect empirical risk to go up or down? Why?
What about validation risk? In other words, would you expect this new model to generalize well? Why or why not?
\\ \\
%%%%%%%%
% Replace the below line with your answer
%%%%%%%%
%\afterpage{\blankpage}
\newpage
\subsection*{Problem 6 (Clustering)}
\footnotesize{\_\_\_\_\_ Mark here to count this problem as extra credit.} \\ \\
\begin{center}
\includegraphics{2022/final/fig/Question 6.pdf}
\end{center}
\noindent a. Suppose we clustered a dataset using two algorithms: K-Means and K-Gaussians, each as defined in class. Each algorithm was run to convergence. Further suppose that for the probabilities output by K-Gaussians, we simply labeled each point with the most probable cluster identifier, so that no probabilistic information was saved and the results could not be distinguished at a glance from those of K-Means. The two clusterings are pictured above. Which clustering (A or B) is from K-Means and which is from K-Gaussians? Justify your answer in terms of either the optimization or the loss function for each algorithm.
\\ \\
%%%%%%%%
% Replace the below line with your answer
%%%%%%%%
\vspace{1.75in}
\\ \noindent b. Suppose we used K-Means to create a classifier, as done in homework 12. In other words, the unlabeled training data are first clustered and then each centroid is labeled with the most common label among examples in the training data in that cluster. Predictions are made by assigning the label of the nearest centroid to each test point.
\\ \\
Further suppose the original training data contains a sensitive feature $s$.
Can we assume that the K-Means classifier is fair with respect to that feature, given that the classifier does not treat that feature any differently from the others (i.e., all features are weighted equally in the distance calculation used to find the nearest centroid)? Why or why not?
\\ \\
For the purposes of this question, you can assume that fairness is measured using the equalized odds criterion.
%%%%%%%%
% Replace the below line with your answer
%%%%%%%%
%\afterpage{\blankpage}
\newpage
\subsection*{Problem 7 (Empirical Risk Minimization)}
\footnotesize{\_\_\_\_\_ Mark here to count this problem as extra credit.} \\ \\
\noindent Consider the hypothesis class of 1-dimensional segmentations into four sections, $\mathcal{H} = \{h_{a,b,c, {\bf s}} : a \in \mathbb{R}$, $b \in \mathbb{R}$, $c \in \mathbb{R}$, and ${\bf s} \in \{-1, 1\}^4 \}$, where:
\[
h_{a,b,c, {\bf s}}(x) =
\begin{cases}
s_1 & \text{if} \; x \leq a \\
s_2 & \text{if} \; x > a \text{ and } x \leq b \\
s_3 & \text{if} \; x > b \text{ and } x \leq c \\
s_4 & \text{otherwise}
\end{cases}
\]
\noindent and $\mathcal{X} = R$, $\mathcal{Y} = \{-1, 1\}$, and $ a < b < c$.
\\ \\
Reminder: ${\bf s} \in \{-1, 1\}^4$ means that ${\bf s}$ is a vector of four values, each of which is $-1$ or $1$.
In other words, $s_1$, $s_2$, $s_3$, $s_4$ (along with $a$, $b$, and $c$) are the parameters that define a hypothesis in $\mathcal{H}$.
\vspace{.25in}
\noindent Describe an algorithm for computing the ERM for this class in the realizable case.
(You can assume 0-1 loss, although the solution will be the same for any reasonable loss function.)
In the context of a training data set of size $m$, your algorithm should run in at most $\mathcal{O}(m \log m)$ time.
%%%%%%%%
% Replace the below line with your answer
%%%%%%%%
%\afterpage{\blankpage}
\newpage
\subsection*{Problem 8 (VC Dimension)}
\footnotesize{\_\_\_\_\_ Mark here to count this problem as extra credit.} \\ \\
\noindent Consider (again, see problem 7) the hypothesis class of 1-dimensional segmentations into four sections, $\mathcal{H} = \{h_{a,b,c, {\bf s}} : a \in \mathbb{R}$, $b \in \mathbb{R}$, $c \in \mathbb{R}$, and ${\bf s} \in \{-1, 1\}^4 \}$, where:
\[
h_{a,b,c, {\bf s}}(x) =
\begin{cases}
s_1 & \text{if} \; x \leq a \\
s_2 & \text{if} \; x > a \text{ and } x \leq b \\
s_3 & \text{if} \; x > b \text{ and } x \leq c \\
s_4 & \text{otherwise}
\end{cases}
\]
\noindent and $\mathcal{X} = R$, $\mathcal{Y} = \{-1, 1\}$, and $ a < b < c$.
\\ \\
Reminder: ${\bf s} \in \{-1, 1\}^4$ means that ${\bf s}$ is a vector of four values, each of which is $-1$ or $1$.
In other words, $s_1$, $s_2$, $s_3$, $s_4$ (along with $a$, $b$, and $c$) are the parameters that define a hypothesis in $\mathcal{H}$.
\vspace{.25in}
\noindent What is the VC dimension of this hypothesis class $\mathcal{H}$? Provide a complete proof.
%%%%%%%%
% Replace the below line with your answer
%%%%%%%%
\end{document}