Home Assignment 3 - The Complete Academic Support Hub

Machine Learning 2020-2021
Home Assignment 3
Yevgeny Seldin Christian Igel
Department of Computer Science
University of Copenhagen
The deadline for this assignment is 8 December 2020, 22:00. You must submit
your individual solution electronically via the Absalon home page.
A solution consists of:
• A PDF file with detailed answers to the questions, which may include graphs
and tables if needed. Do not include your source code in the PDF file.
• A .zip file with all your solution source code with comments about the major
steps involved in each question (see below). Source code must be submitted
in the original file format, not as PDF. The programming language of the
course is Python.
• IMPORTANT: Do NOT zip the PDF file, since zipped files cannot be
opened in speed grader. Zipped pdf submissions will not be graded.
• Your PDF report should be self-sufficient. I.e., it should be possible to
grade it without opening the .zip file. We do not guarantee opening the
.zip file when grading.
• Your code should be structured such that there is one main file (or one main
file per question) that we can run to reproduce all the results presented in
your report. This main file can, if you like, call other files with functions,
classes, etc.
• Handwritten solutions will not be accepted, please use the provided latex
template to write your report.
1
1 The Role of Independence (5 points)
Design an example of identically distributed, but dependent Bernoulli random
variables X1; : : : ; Xn (i.e., Xi 2 f0; 1g), such that
P µ – 1
n
nX i
=1
Xi ≥ 1
2! = 1;
where µ = E [Xi].
Note that in this case 1
n Pn i=1 Xi does not converge to µ as n goes to infinity. The
example shows that independence is crucial for convergence of empirical means
to the expected values.
2 How to Split a Sample into Training and Test
Set (25 points)
In this question you will analyze one possible approach to the question of how to
split a dataset S into training and test sets, Strain and Stest. As we have already
discussed, overly small test sets lead to unreliable loss estimates, whereas overly
large test sets leave too little data for training, thus producing poor prediction
models. The optimal trade-off depends on the data and the prediction model. So
can we let the data speak for itself? We will give it a try.
1. To warm up: assume that you have a fixed split of S into Strain and
Stest, where the size of Stest is ntest. You train a model h^∗ Strain on Strain
using whatever procedure you want. Then you compute the test loss

L^(h^∗ Strain; Stest). Derive a bound on L(h^∗ Strain) in terms of L^(h^

test)

∗ Strain; Sand ntest that holds with probability at least 1 – δ.
2. Now we want to find a good balance between the sizes of Strain and Stest.
We consider m possible splits f(S1train; S1test) ; : : : ; (S

mtrain

; Smtest

)g, where

the sizes of the test sets are n1; : : : ; nm, correspondingly. For example,
it could be (10%; 90%); (20%; 80%); : : : ; (90%; 10%) splits or anything else
with a reasonable coverage of the possible options. We train m prediction
models h^∗
1; : : : ;
h^∗
m, where h^∗ i is trained on Sitrain. We calculate the test loss
of the i-th model on the i-th test set L^(h^∗ i ; Sitest). Derive a bound on L(h^∗ i )
in terms of L^(h^∗ i ; Sitest) and ni that holds for all h^∗ i simultaneously with
probability at least 1 – δ.
Comment: No theorem from the lecture notes applies directly to this setting,
because they all have a fixed sample size n, whereas here the sample sizes
vary, n1; : : : ; nm. You have to provide a complete derivation.
2
3 Occam’s Razor (20 points)
We want to design an application for bilingual users. The application should
detect the language in which the person is typing based on the first few letters
typed. In other words, we want to design a classifier that takes a short string (that
may be less than a full word) as input and predicts one of two languages, say,
Danish or English. For simplicity we will assume that the alphabet is restricted
to a set Σ of 26 letters of the Latin alphabet plus the white space symbol (so
in total jΣj = 27). Let Σd be the space of strings of length d. Let Hd be the
space of functions from Σd to f0; 1g, where Σd is the input string and f0; 1g is
the prediction (Danish or English). Let H = S1 d=0 Hd be the union of Hd-s.
1. Derive a high-probability bound1 for L(h) that holds for all h 2 Hd.
2. Derive a high-probability bound for L(h) that holds for all h 2 H.
3. Explain the trade-off between picking short strings (small d) and long
strings (large d). Which terms in the bound favor small d (i.e., they increase
with d) and which terms in the bound favor large d (i.e., they decrease with
d)?
4. We have presented a lower bound, where we constructed an example of a
problem with a large hypothesis class (of size 22n), where the empirical loss
of the empirically best hypothesis was always zero, but the expected loss of
the empirically best hypothesis was at least 1/4. The hypothesis class H
in this question is obviously infinite. Explain why there is no contradiction
between the bound in Point 2 and the lower bound.
Optional, not for submission You are very welcome to experiment with
the influence of the string length d on the performance. You can find a lot of
texts in different languages at http://www.gutenberg.org/catalog/. Do you
observe the effect of initial improvement followed by overfitting as you increase
d?
4 Kernels (50 points)
The first question should improve the understanding of the geometry of the kernelinduced feature space. You can directly use the result to implement a kernel
nearest-neighbor algorithm.
1A bound that holds with probability at least 1 – δ.
3
The second question should make you more familiar with the basic definition of
the important concept of positive definiteness.
The third question is important to understand the real dimensionality of learning
problems using a linear kernel { one reason why linear kernels are often treated
differently in efficient implementations.
The fourth question derives the general Cauchy-Schwarz inequality, which is of
general importance. In particular, it needed to close a gap in the lecture. When
we defined the scalar product h:; :i : F ×F ! R in the RKHS F, we showed that
8x 2 F : hx; xi ≥ 0. But we also have to show that hx; xi = 0 implies x = 0 {
and this requires the general Cauchy-Schwarz inequality.
4.1 Distance in feature space
Given a kernel k on input space X defining RKHS H. Let Φ : X ! H denote the
corresponding feature map (think of Φ(x) = k(x; :)). Let x; z 2 X . Show that
the distance of Φ(x) and Φ(z) in H is given by
kΦ(x) – Φ(z)k = pk(x; x) – 2k(x; z) + k(z; z)
(if distance is measured by the canonical metric induced by k).
4.2 Sum of kernels
Let k1; k2 : X × X ! R be positive-definite kernels.
Prove that k(x; z) = k1(x; z) + k2(x; z) is also positive-definite.
4.3 Rank of Gram matrix
Let the input space be X = Rd. Assume a linear kernel, k(x; z) = xTz for x; z 2
Rd (i.e., the feature map Φ is the identity) and m input patterns x1; : : : ; xm 2 Rd.
Prove an upper bound on the rank of the Gram matrix from the m input patterns
in terms of d and m.
4.4 Cauchy-Schwarz
Solve exercise (4.3) from the textbook by Steinwart and Christmann (2008).
Given a vector space F and a positive semi-definite symmetric bilinear form
h:; :i : F × F ! R, that is,
1. hx; xi ≥ 0
4
2. hx; yi = hy; xi
3. hαx + y; zi = α hx; zi + hy; zi
for x; y; z 2 F, show the Cauchy-Schwarz inequality
j hx; yi j2 ≤ hx; xi · hy; yi
for all x; y 2 F.
Hint: Start with 0 ≤ hx + αy; x + αyi and consider the case α = 1 and α = -1
if hx; xi = hy; yi = 0. Otherwise, if, e.g., hy; yi 6= 0, use α = -hhyx;;yyii.
Why is this relevant? The Cauchy-Schwarz inequality is of general importance. For example, it is needed for proving the following property completing
the argument in the lecture.
Recall from the lecture slides that a dot product on a vector space F is a symmetric
bilinear form h:; :i : F × F ! R that is strictly positive definite, i.e., 8x 2 F :
hx; xi ≥ 0 with equality only for x = 0.
Given a kernel k and
f(·) =
mX i
=1
αik(·; xi) g(·) =
m0
X j
=1
βjk(·; x0 j)
(see lecture slides for details and context) we defined the scalar product
hf; gi :=
mX i
=1
m0
X j
=1
αiβjk(xi; x0 j)) :
In the lecture, we stated
hf; fi =
mX i
=1
mX j
=1
αiαjk(xi; xj) ≥ 0 ;
which follows from the kernel being positive semi-definite. However, we did not
prove that hf; fi = 0 implies f = 0, but this is necessary for having a scalar
product. A proof can be found in Sch¨olkopf and Smola (2002), which is, however,
not fully correct. It argues with a Cauchy-Schwarz inequality for kernels, but a
Cauchy Schwarz inequality for a positive symmetric bilinear form would be needed
{ and this is what you are supposed to proof in this assignment.
5
References
B. Sch¨olkopf and A. J. Smola. Learning with Kernels: Support Vector Machines,
Regularization, Optimization, and Beyond. MIT Press, 2002.
I. Steinwart and A. Christmann. Support Vector Machines. Information Science
and Statistics. Springer-Verlag, 2008.
6