Machine Learning (Unsupervised Learning, Tree Based Methods, Support Vector

Machines, Classification, Linear and non-linear Regression, and resampling methods)

The following is an example of the coursework that will be expected to be delivered within 12 hours, This

coursework contains four questions. Answer ALL FOUR. All questions will be given equal weight (25%).Time

allowed – Expected Writing Time: 2 hours (you would have 12 hours to answer)

In this exam is

(a) Suppose that yi ∼ N(µ, 1) for i = 1, . . . , n and that the yi’s are independent.

i. Show that the sample mean estimator ˆµ1 =1/n ∑yi is obtained from

minimising the least squares criterion [7 marks]

µˆsub(1) = argmin.∑(yi-µ)^2, and that ^µsub(1) an unbiased estimator of µ. Also find the variance of ^µsub(1)

ii. Consider adding a penalty term to the least squares criterion, and therefore using the estimator that

minimises µˆ2 = argmin∑(yi-µ)^2+ λ(µ)^2 for the mean, where λ is a non-negative tuning parameter. Derive ˆµ2,

find it bias and show that its variance is lower than that of ˆµ1

Consider the multiple linear regression model yi = β0 + ∑βsub(j)x(sub)ij + e(sub)i, i = 1, . . . , n, j = 1, dots, p,

where β = (β1, …, βp)^T and error-term= (e(sub)1….e(sub)n)^T∼ N(0, σ^2 I(sub)n).

i. When p is comparable to n, the multicollinearity becomes an issue. Describe the effects of multicollinearity on

the estimated coefficients, the

associated standard errors and the significance of the coefficients using the

ordinary maximum likelihood method.

ii. The ridge regression estimate of β can be obtained by minimising a particular expression with respect to β.

Write down this expression as well as

an alternative formulation of it.

iii. Explain why ridge regression can potentially correct the problems of

multicollinearity. [2 marks]

iv. Provide an advantage and a disadvantage of ridge regression over the standard linear regression.

- Let x = (x1, . . . , x100), with ∑xi = 20, be a random sample from the Exponential(λ)

distribution with probability density function given by

f(x(sub)i|λ) = 1/λ exp(−x(sub)i/λ), x(sub)i > 0, λ > 0. Note that E(xi) = λ.

(a) Assign the IGamma(0.1, 0.1) prior to λ and find the corresponding posterior distribution.

(b) Find the Jeffreys’ prior for λ. Which is the corresponding posterior distribution.

(c) Find a Bayes estimator for λ based on the priors of parts (a) and (b)

(d) Let y represent a future observation from the same model. Find the predictive

distribution of y based either on the prior of part (a) or (b).

(e) Describe how you can calculate the mean the of the predictive distribution in

software such as R. - (a) i. Suppose a non-linear model that can be written as Y = f(X) + e,

where e has zero mean and variance σ^2, and is independent of X. Show

that the expected test error, conditional on X can be decomposed into the

following three parts:

E[(Y − ˆf(X))^2] = σ^2 + Bias [f(x)]^2 + Var [f(x)] , where f(·) is estimated from the training data.

7/22/2020 Order 323199824

https://admin.writerbay.com/orders_available?subcom=detailed&id=323199824 3/4

ii. To estimate the test error rate, one can use the 10-fold Cross Validation

(CV) approach or the information criterion approach, e.g. AIC, BIC. What

are the main advantage and disadvantage of using the 5-fold CV approach

in comparison with AIC or BIC?

iii. State which one of AIC and BIC tends to select smaller size model and

explain the reason

(b) i. The tree in Figure 1 provides a regression tree based on a dataset of patient visits for upper respiratory

infection. The aim is to identify factors

associated with a physicians rate of prescribing, which is a continuous variable. The variables appearing in the

regression tree are private: percent

of privately insured patients a physician has, black: the percent of black

patients a physician has, and fam whether or not the physician specialises

in family medicine. Provide an interpretation of this tree.

ii. Consider the regression tree of Figure 2 where the response variable is the

log salary of a baseball player, based on the number of years that he has

played in the major leagues (Years) and the number of hits that he made

in the previous year (Hits). Create a diagram that represent the partition

of the predictors spaces according to this tree

4 (a) i. Consider the following data: 10 20 40 80 85 121 160 168 195.

Use the k-means algorithm with k = 3 to cluster the data set. Use the

Euclidean distance to measure the distance between the data points. Suppose that the points 160, 168, and

195 were selected as the initial cluster

means. Work from these initial values to determine the final clustering for

the data. Provide results from each iteration.

ii. What are the main disadvantages of k-means clustering? Why one may

want to consider hierarchical clustering as an alternative?

(b) i. Data are available for students taking BSc degree in Data Science and

in particular the variables X1: average mark on project coursework, X2:

average hours studied per course, and Y : get a degree with distinction. The

estimated coefficients of a logistic regression model were β0 =?5, β1 = 0.02,

β2 = 0.1. Estimate the probability that a student who takes on average

50% on project coursework and studies 30 hours on average for each course

gets a degree with distinction? How many hours would the student in part

(a) need to study on average to have a 50 % chance of getting a degree

with distinction ?

ii. Suppose that we wish to predict whether a high quality chip produced in

a factory will pass the quality control (‘Pass’ or ‘Fail’) based on x, the

measurement of its diameter. Diameter measurements are available for a

large number of chips. After examining them it turns out that the mean

value of x for chips that passed the quality control was 5mm, while the

mean for those that didn’t was 7mm. Moreover, the variance of x for

these two sets of companies was σ^2 = 1. Finally, 70% of the produced

chips passed the quality control. Assuming that x follows the normal

distribution, predict the probability that a chip with x = 5.8 will pass the

quality control.

Sample Solution

The post Machine Learning appeared first on ACED ESSAYS.