ETC5250: Tutorial 11 Help Sheet

Exercises

Question 1

Part A

Fit model-based clustering with number of clusters ranging from 1-15, to the transformed data, and all possible parametrisations. Summarise the best models, and plot the BIC values for the models. You can also simplify the plot, and show just the 10 best models.

Check lecture 10 slides 3 to 15 for the details on model based clustering. Slide 13 has an example where the penguins data is fit on 2-8 clusters with all possible parameterisations and a plot of all the BIC values for all the models is shown.

Part B

Why are some variance-covariance parametrisations fitted to less than 15 clusters?

Check lecture 10 slides 4 and 5 to see the models that are being fit to make the clustering. Can you work out how many parameters are being fit in each model?

If the number of parameters is greater than the number of observations, a model cannot be fit. How many observations do you have?

Part C

Make a 2D sketch that would illustrate what the best variance-covariance parametrisation looks like conceptually for cluster shapes.

Check lecture 10 slides 6 for some examples of these kinds of sketches

Part D

How many parameters need to be estimated for the VVE model with 7 and 8 clusters? Compare this to the number of observations, and explain why the model is not estimated for 8 clusters.

In order to work out the number of parameters in each model, you need to know: the number of observations (n), the number of predictors (p), and the total number of clusters (G). In this case, n=507, p=11, and G=7,8

The model you are estimating is: f(xi)=k=1Gπkfk(xi;μk,Σk) Therefore, for each cluster you need to estimate πk, a scalar that represents the proportion of total observations that belong to that cluster, μk, a p dimensional vector for each cluster and Σk, a p×p dimensional matrix. Since πk is the number of observations it does not need to be estimated, and f(x) is an assumed functional form. Depending on the model form, Σk may not need to be completely re-estimated for each model. This is where most of the variation in the number of parameters will come from.

We need to work out the form of the variance-covariance matrix to know how many parameters we need to estimate. The variance-covariance matrix for each cluster is: Σk=λkDkAkDkT Where λk is a scalar that captures the number of observations in each cluster, Ak is a diagonal matrix that captures the variance of each parameter (i.e. its shape), and Dk is a lower triangular matrix that captures the orientation of the model.

The model with the most parameters would be VVV as none of the parameters could be shared between clusters (i.e. they change for each cluster). In this case the number of parameters each component would contribute is: λk=1, Ak=p (since all the off diagonals of the p×p matrix would be 0) and Dk=(p+12)=12(p+1)p (which is the triangle number of p since the upper triangle of the p×p matrix would be 0).

We are looking at the VVE model, which has variable volume (λk), variable shape (Ak), and equal orientation (D).Therefore each cluster has its own πk and μk, λk, and Ak that needs to be estimated, but all the clusters share the same D (so it only needs to be estimated once for all the clusters.

Part E

Fit just the best model, and extract the parameter estimates. Write a few sentences describing what can be learned about the way the clusters subset the data.

Since the variables have been standardised you can directly compare the differences in means and variances across clusters.

Question 2

Part A

In tutorial of week 10 you clustered c1 from the mulgar package, after also examining this data using the tour in week 3. We know that there are 6 clusters, but with different sizes. For a model-based clustering, what would you expect is the best variance-covariance parametrisation, based on what you know about the data thus far?

You might have to look back at the SPLOM and tour you investigated the data with. Look at the size, shape and orientation of each cluster and decide which model is best suited.

Part B

Fit a range of models for a choice of clusters that you believe will cover the range needed to select the best model for this data. Make your plot of the BIC values, and summarise what you learn. Be sure to explain whether this matches what you expected or not.

Harriet Comment

By “range” Di means check several different values of G that are suitable for this quesiton. You don’t have to worry about checking different model types since the function will do it for you. This question is the same as Q1(a), an example of which can be found at Lecture 10 slide 13.

Part C

Fit the best model, and examine the model in the data space using a tour. How well does it fit? Does it capture the clusters that we know about?

Question 3

Part A

How many observations in the data? Explain how this should determine the maximum grid size for an SOM.

Check lecture 10 slides 16 to 21 for information on self organising maps.

To answer this question, think about grid size relative to the number of observations.

Part B

Fit a SOM model to the data using a 4x4 grid, using a large rlen value. Be sure to standardise your data prior to fitting the model. Make a map of the results, and show the map in both 2D and 5D (using a tour).

Check lecture 10 slide 18 for an example of fitting a SOM.

Part C

Let’s take a look at how it has divided the data into clusters. Set up linked brushing between detourr and map view using the code provided.