set.seed(908)
n <- 205
df <- tibble(x1 = runif(n)-0.5, x2 = x1^2 + rnorm(n)*0.01)ETC3250: Tutorial 4 Help Sheet
Exercises
Question 1
In the lecture, we used bootstrap to examine the significance of the coefficients for the second principal component from the womens’ track PCA. Do this computation for PC1. The question for you to answer is: Can we consider all of the coefficients to be equal?
Check lecture 3 slides 20 to 24. The code for this question is very similar to the code on slide 24.
What does the red dashed line on the plot mean?
Question 2
The ggscree function in the mulgar package computes PCA on multivariate standard normal samples, to learn what the largest eigenvalue might be when there the covariance between variables is 0.
Part A
What is the mean and covariance matrix of a multivariate standard normal distribution?
This is a pretty basic question about a multivariate normal distribution, I can only suggest checking the wiki page for the standard normal distribution and the multivariate normal distribution
Part B
Simulate a sample of 55 observations from a 7D standard multivariate normal distribution. Compute the sample mean and covariance. (Question: Why 55 observations? Why 7D?)
To simulate the data you will need to use set.seed and rmvnorm. The variables should be independent so you might find the function diag useful in generating the variance-covariance matrix.
There are many ways to calcualte the mean and covariance, I suggest using apply to calculate the mean, where x is your data, MARGIN indicates the calculation will be done over columns, and FUN can just be mean (you dont need the brackets on the function mean). You can use the cov for the variance-covariance matrix.
We are going to compare this simulation to our results from the track data. Consider the dimensionality of the data (for numeric variables).
Part C
Compute PCA on your sample, and note the variance of the first PC. How does this compare with variance of the first PC of the women’s track data?
Check lecture 2 slide 45 for code that computed a PCA.
Remember, your data is simulated from a standard normal distribution, so you shouldn’t scale or center the data.
Look at how much of the total variance of the track data is covered by PC1 and then look at how much of the total variance of your simulated data is covered by PC1. Is it a valid check to just compare these two numbers? What would be required to make sure PC1 of the track data does not just happen to be different to the PC1 of our simulated data for this particular draw?
Question 3
Permutation samples is used to significance assess relationships and importance of variables. Here we will use it to assess the strength of a non-linear relationship.
Part A
Generate a sample of data that has a strong non-linear relationship but no correlation, as follows:
and then use permutation to generate another 19 plots where x1 is permuted. You can do this with the nullabor package as follows:
set.seed(912)
df_l <- lineup(null_permute('x1'), df)and make all 20 plots as follows:
ggplot(df_l, aes(x=x1, y=x2)) +
geom_point() +
facet_wrap(~.sample)Is the data plot recognisably different from the plots of permuted data?
Check lecture 3 slide 26 for information on permutation.
Consider how permutation works and how the other 19 plots were generated. What kind of relationship should become invisible when variables are permuted. How does that translate to being able to identify your data in this lineup plot? What does that mean about your data?
Part B
Repeat this with a sample simulated with no relationship between the two variables. Can the data be distinguished from the permuted data?
To simulate the data you can basically use the code above except instead of setting x2 = x1^2 + rnorm(n)*0.01 you can remove the dependence by setting x2 = rnorm(n)*0.1.
Considerations are identical to those provided in the hint for Part A.
Question 4
For the penguins data, compute 5-fold cross-validation sets, stratified by species.
Part A
List the observations in each sample, so that you can see there is no overlap.
Check lecture 3 slide 13 to 16 for information about k-fold cross validation (in this case, k=5). You can find very similar code for this section on slide 14
Part B
Make a scatterplot matrix for each fold, coloured by species. Do the samples look similar?
Each of the lines in Part A should line up with the index of one segmentation of the data. Look at the code and work out what it is doing.
What is the code required for p_tidy[row_index,] to work? Remember, the training set is the remaining data after the test set is removed, and you can remove data through an index by using p_tidy[-test_index,]
Look at ggscatmat for the scatterplot matrix , you can use columns to specify the numeric variables and color (American spelling) for the colouring of the points.
You can just copy and paste this for each scatterplot matrix you need to make.
Does your data vary between subsets? Is this variation natural?
Question 5
What was the easiest part of this tutorial to understand, and what was the hardest?