ETC3250: Tutorial 6 Help Sheet
Exercises
Question 1
Part A
Check lecture 5 slide 5 to 7 for the formula for the Gini index and an example where it is used. That formula is for
Since we are only working in one bucket, we can ignore
Since we only have two groups we can have a single parameter for
There are two ways you can find the maximum and minimum 1. Algebraically using the first order condition (
Part B
The example in lecture 5 slide 7 goes through an example with a split. The main difference is that we can no longer ignore
A Gini index can only be computed for each bucket, so for buckets
A single bucket in the previous formula from part A would look like:
Part C
I would really suggest taking the time to work through this code line by line and work out what it does. A good way to test if you know what is happening is to guess the output of each line before running it and see if you are right or wrong. You have already made this as a function yourself so it should be straight forward to understand what it is doing.
Hint 1: Where to go for information
Check lecture 5 slide 10 to see what minsplit usually does. :::
Part D
The mysplit
function takes inputs x
(your x variable you are splitting on), spl
(a series of values in the domain of x that you will perform a split on) and cl
(the class associated with that particular x value). You will set minsplit=1
so it is not of particular importance. You need to be able to write this question in terms of those inputs.
The question broken down into psudo code essentially asks you to:
- Let s=c() be a vector of all possible splits.
- For each i in s, calculate
mysplit(x, spl=i, cl, minsplit)
We know that x
, cl
and minsplit
will be exactly the same every time we calculate this function. The only thing that changes is the split.
Remember, if you use a for loop, you need to keep the values you calculate.
You dont actually need to split halfway between any two values on the x variable, you can just split at each unique value to get the same outcome. Try using the function unique()
.
You can use a simple ggplot
with x=split_values
and y=mysplit_output
.
Part E
Quite frankly, this question is very complicated. You are welcome to use the hints and attempt this yourself, however I will not blame you if you just want to look at the code in the solution. PLEASE try and rewrite the solution code in pseudo-code (i.e. plain English) to make sure you understand what it is doing.
Check lecture 5 slide 9 to 13 to see how a random forest works with more than one split.
Basically, you are going to need to manually do what a random forest does, which is:
- To find the first split you need to:
- For each variable (individually) calculate the Gini index for every unique split of the data.
- The best split is the one that minimises the gini index. This is the first split.
- Your data is now split into two bins (based off the first split), to get your second split you need to:
- Within each bin, for each variable calculate the Gini index for every unique split of the data.
- The best split is the one that minimises the gini index. This is the second split.
Question 2
Part A
Check lecture 5 slide 16 to 19 to see how the model works with an example of how to fit it on slide 19.
DO NOT forget to do a training and test split, and example can be found on slide 9
Part B
Check lecture 5 slide 19 for an example that displays the confusion matrix.
Question 3
Part A
Check lecture 5 slides 20 to 22 to understand what a vote matrix is.
Whenever you look at a tour you should describe the shape of the data. For this particular question you should discuss: - the overall shape - if there is clustering and where it is - How this shape translates to meaning in terms of your variables and model - Why this shape might have occured
Part B
Check lecture 5 slides 23 to 24 to understand what variable importance is (and see an example where it is used).
Try and find meaning in what you find by explaining the data in terms of real world phenomena. You want to make a plot x=important variable
and y=cause
and look to see if you can see why that variable is important. What would the plot look like for a variable you think would be important?
Question 4
Check lecture 5 slide 29 to 31 to see an example that uses boosted trees.