ETC3250: Tutorial 6 Help Sheet

Exercises

Question 1

Part A

Check lecture 5 slide 5 to 7 for the formula for the Gini index and an example where it is used. That formula is for K groups and m buckets (i.e. areas after the split). Each bucket (one of m) has pmk^ of the total observations in group k. To answer this question you need to write a function that expresses the Gini index in terms of a single parameter, p, for a single bucket (ignore the concept of a split)

Since we are only working in one bucket, we can ignore m so our Gini index for two groups in a single bucket is: G=p1^(1p1^)+p2^(1p2^)

Since we only have two groups we can have a single parameter for p which we set as: p=p1^=1p2^ Which gives us the Gini index G=p(1p)+(1p)p G=2p(1p)

p is a proporition or a probability, which means it is bound by the same maximum and minimum of a probability.

There are two ways you can find the maximum and minimum 1. Algebraically using the first order condition (dGdp=0) 2. Draw it as a function in R (generate x as a set of values over the domain and y as G) and look at it. You can use whichever method you feel most comfortable with.

Part B

The example in lecture 5 slide 7 goes through an example with a split. The main difference is that we can no longer ignore m and you need to combine the weighted sum of the Gini index for each bucket.

A Gini index can only be computed for each bucket, so for buckets m=L and m=R you will have GL and GR that have pL and pR of the total data in each bucket respectively. We end up with G=pLGL+pRGR

A single bucket in the previous formula from part A would look like: GL=2pL1(1pL1) Where we have only included the group index to differentiate pL from pL1 (they are differnt variables).

Part C

Harriet Note

I would really suggest taking the time to work through this code line by line and work out what it does. A good way to test if you know what is happening is to guess the output of each line before running it and see if you are right or wrong. You have already made this as a function yourself so it should be straight forward to understand what it is doing.

Hint 1: Where to go for information

Check lecture 5 slide 10 to see what minsplit usually does. :::

Part D

The mysplit function takes inputs x (your x variable you are splitting on), spl (a series of values in the domain of x that you will perform a split on) and cl (the class associated with that particular x value). You will set minsplit=1 so it is not of particular importance. You need to be able to write this question in terms of those inputs.

The question broken down into psudo code essentially asks you to:

  1. Let s=c() be a vector of all possible splits.
  2. For each i in s, calculate mysplit(x, spl=i, cl, minsplit)

We know that x, cl and minsplit will be exactly the same every time we calculate this function. The only thing that changes is the split.

Remember, if you use a for loop, you need to keep the values you calculate.

You dont actually need to split halfway between any two values on the x variable, you can just split at each unique value to get the same outcome. Try using the function unique().

You can use a simple ggplot with x=split_values and y=mysplit_output.

Part E

Harriet Note

Quite frankly, this question is very complicated. You are welcome to use the hints and attempt this yourself, however I will not blame you if you just want to look at the code in the solution. PLEASE try and rewrite the solution code in pseudo-code (i.e. plain English) to make sure you understand what it is doing.

Check lecture 5 slide 9 to 13 to see how a random forest works with more than one split.

Basically, you are going to need to manually do what a random forest does, which is:

  1. To find the first split you need to:
  1. For each variable (individually) calculate the Gini index for every unique split of the data.
  2. The best split is the one that minimises the gini index. This is the first split.
  1. Your data is now split into two bins (based off the first split), to get your second split you need to:
  1. Within each bin, for each variable calculate the Gini index for every unique split of the data.
  2. The best split is the one that minimises the gini index. This is the second split.

Question 2

Part A

Check lecture 5 slide 16 to 19 to see how the model works with an example of how to fit it on slide 19.

DO NOT forget to do a training and test split, and example can be found on slide 9

Part B

Check lecture 5 slide 19 for an example that displays the confusion matrix.

Question 3

Part A

Check lecture 5 slides 20 to 22 to understand what a vote matrix is.

Whenever you look at a tour you should describe the shape of the data. For this particular question you should discuss: - the overall shape - if there is clustering and where it is - How this shape translates to meaning in terms of your variables and model - Why this shape might have occured

Part B

Check lecture 5 slides 23 to 24 to understand what variable importance is (and see an example where it is used).

Try and find meaning in what you find by explaining the data in terms of real world phenomena. You want to make a plot x=important variable and y=cause and look to see if you can see why that variable is important. What would the plot look like for a variable you think would be important?

Question 4

Check lecture 5 slide 29 to 31 to see an example that uses boosted trees.