R-Ladies Meet-up

Teaching Computers to See Scatterplots with Scagnostics

Harriet Mason

Package Co-Authors: Di Cook, Ursula Laa, Stuart Lee

2023-11-08

Overview

Big Data - Scagnostics - Cassowaryr - AFLW

  • Hi everyone, I’m Harriet Mason, a PhD student at Monash University
  • Today I’m going to be talking about scagnostics and the package that calculates them, cassowaryr
  • What are scagnostics you may be thinking, it is pretty likely you have never come across the term before
  • They are a group of measures that evaluate the visual features of a scatter plot
  • Scatterplots are particularly useful for examining all kinds of association between variables
  • and we assess that association by looking at the shape made by the points in a scatter plot, that is, its visual featues
  • unfortunately big data has too many variables to plot them all.
  • So, instead of looking at every pairwise plot, we instead picked out an interesting subst and only looked at those? That is the main idea behind scagnostics
  • In this presentation I’m first going to explain how scagnostics work, then i’m going to explain the structure of the cassowaryr package that calculates the scagnostics, and finally I’ll show how you can use the package yourself by going through an example using Australian football league statistics

How Do Scagnostics Work?

Take this “Ring” scatter plot…

  • So, lets see how the scagnostics are caulculated by looking at this “ring” shaped scatter plot.

How Do Scagnostics Work?

… and strip away everything but it’s shape

  • The first thing we do is remove the numbers and just look at the points in relation to each other
  • from here we want to make several objects that represent the scatter plot’s shape,

How Do Scagnostics Work?

… then build the graph based objects

  • the convex hull which is the shape we would get if we stretched a rubber band around the outside of the ring
  • the alpha hull which is made by outlining the shape
  • and the MST which is made by connecting every point up using as little edges as possible
  • with these three objects we can define our scagnostics

The Scagnostics

  • Convex and Alpha Hull Measures
    • Convex
    • Skinny
  • Association Measures
    • Monotonic
    • Splines
    • Dcor
  • MST Measures
    • Outlying
    • Clumpy*
    • Striated*
    • Sparse
    • Skewed
    • Stringy
  • These are the scagnostics that are in the cassowaryR package, all of which have previously been defined in scagnostic liteature
  • They are sorted into their three groups depending on which graph based object they use
  • Those with an astrix have two versions, a calculation that was defined in “Scagnostic Distributions”, a paper by Leland Wilkson and Graham Wills, as well as a new adjusted version that was created by us to solve some issues with binning that I will get to later.
  • To understand how we convert those graph based objects to scagnostics are calculated it helps to break it down
  • so I will go through three of the scagnostics in a bit more detail

The Scagnostics

Convex uses the alpha hull and convex hull

  • First up we have the scagnostic convex. This is a hull based measure, that is the ratio between the alpha hull and the convex hull

The Scagnostics

Outlying uses the MST

  • Second is outlying. This is an example of a measure that uses the minimum spanning tree
  • First it identifies the outlying points and the length of their edges, then it calculates how much of the total MST length is due to these outling edges

The Scagnostics

Splines using the original data

  • Finally we have splines. This is an example of an association measure, so it take in the raw data
  • it calculates two splines models, one with x as a dependent variable and one with y as the dependent variable
  • if either of these splines models have very low variance in their residuals, the splines scagnostic will be high

Assessing the Scagnostics

  • All are on a uniform scale: 0-1, where 0 is low and 1 is high

  • The ordering by scagnostic value hopefully matches how we perceive the structure, but it doesn’t always

  • Scagnostics hopefully identify different features but some are correlated with each other

  • There are a couple rules for the scagnostics, they aren’t just a free for all
  • so, as well as defining several scagnostics, the scagnostic distributions paper I mentioned earlier also specifies three main rules these measures must follow.
  • First they should all be on a uniform scale so they are directly comparable
  • Second, each scagnostic should order a set of scatter plots in a way that lines up with human intuition
  • Finally the scagnostics should be mostly uncorrelated. If two measures are highly correlated they are probably identifying a similar visual structure and we could do without one of the scagnostics.
  • Of course these are not all 100% achievable, but these assessments should be kept in mind when adjusting and creating scagnostics.

The CassowaryR Package

  • If you want to calculate these scagnostics yourself, you can do it very easily with the cassowaryr package

Structure

  • Under the Hood Scagnostic Functions
    • Scree object
    • Scagnostic calculations
  • Summary Functions
    • Wide and long data scagnostic summary
    • Top Summary
  • Draw Functions
    • MST, Convex Hull and Alpha hull
  • Data
    • Datasaurus dozen/ Anscombe Quartet
    • Features
  • The packages has been written so you can easily incorperate calculating scagnostics into a tidy data workflow
  • The functions that calculate the scagnostics themselves are accessible and can be used in isolation, but
  • The summary functions are how most people will use the cassowaryr package.
  • There are the two main scagnostic summary functions, one for long data and one for wide data.
  • and also two further summary functions of that summary
  • one that finds the highest value scatter plot for each scagnostic and another that finds the highest scoring scagnostic for each scatter plot.
  • The use of these will all be shown in the example.
  • The draw functions are mostly a debugging tool, they are designed to help you see the graph based objects so you can better understand the outputs of the package.
  • cassowaryr also comes with some data that you can use to test the scagnostics, the most important of which is the features data.

The Features Dataset

  • The features dataset is a set of scatter plots, each with a distinct feature that we want to identify.
  • It is not only important that the scagnostics can identify the features of the scatter plots but also differentiate between them
  • For example, the Ring is a hollow version of the disk scatter plot, and we want the scagnostics to be able to see that

Features Scagnostics

features_scagnostics_wide <- features %>%
  group_by(feature) %>%
  summarise(calc_scags(x,y))

# Look at the output
features_scagnostics_wide[1:5, 1:6]
# A tibble: 5 × 6
  feature  outlying stringy striated striated2 clumpy
  <chr>       <dbl>   <dbl>    <dbl>     <dbl>  <dbl>
1 barrier    0        0.756    0.231    0.0588  0.810
2 clusters   0.0551   0.703    0.272    0.0828  0.802
3 discrete   0        0.796    0.326    0.108   0.949
4 disk       0        0.711    0.26     0.108   0.913
5 gaps       0        0.714    0.287    0.075   0.908
  • Because the features data set is a long data set, we use group by with the calc_scags() function to get a scagnostic summary of the data set
  • the scagnostic summary can be quite large so this is a glimpse of what you would usually get.

A Scagnostic Visual Table

  • Taking the full scagnostic summary we can make a visual table and have a look at what each scagnostic sees
  • On the x axis is the scagnostic value
  • On the y axis are the scagnostics
  • The points are scatter plots from the features data, each scagnostic has an example of a low value, a high value, and a moderate value, if it fits.
  • if you are paying close attention you may have noticed that the scagnostics based on the MST are the ones that most freuently only have two plots, this is because distributions are very condensed
  • this occurs because all previous work in scagnostics had binning as a pre processing step, we want binning to be optional in the cassowaryr package
  • when we removed binning and allowed for infinitely small edges in the MST, it warped a few of the scagnostics
  • and so to try and fix this, we have designed some adjusted scagnostics

Clumpy Adjusted

  • Here is the same visual table but we are only plotting clumpy with clumpy2 where clumpy2, an adjusted measure that does not require binning.
  • you can see clumpy2 both a better job of identifying the clusters plot as it appears relatively higher on the measure and also is more uniform from 0 to 1.
  • This measure is still being adjusted as it is quite slow, but even in its current state it performs better than the original measure without binning.

AFLW Example: The Data

  • Australian Football League Women’s

  • Data from the 2020 Season

  • 68 variables, 33 of which are numeric

  • 528 Scatter Plots

  • What are we expecting the scagnostics to find?

  • While it is nice to know that the scagnostics work, correctly ordering scatter plots is not what they are used for, the measures need to be able to pick out interesting scatter plots from a large selection of scatter plots
  • In order to do show that they do, in fact do this, I’m going to give an example using data from the AFLW 2020 season
  • This data set is large, and has more pairs of variables than we could plot ourselves
  • Hopefully the scagnostics will pick out some interesting pairs of variables

AFLW Example: The Data

# Get AFLW data
aflw <- fitzroy::fetch_player_stats(2020, comp = "AFLW")

# Scagnostics only work on numeric measures
aflw_num <- aflw %>%
  select_if(is.numeric)

# Calculate average all games in the season
aflw_num <- aggregate(aflw_num[,5:37], 
  list(aflw$player.player.player.surname),
  mean)

# Calculate scagnostics
AFLW_scags <- calc_scags_wide(aflw_num[,c(2:34)])
  • So we fetch the data using the fitzroy package, and we have to make sure we are only using numeric variables since scagnostics only work on numeric variable
  • Then we use the calc_scags_wide function to calculate every scagnostic on every possible pair of variables. This one can be a bit computationally heavy so you do have to leave it for a bit.
  • Once you have this data the best way to analyse it is to look at a SPLOM of the scagnostics.

AFLW Example: SPLOM

0.00.20.40.60.80.250.500.751.000.000.250.500.751.000.50.60.70.80.90.20.40.60.802460.20.40.60.8Corr:0.402***0.50.60.70.80.9Corr:0.604***Corr:0.494***0.000.250.500.751.00Corr:-0.182***Corr:-0.475***Corr:-0.0310.250.500.751.00Corr:0.962***Corr:-0.220***Corr:-0.469***Corr:-0.037
outlyingstriated2skewedsplinesdcordcorsplinesskewedstriated2outlying
plotly-logomark
  • Unfortunately a SPLOM of all the scagnostics won’t fit on the slides so for this example I’ve made one with only a subset
  • The best way to find interesting scatter plots is to find plots that are away from the main mass of scatter plots in this splom.

AFLW Example: Quiz!

Which of these scatter plots was not identified (have a high value) by scagnostics?

  • Ok time for a bit of fun
  • So, we run the scagnostics on this data and it tells us how interesting each scatter plot is
  • 5 of these scatter plots had a high value, or a strange combination on one or more of the 11 scagnostics
  • one of them were just two variables I plotted against each other not really knowing what they looked like
  • Ill give you 10 seconds to try and guess which plot number you think was the one I picked at random before I change the slide

AFLW Example: Plot 6!

  • Plot 6!
  • It was plot 6! I picked the two variables that were the easiest to spell.
  • Here is plot 6 alongside two other randomly chosen plots
  • These plots don’t have structure that is as interesting as the plots chosen with scagnostics
  • I’ll show you how a couple of them were selected.

AFLW Example: Plot 1

0255075100051015
Plot 1disposalEfficiencyhitouts
0.00.20.40.60.80.50.60.70.80.9
Relevant Scagnostics Plotoutlyingskewed
  • Plot 1 was high on outlying and skewed
  • This means that even after removing outliers, the data was still really spread out
  • This structure is clearly visible in the scatter plot

AFLW Example: Plot 2

010200510152025
Plot 2totalPossessionsdisposals
0.000.250.500.751.000.250.500.751.00
Relevant Scagnostics Plotsplinesdcor
  • This plot is really high on the association measures
  • usually a plot that deviates from this big mass, in the middle has a non-linear relationship in the scatter plot
  • we dont have that here, so total posessions and disposals just have a strong linear relationships

AFLW Example: Plot 5

0.00.51.01.52.0051015
Plot 5bounceshitouts
0.00.20.40.60.80.20.40.60.8
Relevant Scagnostics Plotoutlyingstriated2
  • This plot is the last we will look at, and its my favourite
  • it was identifiable because it was high on striated adjsuted and low on outlying
  • This plot clearly shows that almost no players do both bounces and hitouts
  • hitouts are when you punch the ball when the ref throws it back in and they are done by your tall players, bounces have to be done while running so they are done by your fast players
  • In AFL these two categories seem to have no overlap, tall and fast are mutually exclusive
  • This is a fun example of what we can learn from our data

Top Summaries

  • top_pairs()
AFLW_pairs <- top_pairs(AFLW_scags)
head(AFLW_pairs)
# A tibble: 6 × 4
  Var1      Var2    scag    value
  <fct>     <fct>   <chr>   <dbl>
1 behinds   goals   clumpy  0.833
2 kicks     goals   stringy 0.861
3 kicks     behinds stringy 0.888
4 handballs goals   stringy 0.836
5 handballs behinds stringy 0.842
6 handballs kicks   clumpy  0.922
  • The cassowaryr package also has two functions that summarise the scagnostic information, they are top_scags() and top_pairs().
  • the top_pairs() function gives the top scagnostic for each scatter plot,

Top Summaries

  • top_pairs()
table(AFLW_pairs$scag)

   clumpy    convex      dcor monotonic    skewed    skinny   stringy 
      330         5         2         7        12        50       122 
  • since the scagnostics are supposed to be directly comparable on a 0 to 1 scale, if one scagnostic appears a lot, it is likely identifying an underlying structure through the whole data set

Top Summaries

  • top_scags() and top_pairs()
AFLW_tscag <- top_scags(AFLW_scags)
head(AFLW_tscag)
# A tibble: 6 × 4
  Var1                       Var2                          scag       value
  <fct>                      <fct>                         <chr>      <dbl>
1 goalAccuracy               bounces                       clumpy     0.999
2 metresGained               totalPossessions              clumpy2    0.767
3 intercepts                 rebound50s                    convex    36.5  
4 clearances.totalClearances clearances.stoppageClearances dcor       0.979
5 clearances.totalClearances clearances.stoppageClearances monotonic  0.988
6 disposalEfficiency         hitouts                       outlying   0.840
  • top_scags() gives you the scatter plot that had the highest value on each scagnostic
  • a reccuring scatter plot that is high on a lot of measures is likely a scatter plot with an interesting structure

Projection Pursuit Index

x1x2x3x4x5
~indx: 111113151719212325272931333537394143454749515355Play
plotly-logomark
  • All the methods I have been through so far have been how you will typically use scagnostics as an exploratory data method
  • This is an example of a guided tour using the convex scagnostic as a projection pursuit index.
  • The data has the features L shape on the x1 and x4 biplot, and noise on all other variables
  • The use for scagnostics as a projection pursuit index with the tour package is not in the cassowaryr package and some of the measures are not well suited to be used as projection pursuit indexes, either due to being too slow or noisey, but this is an area of future development for scagnostics.

Is It on CRAN?

  • Yes
  • but…
  • You may be thinking to yourself, wow what a neat package, is it on cran?
  • yes it is and it was a nightmare getting there however the version on cran does not calculate these examples
  • one of the packages we are dependent on updated, which broke all these examples, and then the package underneath THAT updated, which fixed all these examples.
  • this means the current version on cran will sometimes have a fit because of complicated reasons, the development version works without issues, so use that one if you can and hopefully the non-broken version will be on cran soon.

The Future of the Package

Near

  • Get new version on cran
  • Publish paper

Far

  • Hexagonal binning
  • Projection pursuit indexes
  • What is next for for cassowaryR package
  • we currently have a paper in the works, and now that our package is no longer completely broken, the examples work again we might be able to get it published.
  • We also want to make make some changes to the package later down the line, like I want to try and introduce hexagonal binning so the original scagnostics aren’t essentially useless.
  • I also want to continue to test the scagnostics as projection pursuit indexes and maybe also have those be easy to implement in another package. I didn’t show the code for implementing that projection pursuit in this presentation, but it is very long and tedious.
  • even as the author of the package is took me several weeks to get it working, so something that would be easier to impleent would be nice.

Thanks for Listening

  • R package: https://github.com/numbats/cassowaryr
  • Paper: https://github.com/harriet-mason/paper-cassowaryr (private)
  • Slides: https://harrietmason.netlify.app/talks/
  • Thanks for listening,
  • here is the link to the package if you want to have some fun with scagnostics
  • And I’m now happy to take any questions

1 / 31
R-Ladies Meet-up Teaching Computers to See Scatterplots with Scagnostics Harriet Mason Package Co-Authors: Di Cook, Ursula Laa, Stuart Lee 2023-11-08

  1. Slides

  2. Tools

  3. Close
  • R-Ladies Meet-up
  • Overview
  • How Do Scagnostics Work?
  • How Do Scagnostics Work?
  • How Do Scagnostics Work?
  • The Scagnostics
  • The Scagnostics
  • The Scagnostics
  • The Scagnostics
  • Assessing the Scagnostics
  • The CassowaryR Package
  • Structure
  • The Features Dataset
  • Features Scagnostics
  • A Scagnostic Visual Table
  • Clumpy Adjusted
  • AFLW Example: The Data
  • AFLW Example: The Data
  • AFLW Example: SPLOM
  • AFLW Example: Quiz!
  • AFLW Example: Plot 6!
  • AFLW Example: Plot 1
  • AFLW Example: Plot 2
  • AFLW Example: Plot 5
  • Top Summaries
  • Top Summaries
  • Top Summaries
  • Projection Pursuit Index
  • Is It on CRAN?
  • The Future of the Package
  • Thanks for Listening
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • ? Keyboard Help