Effect sizes: relating Cohen’s d to r

Mostly this post focuses on three specific measure of effect-size. In fact, we will mostly deal with two, but the third is important. Both r (correlation coefficient) and d (Cohen’s d) are broadly familiar to people who use statistics in Psychology. The third, chi-sqr is also familiar but not necessarily in this context.

Cohen’s d, first (d mainly because that letter hadn’t got used for anything else yet). This is a way of giving a standard value for the difference in mean scores between two groups. It is useful because it is standard: I have found that smokers are more interesting than non-smokers and on my own interestingness scale, the difference is 7.42. Is that a lot? Well, you don’t know. If the scale runs from 0 to 10, it’s a lot; if the score runs from 0 to 1000, it’s not worth noticing. However, if I say it has an effect-size d=0.8, you can go and look up how to interpret this and find out this it is quite a big effect. That’s nice. And Cohen’s d is commonly used because of this. Cohen’s d looks at how different the groups are compared to how different the scores are within the groups.

But, suppose I have 3 groups: non-smokers, cigarette smokers and vapers. There isn’t a single number difference between my groups now. There are 3 scores and therefore 3 differences. I could have 3 effect-sizes, but well you know, life is short and we should always try and use single numbers when we can.

Cohen spotted this and invented Cohen’s f (probably short for ffs). A really nice and very old-fashioned way of talking about how different lots of different numbers are form each other, is the standard deviation. So instead of the difference between 2 group scores (d), we can use the standard deviation of the 3 (or more) group scores. The principle is the same as with d, just a slightly different implementation.

Three steps get us somewhere interesting. We do this with the assumption of equal sized groups. If the groups aren’t of equal sizes, then we need a bit more fancy footwork, but the outcome is much the same:

  1. There’s a simple mathematical link between d and f, if you want to see it. If I have two equal sized groups, then the standard deviation of the group mean scores is exactly half of the difference between those means. So:
    d = 2 x f
  2. Suppose we give each participant a score which is the mean score for their group. We will call this their model score: it’s the part of their score we think we can explain. Each person then has a model score plus their own personal extra variation from that model – this we will call their residual score: it is the part of their score we can’t explain.
  3. The standard deviation of group means is the same as the standard deviation of model scores across the whole sample.
  4. The bit that Cohen calls the pooled standard deviation is just the standard deviation of the residual scores.
  5. So we have two formulae:
    f = sd(model_scores)/sd(residuals)
    d = 2 x sd(model_scores)/sd(residuals)

Cohen’s d and his f turn out to be a comparison of model scores and residuals. This diagram shows the idea:

Now we can move very easily on to the older r. This is given by comparing model scores with the DV scores themselves, which we can call the total scores:
r = sd(model_scores)/sd(total_scores)
As you can see, it is really quite similar. One final step finishes this.

We can say that:
total_score = model_score + residual_score
and (since variances add) therefore that:
var(total_score) = var(model_score) + var(residual_score)
var(residual_score) = var(total_score) – var(model_score)
or:
sd(total_score) = sqrt(sd(model_score)^2 + sd(residual_score)^2)
sd(residual_score) = sqrt(sd(total_score)^2 – sd(model_score)^2 )
A bit of playing around with these quantities (for anyone who enjoys algebra) gets us to the point where we can say that:
r^2 = Mvar/(Mvar+Rvar)
f^2 = Mvar/Rvar
so
Rvar = Mvar/f^2
r^2 = Mvar/(Mvar+Mvar/f^2)
we can drop Mvar completely from this (it’s on the top and the bottom of the equation):
r^2 = 1/(1+1/f^2)
and simplified:
r^2 = f^2/(f^2+1)
then, remembering that f = d/2
r^2 = d^2/(d^2+4)
and lastly:
r = d/sqrt(d^2+4)
or:
d = 2r/sqrt(1-r^2)

 

The Campaign to Abolish Ordinal Variables

In stats classes, we are taught that there are 4 types of variable, (and mysteriously given more than 4 names for them):

  • Categorical (aka Nominal): categories, groups or labels (eg. dog, cat, goat, parrot)
  • Ordinal: ordered values (eg single cream, double cream)
  • Interval: numbers where doing addition and subtraction makes sense (eg age – in 5 years time my age will be what it is now plus 5)
  • Ratio: numbers where doing addition/subtraction and also multiplication/division make sense, so zero corresponds to nothing and negative numbers mean something different from positive numbers (eg my bank balance – the sign positive/negative tells you whether I am owed money by the bank or vice versa)

Then, having got that, ratio simply disappears never to be seen or heard of again. This is a loose-end in statistics teaching and loose-ends are highly undesirable.

The difference between ordinal and interval sounds like it matters, it is certainly the case that combining single and double cream doesn’t make triple cream. But…

The difference is a bogus fact and bogus facts are highly undesirable. It is bogus in two ways and so we can dispense with Ordinal, meaning we can work with just  Categorical or Interval (labels or quantities). I hope you will agree that 2 is simpler than 4. It is admirably lazy. Admirably lazy is highly desirable.

Bogus because:

  1. Reason number 1: no interesting quantity is ever truly interval. Take age and think about these two things. My experience is that time speeds up as I have got older so the experience of 5 years when I was a teenager (last century) was much longer than now I am a pensioner. So teenager+5 isn;t the same interval in age as pensioner+5. Then second, right now add 5 years to my age and my health status will be much reduced (eg COVID-19), but add 5 years to my age as a teenager and nothing has changed: suddenly at my end of life, 5 years really matter.
  2. Reason number 2: in statistics, whether you treat some quantity as ordinal or interval doesn’t normally matter (although people will tell you otherwise…).

There are another two important things to be said. First, doing statistics with Interval variables is more precise and more general. Numbers matter. If you were given an exam grade as a number (69), that is more meaningful to you, than if you are told that you came 19th in the class. Second, there is a much more interesting distinction lurking beneath the surface about what might count as a typical value of something: what is the typical value for an exam grade?

In practical terms a choice of ordinal or interval variables determines whether you do the statistics on medians (ordinal) or means (interval). The mean is an average, the median is a middle value. This difference is interesting because the mean and the median are rarely the same. If I join a group of young people, then the median age of the group is probably not changed, but my extreme age means that the mean age will go up. So median age and mean age have different meanings. That’s an interesting choice to make.

 

The myth of all those different statistical tests in Psychology

Here’s a reveal – but please keep it secret. This is very important – if everyone knew what I am about to say, statistics wouldn’t be hard enough.

There are only two statistical tests in Psychology. Everything else is just for pretend. Here’s the real list:

  • F test: if the DV is an Interval variable.
  • chi-sqr test: if the DV is a Categorical variable or if either of the variables (IV or DV) is Ordinal.

So what about all the others? Well, in BrawStats (and I suspect SPSS, Jamovi, etc etc) the software does just those two tests, but then reports the results as if it had done the “official” test. It hasn’t; it is just pretending.

A bit of an explanation: Psychology has issues. It collects statistical tests, but can’t bring itself to declutter. It’s exactly like internet shopping: it sees a new test and buys it, without asking (i) whether it needs it and (ii) whether it can get rid of older stuff.

The two tests? Think of tests as being about seeing how much of what is going on in the DV can be epxlained by the IV.

  • F test: how much
    If the DV is Interval, we are looking at how much participants scored. F compares what scores we got compared with what scores we expected.
  • Chisqr test: how many
    If the DV is Categorical, we are looking at how many participants were what. Chisqr compares how many we got with how many we expected to get.

That’s it.

Getting Started

Imagine you have a set of data like this:

  • a few hundred participants (or more)
  • 50 responses/variables (or more)

This is a nice place to be, but the analysis can be quite daunting. In the posts that follow, we will discuss how to approach this.

We will be using BrawStats. See here for details:
BrawStats