Problems with orthodox statistics
Zoltan Dienes has written an interesting and stimulating paper in support of Bayesian inference over orthodox Neyman-Pearson/Fisherian inference. The paper is based upon his 2008 book (book reviewed by Baguley & Kaye (2010)). In his book Dienes had favoured the Likelihood approach, but in this paper he appears to have moved much further to supporting the Bayesian approach. Dienes writes well: and it's worth reading the book that fleshes out issues briefly mentioned in the paper.
The issue of over-reliance on p values contrasts with confidence interval and effect size approaches (see for example Ziliak & McCloskey (2008) and Meehl (1978)). Dienes attempts to show that orthodox inference is inconsistent & flawed, and that we have no choice but to do things differently. The issues run deep and are mired by a long history of controversy.
In his paper Dienes gives 3 research scenarios as follows, and I quote:
1. Stopping rule
You have run the 20 subjects you planned and have obtained a p value of .08. Despite predicting a difference, you know this won’t be convincing to any editor and run 20 more subjects. SPSS now gives a p of .01. Would you:
a) Submit the study with all 40 participants and report an overall p of .01?
b) Regard the study as nonsignificant at the 5% level and stop pursuing the effect in question, as each individual 20-subject study had a p of .08?
c) Use a method of evaluating evidence that is not sensitive to your intentions concerning when you planned to stop collecting subjects, and base conclusions on all the data?
2. Planned versus post hoc
After collecting data in a three-way design, you find an unexpected partial two-way interaction, specifically you obtain a two-way interaction (p = .03) for just the males and not the females. After talking to some colleagues and reading the literature, you realize there is a neat way of accounting for these results: Certain theories can be used to predict the interaction for the males but they say nothing about females. Would you:
a) Write up the introduction based on the theories leading to a planned contrast for the males, which is then significant?
b) Treat the partial two-way as nonsignificant, as the three-way interaction was not significant, and the partial interaction won’t survive corrections for post hoc testing?
c) Determine how strong the evidence of the partial two-way interaction is for the theory you put together to explain it, with no regard to whether you happen to think of the theory before seeing the data or afterwards, as all sorts of arbitrary factors could influence when you thought of a theory?
3. Multiple testing
You explore five possible ways of inducing subliminal perception as measured with priming. Each method interferes with vision in a different way. The test for each method has a power of 80% for a 5% significance level to detect the size of priming produced by conscious perception. Of these methods, the results for four are nonsignificant and one, Continuous Flash Suppression, is significant, p = .03, with a priming effect numerically similar in size to that found with conscious perception. Would you:
a) Report the test as p = .03 and conclude there is subliminal perception for this method?
b) Note that all tests are nonsignificant when a Bonferroni-corrected significance value of .05/5 is used, and conclude that subliminal perception does not exist by any of these methods?
c) Regard the strength of evidence provided by these data for subliminal perception produced by Continuous Flash Suppression to be the same regardless of whether or not four other rather different methods were tested?
Most active researchers will have personally encountered these scenarios. To each of the 3 scenarios: choosing option a) is the temptation that many researchers succumb to, and is wrong. Choosing b) is what you should do if you adopt a disciplined Neyman-Pearson approach (as he says "maybe your conscience told you so"). Finally choosing c) appears to be the most desirable and can only be implemented using a Bayesian approach.
The Likelihood
It is claimed that the orthodox answers (b) violate the likelihood principle and hence the axioms of probability. Well, what is the likelihood principle? It is: "All the information relevant to inference contained in data is provided by the likelihood."
The likelihood ratio which represents the relative evidence of one theory over another has been elevated to the "Law of likelihood", as coined by Hacking (1965) (a difficult book, not recommended). Basically then, this states that the actual data obtained most strongly supports the theory that predicted it. The strong likelihood principle can be applied to sequential experiments (e.g. Armitage et al (2002), p615).
The likelihood is a relative probability, and differs from the p value. The p value is the probability of obtaining the data or more extreme data, given a (null) hypothesis and a decision procedure. It is an area under the probability curve. In contrast, the likelihood is the height of the probability curve at the point representing the obtained data. The strength of evidence (the likelihood) can be separated from the probability of obtaining the evidence (the relative costings of the two types of errors, α and β, in the Neyman-Pearson formulation). The p value in a statistical test combines and confuses the strength of evidence and the probability of obtaining that evidence - which is why we run into problems in the 3 scenarios above.
Bayesian approach using the Bayes factor
Bayesian methods aim to determine the probability of a theory given the data. The posterior odds is equal to the likelihood ratio multiplied by the the prior odds (the odds of the theory before data is collected). Rather than calculate the posterior odds, Dienes recommends calculating the Bayes factor. This is the ratio of likelihoods for the theory and the null (how much the data supports your theory versus the null). According to Jeffreys (1961) a Bayes factor greater than 3 or less than 1/3 indicates substantial evidence for or against a theory (a Bayes factor of unity represents no evidence either way and usually indicates low sensitivity - i.e. not enough data). With "moderate numbers" a Bayes factor of 3 corresponds roughly to the conventional 5% significance level. Effect size and shape of the prior distribution (typically uniform or normal) need to be considered in order to calculate the Bayes factor. Explicitly using effect sizes forces the researcher to think about and engage with the data in their research domain. As Dienes (2011) says: "People will think more carefully about theoretical mechanisms so as to link experiments to past research to predict relevant effect sizes. Results sections will become focused, concise, and more persuasive."
Extra data can be added at any time where the probability of producing a Bayes factor that is weak (close to 1). Misleading (wrong direction) Bayes factors decrease the more data there is. This is not true for p values or for the probability of making a Type I error. When the null hypothesis is true then p values are not driven in any direction (Type I error remains at 5%) as sample size increases, while the Bayes factor is driven to 0.
Conclusion
Dienes has written a persuasive paper. The recommended Bayesian approach should be explored and tried out by researchers, and is easier than previous suggestions (e.g. Berger, 2003). It will be interesting to see how successful authors are at getting it past journal reviewers. The Bayesian approach is intuitively appealing: it is much easier to think about a ratio of likelihoods for competing theories than it is to appreciate the probability of obtaining extreme data assuming the null hypothesis is true. One problem with Bayesian probability/factor calculation is that they currently appear awkward and complicated. Hopefully this will change in time and encourage researchers to think more about their data and the theories they support.
Addendum
Another interesting approach suggested by Goodman (1999) in which the minimum Bayes Factor for the null hypothesis is calculated directly from z, t or χ2. This can be compared with the conventional p value, but more than that it informs us of the maximum decrease in the prior probability of a null hypothesis.
How to calculate a simple Bayes factor
Matlab code is provided at Dienes' website and is translated into R by Baguley & Kaye (2010) also available from Kaye's website. I reproduce it here formatted (and also because copying directly from the Baguley & Kaye (2010) paper didn't work for me). Here first is an Excel file containing 3 worksheets:
1) VBA macro with graphic, enter values into yellow cells and click on Calculate button. The graphic shows the position of the obtained mean, the null likelihood, the theory probability distribution (not taking into account the obtained data), and the product between the theory distribution and data likelihood. This latter product does not represent a probability density as such (but is plotted anyway and looks like the (diminutive) posterior probability distribution). The product values are integrated to provide the theory likelihood, p(D | Theory) - thanks to Zoltan Dienes for clarifying these issues (hoping I've got it right this time!)
2) RExcel where values for statistics & parameters can be entered in the yellow cells, and outputs automatically re-calculated by Excel (have it on Automatic calculation option. Follow the instructions in the green cells.
3) a sheet illustrating Goodman's minimum Bayes Factor.
All sheets are locked but can easily be unlocked without entering a password.
In the RExcel worksheet: After installing R and RExcel (Randfriends is the easiest way) you need to do the following:
1. Start R in Add-Ins RExcel menu
2. Select blue cell, remove ' (apostrophe) then press F2 and then Enter (this runs the function)
3. Select the 3 blue cells and in formula bar remove ' then Ctrl + Shift + Enter
Else the following code can be run in R directly:
Bf <- function(sd, obtained, uniform, lower = 0, upper = 1,
Armitage, P., Berry, G. & Matthews, J.N.S. (2002) Statistical Methods in Medical Research. Oxford: Blackwell Science.
Baguley, T. & Kaye, D. (2010) Book review of Dienes' book, British Journal of Mathematical and Statistical Psychology, 63: 695–698. Available here: http://bit.ly/kq7iXh and R code available at: http://www.danny-kaye.co.uk/Docs/Dienes_functions.txt
Berger, J.O. (2003) Could Fisher, Jeffreys and Neyman Have Agreed on Testing? Statistical Science, 18(1): 1-32.
Dienes, Z. (2008) Understanding psychology as a science: An introduction to scientific and statistical inference, Palgrave Macmillan, Paperback, ISBN: 978-0-230-54231-0
Dienes, Z. (2011) Bayesian Versus Orthodox Statistics: Which Side Are You On? Perspectives on Psychological Science, 6(3): 274–290.
Goodman, S.N. (1999) Toward Evidence-Based Medical Statistics. 2: The Bayes Factor, Annals of Internal Medicine, 130(12):1005-13
Hacking, I. (1965) Logic of statistical inference, Cambridge University Press.
Jeffreys, H. (1961) The Theory of Probability, Oxford University Press.
Meehl, P. E. (1978) Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46(4): 806-834.
Ziliak, S.T., & McCloskey, D.N. (2008) The cult of statistical significance: How the standard error cost us jobs, justice and lives, Ann Arbor: University of Michigan Press.
Zoltan Dienes has written an interesting and stimulating paper in support of Bayesian inference over orthodox Neyman-Pearson/Fisherian inference. The paper is based upon his 2008 book (book reviewed by Baguley & Kaye (2010)). In his book Dienes had favoured the Likelihood approach, but in this paper he appears to have moved much further to supporting the Bayesian approach. Dienes writes well: and it's worth reading the book that fleshes out issues briefly mentioned in the paper.
The issue of over-reliance on p values contrasts with confidence interval and effect size approaches (see for example Ziliak & McCloskey (2008) and Meehl (1978)). Dienes attempts to show that orthodox inference is inconsistent & flawed, and that we have no choice but to do things differently. The issues run deep and are mired by a long history of controversy.
In his paper Dienes gives 3 research scenarios as follows, and I quote:
1. Stopping rule
You have run the 20 subjects you planned and have obtained a p value of .08. Despite predicting a difference, you know this won’t be convincing to any editor and run 20 more subjects. SPSS now gives a p of .01. Would you:
a) Submit the study with all 40 participants and report an overall p of .01?
b) Regard the study as nonsignificant at the 5% level and stop pursuing the effect in question, as each individual 20-subject study had a p of .08?
c) Use a method of evaluating evidence that is not sensitive to your intentions concerning when you planned to stop collecting subjects, and base conclusions on all the data?
2. Planned versus post hoc
After collecting data in a three-way design, you find an unexpected partial two-way interaction, specifically you obtain a two-way interaction (p = .03) for just the males and not the females. After talking to some colleagues and reading the literature, you realize there is a neat way of accounting for these results: Certain theories can be used to predict the interaction for the males but they say nothing about females. Would you:
a) Write up the introduction based on the theories leading to a planned contrast for the males, which is then significant?
b) Treat the partial two-way as nonsignificant, as the three-way interaction was not significant, and the partial interaction won’t survive corrections for post hoc testing?
c) Determine how strong the evidence of the partial two-way interaction is for the theory you put together to explain it, with no regard to whether you happen to think of the theory before seeing the data or afterwards, as all sorts of arbitrary factors could influence when you thought of a theory?
3. Multiple testing
You explore five possible ways of inducing subliminal perception as measured with priming. Each method interferes with vision in a different way. The test for each method has a power of 80% for a 5% significance level to detect the size of priming produced by conscious perception. Of these methods, the results for four are nonsignificant and one, Continuous Flash Suppression, is significant, p = .03, with a priming effect numerically similar in size to that found with conscious perception. Would you:
a) Report the test as p = .03 and conclude there is subliminal perception for this method?
b) Note that all tests are nonsignificant when a Bonferroni-corrected significance value of .05/5 is used, and conclude that subliminal perception does not exist by any of these methods?
c) Regard the strength of evidence provided by these data for subliminal perception produced by Continuous Flash Suppression to be the same regardless of whether or not four other rather different methods were tested?
Most active researchers will have personally encountered these scenarios. To each of the 3 scenarios: choosing option a) is the temptation that many researchers succumb to, and is wrong. Choosing b) is what you should do if you adopt a disciplined Neyman-Pearson approach (as he says "maybe your conscience told you so"). Finally choosing c) appears to be the most desirable and can only be implemented using a Bayesian approach.
The Likelihood
It is claimed that the orthodox answers (b) violate the likelihood principle and hence the axioms of probability. Well, what is the likelihood principle? It is: "All the information relevant to inference contained in data is provided by the likelihood."
The likelihood ratio which represents the relative evidence of one theory over another has been elevated to the "Law of likelihood", as coined by Hacking (1965) (a difficult book, not recommended). Basically then, this states that the actual data obtained most strongly supports the theory that predicted it. The strong likelihood principle can be applied to sequential experiments (e.g. Armitage et al (2002), p615).
The likelihood is a relative probability, and differs from the p value. The p value is the probability of obtaining the data or more extreme data, given a (null) hypothesis and a decision procedure. It is an area under the probability curve. In contrast, the likelihood is the height of the probability curve at the point representing the obtained data. The strength of evidence (the likelihood) can be separated from the probability of obtaining the evidence (the relative costings of the two types of errors, α and β, in the Neyman-Pearson formulation). The p value in a statistical test combines and confuses the strength of evidence and the probability of obtaining that evidence - which is why we run into problems in the 3 scenarios above.
Bayesian approach using the Bayes factor
Bayesian methods aim to determine the probability of a theory given the data. The posterior odds is equal to the likelihood ratio multiplied by the the prior odds (the odds of the theory before data is collected). Rather than calculate the posterior odds, Dienes recommends calculating the Bayes factor. This is the ratio of likelihoods for the theory and the null (how much the data supports your theory versus the null). According to Jeffreys (1961) a Bayes factor greater than 3 or less than 1/3 indicates substantial evidence for or against a theory (a Bayes factor of unity represents no evidence either way and usually indicates low sensitivity - i.e. not enough data). With "moderate numbers" a Bayes factor of 3 corresponds roughly to the conventional 5% significance level. Effect size and shape of the prior distribution (typically uniform or normal) need to be considered in order to calculate the Bayes factor. Explicitly using effect sizes forces the researcher to think about and engage with the data in their research domain. As Dienes (2011) says: "People will think more carefully about theoretical mechanisms so as to link experiments to past research to predict relevant effect sizes. Results sections will become focused, concise, and more persuasive."
Extra data can be added at any time where the probability of producing a Bayes factor that is weak (close to 1). Misleading (wrong direction) Bayes factors decrease the more data there is. This is not true for p values or for the probability of making a Type I error. When the null hypothesis is true then p values are not driven in any direction (Type I error remains at 5%) as sample size increases, while the Bayes factor is driven to 0.
Conclusion
Dienes has written a persuasive paper. The recommended Bayesian approach should be explored and tried out by researchers, and is easier than previous suggestions (e.g. Berger, 2003). It will be interesting to see how successful authors are at getting it past journal reviewers. The Bayesian approach is intuitively appealing: it is much easier to think about a ratio of likelihoods for competing theories than it is to appreciate the probability of obtaining extreme data assuming the null hypothesis is true. One problem with Bayesian probability/factor calculation is that they currently appear awkward and complicated. Hopefully this will change in time and encourage researchers to think more about their data and the theories they support.
Addendum
Another interesting approach suggested by Goodman (1999) in which the minimum Bayes Factor for the null hypothesis is calculated directly from z, t or χ2. This can be compared with the conventional p value, but more than that it informs us of the maximum decrease in the prior probability of a null hypothesis.
How to calculate a simple Bayes factor
Matlab code is provided at Dienes' website and is translated into R by Baguley & Kaye (2010) also available from Kaye's website. I reproduce it here formatted (and also because copying directly from the Baguley & Kaye (2010) paper didn't work for me). Here first is an Excel file containing 3 worksheets:
1) VBA macro with graphic, enter values into yellow cells and click on Calculate button. The graphic shows the position of the obtained mean, the null likelihood, the theory probability distribution (not taking into account the obtained data), and the product between the theory distribution and data likelihood. This latter product does not represent a probability density as such (but is plotted anyway and looks like the (diminutive) posterior probability distribution). The product values are integrated to provide the theory likelihood, p(D | Theory) - thanks to Zoltan Dienes for clarifying these issues (hoping I've got it right this time!)
2) RExcel where values for statistics & parameters can be entered in the yellow cells, and outputs automatically re-calculated by Excel (have it on Automatic calculation option. Follow the instructions in the green cells.
3) a sheet illustrating Goodman's minimum Bayes Factor.
All sheets are locked but can easily be unlocked without entering a password.
1. Start R in Add-Ins RExcel menu
2. Select blue cell, remove ' (apostrophe) then press F2 and then Enter (this runs the function)
3. Select the 3 blue cells and in formula bar remove ' then Ctrl + Shift + Enter
Else the following code can be run in R directly:
Bf <- function(sd, obtained, uniform, lower = 0, upper = 1,
meanoftheory = 0, sdtheory = 1, tail = 2)
{
#test data can be found starting at p100 of Dienes (2008) adapted from Baguley & Kaye (2010)
#
area <- 0
if (identical (uniform, 1)){
theta <- lower
range <- upper-lower
incr <- range/2000
for (A in -1000:1000) {
theta <- theta + incr
dist_theta <- 1/range
height <- dist_theta * dnorm(obtained, theta, sd)
area <- area + height * incr
}
} else {
theta <- meanoftheory-5 * sdtheory
incr <- sdtheory/200
for (A in -1000:1000) {
theta <- theta + incr
dist_theta <- dnorm(theta, meanoftheory, sdtheory)
if (identical(tail, 1)) {
if (theta <= 0){
dist_theta <- 0
} else {
dist_theta <- dist_theta * 2
}
}
height <- dist_theta * dnorm(obtained, theta, sd)
area <- area + height * incr
}
}
LikelihoodTheory <- area
Likelihoodnull <- dnorm(obtained, 0, sd)
BayesFactor <- LikelihoodTheory/Likelihoodnull
ret <- list("LikelihoodTheory" = LikelihoodTheory,
"Likelihoodnull" = Likelihoodnull, "BayesFactor" = BayesFactor)
ret
}
Then the following commands can be issued:
Bf(1.09,-2.8,1,lower=-600,upper=0)
giving:
$LikelihoodTheory = 0:00166435
$Likelihoodnull = 0:01350761
$BayesFactor = 0:1232157
Bf(1.09,-2.8,0,meanoftheory=0,sdtheory=2,tail=1)
giving:
$LikelihoodTheory = 0:001955743
$Likelihoodnull = 0:01350761
$BayesFactor = 0:1447882
Bf(1.09,-2.8,0,meanoftheory=0,sdtheory=2,tail= 2)
giving:
$LikelihoodTheory = 0:08227421
$Likelihoodnull = 0:01350761
$BayesFactor = 6:090951
References:Bf(1.09,-2.8,1,lower=-600,upper=0)
giving:
$LikelihoodTheory = 0:00166435
$Likelihoodnull = 0:01350761
$BayesFactor = 0:1232157
Bf(1.09,-2.8,0,meanoftheory=0,sdtheory=2,tail=1)
giving:
$LikelihoodTheory = 0:001955743
$Likelihoodnull = 0:01350761
$BayesFactor = 0:1447882
Bf(1.09,-2.8,0,meanoftheory=0,sdtheory=2,tail= 2)
giving:
$LikelihoodTheory = 0:08227421
$Likelihoodnull = 0:01350761
$BayesFactor = 6:090951
Armitage, P., Berry, G. & Matthews, J.N.S. (2002) Statistical Methods in Medical Research. Oxford: Blackwell Science.
Baguley, T. & Kaye, D. (2010) Book review of Dienes' book, British Journal of Mathematical and Statistical Psychology, 63: 695–698. Available here: http://bit.ly/kq7iXh and R code available at: http://www.danny-kaye.co.uk/Docs/Dienes_functions.txt
Berger, J.O. (2003) Could Fisher, Jeffreys and Neyman Have Agreed on Testing? Statistical Science, 18(1): 1-32.
Dienes, Z. (2008) Understanding psychology as a science: An introduction to scientific and statistical inference, Palgrave Macmillan, Paperback, ISBN: 978-0-230-54231-0
Dienes, Z. (2011) Bayesian Versus Orthodox Statistics: Which Side Are You On? Perspectives on Psychological Science, 6(3): 274–290.
Goodman, S.N. (1999) Toward Evidence-Based Medical Statistics. 2: The Bayes Factor, Annals of Internal Medicine, 130(12):1005-13
Hacking, I. (1965) Logic of statistical inference, Cambridge University Press.
Jeffreys, H. (1961) The Theory of Probability, Oxford University Press.
Meehl, P. E. (1978) Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46(4): 806-834.
Ziliak, S.T., & McCloskey, D.N. (2008) The cult of statistical significance: How the standard error cost us jobs, justice and lives, Ann Arbor: University of Michigan Press.
No comments:
Post a Comment