What is the Positive Predictive Value measure
The p-value
The p-value is a translation of a statistical test result into a probability. The p-values have variability; while the obtained p-value, resulting from a specific statistical test, may signal a statistically significance result, variability in the p-value may cast doubt on the true significance, or at least the stability, of the result. This doubt is amplified when the statistical test result is compared to a pre-set and rigidly applied action standard (e.g., the result must be statistically significant with 95% confidence). Indeed, even though a test result is significant with 95% confidence, and so pass the action standard, the lower bound on a confidence interval drawn around that result may be closer to 80% or 85%.
This notion of uncertainty in statistical test results can also be expressed in terms of whether the same statistical conclusion would be reached if the test were redone (independently replicated). A statistical measure called the Positive Predictive Value (PPV) can be used to quantify this.
The Positive Predictive Value (PPV) measure
Specifically
, PPV is a probability of whether a specific statistical test result is indeed true or correct. PPV reassesses the results of a specific significance test by taking account of the error rates (alpha, the probability of incorrectly calling a statistical result significant when it is not, and beta, the probability of incorrectly calling a statistical result insignificant when it should have been) as well as the historical tendency to find statistical significance when having run similar tests (e.g., product tests within the same category performed within the past year) in the past. In formula form, PPV is:
(R – R x beta) / (R – R x beta) + alpha.
R, in the formula, represents the rate of finding statistically significant results in past research. Specifically, it is the number of significant results divided by the number of insignificant results. This rate, R, can be quantified using the Zappi database. For example, with reference to concept tests in the edibles category, 498 statistical tests out of 1,448 had differences large enough to be taken as significant with 95% confidence. R is then 498 / (1,448 – 498), .52. In a sense, R represents an expectation, or Bayesian prior, regarding the likelihood of obtaining significance for the next test.*
As mentioned above, alpha and beta are error rates associated with the incorrect statistical inferences. For purposes here, alpha is set to .05, consistent with the typical desire to test with 95% confidence; the researcher runs a 5% risk of making a mistaken judgment by calling the result significant when it’s not. Setting the beta risk, or conversely power, is a bit more problematic. Power is not often taken into account when setting sample size needs for the typical product test, especially those required at early stages of concept development. Budget constraints hold sway and smaller samples, testing with 30 to 100 respondents are not unusual. As such, it is not unusual to find power, and the beta error, to be hovering at around 50% (i.e., statistical tests are as likely as not to detect and call as significant a difference of a specific magnitude).
Returning to the formula for PPV and using as inputs R=.52, beta=.5 and alpha=.05, then PPV is equal to .83. In words, within the confines of 50% power and an anticipation (Bayesian prior, R) that a test is as likely to be significant as not (the R is .52), then when a statistical test result is significant with at least 95% confidence, there is an 83% chance that the conclusion of significance is correct.
How sensitive is Positive Predictive Value to its parts?
The most dominating effect is held by R. Numerically, this makes sense because of the restrictions placed on use of alpha (usually between .2, for 80% confidence, and .05, for 95% confidence) and beta (usually in the vicinity of .2, for 80% power, to .5, for 50% power) levels. Focusing specifically on levels of alpha, historically a lower rate of achieving statistical significance (i.e., smaller values of R) leads to greater influence of alpha levels on PPV. Re-expressed, reducing the alpha risk (e.g., from 95% confidence to 99%) matters most to increasing PPV when there is weak history of significance. Consider, for example, if historically significance was infrequently attained, e.g., R=.1, and beta errors were contained at around.2 (i.e., 80% power), then with alpha set to .05 the resulting value of PPV is .62. However, when increasing the desired level of confidence to 99%, and so reducing alpha to .01, PPV then becomes .89, a great improvement in the chance of attaining truth.**
Conversely, strong evidence of past success, a large value of R, increases the probability of future truth and reduces the influence of alpha and beta (statistical testing errors). Insert into the PPV formula a value for R of .8. With minimal power (at 50%) and alpha set to .05, PPV is .89. Reducing alpha to .01 then increases PPV to .98. The range of PPV values shrinks, as does the effect of alpha, when there is a stronger history of significance.
The practical purpose of Positive Predictive Value
Perhaps the greatest practical purpose of PPV is, in the same spirit as reporting confidence bounds around obtained p-values, to alert researchers that having achieved statistical significance for one specific test does not at all guarantee correct inference regarding the results of a second, replication, test. PPV can represent a lower bound on confidence. So, statistical tests results are indicative of having met an action standard only when both the level of confidence and the PPV exceed a pre-set value. The calculation of PPV also reminds the researcher that current test results can be strengthened, or put into better context, by reference to tests performed in the past. This is the utility of R. The level of confidence adopted by the researcher should take account of this history. For example, to better prove the point that a new concept is significantly better when previous attempts have failed, use a stricter confidence level, say 99%.
*Use of the term “rate” to label R is a bit of a misnomer. R is a measure of the odds of finding a significant result compared to insignificant results.
**The same result of a large range of PPV values occurs when setting beta, and power, to .5. The range of PPV extends from .5 to .83 as alpha declines from .05 to .01.