What makes a difference statistically significant?
"Two of my stimuli have the same score, but only one of them is showing a statistically significant difference to a norm or benchmark score. Why is this?"
When testing mean scores for significance against a norm, several factors are taken into account - not just the difference in score. The number of respondents (the sample size) and the spread of responses (the standard error) both also play a role in the calculation.
Score testing
When testing the scores for your surveys against each other, Zappi uses a T-test, for both means and proportions. When comparing a survey score with a norm, we use a one-sided T-test of equivalence for means and a Z test for proportions.
Our statistical tests default to only highlighting differences with a 95% confidence level, and in some cases, we also offer a 99% or 90% confidence level option.
My scores are confusing, help me understand
Sometimes your scores can result in confusing sig-test results. For clarity, we're going to walk you through a specific example (see below). The higher score on the third ad is not flagged as significantly above the norm for a score of 7.0, and the lower score on the first ad, is significantly above the norm with a score of 6.9.
The impact of distribution on significance tests
The calculation of statistical significance takes account of the spread (or distribution) of the respondent’s answers. When trying to answer the question “is the mean score significantly different to the norm” we need to have confidence that the mean is a fair representation of the data. The way that the equation accounts for this is using the standard deviation.
Standard deviation is a quantity expressing how much the members of a group differ from the mean. Basically, it takes account of the “spread” of the data. In the two charts below, you can see how the standard deviation ( σ) changes as the spread changes.
To highlight with an example, imagine your stimuli had this distribution:
This occurs when an equal number of people respond with 1 as they do, 2,3...10. In this example, your mean would be 5.5. Linking back to the Zappi chart, the norm for Brand Feeling is 6.4.
It would be unfair to say that this concept, with uniformly distributed respondent answers, is significantly below the mean. This is because when you look at the distribution of the data it is not clear what the real answer is. No one felt strongly about the advert being a 1 or a 2 etc. The equation we use for calculating significance takes account of this “the real answer” factor, through the use of variation.
Linking to a specific case
The mean value is very similar. One is 7.0 and one is 6.9. If we only used mean scores when looking into significance then 7.0 would be a clear winner.
VIP to See (The Third Example)
The “VIP to See“ score with a mean of 7.0 has a larger variance, more spread results, than 6.9. It could be considered closer to the uniform example above, so that “the real answer” factor is less clear. That means that when we run it through the significance test, then it does not come out as being significant. The data is too spread.
Your Loyalty (The First Example)
For “Your Loyalty” the score is 6.9. The distribution of the associated data a bit closer or tighter than “VIP to See”. In the images below, “Your Loyalty” would be the left image, with “VIP to See” as the right. This is present on the chart as differences in the distribution for respondents that picked the answer 1 or 2, as well as higher numbers around 6, 7, 8, 9.
Due to this closer spread, the system is more confident that 6.9 is “the real answer” and it can then be more confident in saying it is significantly different from the norm.
Summary
“VIP to See” is not significantly above the norm, because the standard deviation is too large, despite the mean being high.
“Your Loyalty” is significantly above the norm because its mean is large enough and the standard deviation is small enough (i.e. It has a tighter spread).