Concept testing: making decisions from your data

A primary purpose of a concept test is to evaluate one or more ideas, presented to respondents as concepts, with the goal of taking them to market, successfully. The test can be designed using various methodologies for evaluating one or more concepts. Almost regardless of methodology, though, the basis for extracting insight and ultimately making decisions regarding the next steps for the concept(s) are relatively few. 

This article describes a set of research approaches that will serve as the basis for decisions made regarding the next steps and actions to be taken for new concepts. It includes specific guidance on which and how to use these approaches when testing new concepts; as well as guidelines for when testing one concept at a time (monadic), two concepts, and three or more - which can be applied for earlier or later stage testing.   

Making a decision after testing one or more concepts 

After having tested one or more concepts, referred to below as a “new concept(s)”, a decision needs to be made regarding actions to take. Actions can be dictated according to a triage:

1. The concept(s) “pass”, communicating well the virtues of the product or service it represents and where those products or services are:

  • deemed of strong enough purchase (or engagement, for services) interest, as measured by performance metrics such as top or top two box purchase likelihood ratings from the standard 5-point purchase likelihood scale, or a measure of share of interest calculated from the use of a chip allocation question...
  • by a large enough portion of the respondents evaluating it, where the performance metrics referenced above provide an estimate of “market size” (e.g., the top box rating percentage when multiplied by the size of that product’s relevant universe as defined by category market size)…
  • to generate enough sales to justify the risk (e.g., the cost to the manufacturer / customer) of taking the product(s) to market. Simply put, the concept(s) perform well enough to move to market with little to no modification.
2. The concept(s) generate some consumer interest, but not enough to “pass”. Modifications can be taken to see if marketability can be improved; additional refinement and testing are suggested.

3. The concept(s) fail, generating none of the strengths or market size desired and with performance metrics so low as to suggest that no further modification of the concept(s) would help.

Comparing the performance metrics for a new concept

At issue then is judging where the concept(s) fall in the triage defined above. This requires some sort of comparison for judging concept performance. There are several ways to provide a comparison.

The basis for comparison most frequently used is by reliance on some form of statistical significance. One form is z-tests, referenced below. A second form relies on norms and databases* (referred to as “statistics without probabilities”). 

The performance metrics for a new concept can be compared to:

  • A database of previous concept results, where those previous concepts are as similar as possible to the concept being evaluated (e.g., past concepts from the same category, matched via tagging for comparability**, and tested within a reasonably recent time frame, say past 6 months). The percentile position of the concept being evaluated serves as the measure of statistical significance as well as indicating placement into the triage: concepts falling above the 85th percentile***of the database are taken as “pass”; concepts falling between the 70th and 84th percentiles have hope but require modification; concepts falling below the 70th percentile fail, further development is deemed a waste of resources.
  • A specific benchmark concept of known in-market performance, against which the new concept(s) can be directly compared. The benchmark is tested at the same time as the new concept, with the same sample specifications. It can be tested as a monadic cell by a separate sample of respondents. Or both the benchmark and concept may be tested sequentially by the same respondents. (Evaluation of sequential test results are best when the effects of order of testing are taken into account. For example, the definition of “best” is: (1) the best concept performs well when tested in 1st position, (2) the best concept’s performance does not fall off significantly when seen in 2nd position, and (3) the best concept causes the performance evaluation of the concept seen after it to fall significantly). Levels of significance used, for either a monadic or sequential monadic design, are the same as those reference above for the triage and database comparisons. If first 6-month or year sales are known for the benchmark then sales for the new concept can be estimated using an exponential growth model. This can provide very useful marketing information if the concept being tested were to replace the benchmark in market. The exponential growth model assumes that all marketing activity would remain the same with the introduction of the new product. 
  • A market-based benchmark obtained by using a volume forecasting model “in reverse”. With this approach, the customer is asked how many units must be sold by the product depicted to be considered a viable business opportunity (e.g., achieve a desired level of profitability within the first 6 months on the market). The required level of top box purchase likelihood percentage is estimated by working backward through a volume forecast model with pre-set market inputs (e.g., levels of awareness and distribution). The new concept’s top box purchase likelihood percent must then exceed this reverse engineered value to “pass”.
  • When two or more concepts are tested together, statistical significance testing, directly comparing performance metrics between or across concepts. (As written earlier, significance testing can also be used in the direct comparison of a databased result vs the new concept’s performance). When only two concepts are being tested, the standard two-sample, one-tailed, z-test can be applied when testing two percentages.  When two or more new concepts are being evaluated, the research goal will often be to identify the “best” of those being tested, the one concept with greatest purchase likelihood. In this situation, the appropriate statistical testing approach is not testing the significance between all pairs of concepts but rather to correctly identify the best concept. A separate “correct selection” methodology is used. (Note that when only two concepts are involved, this “correct selection” test is the one-tailed z-test reference just above).

While several options were presented for evaluating the performance of a new concept, it is quite reasonable to use more than one approach, if available to the customer. Noting the success of a new concept from two or more of the perspectives can lend greater confidence in the viability of the concept.


*The databases and the norms taken from them are an accumulation of evidence and experience to which new results (e.g., levels of purchase likelihood for a concept just tested) can be compared. This is a Bayesian notion that positions these past results as “priors”, short for “prior information”. As more results, pieces of evidence from newly tested concepts, are added to the database, the “priors” change. They should evolve as new information is added. To be sure, norms can and should be fluid, reflecting the continuing development of concepts tested. (A note of caution regarding changes in norms over time is that… norms change! The user of norms should verify the time frame and composition of the database that produced the norms.) This also suggests… requires… that the database be regularly curated. Older (more than past 12 to 18 months, or shorter for fast-moving consumer goods) concepts should be shorn. Also, tagging helps identify those past concepts most like the one currently being evaluated. The tighter the alignment, the better the quality of the evidence for judging the goodness of the new concept.

**There’s a balance that needs to be considered when defining the most relevant normative database. Larger databases can provide more stable estimates of performance at the desired percentile (e.g., 85th, 70th, etc.) at the risk of having a more heterogeneous and poorly defined set of concepts used for comparison. This is especially true if databases are not kept well maintained. In contrast, very specifically defined databases that result from use of tagging are very homogeneous, containing other concepts very similar to that being tested. However, these tagged databases can get small, to the point that percentiles are not well defined. An alternative approach to the percentile comparisons used for larger databases is using statistical significance tests to compare the new concept being evaluated to a specific result within the smaller database. Consider an example of a specifically tagged database of 10 results. Rank order these 10, from best to worst, and compare the new concept’s performance metric (e.g., its top box purchase likelihood percentage) to the 7th best concept (third from the top) concept from the database. Traditional significance testing (a two-sample, one-tailed, z-test) rely on levels of significance used with the triage.  
***The 85th percentile is used to allow more, rather than fewer, concepts to be considered for market. This lower level of significance (for those accustomed to using 95%) is used in the belief that there is greater risk to the customer by missing what could be a viable product than the risk of moving forward with a product that may not meet sales expectations. Using the 95th percentile instead will increase the chance of missing an opportunity but will help reduce the risk of going to market with a weaker product. (If 95% is used the second level of the triage is then defined as falling between the 94th and 70th percentiles.) Which to use, 85th or 95th, is very much a subjective choice made considering the researcher’s penchant for risk.
Did this answer your question? Thanks for the feedback There was a problem submitting your feedback. Please try again later.