Consistency and Discrimination as Measures of Good Judgment

This post is based on a paper that appeared in Judgment and Decision Making, Vol. 12, No. 4, July 2017, pp. 369–381, “How generalizable is good judgment? A multi-task, multi-benchmark study,” authored by Barbara A. Mellers, Joshua D. Baker, Eva Chen, David R. Mandel, and Philip E. Tetlock.  Tetlock is a legend in decision making, and it is likely that he is an author because it is based on some of his past work and not because he was actively involved. Nevertheless, this paper, at least, provides an opportunity to go over some of the ideas in Superforecasting and expand upon them. Whoops! I was looking for an image to put on this post and found the one above. Mellers and Tetlock looked married and they are.  I imagine that she deserved more credit in Superforecasting, the Art and Science of Prediction. Even columnist David Brooks who I have derided in the past beat me to that fact. (

The authors note that Kenneth Hammond’s correspondence and coherence (Beyond Rationality) are the gold standards upon which to evaluate judgment. Correspondence is being empirically correct while coherence is being logically correct. Human judgment tends to fall short on both, but it has gotten us this far. Hammond always decried that psychological experiments were often poorly designed as measures, but complimented Tetlock  on his use of correspondence to judge political forecasting expertise. Experts were found wanting although they were better when the forecasting environment provided regular, clear feedback and there were repeated opportunities to learn. According to the authors, Weiss & Shanteau suggested that, at a minimum, good judges (i.e., domain experts) should demonstrate consistency and
discrimination in their judgments. In other words, experts should make similar judgments if cases are alike, and dissimilar judgments when cases are unalike.  Mellers et al suggest that consistency and discrimination are silver standards that could be useful. (As an aside, I would suggest that Ken Hammond would likely have had little use for these. Coherence is logical consistency and correspondence is empirical discrimination.)

In Mellers’ and Tetlock’s previous superforecaster research, individuals were identified over the
course of four geopolitical forecasting tournaments sponsored by the Intelligence Advanced Research Projects Activity (IARPA), the research and development wing of the U.S. intelligence community. In these tournaments, thousands of people predicted the outcomes of questions such as “Will the U.N General Assembly recognize a Palestinian state by September, 30, 2011?” Each year, the top 2% of subjects were designated “superforecasters” and were assigned to work together in elite teams.

In this paper Mellers et al ask a question that follows naturally from the forecasting tournaments discussed earlier. Are superforecasters, who were markedly better on measures of correspondence than their peers, also better on other standards of good judgment,
including consistency, discrimination, and coherence? The researchers used data from two online surveys to compare the judgments of superforecasters to those made by a less elite group of forecasters (regular forecasters) and University of Pennsylvania undergraduates.

The researchers administered two surveys six months apart. To measure consistency and discrimination, they combined five distinct uncertainty phrases with a selection of events based on (then) current events. They then asked subjects to provide numerical estimates of the “best”, “lowest”, and “highest” plausible interpretations of these phrase event pairs. Below is one set of events:

1. China will seize control of the Second Thomas Shoal in the South China Sea before the end of 2014.
2. The kidnapped girls in Nigeria will be brought back alive before the end of 2014.
3. North Korea will conduct a new multistage rocket or missile launch before September, 2014.
4. China’s annual GDP growth rate will be less than 7% in the first fiscal quarter of 2015.
5. Russian armed forces will invade or enter East Ukraine before October, 2014.

Coherence was tested using various scenarios to look at information bias and congruence bias and various Bayesian reasoning problems. One was the classic epidemiology brain teaser (The Statistics of Health Decision Making–Therapy).

Breast Cancer. The probability of breast cancer
is 1% for a woman at age 40 who participates in
routine screening. If a woman has breast cancer,
the probability is 80% she will get a positive
mammography. If a woman does not have breast
cancer, the probability is 9.6% she will also get
a positive mammography. A woman in this age
group gets a positive mammography test result in
a routine screening. What is the probability she
actually has breast cancer?

Superforecasters were either best or tied for best on all tasks assessing performance on the other benchmarks of coherence, consistency and discrimination. The authors found modest but systematic correlations among benchmarks. Correlations ranged from .08 to .39, with an average of .20. As you can see from Table 4 above, the correlation of correspondence and coherence is strong.


Individuals are consistent if they assign similar judgments to comparable stimuli, and they discriminate if they assign different judgments to dissimilar stimuli.  With this paper Mellers et al have shown that there is a positive relationship between Hammond’s correspondence and coherence and the so-called silver standards of consistency and discrimination. One can certainly imagine that consistency and discrimination might help screen for good or bad decision making, but beyond that I remain unconvinced based on the modest positive correlations that were unearthed.

“How generalizable is good judgment? A multi-task, multi-benchmark study,”  Barbara A. Mellers, Joshua D. Baker, Eva Chen, David R. Mandel, and Philip E. Tetlock. Judgment and Decision Making, Vol. 12, No. 4, July 2017, pp. 369–381,