Jonathan J. Koehler**
Note: This article is published in the Journal of the Royal Statistical Society, Series A, Vol. 154, Part I, 1991, pp. 75-81. It is available in this journal's format at JSTOR. For permission to copy it, please contact the Royal Statistical Society.
Keywords: Probability, evidence, cognitive illusions
Abstract: Some courts have been reluctant to admit testimony expressing probabilities because of a concern that jurors will overweight it relative to other evidence. However, empirical studies indicate a tendency to underweight statistical evidence when other sources of evidence are available. This paper reviews recent studies with mock jurors.
For decades, attorneys, statisticians and psychologists have argued over the admissibility of "mathematical evidence" and the desirability of using Bayes's theorem to quantify for the jury the probative value of trace evidence such as partial fingerprints or bloodstains. Finkelstein and Fairley (1970, 1971); Tribe (1971); Fairley (1973); Finkelstein (1978); Ellman & Kaye (1979); Saks and Kidd (1980-81); Fienberg & Schervish (1986); Tillers and Green (1988). Fearing that jurors will be unduly impressed by mathematical testimony, a few courts have held that the population frequencies of incriminating traits and the conditional probability of an innocent defendant's having an incriminating characteristic T are inadmissible. State v. Carlson, 267 N.W.2d 170 (Minn. 1978); State v. Boyd, 331 N.W.2d 480 (Minn. 1983); State v. Kim, 398 N.W.2d 544 (Minn. 1987); State v. Schwartz, 447 N.W.2d 422 (Minn. 1989). Although this is a minority view (Kaye, 1987), even courts allowing such testimony caution that they "generally disfavor admission of statistical evidence." Commonwealth v. Gomes, 403 Mass. 258, 526 N.E.2d 1270 (1988).
Given these views, it is important to know whether jurors can be trusted to evaluate properly "probability evidence," and what decision aids might assist them in this task. For more than two decades, researchers have studied the ways that people process probabilistic and statistical information, but only a small portion of these studies focuses on the capacity of jurors to process explicitly quantitative probabilistic evidence. This paper reviews this research. It concludes that the work has produced several insights into the factors that affect the judgments of mock jurors, and that it is valuable in devising optimal rules for the admission or exclusion of "probability evidence." At the same time, we do not believe that the experiments published to date have been adequate in their design and implementation to demonstrate unequivocally the extent to which jurors attend to trace evidence or to identify what decision aids, if any, would promote an appropriate weighting of the evidence at trial.
2. Major Themes of Research into Decision Making
A great deal of research has been conducted on the ways in which people process and employ statistical and probabilistic information. Although Bayes's theorem provides a normative rule for revising probabilistic beliefs in light of new evidence, many studies indicate that within certain domains people do not extract as much information from new evidence as the data warrant, that they are slow to revise incorrect probabilistic hypotheses, that they attribute probative value to diagnostically worthless information, that they underutilize statistical base rates, and that they confuse likelihoods and posteriors. For general reviews and collections, see Kahneman, Slovic and Tversky (1982), Nisbett and Ross (1980), von Winterfeldt and Edwards (1986), and Slovic and Lichtenstein (1971). Cf. Edwards and von Winterfeldt (1986), Saks and Kidd (1980-81). Recently, some research has appeared on how mock jurors treat probabilities. In general, the results are consistent with the larger body of work on statistical judgment, but some significant issues have yet to be explored fully.
3. Responses of Mock Jurors to Probabilities
The experiments on probability assessments of mock jurors typically ask the subjects to read a transcript of testimony concerning some incriminating trait T (such as fibers or blood types left at the crime scene) that are said to occur in the general population with some relative frequency F(T). The experiments probe the extent to which mock jurors modify their assessments of guilt ("guilt" in the factual sense that the defendant D committed the criminal acts) when they learn that D has the trait T (an event that may be denoted TD).
Before describing the results of such experiments, we should consider how a juror whose partial beliefs behave like probabilities would react. In this Bayesian scheme, a juror begins with some prior personal probability P(G) that the defendant committed the acts as alleged. The juror then revises P(G) by conditioning on TD according to Bayes's theorem:
P(G|TD)/P(-G|TD) = [P(TD|G)/P(TD³-G)][P(G)/P(-G)] ..... (1)
Here, -G denotes "not guilty." A juror who believes that the forensic test is error-free and that the guilty person is sure to have the incriminating trait T, while only a fraction F(T) of innocent people have T, will choose 1/F(T) for the likelihood ratio P(TD|G)/P(TD|G). Such a juror therefore will conclude that the posterior odds are
P(G|TD)/P(-G|TD) = [1/F(T)][P(G)/P(-G)] ..... (2)
When the sensitivity or specificity of the test is not one, the likelihood ratio is more complicated. (Thompson, Britton and Schumann, undated). Similarly, if there were a suspicion of a frame-up (Tribe 1971) or a selection effect in using the incriminating trait T, then this ratio would be smaller than 1/F(T).
3.1. Effect of TD and F(T)
In the first experiment to study whether mock jurors behaved according to (2), Thompson and Schumann (1987) asked 144 university students to read a description of a hypothetical criminal case in which a suspect matching the description of a robber was apprehended near the crime scene and to state their initial probability of the suspect's guilt (P0) in light of these facts. The subjects next read a summary of testimony by a forensic expert linking the suspect to the crime by an examination of hair fibers (TD). The summary submitted to half the subjects also stated that, in effect, that F(T) = .02 and that in a city of one million people, 20,000 would have hair like the defendant's. These subjects then gave their posterior probability of the suspect's guilt (P1).
In a paper appearing soon after this study, Faigman and Baglioni (1988) had subjects read a trial transcript from another hypothetical robbery case. Following other evidence loosely linking defendant to the offense, a physician testified that defendant's blood type matched that of the blood left on the window. In some transcripts the physician reported that 40% of the population had the incriminating type, in others he gave a 20% figure, and in still others, 5%. The transcripts also included testimony and a chart from a statistician about Bayes's theorem and the effect of the blood grouping evidence on four prior probabilities ranging from 0.01 to 0.80. Subjects estimated the probability that the blood on the window came from the defendant. One group responded before the physician's testimony (giving a probability P0), after the physician's statistical testimony (giving P1), and again after the statistician's testimony (P2). Another group gave estimates P1 and P2. The final third gave only the posterior probability P2.
In the first of two experiments reported in a doctoral thesis Goodman (1988) administered a one-page "narrative summary of ... a homicide case based on an actual trial" to psychology undergraduates. The narratives read by one group stated that an attempt to match bloodstains was inconclusive, while four other groups received narratives stating TD and giving values for F(T) in the city where the murder took place. In a lengthy questionnaire, each subject was asked, among many other things, to give a verdict and the probability of guilt.
In all three studies, the mean probability judgment increased with the evidence of TD and F(T). Since TD is diagnostic of guilt, this pattern is reasonable. Unfortunately, the studies do not directly address one of the most influential arguments against the admission of F(T). Suspecting that jurors will overweight trace evidence or underweight other evidence if they are told how infrequent the trace evidence is, Tribe (1971) argues that jurors should be informed of TD but not F(T). Yet, no treatment group was placed in this condition.
The studies do cast doubt on the premises of Tribe's argument, however. Few subjects were so overwhelmed by F(T) that they arrived at absurdly high values for P1, whose means ranged from 0.34 to 0.62. Furthermore, the one study that considered the point P0 in the distribution of P1 found that one subject in six made no adjustment whatever in response to TD. Finally, no more than 5% of the subjects reported P(G³TD) = 1 - F(T), suggesting that few subjects fallaciously equated P(G|TD) to P(TD|G) = F(T). On its face, this result responds reassuringly to the concern (expressed in a number of court cases) that the jury will confuse the probability of innocence with the infrequency of the incriminating traits. Where the only quantitative information about the diagnostic value of TD is the population frequency F(T), this "inversion fallacy"  appears to be rare. 
3.2. Effect of TD and P(TD|G)
Thompson and Schumann (1987) found that informing the jurors of P(TD|G) instead of F(T) apparently produced a greater upward movement in the probability assessments. When the statistic was presented in this format, the proportion of subjects increasing their probabilities rose from 60/72 to 66/72. At the same time, the proportion who thought that P1 was 1 - F(T) = 0.98 rose from 3/72 to 15/72. This suggests that some of the increase in the number of subjects recognizing that the forensic evidence was probative of guilt was due to more instances of the inversion fallacy. As with the pure frequency presentation, the mean P1 does not indicate any overwhelming effect of the probability testimony.
3.3. Effect of arguments about probative value of TD
Thompson and Schumann (1987) devised a second experiment to investigate the impact of competing arguments about the probative value of TD. Students read a description of another hypothetical murder case. A detective collected some suspicious but quite inconclusive information which led him to estimate P0 = 0.10 for a suspect. The detective then learned that the suspect had the same rare blood type as the killer.
The students read two arguments about TD. The "prosecution argument" (PA) invited the students to equate P(G³TD) with 1 - F(T), while the "defense argument" (DA) urged that TD had almost no probative value. After reading each argument, the subjects indicated (1) whether they believed the argument was correct, (2) whether the detective should revise his probability, and (3) what they thought the detective's probability should be. Just over half read PA, then DA, while the remainder read DA then PA.
Thompson and Schumann conclude that most subjects failed to recognize that "both arguments are incorrect," that probabilities consistent with PA were rare, and that probabilities consistent with DA were common (much more so than in their previous experiment). However, the characterizations of PA and DA as correct or incorrect prove little about the ability of jurors to detect fallacious arguments. To begin with, it is not clear that, as worded in the study, DA is "fallacious." Much of it is consistent with using a uniform prior distribution to represent initial ignorance about the guilt of many people in the city. Under this view and a charitable interpretation of some of the phrasing, it is not unreasonable to concur with the defense argument, as many subjects did.
Of more concern are the posteriors consistent with PA (P1 =.99) and those that stuck at the prior level (P1 = P0 = .10). Nonetheless, there were not many of the former, and a prosecutor's summation that so clearly misstates the import of the evidence is legally objectionable anyway. As for the popularity of P1=.10, it is impossible to know how much this had to do with DA. The experiment did not include a control group that was not exposed to DA or to a clearly fallacious PA. The effect of two competing, reasonable arguments about the probative value of TD therefore remains unknown. 
The studies described above all reveal mean posterior probabilities that are less extreme than those given by a Bayesian aggregation of the prior probabilities and likelihoods. To facilitate comparisons among the studies, Table 1 lists the reported mean priors and posteriors, as well as the mean posteriors obtained from (2), setting P(G) to the reported or deduced priors (the "Bayes P1" column). It should be noted that some of the entries in the table involve modifications in the way that the results are presented in the original studies.
|Study||F(T)||Mean P0||Mean P1||Bayes P1|
|Thompson, Schumann #2||.01||.10||.28||.917|
|Thompson, Schumann #1||.02||.22||.53||.949|
|Thompson, Schumann #1||.02*||.27||.72||.949|
* Framed as Pr(TD|G).
Summary of findings on the effect of F(T) on probability judgments of mock jurors.
These findings of conservatism might suggest the need for some decision aids to help jurors reach posterior probabilities closer to the ones given by Bayes's theorem. The only study to explore the effect of such an approach is Faigman and Baglioni (1988), in which one group of subjects received an explanation of Bayes's theorem as applied to the evidence in that case. As shown in Table 2, which pertains to the 60 subjects who gave estimates P0, P1 and P2, there is a modest increase in the mean posterior following the statistician's testimony.
|F(T)||Mean P0||Mean P1||Mean P2||Mean PB|
Intuitive versus Bayesian computations
of the posterior probability
within Faigman and Baglioni subjects.
(P0 is the prior probability; P1 and P2 are posteriors elicited from mock jurors; PB is the posterior given by (2) with P(G) = P0).
4. An Assessment
The body of research on probabilistic inference by mock jurors is notable both for what it uncovers and for what it leaves untouched. The clearest and most consistent finding is conservatism in the expression of numerical judgments. In all the studies, it appears that many subjects did not update their prior probabilities as much as the likelihood ratio warranted. This result accords with other studies in behavioral decision making. (Cf. von Winterfeldt and Edwards 1986; Hogarth 1987, as discussed in sec. 2).
Beyond this, there are intriguing findings that remain to be replicated. The prevalence of the inverse fallacy in which subjects confuse P(G|TD) with 1 - F(T) is difficult to gauge. One study (Thompson and Schumann, 1987) found it to occur in a substantial minority of subjects exposed to a conditional probability, but it was much less common in other experiments. The sole study to examine its susceptibility to counterargument (Thompson and Schumann, 1987) suggested that the fallacy is easily overcome.
This applied research thus remains in an incunabular phase. Only a handful of studies have been published, and these are limited in at least five ways. First, they tend to omit extremely small proportions F(T). One cannot assume that becuase laboratory subjects underweight moderately probative statistical evidence, they also will underweight highly probative evidence. Second, the realism of many of the simulations leaves much to be desired. Objectionable arguments are placed before mock jurors, and attacks in the form of cross-examination of an expert witness on the probative value of evidence are missing. Third, the effects of group deliberation have been neglected. Even a substantial number of subjects succumbing to one or another cognitive error may not translate into erroneous verdicts when unanimity is required and deliberations are thorough. Fourth, only one decision aid (a chart illustrating Bayes's theorem) has been studied. Fifth, no study investigates the inferences mock jurors draw from trace evidence accompanied by a strictly qualitative reference to F(T). If jurors undervalue the trace evidence even with accurate statements of population proportions, conditional probabilities, or Bayesian decision aids, then it may be reasonable just to report that T is rare without quantifying how rare it is.
Given the interest that the question of juror understanding of "probability evidence" has provoked, future experiments that surmount these limitations can be expected. Such experiments will not all be easy, but they are worth doing. With a systematic body of research demonstrating the conditions under which jurors overuse, underuse and properly use "probability evidence," the legal system will be in a much better position to govern the presentation of such evidence at trial.
* Regents Professor of Law, Arizona State University, Tempe AZ 85287-0604, and Visiting Research Fellow, University of Chicago School of Law. [BACK]
**Assistant Professor of Behavioral Sciences, College and Graduate School of Business, University of Texas, Austin TX 78712. [BACK]
1. Thompson and Schumann (1987) coin the phrase "prosecutor's fallacy" for this reasoning. This phrase can be misleading, since applying (4) to very small F(T) can generate posteriors exceeding 1 - F(T). The fallacy resides, not in a tendency to favor the prosecution, but in the fact that it equates F(T) = P(TD|G) to the inverse probability P(G|TD). [BACK]
2. Some caution is needed here. Goodman (1986), which accounts for most of these findings, had the subjects state the number of people with the incriminating trait in the population. Framing the frequency statistic in this way may prompt jurors to undervalue TD, as discussed in sec. 3.3. [BACK]
3. Schumann and Thompson (1989) refined this experiment, using four hours of videotaped testimony and instructions to the mock jurors. Unfortunately, the unpublished report on this experiment is limited. Although Schumann and Thompson conclude that this study shows that their work generalizes to more realistic settings, detailed comments on these additional experimental results would seem premature. [BACK]
Edwards, W. and von Winterfeldt, D. (1986) Cognitive illusions and their implications for the law. So. Calif. Law Rev., 59, 225-276.
Ellman, I.M. and Kaye, D.H. (1979) Probabilities and proof: Can HLA and blood group testing prove paternity? 54 New York Univ. Law Rev. 54, 1131-1162.
Faigman, D.L., and Baglioni, A.J. (1988). Bayes' Theorem in the trial process: instructing jurors on the value of statistical evidence. Law and Human Behavior, 12, 1-17.
Fairley, W.B. (1973) Probabilistic analysis of identification evidence, J. Legal Stud., 2, 493-.
Fienberg, S.E. and Schervish, M.J. (1986) The relevance of Bayesian inference for the presentation of statistical evidence and for legal decisionmaking, Boston Univ. Law Rev., 66, 771-798.
Finkelstein, M.O. (1978) Quantitative Methods in Law, New York: Free Press.
Finkelstein, M.O., and Fairley, W.B. (1970) A Bayesian approach to identification evidence, Harvard Law Rev., 83, 489-.
Finkelstein, M.O., and Fairley, W.B. (1971) A comment on "Trial by Mathematics", Harvard Law Rev., 84, 1801-1809.\
Goodman, J. (1988) Probabilistic Scientific Evidence: Jurors' Inferences. Ann Arbor, MI: University Microfilms Int'l.
Hogarth, R.M. (1987) Judgment and Choice, Chichester: Wiley.
Kahneman, D., Slovic, P. and Tversky, A. (1982) Judgment Under Uncertainty: Heuristics and Biases, Cambridge: Cambridge Univ. Press.
Kaye, D.H. (1987) The admissibility of "probability evidence" in criminal trials -- part II, Jurimetrics J., 27, 160-172.
Nisbett, R.E. and Ross, L. (1980) Human Inference: Strategies and Shortcomings of Social Judgment, Englewood Cliffs, New Jersey: Prentice Hall.
Saks, M. and Kidd, R. (1980-81) Human information processing and adjudication: trial by heuristics, Law and Society Rev., 15, 123-
Schumann, E.L. and Thompson, W.C. (1989) Effects of Attorneys' Arguments on Jurors' Use of Statistical Evidence (manuscript).
Thompson, W.C. and Schumann, E.L. (1987) Interpretation of statistical evidence in criminal trials: The prosecutors' fallacy and the defense attorney's fallacy, Law & Human Behavior, 11, 167-187.
Thompson, W.C., Britton, L. and Schumann, E.L. (undated) Jurors' Sensitivity to Variations in Statistical Evidence (manuscript).
Tillers, P. and Green, E. (eds.) (1988), Probability and Inference in the Law of Evidence: The Limits and Uses of Bayesianism. Dordrecht, Netherlands: Kluwer Academic Publishers.
Tribe, L. (1971) Trial by mathematics: precision and ritual in the legal process, Harvard L. Rev., 84, 1329-1393.
von Winterfeldt, D. and Edwards, W. (1986) Decision Analysis and Behavioral Research. Cambridge: Cambridge University Press.
Last updated 2/20/97