Back in July 2014, I wrote a post about the misuse of “statistical significance” by defendants and courts trying to apply the Daubert standard to scientific evidence. As I wrote,
It’s true that researchers typically use statistical formulas to calculate a “95% confidence interval” — or, as they say in the jargon of statistics, “p < 0.05” — but this isn’t really a scientifically-derived standard. There’s no natural law or empirical evidence which tells us that “95%” is the right number to pick to call something “statistically significant.” The number “1 in 20” was pulled out of thin air decades ago by the statistician and biologist Ronald Fisher as part of his “combined probability test.” Fisher was a brilliant scientist, but he was also a eugenicist and an inveterate pipe-smoker who refused to believe that smoking causes cancer. Never underestimate the human factor in the practice of statistics and epidemiology.
(Links omitted; they’re still in the original post.) As expected, defense lawyers criticized my post.
Last week, the American Statistical Association published its very first “policy statement” on “a specific matter of statistical practice,” making clear that tossing around the term “statistical significance” is a “considerable distortion of the scientific process:”
Practices that reduce data analysis or scientific inference to mechanical “bright-line” rules (such as “p < 0.05”) for justifying scientific claims or conclusions can lead to erroneous beliefs and poor decision-making. A conclusion does not immediately become “true” on one side of the divide and “false” on the other. Researchers should bring many contextual factors into play to derive scientific inferences, including the design of a study, the quality of the measurements, the external evidence for the phenomenon under study, and the validity of assumptions that underlie the data analysis. Pragmatic considerations often require binary, “yes-no” decisions, but this does not mean that p-values alone can ensure that a decision is correct or incorrect. The widespread use of “statistical significance” (generally interpreted as “p ≤ 0.05”) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process.
Indeed, they went even farther, pointing out that, “Statistical significance is not equivalent to scientific, human, or economic significance. Smaller p-values do not necessarily imply the presence of larger or more important effects, and larger p-values do not imply a lack of importance or even lack of effect.”
In other words, “statistical significance” isn’t a scientific principle, it’s a conclusory term of art. Statistical analyses of the real-world don’t give us “bright-line” rules that make something “true” once a certain arbitrary number is passed or “false” if the number isn’t passed. To put it into the language of Daubert, debates over “p-values” might be useful when talking about the weight of an expert’s conclusions, but they say nothing about an expert’s methodology.
This shouldn’t be news to the legal community. The Reference Manual on Scientific Evidence (3d edition) has a whole chapter called “Reference Guide on Statistics” which is entirely consistent with the American Statistical Association’s analysis. As the Manual says,
To begin with, “confidence” is a term of art. The confidence level indicates the percentage of the time that intervals from repeated samples would cover the true value. The confidence level does not express the chance that repeated estimates would fall into the confidence interval.
P. 247. In footnote 92, they point out:
[I]t is misleading to suggest that “[a] 95% confidence interval means that there is a 95% probability that the ‘true’ relative risk falls within the interval” or that “the probability that the true value was . . . within two standard deviations of the mean . . . would be 95 percent.” DeLuca v. Merrell Dow Pharms., Inc., 791 F. Supp. 1042, 1046 (D.N.J. 1992), aff’d, 6 F.3d 778 (3d Cir. 1993); SmithKline Beecham Corp. v. Apotex Corp., 247 F. Supp. 2d 1011, 1037 (N.D. Ill. 2003), aff’d on other grounds, 403 F.3d 1331 (Fed. Cir. 2005).
The Reference Manual goes on to specifically call out the fallacy of using “p ≤ 0.05” to find statistical significance: “These levels of 5% and 1% have become icons of science and the legal process. In truth, however, such levels are at best useful conventions.”
As the Manual argues, it doesn’t make sense to use the term “statistically significant:”
Because the term “significant” is merely a label for a certain kind of p-value, significance is subject to the same limitations as the underlying p-value. Thus, significant differences may be evidence that something besides random error is at work. They are not evidence that this something is legally or practically important. Statisticians distinguish between statistical and practical significance to make the point. When practical significance is lacking—when the size of a disparity is negligible—there is no reason to worry about statistical significance. …
The significance level tells us what is likely to happen when the null hypothesis is correct; it does not tell us the probability that the hypothesis is true. Significance comes no closer to expressing the probability that the null hypothesis is true than does the underlying p-value.
P. 252. Take a moment to re-read that line, “The significance level tells us what is likely to happen when the null hypothesis is correct; it does not tell us the probability that the hypothesis is true.” The “null hypothesis” is the key issue in every case, i.e. whether or not there’s a causal association between the drug or chemical or whatever the plaintiff believes injured them and the plaintiff’s injuries. The Reference Manual already said long ago that “significance level” will not “tell us the probability that the hypothesis is true.” It’s a factor like many other that scientists consider when they draw conclusions from the scientific data. A high or low p-value says nothing about an expert’s methodology, which is the focus of Daubert. In the real scientific world, scientists don’t obsess over particular p-values, and they certainly don’t dismiss evidence that isn’t “scientifically significant.”
The problem is that many courts have been led astray by defendants who claim that “statistical significance” is a threshold that scientific evidence must pass before it can be admitted into court.
My original post criticized a decision in the In re Zoloft litigation. The In re Zoloft case ruled the same way again later in 2015 with a new round of experts, and the defense lawyers at Drug and Device Law said that “The best Daubert decision of 2015 was either [In re Zoloft] or a similar decision (both excluding the same expert) in the Lipitor MDL.” But both opinions were based upon the exact same misunderstanding of “statistical significance” that was just resoundingly rejected by the American Statistical Association.
Consider this passage from In re: Zoloft:
The Court agrees that Dr. Jewell’s approach to the Zoloft data de-emphasizes the traditional importance of statistical significance. Dr. Jewell notes that a non-significant result “does not tell us that the exposure has no effect—only that ‘no effect’ remains one of the plausible explanations for the data (but not necessarily the most plausible).” Like Dr. Bérard, he cites to Rothman’s Modern Epidemiology textbook for the principle that it is “generally accepted to examine the effect estimates (i.e., Odds Ratio) without exclusion of non-significant results.” Like Dr. Bérard, he points to no other evidence indicating that the fields of epidemiology and teratology have abandoned, or even reduced the importance of, the principle of statistical significance. The New England Journal of Medicine‘s treatment of the correction to the Louik (2007) study provides evidence to the contrary.
In re: Zoloft (Sertraline Hydrocloride) Products Liab. Litig., 2015 WL 7776911, at *9 (E.D. Pa. Dec. 2, 2015). But, as the American Statistical Association and Reference Manual tell us, there is no “traditional importance of statistical significance.” The erroneous belief in an “importance of statistical significance” is exactly what the American Statistical Association was trying to get rid of when they said, “The widespread use of ‘statistical significance’ (generally interpreted as ‘p ≤ 0.05’) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process.”
The Lipitor court similarly erred by getting into the weeds of the “p-value:”
The problem with Dr. Jewell’s use of the mid-p test is that his use of it was results driven. He only used this test once the Fisher exact test returned a non-significant result. After he used the mid-p test to obtain a statistically significant p-value, he did not even bother to determine a mid-p exact confidence interval but continued to use the prior confidence interval obtained via Stata and reported with the Fisher exact p-value. (Dkt. No. 1247–8 at 214, 217). This indicates he was not actually interested in using the mid-p approach but in obtaining a statistically significant p-value.
In re Lipitor (Atorvastatin Calcium) Mktg., Sales Practices & Products Liab. Litig., 2015 WL 7422613, at *8 (D.S.C. Nov. 20, 2015) order amended on reconsideration sub nom. 2016 WL 827067 (D.S.C. Feb. 29, 2016). But there was no reason for the expert to “bother to determine a mid-p exact confidence interval,” because it would not have told him much. As the American Statistical Association said,
Researchers often wish to turn a p-value into a statement about the truth of a null hypothesis, or about the probability that random chance produced the observed data. The p-value is neither. It is a statement about data in relation to a specified hypothetical explanation, and is not a statement about the explanation itself. …
Smaller p-values do not necessarily imply the presence of larger or more important effects, and larger p-values do not imply a lack of importance or even lack of effect. Any effect, no matter how tiny, can produce a small p-value if the sample size or measurement precision is high enough, and large effects may produce unimpressive p-values if the sample size is small or measurements are imprecise.
Once a court finds itself in the weeds of assessing “statistical significance” or starts calculating p-values, it has already gone astray. The court is no longer looking at the reliability of the expert’s methodology, as it is supposed to do under Daubert, and is instead entering into a debate about the weight of the evidence using a framework that the leading association of statisticians has said “can lead to erroneous beliefs and poor decision-making.” As the American Statistical Association made clear, a scientist should never make assumptions about the truth or falsity of a hypothesis just because some arbitrary standard of “statistical significance” like a certain p-value was or was not met.
Some courts have already understood that the “lack of statistical significance” isn’t a proper basis for excluding an expert’s opinion under Daubert. See, e.g., Milward v. Acuity Specialty Products Grp., Inc., 639 F.3d 11, 25 (1st Cir. 2011)(“the court erred in holding that ‘Dr. Smith’s attempt to support his conclusion with data that concededly lacks statistical significance’ was ‘a deviation from sound practice of the scientific method” that provided grounds for exclusion.’)
The Court in the In re Chantix litigation got it exactly right:
While the defendant repeatedly harps on the importance of statistically significant data, the United States Supreme Court recently stated that “[a] lack of statistically significant data does not mean that medical experts have no reliable basis for inferring a causal link between a drug and adverse events …. medical experts rely on other evidence to establish an inference of causation.” Matrixx Initiatives, Inc. v. Siracusano, ––– U.S. ––––, 131 S.Ct. 1309, 1319, 179 L.Ed.2d 398 (2011). The Court further recognized that courts “frequently permit expert testimony on causation based on evidence other than statistical significance.” Id.; citing Wells v. Ortho Pharmaceutical Corp., 788 F.2d 741, 744–745 (11th Cir.1986). Hence, the court does not find the defendant’s argument that Dr. Furberg “cannot establish a valid statistical association between Chantix and serious neuropsychiatric events” to be a persuasive reason to exclude his opinion, even if the court found the same to be true.
In re Chantix (Varenicline) Products Liab. Litig., 889 F. Supp. 2d 1272, 1286 (N.D. Ala. 2012)
Other courts have not done so well. See, e.g., LeBlanc ex rel. Estate of LeBlanc v. Chevron USA, Inc., 396 F. App’x 94, 99 (5th Cir. 2010)(affirming rejection of expert’s testimony in part because “some of the studies do not represent statistically significant results”); Wells v. SmithKline Beecham Corp., 601 F.3d 375, 380 (5th Cir. 2010)(“this court has frowned on causative conclusions bereft of statistically significant epidemiological support.”); Baker v. Chevron U.S.A. Inc., 533 F. App’x 509, 520 (6th Cir. 2013)(“the district court acted within its discretion when it discounted studies that contained statistically insignificant results.”).
It’s time for courts to start seeing the phrase “statistically significant” in a brief the same way they see words like “very,” “clearly,” and “plainly.” It’s an opinion that suggests the speaker has strong feelings about a subject. It’s not a scientific principle.