Daubert Doesn’t Ask Judges To Become Experts On Statistics

Update: It’s worth pointing out that, a year and a half after Dr. Anick Bérard’s testimony was precluded as “unreliable,” she published in the Journal of the American Medical Association, using many of the same methods the court deemed unacceptable.

Back in 2012, I wrote: “Scientific evidence is one of those rare areas of law upon which every lawyer agrees: we are all certain that everyone else is wrong.”

There have been some missteps in the law’s use of scientific proof as evidence in civil litigation — like when the Supreme Court affirmed a trial court holding in Kumho Tire Co. v. Carmichael, 526 U.S. 137 (1999), that an engineer with a Masters in Mechanical Engineering who had worked in tire design and failure testing at Michelin was nonetheless incompetent to testify about tire failures — but, by and large the standard articulated in Daubert v. Merrell Dow Pharmaceuticals, 509 U.S. 579 (1993) makes sense. Courts review an expert’s methods, rather than their conclusions, to ensure that the expert’s testimony has an appropriate scientific basis.

To go with the baseball metaphors so often (and wrongly) used in the law, when it comes to Daubert, the judge isn’t an umpire calling balls and strikes, they’re more like a league official checking to make sure the players are using regulation equipment. Mere disagreements about the science itself, and about the expert’s conclusions, are to be made by the jury in the courtroom.

In practice, though, the Daubert standard runs into problems when courts erroneously decide factual disputes about methodology and conclusions, issues which are better left to cross examination of the experts at trial. Consider the June 27, 2014 opinion in the Zoloft birth defects multidistrict litigation, which struck the testimony of plaintiffs’ “perinatal pharmacoepidemiologist,” Dr. Anick Bérard. Dr. Bérard holds a Ph.D. in Epidemiology and Biostatistics from McGill University, teaches at the Université de Montréal, and has conducted research on the effects of antidepressants on human fetal development. The expert was going to opine that “Zoloft, when used at therapeutic dose levels during human pregnancy, is capable of causing a range of birth defects (i.e., is a teratogen),” an opinion based upon her review of a variety of studies showing a correlation between SSRI use and birth defects. The court had multiple grounds for striking the opinion, but a key issue relating to statistics jumped out at me.

Before we get to that, however, we need to get on the same page about what it means when an epidemiological study finds a “statistically significant” association between a drug and an injury, and what it means to call such an association “strong” or “weak.” As the Court explains:

Epidemiological studies examining the effects of medication taken during pregnancy on birth defects calculate a relative risk (RR) or odds ratio (OR). Simply speaking, these ratios are calculated by dividing the risk or odds of a particular birth defect in children born to medication users (exposed women) by the risk or odds of finding that birth defect in children born without prenatal exposure. …

Because an RR or OR calculation is only an estimate, the precision of which may be affected by general or study-specific factors (including confounders and biases, sample sizes, study methods, etc.), researchers also use statistical formulas to calculate a 95% confidence interval, which is an estimated range of plausible ratio values. A 95% confidence interval means that there is a 95% chance that the “true” ratio value falls within the confidence interval range. Some confidence intervals are narrow, indicating that the calculated rate ratio is fairly precise, and some are wide, indicating that it is not and that additional research is warranted. If the lower bound of the confidence interval is greater than one, researchers say that the ratio is “statistically significant” (i.e., there is only a 5% chance that the increased risk reflected in the ratio is the result of chance alone), and will report finding a statistically significant correlation or association between the medication exposure and the birth defect at issue.

There’s nothing wrong with this summary, but a big caveat needs to be added. It’s true that researchers typically use statistical formulas to calculate a “95% confidence interval” — or, as they say in the jargon of statistics, “p < 0.05” — but this isn’t really a scientifically-derived standard.

There’s no natural law or empirical evidence which tells us that “95%” is the right number to pick to call something “statistically significant.” The number “1 in 20” was pulled out of thin air decades ago by the statistician and biologist Ronald Fisher as part of his “combined probability test.” (Here’s a short paper from professor of biostatistics Jerry Dallal on the number’s origins and problems.) Fisher was a brilliant scientist, but he was also a eugenicist and an inveterate pipe-smoker who refused to believe that smoking causes cancer (PDF). Never underestimate the human factor in the practice of statistics and epidemiology.

In short, the fact that statisticians simply default to “95% confidence interval” as a matter of convention when they can’t think of something else to use doesn’t mean that the “95% confidence interval” is the most meaningful number for us to use to determine whether a statistical discovery is “significant.” It’s no different from the guidance in the law to not raise more than four issues on appeal: it’s a convention, and one that makes some rough sense, but not one that should be blindly followed without reason.

Once a statistician or epidemiologist says an association between a chemical and an injury is “statistically significant,” they then say whether the association is “weak” or “strong.” A “weak” association doesn’t mean a “non-existent” association, it just means that the relative risk and the odds ratio are beneath some arbitrarily determined point. Like with “statistically significant” findings, the difference between “weak” and “strong” associations is in the eye of the beholder.

As epidemiologist Paolo Boffetta wrote not long ago, “there are no predefined values that separate ‘strong’ from ‘moderate’ or ‘weak’ associations.” Boffetta, I should note, has been called a “mercenary” for his claims that pollutants caused by industries from which he has received funding don’t cause cancer (e.g., dioxin and beryllium), and he recently withdrew his candidacy to run France’s top epidemiological institute in response to intense criticism. Again, there’s that human factor in epidemiology.

Unsurprisingly, Boffetta takes the position a “moderate or weak” association is found with “relative risks below 3.” We could take the cynical view, and assume that Boffetta is arbitrarily picking a high number — why 3, and not, say, √2, 2, e or pi? — to protect his funders, but there’s a deeper issue here. These issues are genuinely complicated and are the source of genuine dispute, which is why they are studied and debated by intelligent, well-educated people. There is simply no bright-line at which point a relative risk goes from “weak” to “strong,” it has to be adjusted for the circumstances, and that’s the expert’s job.

Back to the Zoloft opinion, the problem which caught my eye comes in the Court’s analysis of research papers showing a statistically “weak” association between SSRI use and birth defects. As the Court recounted:

Dr. Bérard opines that, although one cannot assume teratogenicity from one weak association in one study, one can assume teratogenicity based upon multiple weak associations found across many studies.

The word “assume” is a bit odd, and I wonder if it comes from the expert’s report or from the Court, but there’s nothing novel about this approach. Consider second-hand smoke: the Surgeon General’s Report from 2006 lists the various studies considered, and many found no relative risk at all, very few found a relative risk above 2, and none found a relative risk above 3. Yet, second-hand smoke is now generally accepted by the scientific community to increase the risk of cancer. A “weak” association across multiple studies can still show a causal link.

But here’s how the Court continues:

Dr. Bérard opines that, although one cannot assume teratogenicity from one weak association in one study, one can assume teratogenicity based upon multiple weak associations found across many studies. However, an equally plausible conclusion from multiple studies finding only weak associations, not greater than one would expect by chance, is that the true association is weak; so weak that one cannot conclude that the risk is greater than that seen in the general population.

This is where the Court has gone too far. Daubert asks a Court to consider only the expert’s methods, and here an undeniably well-qualified epidemiologist who has done research in the field reached her conclusions by using the exact same method — i.e., review of multiple studies showing a weak association — used to prove the link between second-hand smoke and cancer.

Yet, the Court goes to the next step, reaching its own conclusions about what is or should be “equally plausible” to an epidemiologist. That crosses the line from methods to conclusions, as revealed by the Court’s next sentence: “This is, in fact, the conclusion most researchers in Dr. Bérard’s field have reached regarding the association between Zoloft and birth defects, even those cited by Dr. Bérard in support of her contrary opinion.” (Emphasis added.) The very fact that we have stumbled upon the “conclusions” of other researchers shows that we have wandered very far afield from Daubert.

It’s easy to see how the Court would cross that line, but it’s also hard to see how this part of the opinion can stand under Daubert. The Court is in no position to say that one “conclusion” about epidemiological studies is “equally plausible” as compared to another — it takes years of education, training, and experience to make that judgment, which is why we have epidemiologists in the first place.

* For more on all of these issues, consider reading How Not To Be Wrong by Jordan Ellenberg, a mathematician, or The Signal and the Noise by Nate Silver, a statistician.