So Much Science is Bad These Days

It’s a bit sad how the title of this post could easily be a running series if I wanted to devote the time to writing it. There was a recent meta-analysis published in The Lancet called Assessment of adverse effects attributed to statin therapy in product labels: a meta-analysis of double-blind randomised controlled trials. It’s the sort of paper that makes one inclined to assume that all scientific papers are probably false.

First off, meta-analyses can be quite useful, but they have a tendency to sound far more authoritative than they actually are. They are only ever one particular way of looking at the published results that they are analyzing. For this reason, like most of science are far more useful in the positive than in the negative. (If you aren’t familiar: a meta-analysis is a paper that looks at multiple previously published papers and presents some kind of analysis of all of the data taken together.) To give a less-silly-than-it-should-be example which is extremely clear: if I do a meta-analysis on papers published about experiments that gave a test group and a control group vitamins for six months and then measured their wasteline, apply statistical methods to aggregate the results, and conclude “interventions were not shown to reduce hair loss” this would be technically true. The problem is that—if you understand what the meta-analysis is actually doing—it’s completely uninteresting.

The problem that we run into is that modern Science is all about reputation. It’s about publishing, and prestige, and citation counts—and also about funding, which is based on those things. These are the primary motivations for a great many people in science. Even for the people for whom they are not primary motivations, they’re concerns which no scientist can ignore and survive. The result is exactly what you would expect—there’s an enormous amount of what can probably best be called “salesmanship” in scientific papers.

A good example is the paper which motivated this post. If you look at the way they present the analysis, it sounds quite good:

Statin product labels (eg, Summaries of Product Characteristics [SmPCs]) list certain adverse outcomes as potential treatment-related effects based mainly on non-randomised and non-blinded studies, which might be subject to bias. We aimed to assess the evidence for such undesirable effects more reliably through a meta-analysis of individual participant data from large double-blind trials of statin therapy.

Sounds great. But what did they actually do?

Aye, there’s the rub.

The first and most obvious problem is that they took studies on five different drugs that are all in the class “statin” and then summed up all of their side-effects1. The problem is that, while different drugs in a class share at least one mechanism of action, they are still different drugs. They work differently in different people and have different side-effects. Aspirin works better for some people, ibuprofen (Motrin) for others and naproxen (Alleve) for still others. They all reduce pain and inflammation by being COX inhibitors and are, for that reason, classed as NSAIDs. But they’re not the same drug. In the same way, the five different statin drugs they looked at (atorvastatin, fluvastatin, pravastatin, rosuvastatin, and simvastatin) are not the same drug. You wouldn’t expect them to have the same side-effects. In fact, what is extremely common in drugs that are alternatives to each other is that each one has its own side-effects, and patients try them in turn to figure out which—if any—works for them with the least-bad side-effects. By design, this meta-analysis isn’t showing that there exists a statin that one can take without side-effects, it’s only showing, at best, that there is no particular side-effect that a person is guaranteed to get if they’re considering taking a drug from the entire class called “statins”. But you cannot take a generic class of drugs, you must take a particular drug. Just as you are a particular person in a particular place, not an abstraction of the platonic ideal of “a patient,” you must take an actual pill with an actual chemical in it, and cannot take the platonic ideal of “a statin.” And this problem is baked into the approach.

(Worse, the reason that it was baked into the approach is almost certainly that without doing this obviously invalid step, they wouldn’t have had the statistical power necessary to reach their conclusion. Whether they knew that or not, though, I have no guess.)

Which brings us to the main problem: their approach was to test all of of the possible side-effects ever reported anywhere as a group, using a statistical method meant to prevent people doing post-hoc subgroup analysis from finding spurious results. The statistical method (a modified Bonferroni correction called the Mehrotra and Adewale double false discovery rate (FDR) method) prevents “false discoveries” by raising the amount of evidence required to conclude something based on the total number of things being tested. This is important when you’re doing a single study because it’s very tempting to just measure everything you possibly can in the hopes that something will have a statistically significant correlation with what you were testing. If you measure 100 things at a 5% confidence level, you expect 5 statistically significant results by pure chance; corrections like the Bonferroni correction filter this kind of thing out so you can’t report statistically significant findings, only indicate possible directions for further research. (Every scientific paper says that more research is needed; of course, the dairy counsel wants you to drink more milk, too.)

The problem is that they are taking a method that tries to keep scientists honest by throwing out most of the conclusions that they wish that they could keep and using it to throw out all of the conclusions that they want to get rid of. Every scientist knows that Bonferroni corrections (and its relatives like the one used in this paper) punish, to varying degrees, including junk. They thus discourage the approach of “throw everything against the wall and see what sitcks.” This is good; we don’t want scientists generating meaningless results and we want to encourage them to be careful and only measure and report things that there’s a reason to.

But while this tradeoff makes sense—we punish the individual scientist who finds something useful they didn’t expect in order to keep all scientists honest—the incentives are exactly backwards when it comes to throwing out conclusions we don’t want to believe. I haven’t actually run the numbers, but if you throw out the fantastical p-values generated by the improper use of probability (Fisher’s exact test being applied to retrospective studies), it’s very likely that you could use this approach to conclude that the evidence that smoking causes lung cancer isn’t good enough. All you have to do is include enough other stuff and the statistical power in the controlled, prospective studies that clearly demonstrate smoking’s link to lung cancer wouldn’t meet the incredibly stringent threshold this method would set2.

As I said, I haven’t run the numbers, so that might not quite be true. But if it isn’t, it’s pretty close, and in any event it does serve to illustrate how the approach works. If we place lung cancer on an even footing with every other possible cancer and re-test all the clinical evidence as equal, this raises the bar that the evidence has to clear very considerably.

The only reason this is considered acceptable (by some) is that they like the conclusion.


  1. Full disclaimer: adding the side-effect reports of all 5 drugs is only what they said they did in the methods section; I didn’t actually dig into the meat of the paper to confirm this; that said, they’d have had to report their findings very differently if it’s not what they did, so I strongly suspect that their methods section was accurate. ↩︎
  2. The Mehrotra and Adewale double FDR method would require one to be a little more careful in exactly what was added in, since it does filter out adding in pure junk—variables which were known in the original experiment to be unrelated. So it would filter out, for example, adding in genetic diseases. This only means that a little care is required in selecting the data added to weaken the power of the evidence one wishes to deny. There are a lot of different kinds of cancer, and a lot of different possible infarctions, and perhaps smoking triggers one of the many kinds of autoimmune diseases… ↩︎

Discover more from Chris Lansdown

Subscribe to get the latest posts sent to your email.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.