Dozens of Major Cancer Studies Can’t Be Replicated

I recently came across an interesting article in Science News on widespread replication failure in cancer studies. It’s interesting, though not particularly shocking, that the Replication Crisis has claimed one more field.

If you’re not familiar with the Replication Crisis, it has to do with how it was widely assumed that scientific experiments described in peer-reviewed journals were reproducible—that is, if someone else performed the experiment, they would get the same result. Reproducibility of experiments is the foundation of trust in the sciences. The theory is that once somebody has done the hard work of designing an experiment which produces a useful result, others can merely follow the experimental method to verify that the result really happens and that after an experiment has been widely reproduced, people can be very confident in the result because so many people have seen it for themselves and we have widespread testimony of it. Or, indeed, people can perform these experiments as they work their way through their scientific education.

That’s the theory.

Practice is a bit different.

The problem is that science became a well-funded profession. The consequence is that experiments became extraordinarily expensive and time-intensive to perform. The most obvious example would cloud-chamber experiments in super-colliders. The Large Hadron Collider cost somewhere around $9,000,000,000 to build and requires teams of people to operate. Good luck verifying the experiments it performs for yourself.

Even when you’re on radically smaller scales and don’t require expensive apparatus—say you want to assess the health effects of people cutting out coffee from their diet—putting together as study is enormously time-intensive. And it costs money to recruit people; you generally have to pay them for their participation, and you need someone skilled in periodically assessing whatever health metrics you want to assess. Blood doesn’t draw itself and run lipid panels, after all.

OK, so amateurs don’t replicate experiments anymore. But what about other professionals?

Here we come to one of the problems introduced by “Publish Or Perish”. Academics only get status and money for achieving new results. For the most part people don’t get grants to do experiments that other people have already done and get the same results that they got. This should be a massive monkey wrench in the scientific machine, but for a long time people ignored the problem and papered over it by saying that experiments will get verified when other people try to build on the results of previous experiments and fail.

It turns out that doesn’t work, at least not nearly well enough.

The first field in which people got serious funding to try to actual replicate results to see if they replicate was in psychology, and it turned out that most wouldn’t replicate. To be fair, in many cases this was because the experiment was not well-described enough that one could even set up the same experiment again, though this is, to some degree, defending oneself against a charge of negligence by claiming incompetence. Of those studies which were described well enough that it was possible to try to replicate them, something like less than half replicated. They tended to fail to replicate in one of two ways:

  1. The effect didn’t happen often enough to be statistically significant
  2. The effect was statistically significant but so small as to be practically insignificant

To give a made-up example of the first, if you deprive people of coffee for a few months and one out of a few hundred see a positive result, then it may well be you just chanced onto someone who improved for some other reason while you were trying to study coffee. To give an example of the second, you might get a result like everyone’s systolic blood pressure went down by one tenth of a millimeter of mercury. There’s virtually no way you got a result that common in the group by chance, but it’s utterly irrelevant to any reasonable goal a human being can have.

Psychology does tend to be a particularly bad field when it comes to experimental design and execution, but other fields took note and wanted to make sure that they were as much better than the psychologists as they assumed.

And it turned out that many fields were not.

I find it interesting, though not very surprising, that oncology turns out to be another field in which experiments are failing to replicate. After all, in a field which isn’t completely new, it’s easier to get interesting results that don’t replicate than it is to get interesting results that do.

Awful Scientific Paper: Cognitive Bias in Forensic Pathology Decisions

I came across a rather bad paper recently titled Cognitive Bias in Forensic Pathology Decisions. It’s impressively bad in a number of ways. Here’s the abstract:

Forensic pathologists’ decisions are critical in police investigations and court proceedings as they determine whether an unnatural death of a young child was an accident or homicide. Does cognitive bias affect forensic pathologists’ decision-making? To address this question, we examined all death certificates issued during a 10-year period in the State of Nevada in the United States for children under the age of six. We also conducted an experiment with 133 forensic pathologists in which we tested whether knowledge of irrelevant non-medical information that should have no bearing on forensic pathologists’ decisions influenced their manner of death determinations. The dataset of death certificates indicated that forensic pathologists were more likely to rule “homicide” rather than “accident” for deaths of Black children relative to White children. This may arise because the base-rate expectation creates an a priori cognitive bias to rule that Black children died as a result of homicide, which then perpetuates itself. Corroborating this explanation, the experimental data with the 133 forensic pathologists exhibited biased decisions when given identical medical information but different irrelevant non-medical information about the race of the child and who was the caregiver who brought them to the hospital. These findings together demonstrate how extraneous information can result in cognitive bias in forensic pathology decision-making.

OK, let’s take a look at the actual study. First, it notes that black children’s deaths were more likely to be ruled homicides (instead of accidents) than white children’s deaths, in the state of Nevada, between 2009 and 2019. More accurately, of those deaths of children under 6 which were given some form of unnatural death ruling, the deaths of black children were significantly more likely to be rated a homicide rather than an accident than were the deaths of white children.

It’s worth looking at the actual numbers, though. Of all of the deaths of children under 6 in Nevada between 2009 and 2019, 8.5% of the deaths of black children were ruled a homicide by forensic pathologists while 5.6% of the deaths of white children were ruled a homicide. That’s not a huge difference. They use some statistics to make it look much larger, of course, because they need to justify why they did an experiment on this.

In fairness to the authors, they do correctly note that these statistics don’t really mean much on its own, since black children might have been murdered statistically more often than white children, during those time periods in Nevada. It doesn’t reveal cognitive biases if the pathologists were simply correct about real discrepancies.

So now we come to the experiment: They got 133 forensic pathologists to participate. They took a medical vignette about a child below six who was discovered motionless on the living room floor by their caretaker, brought the ER, and died shortly afterwards. “Postmortem examination determined that the toddler had a skull fracture and subarachnoid hemorrhage of the brain.”

The participants were broken up into two groups, which I will call A and B. 65 people were assigned to A and 68 to B. All participants were given the same vignette, except that, to be consistent with typical medical information, the race of the child was specified. Group A’s information stated that the child was black, while group B’s information stated that the child was white. OK, so they then asked the pathologists to give a ruling on the child’s death as they normally would, right?

No. They included information about the caretaker. This is part of the experiment to determine bias, because information about the caretaker is not medically relevant.

OK, so they said that the caretaker had the same race as the child?

Heh. No. Nothing that would make sense like that.

The caretaker of the black child was described as the mother’s boyfriend, while the caretaker of the white child was the child’s grandmother. Their race was not specified, though for the caretaker of the white child it can be (somewhat) inferred from the blood relation, depending on what drop-of-blood rule one assumes the investigators are using to determine the child is white. Someone who is 1/4 black, where the caretaker grandmother was the black grandparent, might well be identified as white, or perhaps the 1 drop of blood rule is applied at the grandmother could be at most 1/8 black for her grandchild to qualify to the racist experimenters as white. Why do they leave out the race of the caretaker despite clearly wanting to draw conclusions about it? Why, indeed.

More to the point, these are not at all comparable things. It is basic human psychology that people are far less likely to murder their descendants than they are to murder people not related to them. Moreover, males are more likely to commit violent crimes than females are (with some asterisks; there is some evidence to suggest that women are possibly even more likely to hit children than men are but just get away with it more because people prefer to look away when women are violent, but in any event the general expectation is that a male is more likely to be violent than a female is). Finally, young people are significantly more likely to be violent than older people are.

In short, in the vignette given to group A, the dead child is black and the caretaker who brought them in is given 3 characteristics, each of which, on its own, makes violence more statistically likely. In group B, the dead child is white and the caretaker who brought them in is given 3 characteristics, each of which, on its own, makes violence more statistically unlikely. For Pete’s sake, culturally, we use grandmothers as the epitome of non-violence and gentleness! At this point, why didn’t they just give the caretaker of the black child multiple prior convictions for murdering children? Heck, why not have him give such medically extraneous information as repeatedly saying, “I didn’t hit him with the hammer that hard. I don’t get why he’s not moving.” I suppose that would have been too on-the-nose.

Now, given that we’re comparing a child in the care of mom’s boyfriend to a child in the care of the child’s grandmother, what do they call group A? Boyfriend Condition? Nope. Black Condition. Do they call group B Grandma Condition? Nope. White Condition.

OK, so now that we have a setup clearly designed to achieve a result, what are the results?

None of the pathologists rated the death “natural” or “suicide.” 78 of the 133 pathologists ruled the child’s death “undetermined” (38 from group A, 40 from group B). That is, 58.6% of pathologists rules it “undetermined”. Of the minority who ruled conclusively, 23 ruled it homicide and 32 ruled it homicide. (That is, 17.2% of all pathologists ruled it accident and 24% of all pathologists ruled it homicide.)

In group A, 23 pathologists ruled the case homicide, 4 ruled it accident, and 38 ruled it undetermined. In group B, 9 ruled it homicide, 19 ruled it accident, and 40 ruled it undetermined.

This is off from an exactly equal outcome by approximately 15 out of 133 pathologists. I.e. if about 7 pathologists in group A had ruled accident instead of homicide, and 7 pathologists in group B ruled homicide instead of accident, the results would have been equal between both groups. As it was, this is a big enough difference to get statistical significance, which is just a measure of whether the random chance you see 95% of the time is sufficient to entirely explain the results. What it doesn’t do is show a pervasive trend. If 11% of the participants had reversed their ruling, the experiment would have shown that the 18.6% of forensic pathologists on an email list of board-certified pathologists who responded to the study were paragons of impartiality.

There’s an especially interesting aspect to the last paragraph of the conclusion:

Most important is the phenomenon identified in this study, namely demonstrating that biases by medically irrelevant contextual information do affect the conclusions reached by medical examiners. The degree and the detailed nature of these biasing effects require further research, but establishing biases in forensic pathology decision-making—the first study to do so—is not diminished by the potential limitation of not knowing which specific irrelevant information biased them (the race of the child, or/and the nature of the caretaker). Also, one must remember that the experimental study is complemented and corroborated by the data from the death certificates.

The first part is making a fair point, which is that the study does demonstrate that it is possible to bias the forensic pathologist by providing medically irrelevant information, such as the caretaker being far more likely to have intentionally hurt the child. Why didn’t they make all of the children white and just have half of the vignettes including the caretaker with multiple previous felony convictions, who was inebriated, repeatedly state, “I only hit the little brat with a hammer four times”? If we’re only trying to see whether medically irrelevant information can bias the medical examiner, that would do it too. But what’s up with varying the race of the child?

While it’s probably just to be sensationalist because race-based results are currently hot, it may also be a tie-in to that last sentence: “Also, one must remember that the experimental study is complemented and corroborated by the data from the death certificates.” This sentence shows a massive problem with the researcher’s understanding of the nature of research. Two bad data sources which corroborate each other do not improve each other.

To show this, consider a randomly generated data source. Instead of giving a vignette, just have another set of pathologists randomly answer “A”, “B,” or “C”. Then decide that A corresponds to undetermined, B to homicide, and C to accident. There’s a good chance that people won’t pick these evenly, so you’ll get a disparity. If it happens to be the same, it doesn’t bolster the study to say “the results, it must be remembered, also agreed with the completely-blinded study in which pathologists picked a ruling at random, without knowing what ruling they picked”.

Meaningless data does not acquire meaning by being combined with other meaningless data.

The conclusion of the study is, curiously, entirely reasonable. It basically amounts to the observation that if you want a medical examiner making a ruling based strictly on the medical evidence, you should hide all other evidence but the medical evidence from them. This, as the British like to say, no fool ever doubted. If you want someone to make a decision based only on some information, it is a wise course of action to present them only that information. Giving them information that you don’t want them to use is merely asking for trouble. It doesn’t require a badly designed and interpreted study to make this point.