I Wonder How Many Scientists Would Get This Wrong

I wonder how many scientists would get this wrong:

You want to know if one basketball team is taller than another so you measure the players. On the one team the average is 6’5″ (15 players, standard deviations is 1.3″) while for the other it’s 6’6″ (14 players, standard deviation 1.4″). What is the p value for this 1″ difference?

I suspect that far more than zero would do what Gemini did when I presented it with this question:

The p-value for the 1-inch difference in the average heights of the two basketball teams is 0.0571.


Explanation of Statistical Test

To determine if the 1-inch difference in average height is statistically significant, a two-sample t-test was performed. This test compares the means of two independent groups to see if the difference between them is due to chance or if it’s a genuine difference in the populations from which the samples were drawn.

Here’s a breakdown of the calculation:

  • Hypotheses: The null hypothesis (H_0) assumes there is no true difference in the average height of the two teams (mu_1=mu_2). The alternative hypothesis (H_a) assumes there is a difference (mu_1nemu_2).
  • T-statistic: This value measures the size of the difference relative to the variation in the sample data. A larger absolute value of the t-statistic suggests a stronger difference. The calculated t-statistic for this data is -1.9894.
  • Degrees of Freedom (df): This value, 26.44, is a measure of the amount of information available to estimate the population variance. It’s used to find the correct p-value from the t-distribution.
  • P-value: The p-value, 0.0571, represents the probability of observing a difference in average heights of at least 1 inch (or more extreme) if the two teams were actually the same height on average. A common significance level is 0.05. Since the calculated p-value of 0.0571 is greater than 0.05, we do not have sufficient statistical evidence to reject the null hypothesis.

This means that while Team 2 is an inch taller on average in this sample, this difference is not statistically significant at a 5% level of significance. It’s plausible that the observed difference is simply due to random chance.

This is completely wrong because it’s a trick question. There is no p-value for this 1″ difference because the entire population was measured.

The p-value, used to determine whether a result is statistically significant, is, in theory, the computed probability of the result having been gotten by random chance if the null hypothesis (usually, no causative relationship or correlation existing between the two measured things) is true. However, that’s not how it’s actually used in practice.

The p-value that is used in practice is actually the probability of the results having been generated by random chance simply by how the members of each group being compared were selected from the total population.

Despite how this may sound, it’s actually a really easy concept, so let me illustrate it with an example. Suppose you want to know whether 1 beer is enough to get people drunk enough that they’re bad at playing ping pong, so you select two groups of people, each with fifty people in them, and to one group you give them the beer and to the other group you give them a non-alcoholic beer. (Let’s make favorable assumptions on all of the practical stuff, like having taken baseline measurements of their ping pong skill beforehand.)

Now, we know ahead of time that there is variation in the population in alcohol tolerance. Some people can safely operate heavy machinery after six shots of vodka and some people are under the table after one, with most people falling somewhere in between those two extremes (even apart from variations in body weight). So here’s the thing: what are the odds that instead of both groups being exactly representative of the population as a whole, your randomly assigning people from the population to one of the two groups just happened to put more alcohol-tolerant people into the 1-beer group than is representative of the whole population? The laws of probability tells us that if you run this experiment enough times, at least once you’ll randomly have all high-tolerance people in the alcoholic beer group and at least one other time you’ll have all low-tolerance people in the alcoholic beer group.

What people are measuring by p-value, in almost all scientific experiments, is how likely this kind of skew is. They are not measuring the effect of, for example, random errors in the actual measurements taken. (Conveniently for the researchers, those are always assumed to be perfect, or at least to always balance out.)

This is why the question I started this post with is a trick question: it’s got all of the information that you’d find in a random trial, presented in roughly the way you’d get it in a random trial, except there was no random selection. Within the hypothetical of the question, the entire population we care about—the two basketball teams—was measured. If you want to be super nit-picky, you can say that the p-value is zero since there was no chance of this being produced by random selection, in the same sense that the probability of a coin set down on the table with the head facing up turning up tails is zero.

But the thing is, there are scientists—evidence points to it being an awful lot of scientists—who don’t actually know this is what they’re doing when they run a p-value calculation. And, of course, there are contexts where this isn’t awful. If they’re copying a format for experiments where this happens to work, well, then, it happens to work. But because they don’t understand what it’s about, they don’t realize that the p-value is the absolute floor for how meaningless the result is. That is to say, if the scientist does a randomized (controlled) trial and gets a p-value of .045, which is below the industry threshold of .05, this means that there is a 4.5% chance that the results are meaningless if everything else is perfect.

Imperfections in instruments, fluctuations in the things to be measured, defects in the measuring tools, biases in the researchers, flaws in the study design—these all raise the chance of the results being meaningless, potentially by quite a lot.

Of course, if you want to be cynical about it, it’s in scientists’ best interests to not know any of this.

Programmed to Overeat?

One of the causes that you will see put forward as to why so many people are overweight, fat, or obese is that we evolved for a food-scarce environment and now live in a food-rich environment, so our natural inclination to eat everything available and store fat for the lean times is no longer adaptive. This hypothesis has a natural conclusion about how to not get fat: limit what you eat and always be hungry. To lose weight, limit what you eat even more and always be hungrier until you’re thin, then just limit what you eat and always be hungry.

Like the idea that carbs are more filling that fats because carbs have 4 Calories per gram while fats have 9 Calories per gram, so carbs take up more room in your stomach, this is one of those ideas that’s strange that anyone says with a straight face, at least if they’ve spent more than a few days living as a human being. Because if you have any experience of living as a human being, this is just obviously false. And there’s a super-obvious thing which disproves both: dessert.

Observe any normal people eating dinner and they will eat until they are full and don’t want to eat anymore. Then bring out some tasty treats like pie, ice cream, etc. and suddenly they have room in their stomach after all. This simple experiment, which virtually all people have participated in themselves in one form or another, irrefutably disproves both of those hypotheses.

You can also easily see this if you have any experience of animals which actually do eat all food that’s available until they physically can’t, such as the cichlid fish called the Oscar.

By Tiia Monto – Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=53098090

You feed oscars feeder fish, and they will keep eating them until there is no more room left in their stomach, throat, and mouth. They, literally, only stop eating once their mouths are full and fit nothing more in them. They then swim around with several tails sticking out of their mouth until their stomach makes room and they can move everything down.

That’s what a hunger signal with no feedback mechanism to stop because the creature evolved in a food-scarce environment looks like. (Oscars who are fed a lot grow extremely rapidly and very large.)

But you can also disprove this from the other direction. Yes, lots of people are fat, but they’re not fat-mouse fat.

Fat mouse was created by lesioning the part of the brain responsible for satiety. Fat mouse then kept eating and eating, without stop, rapidly ballooning into nearly being spherical. (Incidentally, are we to believe that normal mice eat have a satiety limit to their eating because mice evolved in a food-rich environment? When you look at field mice, is “abundant food” really the first thing that comes to mind?)

Now, it’s possible to attempt to save the food-scarce-environment hypothesis by modifying it, saying that we’re genetically predisposed to being fat and unhealthy because that worked out in a food-scarce environment, but not too fat, for whatever reason. This suffers from being arbitrary, but then it is the prerogative of evolution to be arbitrary (obviously nothing needs to make any sense if you’re an atheist, but for the rest of us the influence of fallen angels on evolution, within the limits God permits them to work, has the same result—that’s one of the things that confuses atheists).

Of course, the problem with even this modified hypothesis is that there are plenty of naturally thin people and if you talk to them they’re not constantly hungry and denying themselves the food needed for satiety at every moment.

There’s also the problem of the timing of the rapid fattening of the population. Yes, it took place at a time when food was abundant, but there have been sections of the population for whom food is abundant as far back as there is recorded history. They were not all obese. More recently, in the 1800s, upper middle class and rich people could easily afford enough food to get fat on, yet they were not all obese. And in much of history, when food was scarce, people’s preferences in women were for plump women. Just look up paintings of Venus:

Which makes sense in that context—when people mostly don’t have enough food, women who manage to be plump in this environment are healthier, can have more children, survive the rigors of pregnancy, take care of the children, etc. Hence when painting a goddess of beauty, they painted her to the standards of their day and made her plump. But they didn’t make her obese.

To be fair, you can find the venus of willendorf:

By Oke – Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=1152966

But this dates to a time (30,000 years ago) from which food was supposed to be scarce and—so the hypothesis goes—no one actually looked like that because they were in the environment their constant food cravings were adapted to.

Ultimately, what I find so odd about the programmed-to-overeat hypothesis of modern obesity is not that it’s obviously false. It’s that it’s obviously false and the people who push it have clearly never considered the evidence against it.

You don’t see this with, for example, Young Earth Creationists. They have explanations for why radio-isotope dating doesn’t work and how geology is all wrong and fossil records are being misinterpreted because the dinosaurs were all animals that didn’t make it onto the Ark, etc. etc. etc. Say what you want about Young Earth Creationists, they at least take their ideas seriously.

As far as I can tell, the people saying that we’re programmed to overeat are just saying things.