How to Scrutinize a Statistic
When encountering a statistic, you should inquire about its provenance, its scope of inference, its practical significance, and the estimation error associated with it.
Table of Contents
- Scope of Inference
- Practical Significance
- Estimation Error and Confidence Intervals
There is a cautionary saying to the effect of, “be wary of those who use statistics like a drunkard uses a lamppost – for support rather than for illumination.”1
The point of this saying is to warn about those who start with a preconceived idea of a conclusion that they want to reach and then search for statistics that appear to validate that preconceived conclusion.
This is a practice well known to statisticians, so much so that How to Lie with Statistics (1993) is still in print more than fifty years after it was originally published in 1954. It is a practice that is effective for at least three reasons.
First, statistics have some persuasive ability. They bring with them an atmosphere of scholarship, science, and quantitative certainty. Even when no relevant empirical observations have been made, citing statistics make it appear that a position is supported by empirical evidence.
Second, given enough time and effort, statistics can be manufactured that appear to support any arbitrary position. This can be out of intentional deception by intentional actors. It can also be done accidentally by those who do not know how to do empirical inquiry well, by those afflicted by the psychology of motivated reasoning, or by researchers who out of desperation to further their academic careers in a “publish or perish” world, publish whatever they can.
Third, interpreting and scrutinizing statistics requires knowledge and skills that many laypeople do not possess. Indeed, those who have had education in statistics are sometimes prone to mistakes in using and interpreting statistical methods. Thus, misuses of statistics can go unnoticed, by the general public or by academic peer review. Indeed, as this article discusses, some misuses of statistics can actually become entrenched in academic practice.
This situation might seem so discouraging that you might be inclined to throw out the use of statistics altogether. However, this would be a rash mistake. Those who care about any social issues, medical issues, environmental issues, etc, are necessarily concerned about understanding phenomena that occur in large populations of individuals. Therefore, if you care about such issues, you are necessarily engaged in a statistical inquiry, whether you realize this or not.
The solution for those who care about such issues, therefore, is to be competent at scrutinizing statistics. The alternative is to be preyed upon by charlatans or, perhaps worse, to become a charlatan yourself.
Another way to conceive of this is that statistics should be questioned, by every person who encounters them and must make a decision on whether or not to believe what is being reported.
The intent of this article is to assist the reader in the enterprise of questioning the statistics you encounter. It does not assume the reader has had any formal instruction in statistics, though such instruction would be valuable in understanding the concepts used throughout. The only prerequisite for this article is cursory mathematics knowledge at the secondary school level.
This article is not a replacement for courses in statistics. However, the converse is also true. Statistics courses are rarely taught from the standpoint of what questions are relevant to ask when encountering a statistic. Rather, introductory statistics courses usually focus on giving students a bag of statistical tools to use in their research activities.2
There is merit to knowledge of this bag of tools even if you do not do empirical research in your own life. For instance, when the toxicity of something is reported, it will often be reported by the the median lethal dose (LD50) statistic. The LD50 in turn is usually derived by way of logistic regression, which is one of the tools in the bag of tools typically covered in a two-semester introductory statistics sequence at the university level. Therefore, if you want to understand where this statistic comes from, you need to understand logistic regression.
This article, however, instead of discussing any one specific statistical technique, discusses questions that should be asked of all statistics regardless of how they are derived. Thus, the content of this article is both different from and complementary to the content of statistics courses.
The first thing to scrutinize with regard to a statistic is its provenance. The word “provenance” is borrowed from historians of art who use it to refer to the chronology of place and custody of a work of art traced back to its origin, though the word is now used by a variety of fields.
In the context of scrutinizing a statistic, determining its provenance consists of asking questions such as “Where does this statistic come from?” “Who made the observations from which this statistic was derived?” and “How were those observations made?”
The reason this is the first aspect to scrutinize of any statistic is that its answer is a prerequisite for subsequent scrutiny. If you do not know where a statistic came from or how it was derived, you cannot scrutinize it further.
One good thing about establishing the provenance of a statistic versus other kinds of provenance is that many if not most statistics of any quality are reported in a scholarly paper of some kind. Thus, the work of establishing provenance of a statistic usually ends with a very definitive result: a document that has a Digital Object Identifier (International DOI Foundation 2021) and that can be succinctly referenced in any one of a number of standardized citation formats.
This task is phrased as “establishing provenance” and not simply “looking it up” because there are many situations where the mere act of determining where a statistic comes from is itself a fair amount of work.
Lack of a Source
The most obvious situation that makes establishing provenance challenging is when a statistic is stated, but no source is referenced.
This can occur in a variety of settings, such as when reading an article or when engaging in face-to-face conversation. If the medium allows for two-way communication, it might be prudent to ask the person stating the statistic where that person encountered the statistic.
If no source is provided, that immediately casts doubt on the standards of the person stating the statistic. Such a person has not done the due diligence of scrutinizing a statistic, but is still passing it along, which is a practice that can spread misinformation.
However, while the lack of a source casts doubt on the veracity of the person asserting a statistic, it does not provide insight on the veracity of the statistic itself. In these cases, the best you can do is withhold any judgment and try to find the origin of the statistic yourself.
Today, with the plethora of Internet search tools available, the task of finding an origin is not impossible. It is simply a value judgment you must make as to whether you want to spend your time in this manner. It is certainly within the realm of reasonable behavior to remain in a state of suspended judgment if your time is better spent on other matters.
When a source is provided, the obvious first step in establishing provenance is to check this source. However, sometimes this immediately becomes a dead end because the cited source does not actually assert the statistic in question. This phenomenon can have a variety of causes.
For one, the person falsely citing a source to be the origin of a statistic might simply be making a mistake in good faith. Perhaps the person remembered something incorrectly or confused this source with another.
Alternatively, some people engage in this practice intentionally as an act of deception. The mere presence of a cited source can lend a statistic an air of credibility it would not have otherwise because many people do not check sources.
Finally, there are whole branches of thought that do not believe they need empirical evidence. Thus, a cited source might be asserting something similar to a statistic, but if the source in question is based on ideology instead of empirical evidence, then this pseudo-statistic is just made-up. However, a gullible audience might treat it as a source.
In the first case of a simple mistake, you are left in the same state as you would have been if no source was provided.
In the second case of intentional deception or the third case of an ideological origin, you have strong evidence that the statistic in question was just made-up and can be dismissed. There is a subtlety here to be concerned with, nonetheless. Even if a particular made-up statistic is false, that tells you nothing about what a true statistic would be. For instance, if someone makes up a statistic that 80% of the Cagot ethnic group live in poverty, and you find out this is just made-up, you still do not know what the actual poverty rate among the the Cagot is.3
Unfortunately, when either unsourced statistics or statistics with a misrepresented source are encountered, they leave you not having learned any new information. The exercise of encountering them is not entirely a waste of your time, however, if you take the view that they are opportunities to develop your skills at scrutinizing statistics.
Chain of References
Sometimes you find that the source cited for a statistic, when checked, itself cites another source for the statistic in question. That source might then cite some other source, and that source might yet again cite another source.
Especially whenever Internet journalism churns out numerous shallow articles, this chain of references can get quite long, and this phenomenon can make the work of establishing provenance needlessly time consuming and tedious. However, all that is important in this process is identifying the origin of the statistic. The chain of references themselves neither add nor detract from the credibility of the statistic. Indeed, these chains of references could be avoided if anyone contributing to the chain did their due diligence by establishing provenance and including a reference to the original source.
This highlights yet another fallacy that is sometimes used by those who use statistics for support rather than illumination. You should not be impressed when a single statistic has numerous citations included with it. If anything, seeing multiple sources cited for the same statistic should raise your suspicion.
For one, citing multiple supposed sources is easily accomplished by citing every step in a chain of references to a single origin as if they were a separate source. If they all ultimately lead to the same origin, then there is no point in this multiplicity of citations, and using multiple citations like this is deceptive.
Secondly, for multiple citations that lead to different origins, representing them as if they have the same conclusion is an indication that the sources are being misrepresented. It is unlikely that several studies would arrive at the exact same statistic, due to, if nothing else, randomization used in inference and the estimation error that comes from it. Thus, when multiple original sources are cited for the same statistic, it is likely results are being treated as if they were equivalent when they are not.
Primary versus Secondary Sources
Encyclopedias such as Wikipedia can be helpful when establishing provenance, but with one major caveat. Establishing the provenance of a statistic as discussed here is tracing the statistic back to its origin, which is a primary source. Wikipedia has a different mission than establishing provenance, and prefers secondary sources, as can be read its documentation:
Wikipedia articles should be based on reliable, published secondary sources and, to a lesser extent, on tertiary sources and primary sources. . . . A secondary source provides an author’s own thinking based on primary sources, generally at least one step removed from an event. It contains an author’s analysis, evaluation, interpretation, or synthesis of the facts, evidence, concepts, and ideas taken from primary sources. (“Wikipedia:No Original Research” 2021)
In this article’s context, a secondary source does the work of scrutinizing a statistic, which is what you the reader are supposed to be doing. A secondary source might be useful for comparison, but is not a replacement for finding the primary source.
That being understood, Wikipedia does contain a lot of references to primary sources, as well, even though its “no original research” policy discourages their use except under limited circumstances. Furthermore, a good secondary source will reference the primary source, and so might be a useful step along a chain of references in order to establish provenance.
Valid statistics come from empirical observations of the world. Your first task in scrutinizing a statistic is identifying where a statistic comes from, which is the primary source of the statistic. Unfortunately, identification of primary sources can be made more complicated by those who do not cite sources, by those who misrepresent sources they cite, and by those who cite as a source something other than the primary source for the statistic. Thus, establishing the provenance for a statistic sometimes requires work on your part, but is a prerequisite for further scrutiny.
Scope of Inference
Once you have found the origin of a statistic, the next two genres of questions to ask about the statistic are both related to what is termed “scope of inference” in statistics courses. Scope of inference is divided into two concerns: whether or not the results of a study generalize to a larger population, and whether or not conclusions can be drawn about cause and effect. The former is a concern about sampling, and the latter is a concern about experimental assignment. Both of these concerns stem from the possibility of confounding variables, and both concerns are addressed by a form of randomization.
Sometimes a statistic is intended to summarize information about the state of affairs in a large population. For instance, a country might be interested in the poverty rate of its citizens, which can constitute a population of millions of individual people. If the statistic is intended to generalize, you should ask questions such as “What population was sampled?” and “How was the sampling done?”
For large populations, it is usually practically infeasible to examine every single individual in the population of interest, which is the practice of taking a census. In these cases, the main approach used to infer information about a large population is to take a representative sample from the population and use measurements of the sample to infer estimates about the population. This necessarily introduces some sampling error into the estimates.
Even in cases in which taking a census is practically possible, it might not be the best way to proceed. While taking a census does eliminate sampling error, it does not eliminate other forms of error. Thus, the resources that might be spent on taking a census might be better spent on other things, because sampling error can be quantified and kept within necessary tolerances, whereas non-sampling error is often more elusive.
In order to generalize a statistic to a larger population, a representative sample must be taken. A sample is representative if every individual in the population had a chance to be included in the sample and the probability of such inclusion is known. In short, a sample is representative if selection for the sample is randomized.
All the individuals that could have been included in the sample constitute the actual sampled population. If the statistical analysis done with a representative sample is valid, then the results generalize to this sampled population. The results do not generalize to other populations.
This is an all-too-common mistake in interpreting statistics. A study based on a representative sample of women members of an electrical engineering society in the Pacific Northwest of the United States and a study based on a representative sample of women working for financial analysis companies in Queensland, Australia are not results about “women.” The results of each study are about two different populations, and there should be no surprise if the two studies arrive at different results.
When you are scrutinizing statistics and encounter a study based on a representative sample, your main task with regard to sampling is to identify what population the results generalize to. The population to which the results generalize is exactly the sampled population – no more, no less.
For a statistic derived from a telephone survey of area codes in Saint Petersburg, Florida, the results generalize to all those who have access to telephones with phone numbers with a Saint Petersburg area code. The results do not generalize to people living in Papua New Guinea.4 The results do not generalize to people who live in Saint Petersburg, but do not have access to a telephone. The results do generalize to those who live in Saint Petersburg, but only have access to telephones with phone numbers that do not have Saint Petersburg area codes, etc.
A subtlety arises when researchers intend, either explicitly or implicitly, to sample one population, but wind up sampling a different population. In these cases, the target population for the study and the sampled population are not exactly the same, though they might overlap.
This is a phenomenon that often comes up when election predictions fail spectacularly. The target population for a survey that is intended to predict the outcome of an election is all the people who will vote in the election. However, this is a difficult population to identify and to sample.
For instance, nearly all of the surveys leading up to the the 2015 British General Election concluded that the election was going to be a dead heat between the two most popular political parties in the United Kingdom, the Conservative and the Labour parties. However, the election turned out be a clear victory for the Conservative Party. The predictions were so far off that an academic investigation into the widespread mistakes was commissioned. The investigation found that the samples taken were unrepresentative of the actual electorate,5 over-representing Labour voters and under-representing Conservative ones. (“General Election Opinion Poll Inquiry Publishes Report” 2016)
When scrutinizing statistics, you are concerned with identifying the actual population that was sampled. The statistic can only be generalized to this sampled population. You should not be distracted by the target population that should have been sampled or even the population the researchers portray their results as generalizing to. The way to determine this sampled population is by paying close attention to how the sampling was done. The sampled population consists of those individuals that were eligible to be selected for the sample.
Sometimes, upon discovering the origin of a statistic, you will find that the sample used to calculate the statistic is not representative of any larger population. In this case, the population to which the statistic can be generalized is just the sample itself. In other words, the results do not generalize. This occurs when the sampling is not randomized.
One kind of unrepresentative samples are self-selected samples, such as those composed of volunteers. In these cases, the individuals in the sample were not selected by the researchers with some known probability from a larger population. Instead, the individuals in the sample selected themselves to be included in the sample.
For instance, many behavioral and social science experiments are conducted with some number of volunteers taken from the student body of the researcher’s university. In these cases, results do not generalize even to the student body of the university, let alone to populations outside of the university.
To see why, suppose that a study wanted to determine the average income of students at some university, but used a self-selected sample like those used in many social and behavioral science experiments. The students who volunteer for this self-selected sample differ from the general student body in a couple ways. The volunteers have enough time to devote some of their time to participating in a study, but students who work a part-time job and thus have more income also have less time, so might be less likely to volunteer. Furthermore, some of the volunteers might be attracted by the small payment given to those who participate in such a study, but students who work a part-time job might be less enticed by this payment since they already have income.
Therefore, there are reasons to believe that the self-selected sample differs from the target population in ways related to the variable of interest. Those who have time to devote to participation in a behavioral or social science study for a small payment might have less average income than the student body as a whole. If this were the case, the average income in the self-selected sample would be less than that in population. In this way the self-selected sample would not be representative of the student body. A potential confounding variable, i.e., whether or not a student works a part-time job, has been identified.6
This is avoided in representative samples by using randomized sampling. For instance, suppose 35% of students at the university work part-time. In a simple random sample in which all of the students at the university have an equal probability of being selected for the sample,7 the proportion of students in the sample who work a part time job will also be near 35%. This will be the case not just for this one potentially confounding variable, but for all confounding variables. In this way randomization ensures that a sample will be representative of the population.
Randomized, Controlled Experiments
Sometimes statistics are not intended to summarize the state of affairs in a larger population, but are intended to summarize the effect of a specific intervention. For instance, in a clinical trial, volunteers are given an experimental therapy such as a new drug in order to determine whether the therapy works and whether it is safe. The intent of a clinical trial is not to describe properties of a population, such as the prevalence of a disease among citizens of a country. Instead, a clinical trial is concerned with describing the effects of the therapy.
Studies such as described in the previous section on sampling in which no intervention is made are typically called “observational studies,” whereas studies in which an intervention is made by the researchers are typically called “experiments.”
If a statistic is intended to describe the effect caused by a specific intervention, you should ask questions such as “Was the intervention compared with a control?” and “Were individuals randomized between the intervention and the control?”
A control in the context of an experiment is important to determine the effect of the intervention. For example, suppose a group of volunteers suffering from a disease are given a potential new therapy, and afterward variables pertaining to the effects of the disease are measured. Even if some volunteers improved after the therapy, it would not be clear what the effects of the therapy were because, for many diseases, some proportion of the population suffering from the illness will get better of their own accord over time.
You should verify, therefore, that experiments include a control group of individuals given something else other than the intervention of interest. For a disease without a treatment currently, the control group in a clinical trial might be given a placebo, such as an empty capsule, that is known to have no effect on the disease. In other trials for diseases that have existing therapies, the control group will often be given the current best therapy. The intent in these cases is to determine whether or not the new therapy performs better than current best practices.
If you encounter a statistic from an experiment without a control that is claimed to shed light on the effect caused by an intervention, you should reject it immediately as misleading. Even if 68% of individuals given snake oil recover from some disease in 10 days, it does not indicate that the snake oil is an effective intervention. For all that is known, 68% or more of the general population recovers from the disease in 10 days, anyway. Experiments without controls might be useful as exclusively exploratory exercises, but no conclusions about cause and effect can be drawn from them.
Randomization is important for experiments as it is for observational studies. Again, this is due to the potential effects of confounding variables. In an observational study, a confounding variable can alter the properties of a sample, making it unrepresentative of the population when distribution of the confounding variables in the sample differs from its distribution in the population. In experiments, a confounding variable can alter the measured effects of an intervention if the distribution of the confounding variable in the control group differs from the distribution of the confounding variable in the intervention group.
For example, suppose you learn of a new flu therapy that underwent a clinical trial, and the result of the clinical trial found that all indicators of health – e.g., hospitalization rate, mortality rate, recovery time – were measurably better in the group given the new flu therapy than in the control group. That might seem promising. However, what if you further learn that the average age in the group given the therapy was 32 and the average age in the control group was 67? In this case, if those who are older are more vulnerable to the effects of the flu, age is a confounding variable. This could easily happen if the experimental assignment was not randomized.
Randomization not only reduces the chances of one confounding variable like age being imbalanced between the intervention and the control groups, but randomization, if done well, balances all confounding variables between the intervention and control groups. Thus, randomized assignment ensures that the effects measured in an experiment are indeed caused by the intervention.
Association versus Causation
Misinterpreting association between two variables as a causal relationship is a classic fallacy in interpreting statistical results. The mantra “correlation is not causation” has been drummed into students of statistics courses for decades.
Those who go a hospital are more likely to suffer illness and death than those who do not. Therefore, there is an association between going to a hospital and illness and death. However, one should not conclude that going to a hospital causes illness and death. The obvious confounding variable here is the existence of a prior illness. Generally, healthy people do not go to a hospital, whereas people who are ill often do. Confusing association with causation in this way can lead people to avoid medical care and suffer adverse health and premature death unnecessarily.
Furthermore, a lot of variables undergo trends over time and so can be falsely associated if the trends occur over the same time period. This is a phenomenon that Tyler Vigen has used to humorous effect in his illustrations of associations between obviously unrelated variables, such as divorce rate in Maine and per capita consumption of margarine.
Invalidly drawing causal conclusions from associations is such a common fallacy that old-fashioned statistics classes sometimes assert that no causal conclusion can be drawn from observational studies and that conclusions about cause and effect can only be made from randomized, controlled trials. For a variety of reasons, this is now largely seen as overly restrictive, and the field of causal inference from observational studies is an emerging field in statistical methodology research.
There is no single recipe for causal inference from observational studies, but what all the methods currently being developed have in common is that they require a lot of work: formalization of concepts, explicit stating of assumptions, verification that the assumptions are plausible, etc. Because of the newness of these techniques and the amount of effort they require, and because of the abundance of fallacious reasoning that infers causation from mere association, it is prudent to treat any conclusion about cause and effect that is derived from a source other than a randomized, controlled experiment with suspicion.
The scope of inference of a statistic involves two kinds of randomization. Randomization in sampling and randomization in experimental assignment might seem very similar. They are both practices that use randomization to address issues that stem from confounding variables. However, they are necessary for two different kinds of inference, and the consequences of these two kinds of randomization are independent of each other.8 Thus, a study might have randomized sampling, randomized assignment, both, or neither. The ramifications for a study are sometimes summarized in a table such as Table 1.
|Was this a randomized, controlled experiment?|
|Was the sampling randomized?||Yes||Conclusions generalize to population, but no conclusions about cause should be made.||Conclusions generalize to population, and conclusions about cause can be made.|
|No||Conclusions do not generalize to population, and no conclusions about cause should be made.||Conclusions do not generalize to population, but conclusions about cause can be made.|
The important takeaway for the work of scrutinizing statistics that you encounter is that there are two separate questions to address about any given statistic. One question is whether or not a statistic generalizes to a larger population, which can be answered in the affirmative if it is based on a representative sample. Another question is whether or not a statistic is indicative of the effect of an identifiable cause, which can be answered in the affirmative if the statistic is derived from a randomized, controlled experiment.
Once you have identified the origin of a statistic and you have determined the statistic’s scope of inference, the next genre of questions you should ask about a statistic includes questions such as “What is the practical significance of this statistic?” and “How much of an effect size is described by this statistic?”
Whether a statistic is practically significant is less a matter of statistical theory and much more a function of knowledge of the subject-matter of the context for the statistic.
For instance, suppose you are developing a new drug intended to be used as an alternative to insulin therapy for those with diabetes mellitus. Those with insulin-dependent diabetes mellitus (IDDM) would, without medical intervention, have levels of daily blood glucose higher than what is typical. However, medical research has found that IDDM patients have long-term health benefits if they keep their daily blood glucose levels lowered to within typical ranges. (Diabetes Control and Complications Trial Research Group 1993) This new drug you are developing could therefore be of some use.
However, if you discover the drug decreases daily mean blood glucose by only 5 milligrams per decaliter (mg/dl) on average, its effects are not practically significant, because prior research observed long-term health benefits occurred when IDDM patients lowered their daily mean blood glucose from levels around 230 mg/dl to levels around 130 mg/dl.
Thus, the knowledge used to evaluate the practical significance of the statistic in this example comes from subject-matter knowledge pertaining to diabetes, not from any particular statistical tool.
This example also illustrates another phenomenon: the practical significance of a statistic comes from its comparison to something else. In the diabetes example, the average change in blood glucose levels caused by the drug is evaluated by comparing it to the therapeutic change in blood glucose levels observed in a previous study. A single statistic in isolation without any contextual knowledge has no practical significance.
For example, suppose you learn that the per capita expenditure on food and beverages for off-premises consumption in Michigan during 2020 was $3,532. (Zemanek and Aversa 2021, Table 4) Unless you have detailed economic knowledge of the United States, this statistic probably has very little practical significance to you. Is $3,532 a lot? Is $3,532 a little?9 This statistic, without any context, is impossible to interpret.
With other knowledge for comparison, you can discover the practical significance of the statistic. The $3,532 per capita expenditure on food and beverages for off-premises consumption in Michigan during 2020 was a 10.6% increase from the previous year, despite there being a decrease of 22.6% in the amount spent on food services and accommodations. (Zemanek and Aversa 2021, Table 2) This is consistent with the hypothesis that people in Michigan were dining in restaurants less and eating at home more during the COVID-19 pandemic.
Because comparison is the basis for establishing practical significance, you will sometimes encounter the maddening practice of asserting that one population has “more” of some quantity than another or that a population has “less” of a quantity than another.
Such assertions are worse than no information at all because they create the impression of practical significance while withholding the very information you need to evaluate practical significance, namely, how much more or how much less.
This is the practice of generic comparisons. Generic comparisons assert that there is a difference in some variable between two populations, but do not quantify how much of a difference there is.
Generic comparisons are not informative because it is extremely rare for a variable to be identically distributed in two different populations. Therefore, one population almost always has “more” of something than another.
Generic comparisons can be used deceptively. You could, for instance, assert that patients given the prospective insulin-alternative drug discussed in the earlier example have a “lower” mean daily mean blood glucose than control group patients, and leave it at that. However, you know very well that the actual difference in mean blood glucose caused by the drug is not practically significant for IDDM patients. By withholding quantification from the comparison between patients given the drug and patients in the control group, you can pretend there is practical significance.
Therefore, whenever you encounter an assertion that there is “more” or “less” of something in one population than in another – or that levels are “higher” or “lower,” “bigger” or “smaller,” “greater” or “lesser,” etc – the very next question that you should demand is “How much more?” or “How much less?”
A quick answer to such questions as “How much more?” has come to be commonly called “effect size.” This is an unfortunate phrase, because it evokes thinking about cause and effect. This is fine if the scope of inference for a statistic is that of a randomized, controlled trial or if the statistic results from causal inference techniques applied to observational studies. However, remember that in purely observational studies, only an association, not causation, can be inferred.
The various ways in which two populations can have “more” or “less” of some variable is a nontrivial question. However, even without going very deep into comparing the distribution of a variable between two populations, you can achieve a solid first impression by considering an appropriate summary statistic of effect size. In order to understand what sort of summary statistic to look for, you must understand what the different types of variables you might encounter are.
Types of Variables
In statistical analysis, a variable of interest is often called a “response variable.” A response variable is analyzed in comparison to another variable often called an “explanatory variable.”10 Some amount of variation in a response variable is suspected to be on account of variation in the explanatory variable.
For instance, in the insulin-alternative drug example, the response variable in the clinical trial for the new drug is a patient’s mean daily blood glucose level. The explanatory variable in this case is whether or not a patient is given the new drug.
Broadly, there are two types of variables that you might encounter: categorical variables and numerical variables.11 Categorical variables are the sort in which observations consist of assignment to one of a number of discrete levels, whereas numerical variables are the sort in which observations consist of assignment to a value on the number line.
Whether a response variable is categorical or numerical is a separate question from whether an explanatory variable is categorical or numerical. In the insulin-alternative drug example, the response variable is a numerical variable because numbers in the hundreds of mg/dl are used. However, the explanatory variable is a categorical variable with just two levels: given the drug or not given the drug.
Categorical variables are typically summarized using either counts or proportions, and numerical variables are typically summarized using either totals or arithmetic means.12
When the explanatory variable is categorical, you are essentially comparing the response variable between two or more populations, one population for each level of the explanatory variable.
When comparing between two or more populations, proportions or means are usually preferable to counts or totals.13 This is because for many variables, counts or totals tend to increase along with the number of individuals in the populations. Thus, proportions or means result in better comparisons because they include the population size as their denominator.
This is why statistics per capita are preferable to raw counts or totals for social data. For example, the statistician Thomas Lumley has a series of articles in which he expounds on the fallacy of using raw counts or totals when comparing statistics about Auckland, New Zealand, where he lives, to Wellington, New Zealand, all humorously titled something to the effect of “Auckland is larger than Wellington.”
In one article, he comments on a news story that reported Auckland had 1,553 convictions of teenagers for drunk driving while Wellington only had 728. (Lumley 2012b) In another article, he comments on news stories that reported about three times as many burglaries had occurred in Auckland as in Wellington. (Lumley 2012a) These counts are misleading because the population of Auckland is about three times that of Wellington.
What are more relevant for comparison between the two places are rate of teenage drunk driving convictions and rate of burglary per capita.14 As it turns out, after dividing by the missing denominator of population size, the rate of teenage drunk driving convictions was much lower in Auckland than in Wellington, but the rate of burglary was still slightly higher in Auckland than in Wellington.
Whenever you are presented with a count or total that you need to compare with another quantity in order to determine effect size, you may be able to locate the missing denominator. For instance, within a few minutes, Thomas Lumley was able to find the census figures for the number of teenagers in Auckland and in Wellington from a government website. (Lumley 2012b) Once the relevant population size is identified, conversion from count or total to mean or proportion is a simple operation of division.
However, you might also encounter cases in which population size is not easily obtained information. In these cases, it is best to withhold judgment about practical significance.
Categorical Explanatory Variable
As previously mentioned, when the explanatory variable is categorical, you are essentially comparing the response variable between two or more populations, one for each level of the explanatory variable. However, what you look for in these cases differs slightly based on the kind of response variable you are dealing with.
Numerical Response Variable
With a numerical response variable, the most common way to report effect size is in terms of the means of the response variable for each such population. The effect size is summarized as the difference in these means. For example, the mean expenditure on food and beverages for off-premises consumption in Michigan in 2020 was $3,532, whereas it was $2,872 in Arkansas in 2020. The effect size of living in Michigan versus living in Arkansas can be summarized as the difference in means of $3,532 − $2,872 = $660.
In these cases, you should search for both the mean response variable values reported individually and for the differences in means, because the mean values themselves can shed light on the relative magnitude of the difference. A difference of $660 when the individuals means are $3,532 and $2,872 is relatively smaller than when the individual means are $1,300 and $1,960, though the difference is the same in either case.
If neither the individual means nor the differences in means are reported, then you should withhold judgment regarding practical significance, and you should question the veracity of the source.
Sometimes, other statistics are reported instead of the means of response variable. For various technical reasons, medians might be reported instead of means.15 However, your scrutinization of medians is similar to that of means. You should look for both the individual median values of each population and their differences, prefer to have both, but demand at least one of these.
Furthermore, sometimes standardized statistics are reported instead. Mean or median values of the response variable and their differences are unstandarized statistics: they are reported in the original units of measurement of the observations and relate back to the real-world quantities on which they are based. A popular standardized statistic for this case results from dividing the difference in means of the response variable by the standard deviation of the response variable – commonly called “Cohen’s d” after the psychologist who popularized the practice.
Standardized statistics for effect size have their uses.16 However, whenever you can, you are better off consulting unstandardized statistics since they are more readily related to the real world and thus to subject-matter knowledge, which is the ultimate origin of practical significance. Thus, you should welcome standardized statistics of effect size as additional pieces of information, but still look for the unstandardized statistics, such as mean or median values of the response variable and their differences, and be skeptical when the unstandardized statistics are not provided.
Finally, note that both means and medians are measures of just the central tendency of the distribution of a variable. The central tendency of a variable is only one way in which the variable’s distribution can differ. Therefore, the difference in means or medians is only an initial summary of effect size, not a complete comparison of how the response variable differs in different populations.17
Categorical Response Variable
When the response variable is categorical instead of numerical, but the explanatory variable is still categorical, you are still essentially comparing the response variable between two or more populations. However, the response variable is now expressed in terms of proportions instead of means. Therefore, in this case, everything discussed in the previous section should be reiterated, but with proportions instead of means and differences of proportions instead of differences of means.
However, the major change in the case of a categorical response variable is that differences of proportions are not as ubiquitous as statistics of effect size as differences of means are in the numerical response case. This section, therefore, describes some additional summary statistics.
On the one hand, there is nothing wrong with using a difference in proportion to describe an effect size. For example, in September 2021, 66.67% of flights of jetBlue Flight Number 2495 from Newark to Miami were delayed more than half an hour, whereas 56.67% of flights of jetBlue Flight Number 2695, also from Newark to Miami, were delayed more than half an hour. (Bureau of Transportation Statistics 2021) The effect size can be summarized as a difference of 66.67 − 56.67 = 10 percentage points.
However, for effect sizes such as this that compare two different rates, there are two other summary statistics that are quite common: relative risk and odds ratio.
Relative risk18 is defined as the ratio of two proportions. For example, when choosing between Flight 2495 and Flight 2695, you might be curious how much less likely you are to experience a delay in September 2021 taking Flight 2695 instead of Flight 2495. This can be summarized with the relative risk of 56.67% / 66.67% = 85%.
While relative risk is a useful summary statistic for effect size with a categorical response variable, when you encounter a relative risk, you should also look for the original, absolute risks, for reasons similar to why it is important, in the numerical response variable case, to look for the individual means themselves in addition to the differences in means. Relative risk is just that, relative, and can make risks that are small and risks that are large seem equivalent.
For instance, if the proportion of delayed flights for Flight 2495 was only 11.11%, and for Flight 2695 only 9.44%, the relative risk of experiencing a delay on Flight 2695 instead of Flight 2495 would again be 85%. However, the absolute risk of experiencing a delay on either flight is much smaller in this case, so should probably weigh less in your decision making.
The odds of an event is defined as the probability of an event occurring divided by the probability of the event not occurring. A proportion representing a risk can be expressed as an odds by dividing it by the quantity one minus itself. The odds of experiencing a delayed flight on Flight 2495 would be 66.67% / (1 − 66.67%) = 2 to 1, and the odds of experience a delayed flight on Flight 2695 would be 56.67% / (1 − 56.67%) = 1.31 to 1.
An odds ratio is the odds of one event divided by the odds of another event. The odds ratio for experiencing a delay on Flight 2695 compared with Flight 2495 in September 2021 is 1.31 / 2 = 0.655.
This value of 0.655 might not seem very enlightening. Indeed, for moderate proportions such as 66.67% and 56.67%, an odds ratio is best interpreted only relatively to other odds ratios. The main reason you encounter odds ratios is that in retrospective or “case-control” studies,19 relative risks cannot be directly calculated, but odds ratios can, and when the risks are very small or very large, odds ratios approximate relative risks. For instance, when Flight 2495 has 11.11% of its flights delayed and Flight 2695 has 9.44% of its flights delayed, the odds ratio is 0.834, which is close to the 0.85 relative risk.
When the explanatory variable is numerical instead of categorical, effect size can again be presented in either unstandardized form or standardized form. Unstandardized statistics of effect size are again preferable because they can more easily be related to subject-matter knowledge. However, with numerical explanatory variables, the most direct way to summarize an effect size using an unstandardized statistic involves model-based analysis, which is a more involved topic that is discussed in the next section.
This section describes standardized statistics of effect size for numerical explanatory variables. These are typically a correlation coefficient. There are too many kinds of correlation coefficients to summarize them exhaustively here, but there is one family of correlation coefficient that are used very commonly and summarized in this section.
This family includes the Pearson product-moment correlation coefficient, the Spearman rank correlation coefficient, and related statistics such as the point-biserial correlation coefficient. These correlation coefficients are so common that if a value of r or ρ is presented as a correlation statistic without any other qualification, it is likely that one of these correlation coefficients has been used.20
These correlation coefficients range from a value of -1 to a value of 1. A value of 1 represents an exact correlation, a value of 0 represents no correlation, and a value of -1 represents an exact inverse correlation.21
The Pearson product-moment correlation coefficient measures the linear association of two numerical variables. If for every increase in one variable, there is a proportional increase in the other variable, then the Pearson coefficient would be r = 1. Similarly, values of the Pearson correlation coefficient close to −1 indicate that the larger one variable gets, the more likely the other variable is to be smaller. A Pearson correlation coefficient close to 0 indicates that there is no linear association between two variables. Figure 1 illustrates when Pearson correlation coefficients are close to 1, 0, and -1.
One of the major shortcomings of the Pearson product-moment correlation coefficient is that it only reflects linear correlation between two variables. Another major shortcoming is that the Pearson correlation coefficient is not robust against outliers, such that even if there is a strong correlation among the majority of the observations, if there are just a relatively few observations that are outside this trend, then the Pearson correlation coefficient will be disproportionately closer to 0.
The Spearman rank correlation coefficient addresses these shortcomings by first converting values to their ranks and then performing a similar calculation to the Pearson correlation coefficient.22 It is also interpreted on the same scale of -1 to 0 to 1. However, the Spearman rank correlation coefficient only captures monotonic relationships.
In a monotonic association, one variable always increases or always decreases as the other variable increases over the entirety of the variables’ ranges. If a correlation involves a variable increasing as the other variable increases over part of the other variable’s range, but decreasing as the other variable increases over another part, the Spearman correlation coefficient will be disproportionately closer to 0.
In cases in which the response variable is categorical and the explanatory variable is numerical, there are versions of correlation coefficients similar to the Pearson product-moment correlation coefficient. For instance, the point-biserial correlation coefficient can be used when a response variable has two possible levels and the explanatory variable is numerical. It is also interpreted on the same -1 to 0 to 1 scale.
Whichever correlation coefficient you encounter, you should understand what the correlation coefficient does and does not reflect and what the scale used for its values indicates.
Sometimes you will encounter estimates of effect size that come from the fitting of a statistical model to the observations. As mentioned in the previous section, this is the most direct way to present an unstandardized estimate of effect size when dealing with a numerical explanatory variable. Commonly, model-based statistics for estimation of effect size are derived from techniques such as analysis of variance (ANOVA), regression analysis, or generalized linear models (GLMs).
Unfortunately, interpreting these models is too involved a topic to quickly summarize here. Indeed, the majority of the content of applied statistics courses at the university level is learning how to deploy and interpret these models. Therefore, this section only discusses a very elementary example and mentions several ways in which interpretation can become more complicated.
A very elementary example involves a numerical response variable analyzed in terms of a single numerical explanatory variable. For this, a simple linear regression model can be used. This model consists of the relationship ŷ = β1x + β0, where y is the response variable, x is the explanatory variable, and β1 and β0 are parameters of the model. The mark above the “y” in “ŷ” indicates that the value calculated via this model is an estimate and is not expected to be exactly the observed value.
Fitting this model to a set of observations consists of estimating optimal values for the parameters β1 and β0. In this case, the estimate of β1 is an estimate of the effect size. The units of measurement of β1 are the ratio of the units of the response variable to units of the explanatory variable, and so this is an unstandardized estimate of the effect size.23
Interpretation of models is not always straightforward, for a variety of reasons.
Sometimes instead of fitting a model to the raw observations, transformations are used to process the raw values into values more amenable to analysis, and the model is fit to the transformed data. This has the result of changing the interpretations of parameters such as β1, depending on the particular transformations used.
Most models rely on assumptions about the observations to which they are fit. The mentioned transformations are often motivated by the need to satisfy the assumptions of the models. This raises another issue with interpretation of statistics of effect size deriving from model-based analysis, namely, that these estimates are only valid inasmuch as the model is validly applied. Certain violations of the assumptions of a model can distort estimates generated by the model so badly that the estimates become worthless. Therefore, whenever interpreting statistics derived from model-based analysis, you should verify that the assumptions of the model have been checked.
While linear models such as ANOVA and linear regression can be fitted to observations with a numerical response variable, a categorical response variable necessitates use of generalized linear models, which include techniques such as logistic regression, log-linear regression, etc. Generalized linear models include a link function that changes the relationship between the left and right sides of a regression equation such as ŷ = β1x + β0, and so changes the way in which parameters such as β1 should be interpreted.
Thus, discussion of all of the different kinds of interpretation and verification needed for evaluating statistics from model-based analysis requires a textbook-length treatment. This section is intended only to make you aware of model-based approaches to summarizing an effect size.
Hitherto, all of the discussion of the estimation of effect size has focused on cases in which the analysis uses only one explanatory variable. The real power of model-based analysis is the ability to incorporate multiple explanatory variables into the analysis. This allows for the model to handle interactions, in which the effect of one explanatory variable changes depending on another explanatory variable, and to adjust the estimated effect of an explanatory variable for other confounding variables. This latter issue is so important for scrutinizing practical significance that it is the focus of the following section.
You saw earlier that randomized, controlled experiments use randomization of assignment between intervention and control groups in order to equalize the effects of confounding variables. However, observational studies do not have this guarantee of handling the effects of confounding variables with the experimental design itself, and so must deal with confounding variables in their analysis.
Indeed, in any observational study, the effect of any single explanatory variable on the response variable is likely to be heavily confounded. Therefore, when scrutinizing any statistic derived from an observational study purported to summarize an effect size, you should check if any potential confounding variables have been adjusted for. If not, the statistic may not summarize much of anything.
For instance, Glenn Kessler of the “Fact Checker” column of the Washington Post once correctly criticized the practice of then President of the United States Barack Obama repeatedly stating between 2013 and 2014 that women in the United States make 77 cents for every dollar made by men. (Kessler 2014) The statistic was calculated by the U.S. Census Bureau and is technically correct. However, President Obama repeatedly cited this statistic while discussing issues in gender-based prejudice in wages.
Gender-based prejudice in wages is a serious issue, but the 77-cents-for-every-dollar statistic was derived simply by taking the median annual income for women in the United States and dividing it by the median annual income for men in the United States. Thus, it compares those just beginning their careers with those who are multiple decades into a career, it compares those working in petroleum engineering with those working in early childhood education, etc.
Thus, even at just first blush, there are two potentially confounding variables: how far along people are in their careers and what type of job they work. In short, the use of this statistic as a summary of effect size of gender differences – let alone gender prejudice – in wages is so heavily confounded as to be entirely uninformative. Using this statistic as a metric by which to measure progress in eliminating gender prejudice in wages would be entirely misguided.
While this sort of rough, back-of-the-envelope statistic might be used in the exploratory work at the beginning of analysis, it should not be used to inform policy analysis. Perhaps more disappointing was the attitude of Betsey Stevenson, a member of the White House Council of Economic Advisers, who said, “There are a lot of things that go into that 77-cents figure, there are a lot of things that contribute and no one’s trying to say that it’s all about discrimination, but I don’t think there’s a better figure.” (Kessler 2014)
A better figure would be any of the statistics reported in quality social science research by scholars who study these issues, and any such analyses of quality would at least adjust for the most obvious confounding variables.24 This is an instance where there appears to be a disconnect between social research and policy making. Thus, by scrutinizing statistics for confounding variables, you can already surpass presidential levels of standards in statistical scrutiny.
Generally, any sound observational study should have at least adjusted for the confounding variables that someone familiar with the subject can think of. If you identify a potentially confounding variable for which an observational study has not adjusted, then you have identified a major flaw with any statistic derived from the study.
However, it is impossible to ensure that an observational study has adjusted for every potentially confounding variable. There is always the possibility of a confounding variable no one has thought of. This is the major limitation of observational studies and the major appeal of randomized, controlled trials. Therefore, the prospect of confounding should always be kept in mind when scrutinizing statistics derived from observational studies.
The practical significance of a statistic is derived from subject-matter knowledge and, more specifically, comparison of a statistic to some other contextual knowledge. Because practical significance is based on comparison, you should reject generic comparisons as not only uninformative, but worse than no information at all because of the potential for them to be used deceptively. In order to evaluate the practical significance of a statistic, you should demand a quantification of effect size.
The way in which an effect size is summarized depends on the type of variables being analyzed. Generally, unstandardized statistics of effect size are preferable to standardized statistics because they relate back to the real world more readily. You should be cautious of common issues with summary statistics of effect size, such as missing denominators or confounding variables. Some summary statistics of effect size, such as two mean values and their difference, are straightforward to interpret, but others, such as correlation coefficients or estimates from model-based analyses, require more statistical acumen to interpret properly.
Estimation Error and Confidence Intervals
Usually, the statistics you encounter are actually estimates of some quantity that is not known exactly. This occurs whenever analysis involves statistical inference.
Recall the two major kinds of inference from the section on scope of inference. If a statistic comes from sampling of a larger population, the statistic derived from observations of the sample estimates something about the population. If a statistic comes from a randomized, controlled experiment, the statistic derived from observations in the experiment estimates the true effect of the intervention.
In both kinds of inference, randomization is used. This randomization introduces error into the estimates. When sampling a population, depending on exactly which individuals are selected for the sample, the exact values of statistics calculated from observations of the sample will be slightly different. For example, consider a small population of people of various heights.
The mean height in this population is 64.51 inches or about 5’ 5”. For illustration purposes, take three different simple random samples of 5 people from this population and measure the heights of everyone in each sample.
All of the sample means (64.42, 65.52, 66.32 inches) are near the true population mean of 64.51 inches, but there is some variation around the population mean. This variation in the values of the sample mean height is the sampling error.
In practice, you will only be taking one sample, and only one of the sample means will be observed. The sample mean is an estimate of the population mean; it will be near, but not exactly, the population mean. Therefore, it is useful to have a way to quantify expectations of the difference between an estimate from the sample and the true value in the population that it estimates.
Thus, the randomization used to generalize results to a larger population introduces sampling error. Similarly, the randomization used to identify causation in an experiment introduces error resulting from experimental assignment. Collectively, this article refers to both these kinds of error as “estimation error.”
Estimation error does not make a statistic invalid, because estimation error can be quantified and kept to within acceptable tolerances. However, this does mean whenever you are scrutinizing a statistic that involves inference, you should look for and demand quantification of the estimation error, because it is possible that estimation error is so great that an interpretation of the statistic is actually untrue.
The customary way to describe quantification of estimation error is with a confidence interval.25 A confidence interval is summarized as a range of values and a confidence level between 0 and 1, such as 95%, which has become a common level for confidence intervals. A 95% confidence interval makes the claim that if a randomized sampling or a randomized experimental assignment were repeated many times and a confidence interval calculated from each of these repetitions, that 95% of these confidence intervals would contain the true value being estimated.
Thus, confidence intervals give an indication of the estimation error associated with a statistic in terms of the quantity being estimated.26
A fallacy in interpreting confidence intervals is that they describe the dispersion of a variable in some way. The dispersion of a variable is how spread out the values of the variable are over all the individuals being considered and is usually described by the standard deviation or variance of a variable. Confidence intervals say nothing about the standard deviation or variance of a variable, unless they are specifically confidence intervals for estimates of a standard deviation or variance. Confidence intervals only describe the estimation error associated with an estimate.
Confidence intervals are particularly useful when evaluating whether an effect size is statistically significant. This is a separate question from whether an effect size is practically significant. To see what statistical significance means in this context, consider a case in which a confidence interval for the difference in two mean values contains 0 or a case in which a confidence interval for a relative risk contains 1.
If a confidence interval for a difference in means contains 0, then it is within the realm of probability that there is no actual difference in means, and any supposed difference is due to estimation error. Similarly, if a confidence interval for a relative risk contains 1, then it is within the realm of probability that the two risks are identical, and any alleged difference is due to estimation error. These are statistically insignificant results.
Statistically significant results are just the opposite. When a confidence interval for the difference in means does not contain 0 or a confidence interval for a relative risk does not contain 1, then it is improbable that the entirety of the effect size is due to estimation error. These are statistically significant results.27
Note that statistical significance is a relatively weak claim. When dealing with effect sizes, it is merely the claim that the true effect size is not a null value, to use statistical jargon. Statistical significance thus attributes at least some of an effect size to something other than estimation error. Practical significance is discussed first and in more detail in this article deliberately, because confusing statistical significance with practical significance is a fallacy.
To avoid this fallacy, remember that statistical significance does not imply practical significance. However, the converse – practical significance implies statistical significance – is true. There cannot be practical significance when there is not statistical significance. This is because an effect size of nothing, no matter the subject-matter context, is always practically insignificant, and when an effect is statistically insignificant, an effect size of nothing is probable.
Fortunately, you can evaluate both statistical and practical significance simultaneously when confidence intervals are provided. A rule of thumb you can use to accomplish this is to evaluate the practical significance of a statistic by using the least favorable extreme of the confidence interval to whatever working hypothesis you are considering. In cases in which an effect size is practically insignificant, the least favorable extreme of the confidence interval for the effect size will be outside of your threshold of practical significance.28
In cases in which an effect is statistically insignificant, a confidence interval for an effect size leads to ambiguous interpretations. For instance, if the difference in means between two populations is statistically insignificant, then the confidence interval for this difference will have a negative value at one end and a positive value at the other.
Any statistic derived from statistical inference brings with it some estimation error. Statistical inference occurs in any analysis that involves sampling a population or randomizing assignment of individuals in an experiment. When scrutinizing statistics that come from inference, you should look for and demand some quantification of the estimation error. The most common way to quantify estimation error is with a confidence interval. Confidence intervals allow you to judge both statistical significance and practical significance.
The preceding concludes all of this article’s discussion about what kinds of questions to ask when scrutinizing a statistic. This last section discusses some fallacies in empirical research or, more specifically, some common deceptive tactics used in the presentation of empirical research. You should be aware of these in order to spot them and avoid being tricked by them.
Most of the fallacies discussed in this section involve p-values. There is nothing intrinsically wrong with p-values. They are important in the development of the theory of statistical inference and thus important for all statisticians to learn. Furthermore, they are one useful factor in applied statistics where a binary “yes or no” decision must be made, though they should not be used exclusively to make such a decision without also paying attention to other things. It is unfortunate that p-values, for all their worth, have been so abused.
Many statistical hypothesis tests are framed in terms of testing a null hypothesis. Null hypotheses are of the sort that a difference in means is 0 or that a relative risk is 1. In short, they are hypotheses that there is no effect. Rejection of a null hypothesis is synonymous to statistical significance. They are two ways to describe the same phenomenon.
A p-value is the probability of getting the observed statistic (or a statistic more extreme) if the null hypothesis were true. A low p-value is evidence for rejecting the null hypothesis and hence for statistical significance. The mathematics of the confidence intervals discussed in the previous section is derived in statistical inference theory by inverting a null hypothesis test, so confidence intervals and null hypothesis tests are heavily intertwined.
Null hypothesis tests are usually done by defining a size or level of the test. Common hypothesis test sizes include 0.05, 0.10, and 0.01, though 0.05 seems to be used the most. The size of the test defines a rejection region for the test. Under this scheme, if the p-value is less than or equal to the size of the test, you reject the null hypothesis, and you reach a conclusion of statistical significance. If the p-value is greater than the size of the test, you fail to reject the null hypothesis, and nothing can be concluded.
This approach has been used in a lot of academic journals, though there is a movement away from the rejection region approach. Nakagawa and Cuthill (2007) summarize well the reasons for this move in general, though they write specifically for the field of biology.
Perhaps because of the historic prevalence of null hypothesis testing in academic journals, the p-value comes up again and again when discussing common fallacies of empirical research. The remainder of this section summarizes these fallacies.
Not Reporting Effect Size
One of the most deceptive fallacies in empirical research is the practice of generic comparisons, i.e., to assert that there is an effect, but without reporting an effect size. As previously discussed, you should always demand a quantification of an effect size in order to evaluate the practical significance of an alleged effect.
Sometimes effect sizes are quite blatantly unquantified. Other times the use of p-values in academic journals can be used to mask the lack of an effect size being reported. This is because p-values and related numbers are often listed with statistical results in academic journals, and the presence of these numbers might give the appearance of the quantification of effect size. However, a p-value does not describe an effect size.
For instance, below is an excerpt from a psychology paper published in an academic journal.
Using a questionnaire designed to examine prevailing ethnic and gender stereotypes, we confirmed that the stereotype that Asians are quantitatively gifted prevails more in America than in Canada, t(81) = 2.07, p < .05, r = .22. (Shih, Pittinsky, and Ambady 1999)
You should know by now that the obvious question to ask of this is, “How much more?” A few numbers were given, so you might think that the answer to this question was provided. However, it was not. The value of t is a test statistic that comes from null hypothesis testing and is actually thus redundant with the p-value. However, at least the actual value of t was given in this case. The p value was not explicitly given, though the paper asserted it was less than 0.05.
The r value is likely a correlation coefficient, which can be a measure of effect size, albeit a standardized one. However, the paper did not state what correlation coefficient was used, and regardless, it would be more direct to use an unstandardized statistic for effect size. If the correlation coefficient is on the -1 to 1 scale of a Pearson or Spearman correlation coefficient, then 0.22 reflects rather low correlation. Since this was the only indication of practical significance given, you are left to conclude there is not much practical significance to this difference at all.
In short, despite three numbers being listed, the effect size was not directly given and the only information you have about the effect size leads you to believe it is not practically significant. You were simply not told how much more prevalent the stereotype that Asians are quantitatively gifted prevails in America than in Canada.
This point was not a minor tangent discussed in the paper. The paper’s thesis turned on this point, since it was a premise in the interpretation of the experimental results. You might, rightly, start to wonder if the effect size was not directly given because it was indeed not practically significant, and the authors did not want to undermine their own thesis.
The intent of this example is not to pick on any one specific academic paper. This example was chosen because it comes from a paper cited thousands of times and authored by researchers at Harvard University published in the journal Psychological Science by the American Psychological Association. It is not from the margins of academic research.
This is a reminder to you that just because a statistic comes from a paper that has gone through peer review and been published in a prestigious journal does not mean that your work of scrutinizing the statistic has been done for you. This is especially true because the practice of reporting p-values instead of effect sizes has unfortunately gone unchecked in academic journals for so long.
Using Inferential Statistics Without Statistical Inference
Confidence intervals and p-values are ultimately ways of describing estimation error. Estimation error, as was discussed, comes from the randomization used in statistical inference. In short, confidence intervals and p-values are inferential statistics. Inferential statistics are sometimes contrasted with descriptive statistics, denoting statistics that are not predicated on any statistical inference. It does not make any sense to use inferential statistics when there is no statistical inference.
If a sample is representative of a larger population, then observations of the sample can estimate quantities of the population. The sampling error involved in this is what confidence intervals and p-values are trying to describe. However, if a sample is known not to be representative of a larger population, it is deceptive to use these inferential statistics because there is no sampling error described by them. Thus, if you do not have a representative sample, you do not have any population about which to infer, and you should not be using inferential statistics.
If assignment in an experiment between intervention and control groups is randomized, you can attribute the observed effects in the experiment to the intervention. However, you are estimating these effects, because individuals in the experiment are not all identical, so there is some variation depending on exactly which individuals are assigned to intervention or control groups. Confidence intervals and p-values can be used to describe the error that comes from this randomized assignment. However, if no randomization is used in assignment, these inferential statistics describe nothing. Indeed, if no randomization is used in assignment, then you cannot be sure the observed effects are even caused by the intervention.
However, some academic papers report inferential statistics for results that do not come from statistical inference. It might be due to the requirements of the academic journals, from the desire to make results seem more quantitatively sophisticated, or a lack of facility with descriptive statistics. Regardless, you should be able to spot these cases and properly judge them as fallacies.
Fortunately, the ability to spot these cases comes directly from what you learned in the sections on scope of inference and on estimation error. Thus, you should dismiss any inferential statistics, such as confidence intervals or p-values, when used with a study whose scope of inference is in the lower left corner of Table 1.
For example, Volk and Atkinson (2013) use descriptive statistics to quantify estimates of infant and child mortality in preindustrial societies. This is an excellent example of how quantitative analysis without a representative sample should be done.
It is impossible to acquire a representative sample of all hunter-gatherer and agriculturalist societies in human pre-history and history. Indeed, much of this information has been forever lost to time. However, neither is it the case that there are no relevant observations whatsoever. The authors reviewed what historical, ethnographic, and archaeological studies exist and found a relatively consistent trend, giving researchers an idea of the magnitude of infant and child death over the centuries.
The authors, in the otherwise excellent paper, did slip into the fallacy of using inferential statistics in a non-inferential context once.
While [average rates of infant and child mortality in agriculturalist societies] are still much higher than modernized values, when compared using a Student’s t-test they are significantly lower than either the historical or hunter-gatherer [mortality rates] (agricultural vs. historical [infant mortality rate] t = −2.22, p = 0.36, [child mortality rate] t = −3.743, p < 0.01; agricultural vs. hunter-gatherers [infant mortality rate] t = −3.928, p < 0.01, [child mortality rate] t = −3.599, p < 0.01). (Volk and Atkinson 2013)
Student’s t-test is a null hypothesis test pertaining to differences in mean values. The null hypothesis of a Student’s t-test is about means in the population. What is the population in this context? The target population would be every pre-industrial society that ever existed. The sampled population consists of whatever societies happened to have quality research published in academic journals about their infant or child mortality rates.
The authors reviewed 20 studies of hunter-gather societies, 22 studies of agriculturalist societies, and 43 historical studies. Considering the countless pre-industrial societies that have existed, this is the epitome of an unrepresentative, entirely opportunistic sample. Thus, there is no random sampling, there is no sampling error to quantify, and these p-values are meaningless.
The authors did an excellent job using purely descriptive statistics elsewhere in the paper other than in this one instance, and their review is highly informative. However, the gap highlighted by this misuse of inferential statistics between the actual sample and the target population is striking.
Not Adjusting for Multiple Comparisons
Sometimes an analysis performs not just one null hypothesis test, but several. Indeed, some analyses perform numerous null hypothesis tests and report on whatever results are statistically significant.
The danger in this practice is that null hypothesis tests all have, as previously described, a size. This is typically denoted by the Greek letter alpha, and an α = 0.05 is common. In the theory of statistical inference, this α is an upper bound on what is known as the “type 1 error rate.” The type 1 error rate is the rate at which null hypothesis tests reject the null hypothesis even when the null hypothesis is true. Therefore, with α = 0.05, up to 5% of null hypothesis tests will falsely be rejected.
The probability of incorrectly concluding statistical significance, with α = 0.05,_ can thus be considered 5% for an individual hypothesis test. However, the probability of getting at least one falsely statistically significant result when performing multiple hypothesis tests is greater. If 14 separate hypothesis tests are performed all with α = 0.05, then the probability that at least one of them has a type 1 error, sometimes called the “family-wise” or “group-wise” type 1 error rate, is actually about 51.2%.
Therefore, performing a multitude of unadjusted hypothesis tests and looking for statistically significant results to report is a fallacy. This has been sometimes termed “p-hacking” because it involves generating numerous p-values and looking for p-values below an arbitrary α value. A researcher who is p-hacking is actually looking for anomalies due to chance, not significant results.
There are two main ways to protect yourself from this fallacy. The first is to always consider practical significance, not just statistical significance. This is a good practice not just when there are multiple comparisons being made, but whenever scrutinizing a statistic.
The second is to look for an adjustment for multiple comparisons and, if one has not been done, perform an adjustment yourself. There are many ways to adjust for the fact that multiple hypothesis tests are being done. Listing all such methods is beyond the scope of this article. Generally, they all attempt to keep the family-wise type 1 error rate bounded by the desired family-wise α size.
The simplest adjustment for multiple comparisons is called a “Bonferroni correction” after a mathematician who worked in probability theory. It is simple enough that you as a reader can make the adjustment as long as the researchers have provided p-values for all their hypothesis tests. To make the Bonferroni correction, instead of comparing the p-values against the desired α level, you instead compare the p-values against the level α / m, where m is the number of hypothesis tests being performed.
If you desire a family-wise α level of 0.05, and there are 14 hypothesis tests performed, then instead of comparing each of the 14 p-values with 0.05, you would compare them with 0.05 / 14 = 0.0036 when using the Bonferroni correction. You would then only count those hypothesis tests with p-values less than 0.0036 as statistically significant. When doing this, the probability of incorrectly rejecting at least one null hypothesis among the 14 is again bounded at 0.05.
The Bonferroni correction tends to overcompensate for multiple comparisons and comes at the expense of more statistical power than is necessary. There are better adjustments for multiple comparisons. Which one is appropriate depends on the specific analysis. However, the Bonferroni correction can be useful for back-of-the-envelope style calculations when you are reading reports of empirical research rather than doing the analysis, such as when you are scrutinizing a statistic.
Finally, while p-hacking specifically is a fallacy, using multiple comparisons is not in general fallacious. There are fields such as genomics, which necessarily involve checking very many different possible explanatory variables (e.g., alleles at many different loci) as to whether they have some effect on a measured response variable. When done correctly, these fields not only adjust for multiple comparisons, but often have their own specialized methodologies for doing so.
If you are concerned with a social issue, a medical issue, an environmental issue, or indeed any issue involving large populations or involving cause and effect, you are engaged in statistical reasoning, whether you realize this or not. Therefore, serious inquiry into such an issue necessarily involves encountering statistics. However, because statistics can be abused, the skill of scrutinizing statistics is an important one.
You should scrutinize all statistics for provenance, scope of inference, and practical significance. Additionally, you should scrutinize statistics for estimation error if you determine them to be inferential when scrutinizing their scope of inference.
Scrutinizing provenance involves tracing a statistic back to the original observations from which it is derived. Scope of inference comprises two parts: whether a statistic is derived from a representative sample of a population, and whether a statistic is derived from a randomized, controlled experiment, thus implying causal conclusions can be made. Practical significance is based on subject-matter knowledge and comparison with another quantity. It can be summarized by effect size, but exactly how effect size is summarized depends on the types of variables involved.
For inferential statistics, estimation error stems from the randomization done in sampling or in experimental assignment. It is commonly quantified with confidence intervals, which give a range of probable values for the estimated quantity.
Throughout your work of scrutinizing a statistic, you should be on the look out for common fallacies, such as lack of adjustment for confounding variables, missing denominators, unreported effect sizes, inferential statistics without inference, or p-hacking.
Scrutinizing statistics, though work, can also be a joy. When you are surrounded by those who want to influence rather than inform you, scrutinizing a statistic is an opportunity to strengthen that most singular implement in the search for truth: your own skeptical mind.
This saying was popularized in 1937, when it was commonly attributed to Scottish folklorist Andrew Lang. Andrew Lang died in 1912 and may or may not have originated it, as it does not appear in any of his extant writings. Regardless of the origin of this specific version of the saying, it was likely inspired by similar figures of speech which came before it. See this article at the Quote Investigator blog for more information about the origin of the saying.↩︎
Statistics courses beyond the introductory level usually have another kind of content. Typically, introductory courses are labeled “applied statistics” and are concerned with how to use various statistical methods. Upper-level courses are usually termed “mathematical statistics” and are concerned with the mathematical foundations of results used in statistics and how the statistical techniques are thus derived.↩︎
The Cagot were a real ethnic group who lived in a region part of modern-day Spain and France. People categorized as “Cagot” were the object of much prejudice, and their origins have remained a mystery to scholarship over the centuries.↩︎
The results could hypothetically generalize to individuals living in Papua New Guinea with mobile phones that have Saint Petersburg area code phone numbers. If this were the case, the results would generalize only to these individuals in Papua New Guinea and only if it was still possible to call them at the time the survey was being done.↩︎
Most of these unrepresentative surveys did not using probability sampling, which is the form of sampling described in this article in which every possible sample taken from the population has a defined probability of selection. Instead, the surveys, as many market research and political research surveys do, used quota sampling. Quota sampling is a form of sampling that tries to intentionally match certain demographic variables in the sample to the known distributions of the demographic variables in the population. Quota sampling is a lot cheaper because samples can be constructed opportunistically based on individuals who are readily available. However, as the 2015 British General Election illustrates, quota sampling is not very reliable.↩︎
Not all randomized sampling consists of simple random samples. Surveys usually use more complex sampling schemes, for various reasons. The sampling is still randomized, but the probabilities of inclusion in the sample are not necessarily all equal. This can be accounted for in the analysis and is a standard practice. (Lohr 2021)↩︎
A point of confusion can arise because randomized experiments are often analyzed using the same mathematical tools as observational studies. Classical statistical inference was developed using the mathematical model of a population and a sample from the population. Rather than reinvent all of the statistical tools developed over the years, analysis of randomized, controlled experiments sometimes simply invokes a population model and reuses the same tools. While this practice is common, inference for randomized, controlled experiments does not necessarily have to be done this way. Rosenberger and Lachin (2015) summarize inference based on the randomization used for assignment in an experiment, instead of invoking a population model.↩︎
Response variables are sometimes called “research variables” or “dependent variables,” and explanatory variables are sometimes called “independent variables.” However, this article recommends avoiding the use of the labels “dependent variable” and “independent variable” because “independence” means something else very important in statistical theory. Indeed, many statistical models and tests are built upon an assumption of independence. While the violation of other assumptions, such as the assumption of normality, can still result in passable estimates, failure to handle the autocorrelation structure in the observations when the assumption of independence is violated will usually make estimates of standard errors drastically mistaken and thus invalidate the results.↩︎
Even within these two broad types of variables, there are more subtypes, but for the purposes of this article, only these two types are used. Also, it should noted that in the math underlying statistical theory, categorical variables are often taken as just a more limited case of numerical variables, and the proofs and derivations for numerical variables reused for categorical variables. However, the way the two kinds of variables are thought about and handled in applied statistical analysis can be quite different.↩︎
Real-world example of counts, proportions, totals, and means will be included here in the future.↩︎
Of course, there do exist cases in which a count or total is more relevant for a specific problem than the corresponding proportion or mean. For instance, if you are in charge of allocating resources to the waste disposal facilities in your country, you care very much about what the total waste produced next year in each jurisdiction is projected to be and care only indirectly about what the mean waste produced per household in each jurisdiction is projected to be.↩︎
Rate can be thought of as a proportion with the number of people experiencing an event in the numerator and the number of people in the population as the denominator.↩︎
Medians are more resistant to outliers, so they are often used for variables with heavily skewed distributions, such as personal monetary income. Furthermore, analyses that use logarithm transforms are more easily interpreted in terms of medians than means.↩︎
Ferguson (2009) summarizes many of standardized statistics for effect size. Valid uses of standardized statistics for effect size include comparing effect sizes when statistics are on different scales or when dealing with inherently abstract and unitless quantities such as those used by psychologists in personality tests.↩︎
Some standardized statistics such as Cohen’s d incorporate information about the dispersion of the distribution of the variable, e.g., Cohen’s d involves division by the standard deviation, which is a measure of dispersion. However, Cohen’s d does not report on other properties such as skewness or kurtosis.↩︎
The use of the term “risk” in “relative risk” comes from the idea of risk of disease or other adverse outcome, because relative risk is frequently used in medical statistics.↩︎
Retrospective or “case-control” studies select individuals for the study based on the response variable, and then look for how the explanatory variable differs between response variable levels. This is opposite of how a randomized, controlled experiment works. For instance, in a medical context, individuals in a randomized, controlled experiment are assigned to a level of the explanatory variable (e.g., therapy or control), and then are measured in terms of the response variable, such as disease outcome. In a case-control study, the disease outcome is used for selection to the study, and then explanatory variables such as risk factors are measured.↩︎
The character “ρ” is the Greek letter rho. Sometimes a convention is used where r refers to the Pearson product-moment correlation coefficient and ρ is used for the Spearman rank correlation coefficient. However, you should not rely on this convention, because there exists another convention in which Roman letters, such as r, are used for statistics pertaining to a sample and Greek letters, such as ρ, are used for parameters pertaining to a population.↩︎
A real-world example of correlation and inverse correlation will be included here in the future.↩︎
Because the Spearman correlation coefficient uses ranks, it is also more appropriate for ordinal variables. Ordinal variables are those that have discrete levels, but also have some sense of order, such as small, medium, and large.↩︎
A real-world example of simple linear regression will be included here in the future.↩︎
This article was originally intended to include a concrete example of an analysis of gender pay differences that adjusted for confounding variables. While Kessler (2014) contains numerous references to better analyses than the 77-cents-for-every-dollar statistic, all of these hypertext links now lead to nothing. (This is perhaps a testament to the transient nature of the World Wide Web.) Therefore, in the spirit of this article’s section on provenance, no confounding-adjusted statistic is cited here in order to avoid citing a statistic whose original source has not been scrutinized.↩︎
Confidence intervals are based on frequentist statistics. In a Bayesian statistics setting, credible intervals are used. Credible intervals have an even simpler interpretation: a 95% credible interval claims that the true value has a 95% probability of falling within the interval. However, this simpler interpretation is a result of different definitions of “probability.” In Bayesian statistics, probability is a measure of subjective weight given to a belief, like when someone says, “I am fifty-fifty on whether that is true or not.” In frequentist statistics, probability is a measure of how often an event occurs in the long run, like when someone says, “The odds are 2 to 1 that Flight 2495 is delayed.”
First courses in statistics are usually based on a frequentist interpretation. Bayesian interpretation of probability can open the door to methods that do not work with the frequentist interpretation, most notably conjugate updating and hierarchical models. While credible intervals are easier to interpret for the layperson, Bayesian methods should not be used just so that credible intervals instead of confidence intervals can be used. Instead, Bayesian methods should be used when the problem makes sense for a Bayesian interpretation, for example, when there is a desire to revise a belief based on new evidence that has been informed by prior results.↩︎
A real-world example of a confidence interval will be included here in the future.↩︎
A real-world example of a statistically insignificant confidence interval will be included here in the future.↩︎
A real-world example of a practically insignificant, but statistically significant confidence interval will be included here in the future.↩︎