“There are 3 kinds of lies: lies, damned lies, and statistics.” ~ American author and humorist Mark Twain (1835–1910)

Death inspired the science of statistics. King Henry VII’s fear of the Black Plague probably prompted his publishing reports on deaths, beginning in 1532. Mortuary tables started ~1662 with the work of English demographer John Graunt, who saw patterns in the statistics. This was followed by English polymath Edmund Halley, who published an article on life annuities in 1693, thus providing the mathematical grounding for insurance via actuarial science.

Probability tells the likelihood of an event. A graph may be made by collating sample outcomes. This results in a probability curve. Carl Friedrich Gauss derived an equation for the probability curve and analyzed its properties.

In 1749, German philosopher Gottfried Achenwall quantitatively characterized government statistical data and coined the term *statistic*. It had been called *political arithmetic* in England.

A bell-curve figure shows an idealized symmetrical probability curve with a normal distribution: a continuous probability spread. The mean (median) is its apex, which is the peak number of occurrences, and therefore the most likely point (expected value).

Less likely are events to the left and right, which are respectively fewer and more of whatever is being measured in a sample than those at the median.

“A vast body of statistical theory and methods presupposes a normal distribution,” observed American statistician Leroy Folks. Instead, normal distribution is seldom the norm. For example, mortuary statistics are notably skewed, with (in modern industrialized nations) a long left side before the mean and a sharp drop afterwards, as old folks start dropping like flies.

The *central limit theorem* is the term used for the statistical assumption that a large number of independent events will be normally distributed. Logically it is a flagrant fallacy to blithely apply to specific instances in the present what has been generally found in the past. Nevertheless, assumed continuity is a central tenet of statistics.

The *standard deviation* is the amount of variance from the mean: dispersion from an average. Another view, from a predictive context, is that the standard deviation represents a measure of errors from the expected norm. A low standard deviation betokens data points clustered close to the central tendency (mean). A high standard deviation indicates a greater dispersion.

*Descriptive statistics* is the discipline of quantifying a sample, which is a set of collated data. In contrast, inferential statistics draws predictive conclusions from statistical information about the population that is sampled.

*Inferential statistics* is used to test hypotheses, and to forecast via sample data. As such, statistical inference from a random sample has a *confidence interval*: the probability percentage that a certain event will occur, or data not sampled, or found in the future, will correspond to an asserted characterization. In other words, a confidence interval is a range of values indicating the uncertainty surrounding an estimate. Confidence intervals represent how “good” an estimate is.

Confidence intervals are frequently misunderstood, even by scientists who should know better. A confidence interval expresses a statistical sense of reliability in an estimation, not the likelihood that the result (the sought-after population parameter) is within the interval (that is, the probability that the interval covers the population parameter).

Confidence intervals are given in the form of estimate ± margin of error, where estimate is the measure (the center of the interval) of the unknown population parameter being surveyed for, and margin of error is the potentiality of the estimate being erroneous.

Attached to every confidence interval is a *confidence level*, which is the probability (%) indicating how much certainty should be attributed to a confidence interval. In general, the higher the confidence level, the wider the confidence interval.

The confidence level is not a statement about the population or sampling procedure. The confidence level is instead an indication of the success in constructing the confidence interval. For example, confidence intervals with a confidence level of 80%, will, over the long run (repetitiously), miss the true population parameter 1 out of every 5 times. Confidence intervals were introduced into statistics by Polish mathematician Jerzy Neyman in 1937.

Probability is relatively easy because it is in the realm of causality: how likely an event is. Even babies detect patterns that form the base of probability.

Statistics is difficult for many because it is an obtuse abstraction. The mind is naturally inclined to think in terms of outcomes having causes. By contrast, correlation is a contrived concept. As American statistician and psychologist Rand Wilcox notes, “correlation doesn’t tell you anything about causation, but it’s a mistake even researchers make.”

Statistics is essentially acausal, which is a paradigmatic shift that has no everyday application. The very concept of statistics is a significant step away from how mind appreciates the world.

The closest statistics comes to everyday application is averaging. The mind can rather readily suss the average of a tangible set of something (e.g., how big the average orange is of 4 sitting there). But how representative a sample is of a larger, unseen population, seriously stretches the imagination; and how confident one should be about the randomness or representativeness of such a sample is a mathematical chimera.

The difficulty is with distributions, which are crucial to understanding statistical reasoning. The idea that individual phenomena can be independently distributed is comprehensible, but that the “distribution” of a large collection of random events can be mathematically characterized with regularity is at best a mystery wrapped in an enigma.

The notion of independence within distributions is at the root of the problem. People expect samples, even small ones, to be representative. Consequently, reconciling independence with an abstract distribution is a mental challenge.

A common misunderstanding of independence and distribution is exemplified by a statement about coin tossing: “if 10 heads have been thrown in a row, the next few tosses have to be tails for the results to represent the distribution.” (There is no statistical population distribution of coin tosses, as each toss is independent. The stated confusion about distribution is with probability distribution (aka odds), which for tossing coins is 0.5.)

**Data Quality**

English statistician David Hand aptly observed that “statistics begins with data.”

Statistics aims at a specific, quantified characterization of a population. The data used in a statistical analysis is almost always a sampling of a population rather than the entire population. Hence, statistics are almost always a rough sketch, not a complete picture, and so inherently are of uncertain quality, though they often serve as good approximations, which is the best a fact can ever be.

The quality of a sample determines the quality of the statistics associated with it. For statistics to be decent, the sample upon which they are based must be representative of the population being examined.

A population is the entire set of objects about which information is wanted. A parameter is a quantitative characteristic of a population; it is a fixed and mysterious number. The goal of a statistical exercise is gaining insight into one or more parameters.

In statistics, characteristics are commonly called variables, with each object having a value of a variable.

◊ ◊ ◊

“Raw data, like raw potatoes, usually require cleaning before use,” cautioned American statistician Ronald Thisted.

Data provides a window to the world. The problem is getting a clear view. That requires good data and unbiased examination.

A sample is data about a subset of the target population. A statistic is a numeric characteristic of a sample.

There are 2 basic types of statistical studies: observational and experimental. In observational situations, data is captured without interference in the process. In contrast, experimental studies consist of manipulating the objects measured. The quality of experimental data is directly related to the design of the experiment.

A sampling frame is the source material or method by which a sample is selected. Sampling frames must be designed to collect representative data, and, once amassed, cleaned as necessary to reflect that goal. Sample size is a critical aspect of data quality.

The law of large numbers is a theorem relating to sample quality. The theorem states that the average result should come closer to the expected value with larger sample size, or greater number of repetitions in experimental results.

The term random sample is used to describe the technique of randomly picking sample objects for examination. The happy thought and fond hope is that random selection will result in population representativeness. Many times, sampling, though intended as random, is no such thing. This is because certain members of a population are more accessible than others, and so more likely to be chosen.

Market research long used landline phones to survey consumers. The problems of obtaining a representative sample, once mainly limited to the demographics of geography and income/wealth, have been compounded in recent decades by the facts that many people now exclusively use cell phones, and that phone books are no longer the population compendium they once were.

Data is evidence. In scientific experiments phenomena are characterized via data. Data quality is a problem in every sort of analysis.

“Too many cooks spoil the broth” is a hoary proverb. The larger the data set, the more hands involved in its compilation, and the more processing stages involved, the more likely errors creep in. The law of large numbers may be a mirage.

**Statistics in Science**

“There are large numbers of experts – not just laypeople – who have no training in statistical thinking,” said German psychologist Gerd Gigerenzer.

The bane of empirical science is uncertainty. For a modern scientist, a pattern of anecdotes may provide fodder for a hypothesis but is far too flimsy a foundation to float a theory. So scientists invariably rely upon statistics to muster support. Therein damning problems lurk.

The overriding issue in employing statistics is confusing correlation with causality. Canadian econometrician James Ramsey reminds that “statistics per se is acausal.” The best any statistical result can show is a coincidence between 2 factors. Statistics can never prove one event causes another.

“The shift from disciplines with an all-pervading causal interpretation to one that is inherently acausal represents a major fundamental shift in viewpoint, and one that cannot merely be dismissed as an alternative ‘explanation.'” notes Ramsey.

For statistics to bolster any claim the sample size must be large and the result unambiguous. Excepting physics and chemistry – where experimental reproducibility is relatively easily had – both criteria are rarely met. Failure to compile unassailable data of sufficient sample size is particularly true in the life sciences, notably the medical field.

Then there is the scale of potential effect. Whereas large effects may rather readily be determined, and therefore the use of statistics rather superfluous, small effects are tough to suss.

That smoking cigarettes hurts health – damaging lungs, causing cardiovascular disease and cancer – is practically a no brainer. Statistically, it helps greatly that there a lot of smokers about. But the health effects of eating something considered food are so difficult to discern that the statistics of all such studies are worthless.

A common technique is meta-analysis: aggregate a large number of studies, none of which individually may be considered conclusive, or even worth a damn, but then conclude that altogether a causal conclusion may be drawn. However appealing the rational, the technique is bogus from a statistical standpoint; yet it remains a popular ruse.

“Very few areas of science are uncontaminated by the pseudo-certainty of statistical conclusions describing individually uncertain events,” observed American biomechanist Steven Vogel.

The more flexible a scientific field is in its definitions, experimental designs, analytic modes, and outcomes, the less likely that research conclusions are reliable. Biology, psychology, sociology, and economics are exemplary fields where empirical problems loom large.

Getting good data is the 1st hurdle, and is where many studies falter, often without the acknowledgement of those involved. Once amassed, data is then subject to statistical interpretation. If a study has not already been invalidated for lack of decent data, it is in this step that results readily go awry.

Even when performed properly, statistical tests are widely misunderstood and frequently misinterpreted. As a result, countless scientific conclusions that make the news are erroneous. “A lot of scientists don’t understand statistics. And they don’t understand statistics because the statistics don’t make sense,” said American epidemiologist Steven Goodman.

**Probability Value**

In the 1920s and 1930s, English statistician and biologist Ronald Fisher mathematically combined Mendelian genetics with Darwin’s hypothesis of natural selection, creating what became known as the modern evolutionary synthesis, thereby establishing evolution as biology’s primary paradigm. Fisher’s work revolutionized the experimental design and the use of statistical inference.

In his approach, Fisher expressly wanted to avoid the subjectivity involved in Bayesian inference, which became popular in the 1980s and remains so. Bayes’ theorem is now badly abused in science, medicine, and law to conclude causation when this shaggy approach at best suggests conditional plausibility when critical data is missing.

Fisher statistically assessed significance using a probability value (p-value). The p-value simply suggests the probability that a proposed hypothesis is plausible. “The problem is that the p-value by itself is not of particular interest. What scientists want is a measure of the credibility of their conclusions, based on observed data. The p-value neither measures that nor is it part of the formula that provides it,” explained Goodman.

Scientists now use p-value as a backhanded way of determining whether their data and attendant conclusions are valid. This is a fundamental misconception. “This pernicious error creates the illusion that the p-value alone measures the credibility of a conclusion, which opens the door to the mistaken notion that the dividing line between scientifically justified and unjustified claims is set by whether the p-value has crossed a “bright line” of significance, to the exclusion of external considerations like prior evidence, understanding of mechanism, or experimental design and conduct,” said Goodman. “Random variation alone can easily lead to large disparities in p-values,” noted Swiss zoologist Valentin Amrhein.

Fisher used “significance” only to suggest whether an observation was worth following up on. “This is in stark contrast to the modern practice of making claims based on a single documentation of statistical significance,” observed Goodman.

“p-values used in the conventional, dichotomous way decide whether a result refutes or supports a scientific hypothesis. Bucketing results into ‘statistically significant’ and ‘statistically non-significant’ makes people think that the items assigned in that way are categorically different. The false belief that crossing the threshold of statistical significance is enough to show that a result is ‘real’ has led scientists and journal editors to privilege such results, thereby distorting the literature. Statistically significant estimates are biased upwards, whereas statistically non-significant estimates are biased downwards. Consequently, any discussion that focuses on estimates chosen for their significance will be biased. On top of this, the rigid focus on statistical significance encourages researchers to choose data and methods that yield statistical significance for some desired (or simply publishable) result, or that yield statistical non-significance for an undesired result, such as potential side effects of drugs – thereby invalidating conclusions,” explained Amrhein.

○○○

“Claimed research findings may often be simply accurate measures of the prevailing bias,” observed American epidemiologist John Ioannidis.

As with all endeavors involving pecuniary interest: to find the fraud, follow the money. Corporate-funded scientific research is inherently untrustworthy for this reason. People are paid to find a desired result.

Further, the hotter a scientific subject is, the less likely that research findings are reliable: more teams are involved, and de facto under implicit competitive pressure to produce results. But the converse also presents the same problem. “The smaller the number of studies conducted in a scientific field, the less likely the research findings are to be true,” noted Ioannidis.

Researchers in a noncompetitive field need to produce noteworthy results to have any hope of continuing their work. The temptation for a little fiddling to sustain one’s livelihood is strong.

◊ ◊ ◊

Statistical inference is a nuanced mathematical art related to correlation which is widely misused as a yardstick of causality. Further, misunderstanding the concepts of statistics has meant falsely designating or denigrating significance – conclusions based on experiment design and measurement without proper accounting of actuality. As ubiquitously exercised, the employment of statistics in science is mostly delusion. “Most claimed research findings are false,” concluded Ioannidis.

○○○

If this liked this essay, you’ll enjoy the entire statistical population, presented in *The Echoes of the Mind*.