The Echoes of the Mind (50) Statistics


There are 3 kinds of lies: lies, damned lies, and statistics. ~ American author and humorist Mark Twain

Death inspired the science of statistics. King Henry VII’s fear of the Black Plague probably prompted his publishing reports on deaths, beginning in 1532. Mortuary tables started ~1662 with the work of English demographer John Graunt, who saw patterns in the statistics. This was followed by English polymath Edmund Halley, who published an article on life annuities in 1693, thus providing the mathematical grounding for insurance via actuarial science.

Probability tells the likelihood of an event. A graph may be made by collating sample outcomes. This results in a probability curve. Carl Friedrich Gauss derived an equation for the probability curve and analyzed its properties.

In 1749, German philosopher Gottfried Achenwall quantitatively characterized government statistical data and coined the term statistic. It had been called political arithmetic in England.

A bell-curve figure shows an idealized symmetrical probability curve with a normal distribution: a continuous probability spread. The mean (median) is its apex, which is the peak number of occurrences, and therefore the most likely point (expected value).

Less likely are events to the left and right, which are respectively fewer and more of whatever is being measured in a sample than those at the median.

A vast body of statistical theory and methods presupposes a normal distribution. ~ American statistician Leroy Folks

The central limit theorem is the term used for the statistical assumption that a large number of independent events will be normally distributed. Logically it is a flagrant fallacy to blithely apply to specific instances in the present what has been generally found in the past. Nevertheless, assumed continuity is a central tenet of statistics.

In charting statistics, such as a mortuary table, the shape of a probability curve may be skewed. In the exemplary curve, illustrating age of death, demise is spread over a greater range before reaching the midpoint, after which folks start dropping like flies.

The standard deviation is a reliable measure of dispersion. ~ American mathematician Thomas Pirnot

The standard deviation is the amount of variance from the mean. Another view, from a predictive context, is that the standard deviation represents a measure of errors from the expected norm. A low standard deviation betokens data points clustered close to the central tendency (mean). A high standard deviation indicates a greater dispersion.

Descriptive statistics is the discipline of quantifying a sample, which is a set of collated data. In contrast, inferential statistics draws predictive conclusions from statistical information about the population that is sampled.

Inferential statistics is used to test hypotheses, and to forecast via sample data. As such, statistical inference from a random sample has a confidence interval: the probability percentage that a certain event will occur, or data not sampled, or found in the future, will correspond to an asserted characterization. In other words, a confidence interval is a range of values indicating the uncertainty surrounding an estimate. Confidence intervals represent how “good” an estimate is.

Confidence intervals are frequently misunderstood, even by scientists who should know better. A confidence interval expresses a statistical sense of reliability in an estimation, not the likelihood that the result (the sought-after population parameter) is within the interval (that is, the probability that the interval covers the population parameter).

Confidence intervals are given in the form of estimate ± margin of error, where estimate is the measure (the center of the interval) of the unknown population parameter being surveyed for, and margin of error is the potentiality of the estimate being erroneous.

Attached to every confidence interval is a confidence level, which is the probability (%) indicating how much certainty should be attributed to a confidence interval. In general, the higher the confidence level, the wider the confidence interval.

The confidence level is not a statement about the population or sampling procedure. The confidence level is instead an indication of the success in constructing the confidence interval. For example, confidence intervals with a confidence level of 80%, will, over the long run (repetitiously), miss the true population parameter 1 out of every 5 times.

Confidence intervals were introduced into statistics by Polish mathematician and statistician Jerzy Neyman in 1937.

Although the field of statistics is rooted in mathematics, and mathematics is exact, the use of statistics to describe complex phenomena is not exact. ~ American economist Charles Wheelan

◊ ◊ ◊

Children are taught the mathematics of certainty: algebra, trigonometry, geometry and the like. That’s beautiful but often useless. We should be taught uncertainty. ~ German psychologist Gerd Gigerenzer

Statistics is difficult for many because it is an obtuse abstraction. The mind is naturally inclined to think in terms of outcomes having causes.

Probability is relatively easy because it is in the realm of causality: how likely an event is. Even babies detect patterns that form the base of probability.

Correlation doesn’t tell you anything about causation, but it’s a mistake even researchers make. ~ American statistician and psychologist Rand Wilcox

Statistics is essentially acausal, which is a paradigmatic shift that has no everyday application. The very concept of statistics is a significant step away from how mind appreciates the world.

There’s order in the form of correlations. ~ English physicist David Jennings

The closest statistics comes to everyday application is averaging. The mind can rather readily suss the average of a tangible set of something (e.g., how big the average orange is of 4 sitting there). But how representative a sample is of a larger, unseen population, seriously stretches the imagination; and how confident one should be about the randomness or representativeness of such a sample is a mathematical chimera.

The difficulty is with distributions, which are crucial to understanding statistical reasoning. The idea that individual phenomena can be independently distributed is comprehensible, but that the “distribution” of a large collection of random events can be mathematically characterized with regularity is at best a mystery wrapped in an enigma.

The notion of independence within distributions is at the root of the problem. People expect samples, even small ones, to be representative. Consequently, reconciling independence with an abstract distribution is a mental challenge.

A common misunderstanding of independence and distribution is exemplified by a statement about coin tossing: “if 10 heads have been thrown in a row, the next few tosses have to be tails for the results to represent the distribution.” (There is no statistical population distribution of coin tosses, as each toss is independent. The stated confusion about distribution is with probability distribution (aka odds), which for tossing coins is 0.5.)