**Data Quality**

Statistics begins with data. ~ David Hand

Statistics aims at a specific, quantified characterization of a population. The data going into developing a statistical analysis is almost always a sampling of a population rather than the entire population. Hence, statistics are almost always a rough sketch, not a complete picture, and so inherently are of uncertain quality, though they often serve as good approximations, which is the best a fact can ever be.

Garbage in, garbage out. ~ first used in a newspaper article about the US Internal Revenue Service computerizing their data (1 April 1963)

The quality of a sample determines the quality of the statistics associated with it. For statistics to be decent, the sample upon which they are based must be representative of the population being examined.

A population is the entire set of objects about which information is wanted. A parameter is a quantitative characteristic of a population; it is a fixed and mysterious number. The goal of a statistical exercise is gaining insight into one or more parameters.

In statistics, characteristics are commonly called variables, with each object having a value of a variable.

◊ ◊ ◊

Raw data, like raw potatoes, usually require cleaning before use. ~ American statistician Ronald Thisted

Data provides a window to the world. The problem is getting a clear view. That requires good data and unbiased examination.

A sample is data about a subset of the target population. A statistic is a numeric characteristic of a sample.

There are 2 basic types of statistical studies: observational and experimental. In observational situations, data is captured without interference in the process. In contrast, experimental studies consist of manipulating the objects measured. The quality of experimental data is directly related to the design of the experiment.

A sampling frame is the source material or method by which a sample is selected. Sampling frames must be designed to collect representative data, and, once amassed, cleaned as necessary to reflect that goal. Sample size is a critical aspect of data quality.

The law of large numbers is a theorem relating to sample quality. The theorem states that the average result should come closer to the expected value with larger sample size, or greater number of repetitions in experimental results.

The term random sample is used to describe the technique of randomly picking sample objects for examination. The happy thought and fond hope is that random selection will result in population representativeness. Many times, sampling, though intended as random, is no such thing. This is because certain members of a population are more accessible than others, and so more likely to be chosen.

Market research long used landline phones to survey consumers. The problems of obtaining a representative sample, once mainly limited to the demographics of geography and income/wealth, have been compounded in recent decades by the facts that many people now exclusively use cell phones, and that phone books are no longer the population compendium they once were.

Data is evidence. In scientific experiments phenomena are characterized via data. Data quality is a problem in every sort of analysis.

The larger the data set, the more hands involved in its compilation, and the more processing stages involved, the more likely errors creep in. The law of large numbers may be a mirage.

Too many cooks spoil the broth. ~ proverb