‘Fake results’ in scientific studies

By Dr Huw Llewelyn

Image credit: Edited STATS1_P-VALUE originally by fickleandfreckled. CC BY 2.0 via Flickr.

There is a concern that too many scientific studies are failing to be replicated as often as expected. This means that a high proportion is suspected of being invalid. The blame is often put on confusion surrounding the ‘P value’ which is used to assess the effect of chance on scientific observations. A ‘P value’ is calculated by first assuming that the ‘true result’ is disappointing (e.g. that the outcome of giving a treatment and placebo was exactly the same based on an ideally large number of patients). This disappointing true result is called a ‘null hypothesis(read this freely available chapter).

A ‘P value’ of 0.025 means that if the ‘null hypothesis’ were true, there would be only a 2.5% chance of getting the real observed difference between treatment and placebo, or even a greater difference, in an actual study based on a smaller number of patients. This clumsy concept does not tell us the probability of getting a ‘true’ difference in an idealized study, based on the result of a real study.

Because it is based on random sampling model, a ‘P value’ implies that the probability of a treatment being truly better in a large idealized study is very near to ‘1 – P’ provided that it is calculated by using the ‘normal’ or Gaussian distribution, that the study is described accurately so that someone else can repeat in exactly the same way, the study is performed with no hidden biases, and there are no other study results that contradict it.

It should also be borne in mind that ‘truly better’ in this context includes differences of just greater than ‘no difference’, so that ‘truly better’ may not necessarily mean a big difference. However, if the above conditions of accuracy etc. are not met then the probability of the treatment being truly better than placebo in an idealized study will be lower (i.e. it will range from an upper limit of ‘1 – P’ [e.g. 1 – 0.025 = 0.975] down to zero). This is so because the possible outcomes of a very large number of random samples are always equally probable, this being a special property of the random sampling process.

The current confusion about ‘P values’ is because the fact that random sampling only ‘sees’ the proportion within each group is overlooked and is assumed wrongly that a difference in size of the source populations affects the sampling process. A scientist would be interested in the possible long term outcome of an idealised study, not in the various proportions in the unknown source population.

A more in-depth version of this article was previously published on the OUPblog.

Dr Huw Llewelyn is a general physician with a special interest in endocrinology and acute medicine, who has had a career-long interest in the mathematical representation of the thought processes used by doctors in their day to day work during clinical practice, teaching and research. He has also been an honorary fellow in mathematics in Aberystwyth University for many years and has had wide experience in different medical settings: general practice, teaching hospital departments with international reputations of excellence and district general hospitals in urban and rural areas.

His insight is reflected in the content of the Oxford Handbook of Clinical Diagnosis based on his teaching notes when he was a consultant physician at King’s College Hospital in London and the mathematical models in the form of new theorems on which the book’s content is based.

The Oxford Handbook of Clinical Diagnosis is available in print and online.

For other articles, visit our Article Archive.