Show Summary Details
Page of

Research design 

Research design
Research design

Janet L. Peacock

and Philip J. Peacock

Page of

PRINTED FROM OXFORD MEDICINE ONLINE ( © Oxford University Press, 2021. All Rights Reserved. Under the terms of the licence agreement, an individual user may print out a PDF of a single chapter of a title in Oxford Medicine Online for personal use (for details see Privacy Policy and Legal Notice).

date: 26 October 2021


It is important to understand the main issues involved in study design in order to be able to critically appraise existing work and to design new studies. In this chapter we describe the main features of the design of interventional and observational studies and the differences and similarities between research and audit. We discuss when a sample size calculation is needed, describe the main principles of the calculations, and outline the steps involved in preparing a study protocol. Most sections are illustrated with examples and we give particular attention to the statistical issues that arise in designing and appraising research.

Introduction to research

Engaging with research

At any one time, a clinician or medical student who is engaging with quantitative research may be doing so for one or more of the following reasons:

  • To critically appraise research reported by others

  • To conduct primary research that aims to answer a specific question or questions, and thus generate new knowledge or extend existing knowledge

  • To gain research skills and experience, often as part of an educational programme

  • To test the feasibility of a particular research design or technique

The following issues are important for all of these:

  • What is the study question or aim?

  • What design is appropriate to answer the question(s)?

  • What statistics are appropriate for the study?

Conducting and appraising primary research

Primary research requires rigorous methods so that the design, data, and analysis provide sound results that stand up to scrutiny and add to current knowledge. Similarly, when critically appraising research, it is important to have a solid understanding of good research methodology.

Conducting research as part of an educational programme

When research is conducted purely for educational purposes, such as with a medical student project, the main purpose is not to generate new knowledge but instead to provide practical training in research that will equip the individual to conduct sound primary research at a later stage.

It is important that, as far as possible, research projects conducted within an educational programme are carried out rigorously. However, since these research projects usually face constraints, such as a narrow time frame and a limited budget, it may not be possible to fully meet the high standards set for primary research. For example, it may not be possible to recruit sufficient subjects to satisfy standard sample size calculations in the time given for a student project. If the purpose of the research is truly educational and not primarily to further knowledge, and this is made clear in any reporting, then this is not a problem.

Publishing research conducted as part of an educational programme

Although student projects are often limited in scope, they may be sufficiently novel and of a high enough standard to be published. This is to be encouraged to provide further experience of the publication process and to encourage high standards. For examples of student projects that have been published, see Peacock and Peacock (2006), Peacock et al. (2009), and Thomas et al. (2017).


Peacock PJ, Peacock JL. Emergency call work-load, deprivation and population density: an investigation into ambulance services across England. J Public Health (Oxf) 2006; 28:111–15.Find this resource:

Peacock PJ, Peters TJ, Peacock JL. How well do structured abstracts reflect the articles they summarize? Eur Sci Editing 2009; 35:3–5.Find this resource:

Thomas E, Peacock PJ, Bates SE. Variation in the management of SSRI-exposed babies across England. BMJ Paediatr Open 2017; 1:e000060.Find this resource:

Research questions


Research aims to establish new knowledge around a particular topic. The topic might arise out of the researcher’s own experience or interest, or from that of a mentor or senior, or it may be a topic commissioned by a funding body. Sometimes a research study follows on directly from a previous study, either conducted by the researcher themselves or another researcher, and on other occasions it may be a completely new topic.

As the research idea grows, the researcher generates a specific question or set of questions that he/she wants to pursue. It can be quite difficult to focus down on specific questions if the topic is broad and there are many things that are interesting to explore. The scope of the study will determine how many questions can be investigated—an individual with no research funds may only be able to centre on a single question, whereas one with a funded programme of research can investigate a number of related questions.

Even when a particular study investigates many questions, it is important that each question is tightly framed so that the right data can be collected and the appropriate analyses conducted. If questions are too vague or too general then the study will be difficult to design and may not ultimately be able to answer the real questions of interest.

Research questions

These should be:

  • Specific with respect to time/place/subjects/condition as appropriate

  • Answerable such that the relevant data are available or able to be collected

  • Novel in some sense so that the study either makes a contribution to knowledge or extends existing knowledge

  • Relevant to current medicine

Types of question

Most questions fall into one or more of the following categories:

  • Descriptive, for example, incidence/prevalence; trends/patterns; opinion/knowledge; life history of disease

  • Evaluative, for example, efficacy/safety of treatments or preventive programmes; may be comparative

  • Explanatory, for example, causes of disease; mechanisms for observed processes or actions or events


  • What is the prevalence of diabetes mellitus in the population?

    • This is a simple descriptive study

  • How effective is influenza vaccination in the community-based elderly?

    • This is a comparative study, comparing individuals who had vaccines with those who did not

  • Does lowering blood pressure reduce the risk of coronary heart disease?

    • This is an evaluative study, investigating the efficacy of lowering blood pressure

  • Is prognosis following stroke dependent on age at the time of the event?

    • This is an observational study

  • Why does smoking increase the risk of heart disease?

    • This is an explanatory study investigating the mechanism behind an observed relationship

  • What evidence is there for the effectiveness of antidepressants in treating depression?

    • This study is a meta-analysis of existing interventional studies

Translational medicine

What is it?

Translational medicine or translation research is often described as research that goes ‘from bench to bedside’ so that discoveries can be turned into treatments, devices, or programmes of care that improve patient health. It is sometimes described simply as ‘translating research into practice’ so that new treatments and new information lead to benefit for patients (Woolf 2008).

Why is it important?

Recent years have seen unprecedented advances in basic biomedical sciences including human genomics, other omics, stem cell biology, biomedical engineering, molecular biology, and immunology (Sung et al. 2003). These advances need translating into tangible benefits such as:

  • Patient benefit: to ensure health research leads to improved patient outcomes

  • Promote innovation: to drive innovation in laboratory and clinical sciences

  • Return on investment: to ensure public funding is value for money

The translational pipeline

Figure 2.1 shows the five basic components of translational research displayed as a pipeline: basic science discoveries (phase 0), leading to early testing in healthy individuals (phase 1), then in patients (phase 2), leading to full testing in patients (phase 3), and finally leading to the adoption of effective interventions into patient care (phase 4).

Figure 2.1 The translational pipeline.

Figure 2.1 The translational pipeline.

Challenges in translational medicine

Blockages in the pipeline

To maximize effectiveness and efficiency it is important to identify and remove blockages between phases. Particular problems arise where early discoveries do not get carried through to testing. A common problem arises in the difficulties in moving effective interventions through to healthcare practice in a timely manner (Sung et al. 2003).

Implementing existing known best practice and information

This is a problem for existing interventions that are known to be effective but are either not implemented at all or their implementation is incomplete—some patients are given the best treatment and some are not. For example, it is well known that putting babies to sleep on their backs reduces the risk of cot death (Fleming et al. 1990), but many parents (and indeed some healthcare professionals) do not follow this practice.

Interdisciplinary working

One of the great benefits of translational medicine is the recognition of the importance of interdisciplinary working among professionals: life sciences, clinical sciences, social sciences, biostatistics, health psychology, health economics, and so on.

Quality of research

Robust research is transparent and reproducible and yet there has been a growing recognition that poor quality and/or inadequately reported research methods hampers this both in clinical sciences (Smith 2014) and life sciences (Editorial 2013; Masca et al. 2015).


Editorial. Reducing our irreproducibility. Nature 2013; 496:398.Find this resource:

Fleming PJ, Gilbert R, Azaz Y, Berry PJ, Rudd PT, Stewart A, Hall E. Interaction between bedding and sleeping position in the sudden infant death syndrome: a population based case-control study. BMJ 1990; 301:85–9.Find this resource:

Masca NG, Hensor EM, Cornelius VR, Buffa FM, Marriott HM, Eales JM, et al. RIPOSTE: a framework for improving the design and analysis of laboratory-based research. Elife 2015; 4:e05519.Find this resource:

Smith R. Medical research still a scandal. 2014. Research design

Sung NS,Crowley, Jr WF,Genel M,Salber P,Sandy L,Sherwood LM, et al. Central challenges facing the national clinical research enterprise. JAMA 2003; 289:1278–87.Find this resource:

Woolf SH. The meaning of translational research and why it matters. JAMA 2008; 299:211–13.Find this resource:

Interventional studies

Study designs

Intervention studies test the effect of a treatment or programme of care. The purpose is usually to test for efficacy but in early drug trials, safety and dosage are established first (Research design see Phases of clinical trials, p. [link]).

No control group

  • Preliminary drug trials investigating safety and tolerance are often uncontrolled

Control group

  • It is highly desirable to have a control or comparison group in efficacy studies to be able to demonstrate superiority or inferiority

  • For example, it may be useful to know that a new drug lowers blood pressure, but it is more important to know how it compares to medications already in common use, especially as existing drugs are likely to be cheaper

Historical controls

  • Patients given a new treatment are compared with patients who have already been treated with an existing treatment regimen and who at the time of testing the new treatment have already been treated, assessed, and discharged

  • The comparison of the treatment group and the control group is not concurrent and may be problematic as other factors change over time, such as hospital staff and patient mix

  • Interpretation is difficult—it is impossible to be sure that any differences observed between the new treatment group and the control group are solely due to the treatments received

Randomization between intervention and control group

When randomization is not possible

  • It is hard to test the efficacy of a treatment that is widely used and accepted against no treatment or a placebo

  • For example, the use of adrenaline for cardiac arrest is generally accepted as effective. It would be difficult, if not impossible, to formally test this against a control treatment

Natural experiments

  • Individuals receive different interventions concurrently but in a non-randomized manner

Example 1

The effect of the fluoridation of drinking water may involve a comparison of subjects in areas where the water is subject to natural, artificial, or no fluoridation. Subjects are not allocated to the different types of fluoridation—this is determined by where they live.

Example 2

The effect of treatment may be compared in patients who choose conservative surgery for breast cancer rather than radical surgery. Patients are not randomized.

When intervention studies are unethical

  • It is not ethical to experiment on humans when the intervention is likely to cause harm

  • It is not ethical to test whether environmental agents cause harm, and so observational studies are used to determine effects

  • Natural experiments may allow a better comparison to be made of individuals who are exposed and unexposed than a cross-sectional analysis. For example, before and after studies have been used to compare health status before and after the introduction of the smoking ban in public places in the USA and the UK (Eisner et al. 1998; Allwright et al. 2005). In this way, a reasonable assessment of the effect of passive smoke exposure was made

Design and analysis for non-randomized studies and natural experiments

  • Collect as much data as possible on the subjects’ key characteristics

  • Use statistical analysis to adjust for these differences

  • Note that, even with statistical adjustment, there may still be differences between the groups that are unknown and so comparisons may still be biased. We probably won’t know

  • Interpretation of non-randomized trials is difficult and firm conclusions are hard to draw (Research design see Deducing causal effects, p. [link])


Allwright S, Paul G, Greiner B, Mullally BJ, Pursell L, Kelly A, et al. Legislation for smoke-free workplaces and health of bar workers in Ireland: before and after study. BMJ 2005; 331:1117.Find this resource:

Eisner MD, Smith AK, Blanc PD. Bartenders’ respiratory health after establishment of smoke-free bars and taverns. JAMA 1998; 280:1909–14.Find this resource:

Phases of clinical trials


Clinical trials are usually conducted in phases, as depicted in the translational research pipeline (Research design see Figure 2.1, p. [link]). Each phase has specific aims as described in the following sections. The ‘early’ trials do not aim to give a firm conclusion about the effects of the intervention but rather seek to see whether the intervention shows promise with respect to improving patient outcome, and whether there are any concerns about safety.

Phase 1 trials

These are early trials where a new drug or treatment is tested on a small group of people. They usually set out to establish safety and/or tolerance and may seek to determine the appropriate dose in drug trials where this is not known. There are usually no control subjects.

Although these trials are small, the design may use complex statistical methods, particularly for dose-finding studies that may use statistical modelling throughout the study to identify when the dose should change as patients go through, and to identify when the optimum dose can be established. The design of these studies is an emerging area of research. At the time of writing, the UK National Institute for Health Research (NIHR) Statistics Group (Research design includes a research section in early phase trials and has published recommendations for dose-funding studies (Love et al. 2017). An example of a safety trial is given in Box 2.1 (Petrof et al. 2015).

See Petrof G et al. Potential of systemic allogeneic mesenchymal stromal cell therapy for children with recessive dystrophic epidermolysis bullosa. J Invest Dermatol 2015; 135:2319–21.

Research design A more complex phase 1 dose-finding study is described in Chapter 14 (Research design see Bayesian methods in early phase trials: example, p. [link]).

Phase 2 trials

These are also early trials but typically use a larger sample size than the corresponding phase 1 trial. They are usually controlled with allocation to active or control intervention being randomized (Research design see Randomization in RCTs, p. [link]). Their main aim is usually to assess effectiveness and safety. Like phase 1 trials, phase 2 trials are not designed to be definitive but rather to guide decision-making as to whether a new intervention is sufficiently promising to warrant testing in a full trial. Sample size can be based on probability methods (Piantadosi 2005) or a power-based method can be used with a power lower than the usual 80% or 90% (Research design see Sample size for comparative studies, p. [link]).

As with phase 1 trials, the designs of these trials can be complex. They tend to estimate the intervention effect with 95% confidence intervals but they do not do a significance test. An example is shown in Box 2.2 (Khan et al. 2016).

See Khan MS et al. A single-centre early phase randomised controlled three-arm trial of open, robotic, and laparoscopic radical cystectomy (CORAL). Eur Urol 2016; 69:613–21.

Phase 3 trials

These trials are designed to be definitive, i.e. to be large enough to detect the smallest difference in outcome that is clinically meaningful. They should also be large enough to estimate the incidence of adverse effects with reasonable precision.

Phase 4

These are studies that are carried out after the previous three phases to collect information on side effects and adverse effects associated with long-term use.


Khan MS,Gan C,Ahmed K,Ismail AF,Watkins J,Summers JA, et al. A single-centre early phase randomised controlled three-arm trial of open, robotic, and laparoscopic radical cystectomy (CORAL), Eur Urol 2016; 69:613–21.Find this resource:

Love SB,Brown S,Weir CJ,Harbron C,Yap C,Gaschler-Markefski B, et al. Embracing model-based designs for dose-finding trials. Br J Cancer 2017; 117:332–9.Find this resource:

Petrof G,Lwin SM,Martinez-Queipo M,Abdul-Wahab A,Tso S,Mellerio JE, et al. Potential of systemic allogeneic mesenchymal stromal cell therapy for children with recessive dystrophic epidermolysis bullosa. J Invest Dermatol 2015; 135:2319–21.Find this resource:

Piantadosi S. Clinical trials: a methodologic perspective. Chichester: Wiley, 2005.Find this resource:

Adaptive designs


Adaptive design trials are clinical trials that change either the design, analysis, or both on the basis of the emerging data or outcomes. They aim to do one or more of the following:

  • Increase efficiency by reducing the number of patients included

  • Reduce the time required for the trial to reach a conclusion

  • Increase the likelihood of demonstrating an effect if one exists

  • Provide more useful information on dose-response relationship

Early phase adaptive designs

These aim to determine whether an intervention is safe and, if the intervention is a drug, what the best dose is. They seek to allocate a higher proportion of participants to treatments or doses that are effective and fewer to those that are not. Where the best dose is unknown they explore a range of doses in order to determine the maximum tolerated dose (MTD). The designs often involve complex statistics including Bayesian methods (Research design see Chapter 14), and there may be several possible acceptable designs (Research design see Phase 1 trials p. [link]).

Phase 3 adaptive designs

These seek to make pre-planned changes to the future conduct of the trial on the basis of emerging data, while maintaining the statistical integrity of the conclusions. Examples of adapted designs include:

  • Designs that allow ‘seamless’ transition between phases 2 and 3

  • Designs that permit sample size recalculation with either blinded or unblinded data

  • Group sequential designs that allow early stopping for efficacy, futility, or harm

  • Population enrichment designs that remove treatment groups or other subgroups in which the intervention is less effective

As for early phase adaptive designs, these may be statistically complex including using Bayesian methods (Research design see Chapter 14).

Further reading

Bhatt DL,Mehta C. Adaptive designs for clinical trials. N Engl J Med 2016; 375:65–74.Find this resource:

Piantadosi S. Clinical trials: a methodologic perspective. Chichester: Wiley, 2005.Find this resource:

Biomarker designs


These are designs that seek to discover whether a specific patient characteristic, or biomarker, can be used to identify a subgroup of participants in which an intervention is more effective. Typically patients are randomized to either a biomarker-led strategy of care or standard care. For the patients in the biomarker arm, their care is directed according to their biomarker status. An example is given of a biomarker trial in patients following kidney transplant in Box 2.3 (Dorling et al. 2014).

See Dorling, A et al. Can a combined screening/treatment programme prevent premature failure of renal transplants due to chronic rejection in patients with HLA antibodies: study protocol for the multicentre randomised controlled OuTSMART trial. Trials 2014; 15:30.

Further reading

Wason J,Marshall A,Dunn J,Stein RC,Stallard N. Adaptive designs for clinical trials assessing biomarker-guided treatment strategies. Br J Cancer 2014; 110:1950–7.Find this resource:


Dorling A,Rebollo-Mesa I,Hilton R,Peacock JL,Vaughan R,Gardner L, et al. Can a combined screening/treatment programme prevent premature failure of renal transplants due to chronic rejection in patients with HLA antibodies: study protocol for the multicentre randomised controlled OuTSMART trial. Trials 2014; 15:30.Find this resource:

Pilot and feasibility studies


Pilot and feasibility studies are preliminary studies conducted in preparation for a full study. They aim to test the process and protocols to make sure that the study will run as planned and achieve its aims.

Overall study aims

Definitions of pilot and feasibility studies vary but there is general agreement that:

  • Pilot studies are a small-scale version of the intended full study

  • Feasibility studies test the practicalities of conducting the study

Key objectives of pilot and feasibility trials

(See Lancaster et al. 2004; Lancaster 2015.)

  • Test the integrity of the study protocol for the future trial

  • Gain initial estimates for sample size calculations

  • Test data collection forms or questionnaires

  • Test the randomization procedure

  • Estimate rates of recruitment and consent

  • Determine the acceptability of the intervention

  • Identify the most appropriate primary outcome

What pilot and feasibility trials are not

Pilot and feasibility studies are not designed (or powered) to test the effectiveness of an intervention or treatment. Effectiveness is tested in a full trial. Hence hypothesis tests are not the main focus and may not be needed at all. If any treatment comparisons are carried out then these are reported with confidence intervals but not P values.

Research design Sometimes a trial is described as a pilot because it is small even though the aim is to test effectiveness. However, this is not a true pilot study.

Pilot or feasibility study?

Exact definitions of pilot and feasibility studies vary and so to try to resolve this, Eldridge and colleagues (2016) have developed a framework for their definition. They used multiple methods among a wide range of trialists to reach the following consensus:

  • Pilot studies are a subset of feasibility studies; they are not mutually exclusive

  • A feasibility study asks whether something can be done, should we proceed with it, and if so, how?

  • A pilot study asks the same questions but also has a special design feature: in a pilot study, a future study, or part of a future study, is conducted on a smaller scale

Eldridge and colleagues (2016) recommended that:

  • These studies should be identified using the term ‘pilot’ or ‘feasibility’ in the title or abstract of publications

  • Researchers should report the study objectives and methods related to feasibility

  • Researchers should clearly state that the study is in preparation for a future full trial designed to assess the effect of an intervention

Sample size for pilot and feasibility studies

The sample size for these studies needs to be sufficient to answer the aims. Various authors have made recommendations, such as Julious (2005), but the important thing is that the sample size is justified appropriately.

Applying for grants for pilot and feasibility studies

The Medical Research Council (2006) and NIHR (2016) have each stated their working definitions of pilot and feasibility trials. As these are slightly different, researchers should check their design fits with the funding stream they are applying to. The NIH sometimes requires a published pilot study prior to an application for funding a definitive randomized controlled trial (RCT).


Eldridge SM,Lancaster GA,Campbell MJ,Thabane L,Hopewell S,Coleman CL,Bond CM. Defining feasibility and pilot studies in preparation for randomised controlled trials: development of a conceptual framework. PLoS One 2016; 11:e0150205.Find this resource:

Julious S. Sample size of 12 per group rule of thumb for a pilot study. Pharm Stat 2009; 4:287–91.Find this resource:

Lancaster GA. Pilot and feasibility studies come of age! Pilot Feasibility Stud 2015; 1:1.Find this resource:

Lancaster GA,Dodd S,Williamson PR. Design and analysis of pilot studies: recommendations for good practice. J Eval Clin Pract 2004; 10:307–12.Find this resource:

Medical Research Council. Developing and evaluating complex interventions: new guidance. 2006. Research design

National Institute for Health Research. Pilot studies. 2016. Research design

Randomized controlled trials


An RCT is an intervention study in which subjects are randomly allocated to treatment options. RCTs are the accepted ‘gold standard’ of individual research studies. They provide sound evidence about treatment efficacy which is only bettered when several RCTs are pooled in a meta-analysis.

Choice of comparison group

  • The choice of the comparison group affects how we interpret evidence from a trial

  • A comparison of an active agent with an inert substance or placebo is likely to give a more favourable result than comparison with another active agent

  • Comparison of an active agent against a placebo when an existing active agent is available is generally regarded as unethical (see Box 2.4 from the Declaration of Helsinki, item 32 (Research design

  • For example, it would not be ethical to test a new anticholesterol drug against a placebo; any comparison of new therapy would have to be against the currently proven therapy, statins

Comparison with ‘usual care’

When an intervention is a programme of care, for example, an integrated care pathway for the management of stroke, it is common practice for the comparison group to receive the usual or standard care.

Declaration of Helsinki

The Declaration of Helsinki was first developed in 1964 by the World Medical Association to provide guidance about ethical principles for research involving human subjects. It has had multiple revisions since, with the latest full version published in 2008 with additional protections for research in children added in 2012. Although not legally binding of itself, many of its principles are contained in laws governing research in individual countries, and the declaration is widely accepted as an authoritative document on human research ethics.

The declaration addresses issues such as:

  • Duties of those conducting research involving humans

  • Importance of a research protocol

  • Research involving disadvantaged or vulnerable persons

  • Considering risks and benefits

  • Importance of informed consent

  • Maintaining confidentiality

  • Informing participants of the research findings

The full 35-point declaration is available online at Research design

Randomization in RCTs

Why randomize?

  • Randomization ensures that the subjects’ characteristics do not affect which treatment they receive. The allocation to treatment is unbiased

  • In this way, the treatment groups are balanced by subject characteristics in the long run and differences between the groups in the trial outcome can be attributed as being caused by the treatments alone

  • This provides a fair test of efficacy for the treatments, which is not confounded by patient characteristics

  • Randomization makes blindness possible (Research design see Blinding in RCTs, p. [link])

Randomizing between treatment groups

The usual way to do random allocation is by using a computer program based on random numbers. The random allocation process may work in two different ways:

  • The program is interactive and provides the allocation code for each patient as he/she is entered into the trial. This may be a code which refers to a treatment to maintain blindness or if the treatment cannot be blinded (e.g. with a technology), it will be the name of the actual intervention

  • A computer-generated list of sequential random allocations is produced and administered by someone who is independent of the team that is recruiting patients to the trial. In this way, there is no bias in recruitment or allocation. In drug trials, the pharmacy may conduct the randomization and provide numbered containers to which it holds the code, so that the researcher and the patient can be kept blind to the actual allocation

Audit trail

It is important to have an audit trail of the recruitment and randomization process including keeping a log of the recruited patients. This information is needed for later reporting of the trial and assists with checking that the trial is being conducted according to the protocol.

Non-random allocation

Research design Alternate allocation, or a method based on patient identifiers such as hospital number or date of birth, are not random methods and are not recommended because they are open, and in the case of alternate allocation, predictable. These methods make blinding difficult and leave room for the researcher to change the allocation or recruit according to the treatment that is to be received (e.g. to give a sicker patient the new treatment).

Stratification for prognostic factors

If there are important prognostic factors that need to be accounted for in a particular trial, the random allocation can be stratified so that the treatment groups are balanced for the prognostic factors. For example, in trials of treatment for heart disease, the random allocation may be stratified by gender so that there are similar numbers of men and women receiving each treatment.


Minimization is another method of allocating subjects to treatment groups while allowing for important prognostic factors (Pocock 1983; Altman and Bland 2005). The allocation takes place in a way that best maintains balance in these factors. At all stages of recruitment, the next patient is allocated to that treatment which minimizes the overall imbalance in prognostic factors. For a worked example, see Altman and Bland (2005) or Pocock (1983). ‘Minim’ software to do minimization is available free from Martin Bland’s website (Research design


Blocking is used to ensure that the number of subjects in each group is very similar at any time during the trial. The random allocation is determined in discrete groups or blocks so that within each block there are equal numbers of subjects allocated to each treatment.

Example using blocks of size 4 and two treatments A, B

There are six possible blocks or arrangements of A and B, which give equal numbers of As and Bs:


We randomly choose blocks, so say the first two chosen blocks are:


Then the first eight subjects will be allocated B, B, A, A, A, A, B, B.

The total subjects on A and B as subjects 1 to 8 are recruited will be:

(0,1), (0,2), (1,2) (2,2), (3,2), (4,2), (4,3), (4,4).

Hence, at all times, the total on A and the total on B will only differ by a maximum of 2 and so the treatment numbers will always be very similar and the numbers will be exactly balanced after every fourth subject is randomized.

Further extensions of ‘blocking’ are available with a mixture of different block sizes, whereby random combinations of blocks are selected.

Further reading

Altman DG, Bland JM. Statistics notes: treatment allocation in controlled trials: why randomise? BMJ 1999; 318:1209.Find this resource:

Altman DG, Bland JM. Statistics notes: how to randomise. BMJ 1999; 319:703–4.Find this resource:


Altman DG, Bland JM. Treatment allocation by minimisation. BMJ 2005; 330:843.Find this resource:

Pocock SJ. Clinical trials: a practical approach. Chichester: Wiley, 1983.Find this resource:

Patient consent in research studies


It is generally accepted that all subjects participating in research give their prior informed consent. The Declaration of Helsinki (item 24; Research design states the following:

In medical research involving competent human subjects, each potential subject must be adequately informed of the aims, methods, sources of funding, any possible conflicts of interest, institutional affiliations of the researcher, the anticipated benefits and potential risks of the study and the discomfort it may entail, and any other relevant aspects of the study. The potential subject must be informed of the right to refuse to participate in the study or to withdraw consent to participate at any time without reprisal. Special attention should be given to the specific information needs of individual potential subjects as well as to the methods used to deliver the information. After ensuring that the potential subject has understood the information, the physician or another appropriately qualified individual must then seek the potential subject’s freely-given informed consent, preferably in writing. If the consent cannot be expressed in writing, the non-written consent must be formally documented and witnessed.

(Declaration of Helsinki, item 24; Research design

Informed consent

  • This requires giving patients a detailed description of the study aims, what participation is required, and any risks they may be exposed to

  • Consent must be voluntary

  • Consent is confirmed in writing and a cooling-off period is provided to allow subjects to change their minds

  • Consent must be obtained for all patients recruited to an RCT

  • Giving or withholding consent must not affect patient treatment or access to services

  • For questionnaire surveys, consent is often implicit if the subject returns the questionnaire where it is clear in the accompanying information that participation is voluntary

  • Consent may not be required if the study involves anonymized analyses of patient data only

When consent may be withheld

In some situations, obtaining patient consent to a study may be problematic.

Example 1

For example, where the intervention is so desirable that patients would not want to risk being randomized to the control group. This is particularly so when it is not possible to mask the intervention such as where the intervention is a programme of care and the control treatment is ‘usual care’. Subjects may not be willing to enter the trial and risk not getting the new intervention, or they may enter the trial but drop out if they are allocated to the control group.

One solution in situations like these is for the researcher to decide in advance to offer the intervention to all control group subjects after the trial has finished, assuming that the intervention proves to be effective. For example, in exercise therapy trials, control group subjects may be offered the exercise regimen at the end of the trial if it has been shown to work. Such an approach is stated in the Declaration of Helsinki (item 33; Research design (Box 2.5) and would need to be costed into the trial.

Example 2

Patients may be reluctant to agree to enter a trial of a new therapy when there is an existing treatment which is known to work. In such situations, assuming that there is equipoise, it is the responsibility of the clinician to explain the study clearly enough to allow the patient to make an informed choice of whether or not to take part.

Further discussion of patient consent is beyond the scope of this book but the General Medical Council UK website has detailed guidance (Research design

Blinding in RCTs

Concealing the allocation

  • Blinding is when the treatment allocation is concealed from either the subject or assessor or both

  • It is done to avoid conscious or unconscious bias in reported outcomes

  • A trial is double blind if neither the subject nor the assessor knows which treatment is being given

  • A trial is single blind if the treatment allocation is concealed from either the subject or the assessor but not both

  • ▶ Note that randomization makes blinding possible and is its most important role


A subject who knows that he is receiving a new treatment for pain which he expects to be beneficial may perceive or actually feel less pain than he would do if he thought he was receiving the old treatment.

An assessor who knows that a subject is receiving the new steroid treatment for chronic obstructive pulmonary disease, which he expects to work better than the old one, may tend to round up measurements of lung function.

If the treatment allocation is concealed, then both the patient and assessor will make unbiased assessments of the effects of the treatments being tested.


  • A placebo is an inert treatment that is indistinguishable from the active treatment

  • In drug trials it is often possible to use a placebo drug for the control which looks and tastes exactly like the active drug

  • The use of a placebo makes it possible for both the subject and assessor to be blinded

When blinding is not possible

In some situations blinding is not possible, such as in trials of technologies where concealment is impossible. For example, in trials comparing different types of ventilator, it is impossible to blind the clinician, and similarly in trials of surgery versus chemotherapy.

Possible solutions are the use of sham treatments, such as sham surgery, but this may not be ethically acceptable. Trials of the effectiveness of acupuncture have used sham acupuncture for the control group to maintain blindness (Scharf et al. 2006) and trials involving injections sometimes use saline injections in the control group, although this may raise ethical objections.

Sometimes ingenuity can be employed to address blindness, such as in a trial of electrical stimulation in non-healing fractures, where patients in the control group also received an electric current of non-therapeutic power but sufficient to interfere with radio in the same way as the active coil did (Simonis et al. 2003).

Double placebo (double dummy)

If a trial involves two active treatments that have different modes of treatment, for example, a tablet versus a cream, a double placebo (‘double dummy’) can be used whereby each patient receives two treatments. In the example given, patients would receive either the active tablet plus a placebo cream, or a placebo tablet plus an active cream. A double dummy can also be used if the timing of treatment is different for the two drugs being tested, for example, if one drug is given once a day in the morning (drug A) and the other is given twice a day, morning and evening (drug B). In this case, one group of patients would receive the active drug A in the morning and placebo drug B both morning and evening and the other would receive the placebo drug A in the morning and active drug B both morning and evening.

Active placebo

Trials may use an active placebo, which mimics the treatment in some way to maintain blindness. For example, some treatments give patients a dry mouth and so the presence or absence of this side effect may indicate to the patient which treatment they are on.


In a trial of dextromethorphan and memantine to treat neuropathic pain, patients in the placebo group were given low-dose lorazepam to mimic the side effects of dextromethorphan and memantine and thus help conceal the treatment allocation (Sang et al. 2002).


Sang CN, Booher S, Gilron I, Parada S, Max MB. Dextromethorphan and memantine in painful diabetic neuropathy and postherpetic neuralgia: efficacy and dose-response trials. Anesthesiology 2002; 96:1053–61.Find this resource:

Scharf HP, Mansmann U, Streitberger K, Witte S, Kramer J, Maier C, et al. Acupuncture and knee osteoarthritis: a three-armed randomized trial. Ann Intern Med 2006; 145:12–20.Find this resource:

Simonis RB, Parnell EJ, Ray PS, Peacock JL. Electrical treatment of tibial non-union: a prospective, randomised, double-blind trial. Injury 2003; 34:357–62.Find this resource:

RCTs: parallel groups and crossover designs

Two or more parallel groups

  • This is a trial with a head-to-head comparison of two or more treatments

  • Subjects are allocated at random to a single treatment or a single treatment programme for the duration of the trial

  • Usually, the aim is to allocate equal numbers to each trial, although unequal allocation is possible

  • The groups are independent of each other

Crossover trials

  • This involves a single group study where each patient receives two or more treatments in turn

  • Each patient therefore acts as their own control and comparisons of treatments are made within patients

  • The two or more treatments are given to each patient in random order

  • Crossover trials are useful for chronic conditions such as pain relief in long-term illness or the control of high blood pressure where the outcome can be assessed relatively quickly

  • They may not be feasible for treatments for short-term illnesses or acute conditions that once treated are cured, for example, antibiotics for infections

  • It is important to avoid the carry-over effect of one treatment into the period in which the next treatment is allocated. This is usually achieved by having a gap or washout period between treatments to prevent there being any carry-over effects of the first treatment when the next treatment starts

  • The simplest design is a two-treatment comparison in which each patient receives each of the two treatments in random order with a washout period of non-treatment in between

  • There are some particular statistical issues that may arise in crossover trials which are related to the washout period and carry-over effects, and how and whether to include patients who do not complete both periods. Senn (2002) gives a full discussion of the issues and possible solutions.

Example: crossover trial

A randomized, double-blind, placebo-controlled crossover study tested the effectiveness of valproic acid to relieve pain in patients with painful polyneuropathy. Thirty-one patients were randomized to receive either valproic acid (1500 mg daily) and then placebo, or placebo followed by valproic acid. Each treatment lasted for 4 weeks. No significant difference in total pain or individual pain rating was found between treatment periods on valproic acid and placebo (total pain (median) = 5 in the valproic acid period versus 6 in the placebo period; P = 0.24) (Otto et al. 2004).

Choice of design: parallel group or crossover?


Otto M, Bach FW, Jensen TS, Sindrup SH. Valproic acid has no effect on pain in polyneuropathy: a randomized, controlled trial. Neurology 2004; 62:285–8.Find this resource:

Senn S. Cross-over trials in clinical research. Chichester: Wiley, 2002.Find this resource:

Zelen randomized consent design


This design can be used when comparing a new treatment programme with usual care and attempts to address problems with patient consent (Research design see Patient consent in research studies, p. [link]).

Allocation to treatments

  • Subjects are randomly allocated to treatment or usual care

  • Only those subjects who are allocated to treatment are invited to participate and to give their consent

  • Subjects allocated to usual care (control) are not asked to give their consent

  • Among the treatment group, some subjects will refuse and so this design results in three treatment groups (Zelen 1979, 1990):

    1. 1. Usual care (allocated)

    2. 2. Intervention

    3. 3. Usual care (but allocated to intervention)

  • The analysis is performed with patients analysed in the original randomized groups, that is, 1 versus 2 + 3 (Research design see Intention-to-treat analysis, p. [link])

Double randomized consent

  • Patients are randomized to intervention or control and then their consent is sought, whichever group they are allocated to

  • Patients are allowed to choose either the treatment they are allocated to or the other treatment

  • The analysis is performed with patients analysed in the original randomized groups, whichever treatment they chose or received

  • (Research design see Intention-to-treat analysis, p. [link])


The single randomized Zelen design has been criticized as being unethical since some subjects are not informed that they are in a trial. However, it is generally agreed that some trials could not take place without the use of this design because in some situations patients would not wish to take part if they were allocated to the control group. It could be argued that this therefore justifies its use (Torgerson and Roland 1998).


Torgerson DJ,Roland M. Understanding controlled trials: what is Zelen’s design? BMJ 1998; 316:606.Find this resource:

Zelen M. A new design for randomized clinical trials. N Engl J Med 1979; 300:1242–5.Find this resource:

Zelen M. Randomized consent designs for clinical trials: an update. Stat Med 1990; 9:645–56.Find this resource:

Zelen randomized consent design (continued)


In this trial the investigators sought to determine whether an intervention using postcards could reduce the number of episodes of repeated deliberate self-poisoning (Carter et al. 2005). Potentially eligible patients were identified from a database of patients who had presented at the emergency department with poisoning. They were randomized to either receive the postcard intervention or to the control group, but the allocation was hidden until recruitment. After randomization, patients were screened for eligibility and consent was sought from those randomized to the postcard intervention. Figure 2.2 shows the flow chart.

Figure 2.2 Flow chart of participants through a Zelen design trial.

Figure 2.2 Flow chart of participants through a Zelen design trial.

Reproduced from BMJ, Carter GL (2005) “Postcards from the EDge project: randomised controlled trial of an intervention using postcards to reduce repetition of hospital treated deliberate self-poisoning”, 331(7520): 374–375 with permission from BMJ Publishing Group Ltd.

The primary analysis was by intention to treat comparing the proportions with repeated attendance at the emergency department with self-poisoning in the postcard versus the control group, pooling those who did and did not consent to the intervention.

The primary analysis was not statistically significant but a secondary outcome, the number of repetitions, was significantly reduced in the postcard group. The design necessarily meant that some intervention patients did not receive the intervention through withheld consent. This is likely to have to have reduced the difference between the groups but the authors argued that this design was suited to this study and clinical population.


Carter GL. Postcards from the Edge project: randomised controlled trial of an intervention using postcards to reduce repetition of hospital treated deliberate self poisoning. BMJ 2005; 331:375–5.Find this resource:

Superiority and equivalence trials

Superiority trials

  • These seek to establish that one treatment is better than another

  • When the trial is designed, the sample size is set so that there is high statistical power to detect a clinically meaningful difference between the two treatments

  • For such a trial a statistically significant result is interpreted as showing that one treatment is more effective than the other

Equivalence trials

  • These seek to test if a new treatment is similar in effectiveness to an existing treatment

  • They are appropriate if the new treatment has certain benefits such as fewer side effects, being easier to use, or being cheaper

  • The trial is designed to be able to demonstrate that, within given acceptable limits, the two treatments are equally effective

  • Equivalence is a pre-set maximum difference between treatments such that, if the observed difference is less than this, the two treatments are regarded as equivalent

  • The limits of equivalence need to be set to be appropriate clinically

  • The tighter the limits of equivalence are set, the larger the sample size that will be required

  • If the condition under investigation is serious, then tighter limits for equivalence are likely to be needed than if the condition is less serious

  • The calculated sample size tends to be bigger for equivalence trials than superiority trials

Non-inferiority trials

  • This is a special case of the equivalence trial where the researchers only want to establish if a new treatment is no worse than an existing treatment

  • In this situation the analysis is by nature one-sided (Research design see Tests of statistical significance, p. [link])


  • In general, the design and implementation of equivalence trials is less straightforward than superiority trials

  • If patients are lost to follow-up or fail to comply with the trial protocol, then any differences between the treatments is likely to be reduced and so equivalence may be incorrectly inferred

  • It is especially important that equivalence trials need very strict management and good patient follow-up to minimize these problems

  • It is often helpful to include a secondary analysis where subjects are analysed according to the treatment they actually received, ‘per protocol’ analysis


  • Is atorvastatin more effective at reducing blood cholesterol levels than simvastatin?

This is an example of a superiority trial

  • Are angiotensin receptor blockers (e.g. valsartan) as effective at reducing blood pressure in hypertensive patients as angiotensin-converting enzyme inhibitors (e.g. ramipril)?

This is an example of an equivalence trial

  • Does biomarker-led care reduce the risk of graft failure in renal transplant patients?

This trial uses both superiority of biomarker-led care in biomarker-positive patients and non-inferiority of screening for biomarker status overall. For further details, Research design see Biomarker designs, p. [link], Dorling et al. (2014), and the following text

Superiority and equivalence

  • It is important to distinguish between superiority and equivalence when designing a trial

  • The choice depends on the purpose of the trial

  • A trial designed for one purpose may not be able to adequately fulfil the other

  • In general, equivalence trials tend to need larger samples

  • A trial designed to test superiority is unlikely to be able to draw the firm conclusion that two treatments which are not significantly different can be regarded as equivalent

For further details of equivalence trials, see the books on clinical trials by Matthews (2006) and Girling and colleagues (2003).


Dorling A, Rebollo-Mesa I, Hilton R, Peacock JL, Vaughn R, Gardner L, et al. Can a combined screening/treatment programme prevent premature failure of renal transplants due to chronic rejection in patients with HLA antibodies: study protocol for the multicenter randomised controlled OuTSMART trial. Trials 2014; 15:30.Find this resource:

Girling DJ, Parmar MKB, Stenning SP, Stephens RJ, Stewart LA. Clinical trials in cancer principles and practice. Oxford: Oxford University Press, 2003.Find this resource:

Matthews JNS. Introduction to randomized controlled clinical trials, 2nd ed. Boca Raton, FL: Chapman & Hall/CRC, 2006.Find this resource:

Cluster trials


In most randomized trials, individual participants are allocated to an intervention. In a cluster randomized trial, a group of individuals, or ‘cluster’, are allocated to all receive the same intervention. So, if there are two interventions A and B, some clusters will receive A and others will receive B.

Cluster trials are sometimes used in primary care studies where it would be difficult to allocate individual patients in a general practice to different treatments. They are also sometimes used in hospital studies where, for example, a whole ward or clinic is the ‘cluster’.

Why randomize clusters?

To avoid contamination

When individuals are in a natural grouping such as a general practice, they may have contact with other patients in the trial who receive the same or a different intervention. This might affect their compliance and response to the intervention.


Some treatments are naturally administered to groups of individuals, for example, if the intervention is an exercise class. Others would be difficult to administer to individuals simply because of the complexity of an intervention, for example, if the intervention was a programme of care.

Consequences of allocating clusters

  • Two individuals in the same cluster are more alike than two individuals in different clusters. This clustering needs to be accounted for in the analysis (Research design see Cluster samples: analysis, p. [link])

  • A cluster trial needs a larger sample than the equivalent trial randomized at the individual level and so the clustering needs to be considered in the sample size calculations. These calculations use a measure called the ‘intraclass correlation coefficient’ or ‘ICC’ which quantifies the extent to which individuals within the same cluster are more alike than those in different clusters (Research design see (ADV) Sample size in cluster trials, p. [link])

Challenges with cluster trials

  • Number of clusters: the number of clusters required depends partly on the number of individuals available within each cluster (Research design see Sample size in cluster trials, p. [link]). If the number of clusters is small, there is a greater chance of imbalance in baseline characteristics between treatment groups. In addition, there needs to be a reasonable number of clusters for the analyses to be valid. Eldridge and Kerry (2012) give a full discussion of the choice of number of clusters

  • If a whole cluster drops out for some reason, the impact on power and balance between the arms is greater than if an individual drops out in an individually randomized trial


Two examples are given in a later chapter (Research design see Cluster samples: analysis, p. [link]).

Further reading

Kerry SM,Bland JM. Analysis of a trial randomised in clusters. BMJ 1998; 316:54.Find this resource:

Kerry SM,Bland JM. Sample size in cluster randomisation. BMJ 1998; 316:549.Find this resource:


Eldridge S, Kerry SM. A practical guide to cluster randomised trials in health services research. Chichester: Wiley, 2012.Find this resource:

Intention-to-treat analysis


The statistical analysis of RCTs is relatively straightforward where there are complete data. The primary analysis is a direct comparison of the treatment groups, and this is performed with subjects being included in the group to which they were originally allocated. This is known as analysing according to the intention to treat (ITT) and is the only way in which there can be certainty about the balance of the treatment groups with respect to baseline characteristics of the subjects. ITT analysis therefore provides an unbiased comparison of the treatments.

Change of treatment

If patients change treatment they should still be analysed together with patients in their original, randomly allocated group, since a change of treatment may be related to the treatment itself. If a patient’s data are analysed as if they were in their new treatment group, the balance in patient characteristics which was present at random allocation will be lost. A per protocol analysis, where patients are analysed according to the treatment they have actually received, may be useful in addition to the ITT analysis if some patients have stopped or changed treatment.

Research design Complier average causal effect (CACE) methods can be used to disentangle the effects of treatment in compliers—for more details see Ye et al. (2014).

Missing data

Missing data are unfortunately common in all research studies, particularly where data are collected at several time points. Where there are missing data, it may not be possible to include a particular individual in the analysis, and clearly if there are a lot of missing data, the validity of the results is called into question.

Where possible, all subjects should be included in the analysis. In a trial with follow-up it may be possible to include subjects with no final data if they have some interim data available, either by using the interim data directly or by statistical modelling. These issues should be addressed through careful design of outcome data and strategies to minimize loss to follow-up.

All subjects recruited should be accounted for at all stages so that a detailed account can be given of how the trial was conducted and what happened to all subjects. This is particularly important for the interpretation of the findings and so is included when the study is written up.

A fuller discussion of missing data is given elsewhere (Research design see Missing data, p. [link]).

ITT and missing data

  • Analyse subjects in the groups they were originally allocated to even if they change treatment or don’t comply

  • This provides an unbiased comparison of the treatments

  • Per protocol analysis may be useful but only in addition to ITT and not as the primary analysis (see following example)

  • Keep a record of all subjects to be able to account for their treatment and for any subjects who withdraw


RCT of introduction of allergenic foods in breastfed infants

This trial evaluated the early introduction of allergenic foods in the diet of breastfed infants to test the hypothesis that early introduction provided protection against the development of immunoglobulin E-mediated food allergy. The results showed that the early introduction group had a non-significant reduction in allergy at age 3 years in the intention-to-treat analysis (relative risk (RR) 0.80; 95% confidence interval (CI) 0.51, 1.25).

The researchers had observed some non-compliance with early introduction of foods and so a per protocol analysis was conducted. This showed that in those who complied with early introduction, there was a significantly lower risk of allergy at age 3 (RR 0.33; 95% CI 0.13, 0.83).

The researchers were unable to draw firm conclusions about the benefits of early introduction but noted no evidence of harm and a suggestion of efficacy in those that complied.

See Perkin MR, et al. (2016). Randomized trial of introduction of allergenic foods in breast-fed infants. N Engl J Med 2016; 374:1733–43.

Further reading

Matthews JNS. Introduction to randomized controlled clinical trials. Boca Raton, FL: Chapman & Hall/CRC, 2006.Find this resource:

Piantadosi S. Clinical trials: a methodologic perspective. Chichester: Wiley, 2005.Find this resource:

Pocock SJ. Clinical trials: a practical approach. Chichester: Wiley, 1983.Find this resource:


Perkin MR,Logan K,Tseng A,Raji B,Ayis S,Peacock J, et al. Randomized trial of introduction of allergenic foods in breast-fed infants. N Engl J Med 2016; 374:1733–43.Find this resource:

Ye C,Beyene J,Browne G,Thabane L. Estimating treatment effects in randomised controlled trials with non-compliance: a simulation study. BMJ Open 2014; 4:e005362.Find this resource:

Case–control studies

Observational studies

In observational studies, the subjects receive no additional intervention beyond what would normally constitute usual care. Subjects are therefore observed in their natural state.

Case–control study

  • This study investigates causes of disease, or factors associated with a condition

  • It starts with the disease (or condition) of interest and selects patients with that disease for inclusion, the ‘cases

  • A comparison group without the disease is then selected, ‘controls’, and cases and controls are compared to identify possible causal factors

  • Case–control studies are usually retrospective in that the data relating to risk factors are collected after the disease has been identified. This has consequences, which are discussed later in this section

When to use a case–control design

  • To investigate risk factors for a rare disease where a prospective study would take too long to identify sufficient cases—for example, for Creutzfeldt–Jakob disease

  • To investigate an acute outbreak in order to identify causal factors quickly—for example, where an answer is needed about the causes of an outbreak of food poisoning, or an outbreak of Legionnaire’s disease

Choice of controls

As with intervention studies, the choice of controls affects the comparison that is made. Common choices include:

  • Patients in the same hospital but with unrelated diseases or conditions

  • Patients one-to-one matched to controls for key prognostic factors such as age and sex

  • A random sample of the population from which the cases come

Clearly the best control group is the third option, but this is rarely possible. For this reason, some case–control studies include more than one control group for robustness.

Matched controls

Matching is popular but needs to be carefully specified, for example, ‘age matched within 2 years’ gives the range within which matching can be made. It is not usually possible to match for many factors, as a suitable match may not exist. In a matched design, the statistical analysis should take account of the matching and factors used for matching cannot be investigated due to the design. Where one subject in a matched pair has missing data, then both subjects are omitted from the statistical analysis.

Sample size for controls

It is common to choose the sample size so that there is the same number of cases as controls. For a given total sample size this gives the greatest statistical power, that is, the greatest possibility of detecting a true effect. If the number of available cases is limited, then it is possible to increase the power by choosing more controls than cases. However, the gain in power diminishes quickly so that it is rarely worth choosing more than three controls per case (Taylor 1986).

Collecting data on risk factors

Since case–control studies start with cases that already have the disease, data about their exposure to possible risk factors prior to diagnosis is collected retrospectively. This is both an advantage and a disadvantage. The advantage is that the exposure has already happened and so the data simply need to be collected; no follow-up period is needed. The disadvantage relates to the quality of the data. Data taken from clinical notes may contain errors that cannot be rectified or gaps that cannot be filled. Data obtained directly from subjects about their past is susceptible to recall bias because cases may have different recall of past events, usually better, than the controls. For example, a case with a gastrointestinal condition may be more conscious of what they have eaten in the past than a healthy control who may have simply forgotten.


Taylor JM. Choosing the number of controls in a matched case–control study, some sample size, power and efficiency considerations. Stat Med 1986; 5:29–36.Find this resource:

Case–control studies (continued)

Limitations of design

  • The choice of control group affects the comparisons between cases and controls

  • Exposure to risk factor data is usually collected retrospectively and may be incomplete, inaccurate, or biased

  • If the process that leads to the identification of cases is related to a possible risk factor, interpretation of results will be difficult (‘ascertainment bias’). For example, suppose the cases are young women with high blood pressure recruited from a contraception clinic. In this situation, a possible risk factor, the oral contraceptive (OC) pill, is linked to the recruitment of cases and so OC use may be more common among cases than population controls for this reason alone.

  • Time-course relationships need careful interpretation since changes in biological quantities may precede the disease or be a result of the disease itself. For example, a raised serum troponin level is associated with myocardial infarction, but is only raised after the event. Therefore, a case–control study may find that high troponin levels are associated with myocardial infarction but this cannot in fact be a risk factor

  • Risk estimates for exposures cannot be estimated directly because the case and control groups are not representative samples of their respective target populations and so estimates of risks are biased. This has implications for the statistical analysis and the interpretation of results. Risks are usually estimated using odds and ratios of odds, and these only approximate to risks and ratios of risks when the disease under investigation is rare

  • This limitation can be overcome with certain designs, for example, where a case–control study is nested in a cohort study where all cases and controls are identified prospectively and a truly random sample of controls is available (Research design see Cohort studies, p. [link]). In this situation, the relative risk can be calculated directly

Example of a case–control study

A study investigated the association between genitourinary infections in the month before conception to the end of the first trimester, and gastroschisis (Feldkamp et al. 2008). The subjects were 505 babies with gastroschisis (the ‘cases’), and 4924 healthy liveborn infants as controls.

The study reported data (Table 2.1) showing a positive relationship between exposure to genitourinary infections and gastroschisis (odds ratio = 2.02; 95% CI 1.54, 2.63).

Table 2.1 Genitourinary infections in the month before conception to the end of the first trimester, and gastroschisis

Exposed to infection?




81/505 (16%)

425/4924 (9%)


424/505 (84%)

4499/4924 (91%)

Source: data from Feldkamp ML, Reefhuis J, Kucik J, Krikov S, Wilson A, Moore CA, et al. Case–control study of self-reported genitourinary infections and risk of gastroschisis: findings from the national birth defects prevention study, 1997–2003. BMJ 2008; 336(7658):1420–3.


Feldkamp ML,Reefhuis J, Kucik J, Krikov S, Wilson A, Moore CA, et al. Case–control study of self reported genitourinary infections and risk of gastroschisis: findings from the national birth defects prevention study, 1997–2003. BMJ 2008; 336:1420–3.Find this resource:

Cohort studies


A cohort study is an observational study that aims to investigate causes of disease or factors related to a condition but, unlike a case–control study, it is longitudinal and starts with an unselected group of individuals who are followed up for a set period of time. Cohort studies are sometimes used to confirm the findings of case–control studies, such as happened when Doll and Hill (1950) observed a relationship between smoking and lung cancer in a case–control study and subsequently established the longitudinal study of doctors in the UK (Doll et al. 2004).

Design of a cohort study

  • This starts with an unselected group of ‘healthy’ individuals

  • The subjects are followed up to monitor the disease or condition of interest and potential risk factors

  • The length of follow-up is chosen to allow sufficient subjects to get the disease and risk factors to be explored

  • In the simplest case, where there is a single risk factor that is either present or absent, the incidence of disease can be related directly to the presence of the risk factor

  • It is usually prospective, with the risk factor data being recorded before the disease is confirmed

  • It can be retrospective but requires that full risk factor data are obtained on all individuals with and without the disease of interest using data that were recorded prospectively

When to use a cohort study design

  • When precise estimates of risk associated with particular factors are required, for example, when a case–control study has established that an association exists but is unable to provide estimates of the risk

  • When information on past risk factors in individuals with disease is unavailable or too unreliable to use

  • When the time-course of a risk factor is of interest, for example, with smoking, where cohort studies have been able to demonstrate the cumulative adverse effects of long-term smoking and the potential benefits of quitting after smoking for different lengths of time (Doll et al. 2004)

  • When resources and time are sufficient to support a lengthy study

Difficulties with cohort studies

  • A large number of subjects are needed to obtain enough individuals who get the disease or condition, particularly if it is uncommon

  • The length of follow-up may be substantial to get enough diseased individuals and so the cohort study is not feasible for rare diseases

  • There is difficulty in maintaining contact with subjects, particularly if the follow-up is lengthy

  • The resources required may be very high

Example of a cohort study: body mass index and all-cause mortality

A cohort study examined the relationship between body mass index (BMI) and all-cause mortality in 527,265 US men and women in the National Institutes of Health–AARP cohort who were 50–71 years old at enrolment in 1995–1996 (Adams et al. 2006). BMI was calculated from self-reported weight and height.

The study found that among those who had never smoked, excess body weight during midlife was associated with a higher risk of death. Table 2.2 shows the results for men who had never smoked.

Table 2.2 Relative risk of death in men aged 50–71 at enrolment by BMI

BMI at age 50

Relative risk





















All relative risks were adjusted for confounding factors. The reference category for BMI is shown in bold.

Source: data from Adams KF, Schatzkin A, Harris TB, Kipnis V, Mouw T, Ballard-Barbash R, Hollenbeck A, Leitzmann MF. Overweight, obesity, and mortality in a large prospective cohort of persons 50 to 71 years old. N Engl J Med 2006; 355:763–78.


Adams KF,Schatzkin A,Harris TB,Kipnis V,Mouw T,Ballard-Barbash R, et al. Overweight, obesity, and mortality in a large prospective cohort of persons 50 to 71 years old. N Engl J Med 2006; 355:763–78.Find this resource:

Doll R,Hill AB. Smoking and carcinoma of the lung; preliminary report. Br Med J 1950; 2:739–48.Find this resource:

Doll R,Peto R,Boreham J,Sutherland I. Mortality in relation to smoking: 50 years' observations on male British doctors. BMJ 2004; 328:1519.Find this resource:

Cohort studies (continued)

Mixed designs

Larger programmes of study may involve a mixture of designs such as cohort and case–control, a cross-sectional study being extended to become a cohort study, and so on. Trial populations may be followed up after the trial part has ended, simply as a cohort of like individuals.

Cohort study with a nested case–control study

In a cohort study, it may be worthwhile to identify all individuals with a disease and then retrospectively select a sample of the non-diseased individuals for comparison. This design may be desirable if:

  • The resource implications of collecting data on all non-diseased individuals is too high

  • All information was available but unprocessed

  • Biological samples were collected but not analysed

This study is known as a nested case–control study and provides an efficient way of investigating particular factors once the outcomes from the cohort have been established.

Bias in risk factor data

  • In a nested case–control study such as this, the risk factor data should not be as biased as it may be in a conventional case–control study, since it was collected prospectively

  • There is a potential problem if there is differential loss to follow-up as this would reduce the availability of true controls and bias the comparisons

Example: cohort study

UK National Child Development Study (NCDS)

  • All babies born 3–9 March 1958 in Great Britain were studied to investigate and document perinatal mortality

  • The subjects were followed into childhood and further assessments made at ages 7, 11, 16, 23, 33, 41–42, 44–46, and 49–50 years

  • The study aims broadened over the years to monitor physical, educational, social, and economic development in the subjects

  • The recent sweeps have obtained measures of ill health and biomedical risk factors to address a range of hypotheses

  • Data are available from UK data archive (Research design

  • While follow-up has been careful, the reduction in numbers at each sweep can be seen in Table 2.3

Table 2.3 Numbers of subjects at different follow-ups in the NCDS (longitudinal achieved sample)

















Prognostic studies


Prognostic studies or prognostic research are studies that aim to investigate the relationship between patient outcomes and potential predictive biomarkers. Prognostic studies include three broad types (Hemingway et al. 2013):

  • Studies that identify single biomarkers associated with outcome

  • Studies that develop statistical models to predict future outcome based on known biomarkers

  • Studies that identify biomarkers that predict how patients respond to specific treatments

Sound prognostic research is critical for evidence-based medical practice and has generated significant methodological attention in recent years. We have given some very general points about prognostic studies and an example of a prognostic model. For fuller details and guidance, please see the Progress Partnership website and their publications (Research design

Some key points in conducting and reviewing prognostic studies

  • Is there a clear protocol that sets out beforehand the research questions, methods, and analyses to be done?

  • Does this study identify new biomarkers that may need a confirmatory study or confirm the relevance of previously reported ones?

  • Is the population studied clearly described?

  • If a prognostic model is developed, has it been validated in a separate sample? Is there external validation?

  • Are the design and statistics sound?


See Box 2.6 (Bernal et al. 2016).

Source: data from Bernal W et al. Bernal W, Wang Y, Maggs J, Willars C, Sizer E, Auzinger G et al. Development and validation of a dynamic outcome prediction model for paracetamol-induced acute liver failure: a cohort study. Lancet Gastroenterol Hepatol 2016;1:217–25.

Further reading

See publications on the Progress Partnership website: Research design


Bernal W, Wang Y, Maggs J, Willars C, Sizer E, Auzinger G et al. Development and validation of a dynamic outcome prediction model for paracetamol-induced acute liver failure: a cohort study. Lancet Gastroenterol Hepatol 2016;1:217–25.Find this resource:

Hemingway H, Croft P, Perel P,Hayden JA,Abrams K,Timmis A, et al. Prognosis research strategy (PROGRESS) 1: a framework for researching clinical outcomes. BMJ 2013; 346:e5595.Find this resource:

Cross-sectional studies


In a cross-sectional study, a sample is chosen and data on each individual are collected at one point in time. Note that this may not be exactly the same time point for each subject. For example, a survey of primary care consultations may be conducted over a week—each patient will fill in the survey once but different subjects will fill out their survey on different days depending on when they came to the surgery.

When to use a cross-sectional study

  • Surveys of prevalence, such as a survey to ascertain the prevalence of asthma

  • Surveys of attitudes or views, such as studies of patient satisfaction or patient/professional knowledge; or studies of behaviour, such as alcohol use and sexual behaviour

  • When inter-relationships between variables are of interest, for example, in a study to determine the characteristics of heavy drinkers, a cross-sectional study allows comparisons by sex, age, and so on

Cautions in interpreting cross-sectional study data

Temporal effects

Since the data on each individual are collected at one time point, care is needed in inferring temporal effects unless the exposure is constant, such as with a congenital or genetic factor (e.g. blood groups). For example, if a relationship is observed between a disease and blood group then we can safely assume that this is a true association since the blood group of the subjects would not be changed by the disease process. The same could not be assumed if a cross-sectional study showed an association between a disease and blood pressure since the disease might have led to the rise in blood pressure rather than the other way around.

Repeated cross-sectional studies

Sometimes cross-sectional studies are repeated at different times and/or in different places to look at the variability in findings. For example, many cross-sectional studies have estimated the prevalence of asthma in schoolchildren. Comparisons of prevalence in different places is straightforward but comparisons of the prevalence at different times is less so because each cross-sectional survey is likely to have included a slightly different sample of children at the different time points, and so interpretation of changes must be made cautiously.

Cross-sectional studies that appear to be longitudinal

Cross-sectional studies can be misinterpreted as if they were longitudinal studies. For example, a cross-sectional study in a sample of fetuses where the gestational age of the fetuses spans a range, say 22–28 weeks. Some researchers have used data such as these to estimate growth trends. This is dubious because each fetus is measured just once and so the trend is being estimated from different fetuses. Thus differences between fetuses are likely to contribute to some of the differences observed by gestational age.

Example: cross-sectional study

A study investigated differences in cardiovascular risk in British South Asian and in British white children in ten towns (Whincup et al. 2002). The study included 73 South Asian and 1287 white children and measured fasting glucose levels as a measure of insulin resistance, plus a number of other markers of cardiovascular risk. Each child was assessed just once and so this is a cross-sectional study.


Whincup PH, Gilg JA, Papacosta O, Seymour C, Miller GJ, Alberti KG, et al. Early evidence of ethnic differences in cardiovascular risk: cross sectional comparison of British South Asian and white children. BMJ 2002; 324:635.Find this resource:

Case study and series

Differences in aims

A case study or case report is like a case series but it includes only one individual:

  • The aim is to describe a single and unusual incident or case

A case series is a descriptive study involving a group of patients who all have the same disease or condition:

  • The aim is to describe common and differing characteristics of a particular group of individuals


For both a case study and a case series:

  • The aim is not to draw general conclusions

  • It is not a true research study

  • It may provide useful indications for further research

Example: a case study

An article published in The Lancet described the case of an 80-year-old woman who presented with episodes of unconsciousness and disorientation over several years (Wiesli et al. 2002). During a subsequent episode she was found to have a blood glucose level of 1.5 mmol/L (normal range 3.5–5.5 mmol/L fasting). Routine blood tests were normal and a 72-hour fast produced no symptoms of hypoglycaemia (low blood sugar).

Further investigations led to the discovery of an insulin-secreting tumour in the body of the pancreas. The tumour was producing excess insulin in response to glucose, therefore causing glucose-induced hypoglycaemia.

Example: a case series

An article published in Brain described a series of patients with pneumococcal meningitis (Kastenbauer and Pfister 2003). The paper reported the symptoms, complications, and outcome in 87 consecutive meningitis patients seen in a particular neurology department. The authors stated that their analysis can help doctors identify prognostic factors in patients, and can guide the design of future research studies.


Kastenbauer S, Pfister HW. Pneumococcal meningitis in adults: spectrum of complications and prognostic factors in a series of 87 cases. Brain 2003; 126:1015–25.Find this resource:

Wiesli P, Spinas GA, Pfammatter T, Krahenbuhl L, Schmid C. Glucose-induced hypoglycaemia. Lancet 2002; 360:1476.Find this resource:

Deducing causal effects

Association and causation

Observational studies frequently reveal associations. It is important in interpreting such associations to consider if they are likely to represent actual causes.

  • Causal effects can only be firmly concluded from RCTs. In other words, it is only when a study has randomized subjects to treatments that researchers are able to deduce that differences observed between treatment groups are due to the treatment alone

  • Observational studies often reveal relationships between a disease and a risk factor. However, we cannot be sure that the risk factor caused the disease. It may be that another factor that was related to both the disease and the risk factor was in fact the causal factor, and that the relationship observed was due to confounding

  • Cigarette smoking is a common confounder since the characteristics of smokers and non-smokers differ in many ways, some of which may be related to disease simply because of their association with smoking. In such cases, when smoking is controlled for in the analysis, the associations diminish or disappear

Example of confounding in an observational study

A study of factors affecting birthweight observed that on average pregnant women with low blood folate levels had smaller babies. The data were analysed further and showed no evidence for this relationship in women who were non-smokers although the relationship was seen in the women who smoked. It was further discovered that women who smoked had lower mean folate levels than women who did not smoke.

Further multifactorial analysis was conducted and the effect of folate on birthweight became non-significant after controlling for smoking whereas the effect of smoking remained significant after adjusting for folate level.

It was concluded that the ‘folate effect’ that was observed was simply due to smoking. In other words, women with low folate levels had smaller babies because of their smoking and not because of the folate levels. The folate effect was a confounder and not a direct causal effect.

Controlling for confounding

This can be done using multifactorial statistical analyses. Further details on the methods that may be used are given in Research design Chapter 12 (multiple regression, logistic regression, and propensity scores). More complex multifactorial modelling methods such as structural equation modelling and others fall under the broad term causal inference. Details of this broad and expanding area are beyond the scope of this book.

The Bradford Hill criteria for causation

The British medical statistician, Austin Bradford Hill, published a set of criteria for causation (Hill 1965). The criteria are conditions which, if fulfilled, allow causation to be more confidently inferred from an observational study. They are:

  • Strength of association

  • Consistency in different studies, settings, etc.

  • Specificity of association of risk factor with a particular disease

  • Temporal relationship—exposure precedes disease

  • Dose–response relationship

  • Biological plausibility for causality

  • Coherence—association is consistent with current knowledge

  • Experimental evidence for causality

  • Existence of analogous evidence between a similar exposure and disease

Research design Mendelian randomization

Mendelian randomization can potentially provide stronger evidence for causality in observational studies than direct adjustment for confounding alone. It is useful because the determination of a particular genotype in reproduction is effectively randomized. Mendelian randomization uses this together with the knowledge that certain risk factor exposures are associated with certain genotypes to provide a pseudo-randomized setting. Whereas exposure may change over time, perhaps in direct response to disease, the genotype does not change. Hence, in some situations genotype can be used as a proxy for exposure to adjust for confounding. When this is possible, causality can be attributed with greater certainty than is possible using exposure itself as the confounder to adjust for. See Davey Smith and Ebrahim (2003, 2008) for further details.

Further reading

Davey Smith G, Ebrahim S. ‘Mendelian randomization’: can genetic epidemiology contribute to understanding environmental determinants of disease? Int J Epidemiol 2003; 32:1–22.Find this resource:

Davey Smith G,Ebrahim S. Mendelian randomization: genetic variants as instruments for strengthening causal inference in observational studies. In: Vaupel JW Weinstein M, Wachter KW (eds), Biosocial surveys, pp. 366–86. Washington, DC: National Academies Press, 2008.Find this resource:


Hill AB. The environment and disease: association or causation? Proc R Soc Med 1965; 58:295–300.Find this resource:

Quality improvement


Quality improvement (QI) processes in healthcare have been defined as: systematic, data-guided activities designed to bring about immediate improvements in health care delivery in particular settings. (Lynn et al. 2007) Clinical audit is a QI process that is used to monitor and improve patient care. It always begins with an evaluation of current practice prior to introducing any change that is needed. Doctors routinely contribute to clinical audits and undertake QI projects, and participation in these is a fundamental part of medical training.

Quality improvement projects

QI projects are not research projects but still need to be properly designed and conducted for their findings to be valid. Specifically, they need to consider:

  • What is the topic for the project and who are the patients and/or events being studied?

  • How many subjects or events are needed to reach firm conclusions?

  • What data should be collected and in what format to ensure it is representative and reliable?

  • How should the data be analysed and presented?

National quality improvement audit projects

There are many national audits of specific conditions that seek to monitor activities and practice against known best-evidence standards. For example, the Sentinel Stroke National Audit Programme administered by King’s College London measures the quality of care received by stroke patients in England, Wales, and Northern Ireland.

Clinical audit

Clinical audit is a quality improvement process that seeks to improve the patient care and outcomes through systematic review of care against explicit criteria and the implementation of change. Aspects of the structures, processes and outcomes of care are selected and systematically evaluated against explicit criteria. Where indicated, changes are implemented at an individual team, or service level and further monitoring is used to confirm improvement in healthcare delivery. (Lynn et al. 2007)

Audit cycle

The aim of audit is to monitor clinical practice against agreed best practice standards and to remedy problems. Where problems in practice are identified, attempts are made to resolve these and then clinical practice is re-audited against the agreed standards—this is the audit cycle (Figure 2.3).

Further reading

Royal College of Paediatrics and Child Health. QI central. Research design


King’s College London. Sentinel Stroke National Audit Programme (SSNAP). Research design

Lynn J,Baily MA,Bottrell M,Jennings B,Levine RJ,Davidoff F, et al. The ethics of using quality improvement methods in health care. Ann Intern Med 2007; 146:666–73.Find this resource:

Designing a clinical audit

Choosing a suitable topic

Audits are designed to monitor and improve clinical practice. The choice of topic is guided by indications of areas where improvement is needed in addition to local and national requirements. The following criteria help guide the choice of topics in general.

Possible topics

  • Areas where a problem has been identified (e.g. an infection outbreak)

  • High-volume practice (e.g. prescribing antibiotics in general practice)

  • High-risk practice (e.g. major surgery)

  • High cost (e.g. in vitro fertilization)

  • Areas of clinical practice where guidelines or firm evidence exists (e.g. National Institute for Health and Care Excellence (NICE) guidelines or government targets)

Aims of audit

  • This defines the overall purpose and can be a question or statement

  • The focus is on improvement in clinical practice

  • The organization carrying out the audit should have the ability to make changes based on the findings. For example, there would be no point for a hospital to audit the number of referrals received from GPs unless it could influence the practice of the GPs who were referring

Determining the standard

  • This is the best currently available clinical practice based on best evidence

  • It must be measurable

Data collection: retrospective

  • Can be used to investigate acute events

  • Useful when resources—time, cost, and human resources—are limited

  • Tends to use routine data, thus may provide limited information

Data collection: prospective

  • Provides current data

  • Allows a choice of data to be collected

  • Requires forward planning

  • Has resource implications—time, cost, and human resources

Census or sample?

  • A census is needed if outcome is critical (e.g. death rates after surgery)

  • A sample is okay if a snapshot will suffice

  • A sample may be dependent on a fixed number or a length of time

  • A sample size needs to be big enough to provide robust information for key aims of audit and to use standard sample size calculations to ensure this (Research design see Choosing a sample size, p. [link])

  • A sampling strategy needs to be representative of the target population (Research design see Sampling strategies, p. [link])

  • Beware of seasonal effects when choosing a sample

  • Use random samples if possible, or representative consecutive samples

Potential problems with audits

  • Doctors may feel pressured to do statistical testing when descriptive results would suffice

  • Small audit samples give non-significant P values and are wrongly interpreted as indicating ‘no difference’ or ‘no change’

  • Data collected are not generalizable outside the audit setting due to the sampling method and/or the patient group/clinical setting

Further help

Most hospitals have clinical audit (or QI) departments, which can provide support for clinicians designing and conducting clinical audits.

Further reading

The Healthcare Quality Improvement Partnership website has many useful resources including booklets that can be downloaded. See: Research design

Data collection in audit

Data forms

  • Consider how the data will be analysed when designing the form

  • Design the form in advance—standard forms or example forms may be available

  • If audit is new to you, discuss the draft form with an experienced colleague

  • Pilot the data collection on a few cases to check for feasibility and usability of the form, and so on

Outcomes measured

These may take one of several forms:

  • A direct outcome (e.g. death, infection, or re-admission)

  • A process (e.g. whether or not cholesterol was measured in patients admitted with cardiovascular disease)

  • A surrogate outcome (e.g. spirometry as a measure of lung function)

Data analysis

In general, the same methods of statistical analysis are used for audit as for research, although complicated statistical methods may not be needed. In particular:

  • Simple descriptive analyses may be sufficient to answer audit questions

  • Summary statistics should always be calculated first such as percentages for frequencies and mean, standard deviation, and median range for continuous data

  • Graphical display may be helpful

  • Where the size of an estimate is critical, it should be accompanied by a 95% confidence interval to show how precise it is (Research design see 95% confidence interval for a proportion, p. [link])

  • Comparisons of proportions or means can be done using standard significance tests as described later in this book (Research design see Tests of statistical significance, p. [link])

Examples of audit topics

  • Are all hospital patients seen by a doctor every day?

  • How many inpatients have acquired meticillin-resistant Staphylococcus aureus (MRSA) in hospital?

  • Is there adherence to antibiotic protocols?

  • What proportion of patients in an emergency department stay longer than 4 hours?

Research versus audit


The main difference between research and audit is in the aim of the study. A clinical research study aims to determine what practice is best, whereas an audit checks to see that best practice is being followed. In this way audit and research may follow each other in a cycle whereby research leads to new best practice which needs to be audited and audits lead to new questions which require investigating in research studies.

Common features of research and audit

  • Both address a particular question related to best clinical practice

  • Both consider and collect the appropriate data required to fulfil the aims of the study

  • Both usually involve samples and a determination of the appropriate type and size of sample

  • Both require data checking and data analysis

  • Both require scientific rigour appropriate to the aims of the study

Grey areas

It is difficult to classify some studies as either wholly audit or wholly research. It is best to get local advice in such situations. Examples include:

  • Patient surveys that seek views and attitudes about clinical practice

  • ‘Service evaluations’ of a modified or new service that seek to determine whether it is effective

Data collection: sources of data

New data

This is when data collection is designed specifically for the study and the data are newly collected.


  • Researcher has control over what data are collected (i.e. fit for purpose)

  • Current


  • Cost

  • Time to collect and process

  • Possibility of unknown quantity of missing data due to refused participation, subjects lost, and so on

Routine data

This refers to data collected for another purpose, often unrelated to research, such as monitoring.


  • Relatively quick to obtain, particularly if computerized

  • May be already processed and/or computerized

  • Usually much lower cost than primary data collection


  • No control over data available

  • Limited control over missing data and ability to fill gaps and resolve queries

  • Data may not be in required format

Patient notes

These may be in hand-written or computerized format.


  • Relatively quick to obtain

  • Usually much lower cost than primary data collection


  • No control over data available

  • Limited control over missing records data, missing records, ability to fill gaps, and resolve queries

  • Hand-written notes may be unformatted, difficult to search, and hard to read

Secondary data

These are data collected and recorded for another research study, and which are available for use.


  • Relatively quick to obtain

  • Usually already processed so that minimal checking and data cleaning is required

  • Usually much lower cost than primary data collection


  • No control over data available

  • Limited control over missing data and ability to fill gaps and resolve queries

  • Data may not be in required or desirable format

  • May be out of date


A study investigated the association between deprivation and use of the emergency ambulance service across England. Deprivation scores for each district in the country were obtained from the Office for National Statistics. The number of ‘999’ calls to each ambulance service over the course of a given year were obtained from the Department of Health. Information on which districts were covered by each ambulance service in England was obtained from individual ambulance services. These data were used to investigate the relationship between deprivation and ambulance service usage. No new data were collected for the study (Peacock and Peacock 2006).


Peacock PJ, Peacock JL. Emergency call work-load, deprivation and population density: an investigation into ambulance services across England. J Public Health (Oxf) 2006; 28:111–15.Find this resource:



Registers are databases of observational patient data that allow their clinical care and subsequent outcomes to be followed over time. They tend to include patients with a specific condition (e.g. cancer). Other registers are set up to evaluate interventions as in the ‘Commissioning through Evaluation’ (CtE) programme established by NHS England to evaluate new or untested interventions in a limited set of patients (Research design


  • Large longitudinal data sets are available that reflect clinical practice

  • May use routine electronic health records to populate register (therefore, data are more easily obtainable at lower cost)

  • Can explore effects of exposures/behaviours and treatments on multiple outcomes at different time points and in different subgroups


  • Difficult to obtain complete follow-up on all patients

  • May be difficult to obtain missing data retrospectively and/or resolve queries

  • Lack of comparator group may be a problem if evaluating an intervention

Examples of registers

Estimating intervention effects

One of the main difficulties with the use of registers or other observational data to evaluate interventions is in identifying a suitable control or comparator group. If a comparator is not available within the register then historical data may be used or comparisons of outcomes may be the best that can be done.

Even if a comparator group is available, comparator patients may differ from intervention patients in demographic characteristics and clinical features such that a simple comparison of outcomes is likely to be biased. In this case, differences need to be accounted for in the analysis using methods as outlined in Research design Deducing causal effects, p. [link], but be aware that residual confounding may remain.

Real-world data (‘big data’)

The terms real-world data or big data are used in health research to describe patient data that are collected in routine clinical practice. Real-world data can be used to estimate treatment effects, explore prognostic factors, and identify new biomarkers. Real-world data are more readily available and far less expensive than equivalent data from RCTs or new registers or cohorts. It therefore makes sense to try to use real-world data in situations where no evidence exists and/or a randomized trial would not be feasible or ethical.

The use of real-world data requires robust statistical thinking and expert statistical input to ensure that meaningful, unbiased results are obtained. This is a growing area worldwide—see ‘Further reading’ for examples from the UK, the USA, and China of the huge drive to use the data we have to answer important questions.

Further reading

Obermeyer Z,Emanuel EJ. Predicting the future—big data, machine learning, and clinical medicine. N Engl J Med 2016; 375:1216–19.Find this resource:

UK Department for Business, Energy & Industrial Strategy. Industrial strategy: building a Britain fit for the future. 2017. Research design

Zhang L,Wang H,Li Q,Zhao MH,Zhan QM. Big data and medical research in China. BMJ 2018; 360:j5910.Find this resource:

Data collection: outcomes

General principles

  • In an intervention study, the main or primary outcome is critical as it is used to determine the efficacy of the treatment under investigation

  • In most trials only one primary outcome is chosen and other important outcomes are regarded as secondary

  • Sample size calculations use the primary outcome to ensure the study is big enough to detect a clinically important difference

  • Choice of a single outcome is not always straightforward because a similar outcome may be measurable in more than one way, for example, using capillary blood glucose readings compared with glycated haemoglobin (HbA1c)

Composite outcomes

In some situations there are multiple ways of assessing a trial outcome, for example, in trials in cardiology where possible outcomes include subsequent cardiac event, hospitalization, and death. In such cases researchers may choose a primary outcome which is a composite of two or more outcomes, such that the composite outcome is positive if one or more of the component outcomes have happened. Many composite outcomes include ‘death’ as one of the possible events.


Composite outcomes have several advantages:

  • They allow several outcomes to be combined in settings where different outcomes are of similar importance but reflect different clinical events. For example, in a trial of treatment for gestational diabetes, the primary outcome was a composite measure of serious perinatal complications, defined as one or more of fetal death, shoulder dystocia, bone fracture, and nerve palsy (Crowther et al. 2005)

  • The main advantage of using a composite outcome is the gain in statistical power—where individual events are uncommon, a large sample will be required to demonstrate conclusive differences. Using a composite will increase the event rate and allows trials to recruit a lower sample size


There are some difficulties with the choice and use of composite outcomes:

  • It may be hard to determine the minimum clinically important difference for the composite, this requires an estimate of the incidence of the composite itself and not just the incidence of the individual components as well as clinical judgement about what constitutes an important change in rate

  • The interpretation of results may be difficult—it is important that the separate component effect sizes are each reported as well as the combined effect size, to allow clinical interpretation

  • If the effect sizes (e.g. relative risks) vary among the components then overall interpretation of the findings is difficult, for example, if a new treatment reduces subsequent adverse events but increases death rates (Freemantle et al. 2003; Montori et al. 2005; Freemantle and Calvert 2007; Ross 2007)

Surrogate outcomes

In studies where the outcome of interest is very rare or requires a long follow-up period to determine it, a surrogate outcome is often used to increase statistical power and efficiency. Surrogate outcomes should be chosen and used with care:

  • A surrogate outcome should be closely related to the clinical outcome of interest such as a biomarker or process variable

  • Examples include CD4 count for acquired immune deficiency syndrome (AIDS) morbidity and mortality, cholesterol level for cardiovascular disease, and length of stay for hospital-based treatments

  • Where surrogate outcomes are only weakly associated with the clinical outcome of interest, the benefit in using them is offset by the difficulty in interpreting the results


Crowther CA,Hiller JE,Moss JR,McPhee AJ,Jeffries WS,Robinson JS. Effect of treatment of gestational diabetes mellitus on pregnancy outcomes. N Engl J Med 2005; 352:2477–86.Find this resource:

Freemantle N,Calvert M,Wood J,Eastaugh J,Griffin C. Composite outcomes in randomized trials: greater precision but with greater uncertainty? JAMA 2003; 289:2554–9.Find this resource:

Freemantle N,Calvert M. Composite and surrogate outcomes in randomised controlled trials. BMJ 2007; 334:756–7.Find this resource:

Montori VM,Permanyer-Miralda G,Ferreira-Gonzalez I,Busse JW,Pacheco-Huergo V,Bryant D, et al. Validity of composite end points in clinical trials. BMJ 2005; 330:594–6.Find this resource:

Ross S. Composite outcomes in randomized clinical trials: arguments for and against. Am J Obstet Gynecol 2007; 196:119e1–6.Find this resource:

Dichotomization of outcomes: P values


In clinical medicine and in medical research, it is fairly common to categorize a biological measure into two groups, either to aid diagnosis or to classify an outcome. For example, blood cholesterol level is measured as millimoles per litre (mmol/L) but may be classified into two groups defined as less than or equal to 5.8 mmol/L (‘normal’) or greater than 5.8 (‘high’). It is often useful to categorize a measurement in this way to guide decision-making, and/or to summarize the data but doing this leads to a loss of information which in turn has statistical consequences.

Example: what happens when we categorize data

Suppose in a study of infants their birthweights are recorded. Suppose then that the birthweight data, which are continuous, are categorized as ‘low birthweight’ (<2500g) or ‘normal birthweight’ (≥2500g). This means that each birthweight value is effectively replaced by a 0 or 1 (Table 2.4) and much data are discarded.

Effects of categorization on statistical significance

  • Categorizing continuous data into two groups discards much data

  • For statistical tests, the P value will be larger than if we had analysed the data as a continuous variable

  • Thus statistical tests are less likely to find a significant difference (Table 2.5)

Table 2.4 Part of a dataset showing birthweight in grams and birthweight dichotomized as low birthweight yes/no

Subject no.

Birthweight (g)

Low birthweight

(<2500: no = 0, yes = 1)
















Table 2.5 Mean birthweight (BW) and the percentage of low birthweight (LBW) babies (BW <2500 g) by the mothers’ smoking status during pregnancy



n = 156


n = 114

P value

BW mean (SD) (g)

3360 (535)

3192 (483)


LBW % (n)

4.5% (7)

7.0% (8)


Example: effects of categorization on statistical significance

  • Using mean birthweight (i.e. a continuous variable), the difference between non-smokers and smokers is significant with P = 0.008

  • Using birthweight in two groups, low birthweight and normal birthweight, the difference between non-smokers and smokers is not significant with P = 0.370

  • In the same dataset, categorization of birthweight into two groups has discarded information and gives a less significant (bigger) P value

  • Hence, when data are categorized there is less statistical power to detect a difference (Research design see Sample size for comparative studies, p. [link])

Dichotomization of outcomes: sample size

Effects of categorization on sample size

  • Categorizing continuous data into two groups discards much data

  • If a continuous variable is used for analysis in a research study, a substantially smaller sample size will be needed than if the same variable is categorized into two groups

Example: effects of categorization on sample size

Research design See Sample size for comparative studies, p. [link].

Table 2.6 shows the sample size needed to detect a difference using means and the corresponding difference using proportions to illustrate the effects on required sample size when a continuous variable is analysed in two groups.

The calculations use standard formulae and were done using the statistical program NQuery (Statistical Solutions). It is assumed that birthweight follows a Normal distribution with a mean of 3500 g and a standard deviation (SD) of 500 g. Power is 90% and significance level is 5%.

Table 2.6 Sample size needed to detect a difference in mean birthweight (BW) between two groups and the corresponding sample size (SS) needed to detect an equivalent difference in percentage of low birthweight (<2500 g, LBW)

Difference in BW


Difference in % LBW


50 g




100 g




150 g




200 g




250 g




This example illustrates that, for the same size of difference, categorization increases required sample size considerably.

Dichotomization: dilemma and solution

  • Researchers and doctors dichotomize outcomes to help decision-making and to make outcomes clinically meaningful

  • Research design The distributional approach helps solve the problem by providing a dual approach so that outcomes can be reported in both their continuous and dichotomous form without loss of power (Peacock et al. 2012)

  • The following example shows how this approach can improve the clinical meaningfulness in a research study

Example: using the distributional approach to aid clinical interpretation

Effects of type of ventilation in very preterm babies on their lung function in adolescence

  • These data come from an RCT where extremely preterm babies who needed respiratory help at birth received either conventional or oscillatory ventilation

  • The children were assessed following birth and in infancy and no differences in outcome were found by type of ventilation

  • When assessed at age 11–14 years, a small, statistically significant difference in mean lung function was found but the clinical importance of this was unclear (forced expiratory flow at 75% of the expired vital capacity (FEF75) mean z-score difference = 0.23)

  • Using the distributional approach, the difference in means was also shown as the equivalent dichotomized outcome, the proportion with abnormal lung function, and showed a difference of 10 percentage points, 37% versus 47%, in favour of oscillation

  • This dual presentation of results as a difference in means and a difference in the proportion with abnormal lung function provided greater clarity of the clinical meaning of the findings

  • See also Zivanovic et al. (2014) and the following ‘Further reading’

Further reading

Peacock JL,Sauzet O,Ewings SM,Kerry SM. Dichotomising continuous data while retaining statistical power using a distributional approach. Stat Med 2012; 31:3089–103.Find this resource:

Sauzet O. Software. Research design

Sauzet O,Breckenkamp J,Borde T,Brenne S,David M,Razum O,Peacock JL. A distributional approach to obtain adjusted comparisons of proportions of a population at risk. Emerg Themes Epidemiol 2016; 13:8.Find this resource:

Sauzet O,Peacock JL. Estimating dichotomised outcomes in two groups with unequal variances: a distributional approach. Stat Med 2014; 33:4547–59.Find this resource:


Statistical Solutions. nQuery advisor: sample size and power calculations. Research design

Zivanovic, S,Peacock J,Alcazar-Paris M,Lo JW,Lunt A,Marlow N, et al. Late outcomes of a randomized trial of high-frequency oscillation in neonates. N Engl J Med 2014; 370:1121–30.Find this resource:

Regression to the mean

What is it?

Regression to the mean is a statistical phenomenon that has important consequences for the design, analysis, and interpretation of research. It works like this:

  • On average, individuals with extreme values at a first measurement have less extreme values when measured again

  • Regression to the mean is observed due to natural variation in measurement levels, irrespective of any intervention or treatment effect


  • When individuals are chosen because they have extreme levels within their population, they will tend to have less extreme values when measured again

  • For example, patients with high blood pressure will tend to have lower blood pressure when measured again

  • This affects how we interpret changes in measurements over time

Example: regression to the mean

See Rees et al. (2013).

Vitamin D supplementation and upper respiratory tract infection

  • We look at changes in vitamin D level from baseline to 1 year in the placebo group. (Note: we wouldn’t expect any change in mean levels)

  • The placebo group is categorized into quartiles according to vitamin D level at baseline

  • Figure 2.4 shows mean vitamin D levels in the four quartile groups at baseline and 1 year. We see that:

    • Those with a high starting mean value (upper quartile) have a lower mean at 1 year

    • Those with a low starting mean value (lower quartile) have a higher mean at 1 year

    • In other words, both group means have moved towards the overall mean when measured again—they are less extreme

    • Means in the middle two groups have hardly changed but moved slightly nearer the overall mean

Figure 2.4 Regression to the mean in an untreated (placebo) group.

Figure 2.4 Regression to the mean in an untreated (placebo) group.

Implications of regression to the mean

  • Designing RCTs—need for control group. When investigating the effect of a treatment on change from baseline, we need a control (placebo) group. We compare change in the treatment group with change in the control group to remove the effect of regression to the mean

  • Appraising evidence—beware of changes in ‘extreme’ groups. Changes in extreme groups may be due to regression to the mean. For example, failing schools ‘improve’, great schools get worse, crime rates go up/down in good/bad areas, etc.—beware!

  • For further reading, see Bland and Altman (1994a, 1994b) and Bland (2015, chapter 11)


Bland, JM, Altman DG. Regression towards the mean. BMJ 1994a; 308:1499.Find this resource:

Bland, JM, Altman DG. Some examples of regression towards the mean. BMJ 1994b; 309:780.Find this resource:

Bland, M. An introduction to medical statistics, 4th ed. Oxford: Oxford University Press, 2015.Find this resource:

Rees JR,Hendricks K,Barry EL,Peacock JL,Mott LA,Sandler RS,Bresalier R.S, et al. Vitamin D3 supplementation and upper respiratory tract infections in a randomized, controlled trial. Clin Infect Dis 2013; 57:1384–92.Find this resource:

Collecting additional data

Descriptive, predictive, and exposure data

Similar principles apply to these as apply to the selection and recording of main outcomes:

  • Continuous variables are preferable from a statistical viewpoint, since they will give more precision to analyses

  • If the data are obtained from notes or from direct enquiry, then they should be recorded with adequate precision

  • If the data will be from a self-completed questionnaire, then subjects may prefer to tick boxes rather than give exact numbers and the tension between accuracy and completeness will come into play (Research design see Questions and questionnaires, p. [link])

How much data to collect?

Research studies require certain specific data which must be collected to fulfil the aims of the study, such as the primary and secondary outcomes and main factors related to them. Beyond these data, there are often other data that could be collected and it is important to weigh the costs and consequences of not collecting data that will be needed later against the disadvantages of collecting too much data.

  • Too little data: missed data, if not collected, may not be able to be collected on a later occasion and so it is important to decide what key data are needed

  • Too much data: collecting too much data is likely to add to the time and cost of data collection and processing, and may threaten the completeness and/or quality of all of the data so that key data items are threatened. For example, if a questionnaire is overly long, respondents may leave some questions out or may refuse to fill it out at all

Further reading

Research design Chapter 3 gives much more information on collecting data.Find this resource:

Sampling strategies


Whenever a sample is used to provide information about a wider population, we have to consider how the sample is to be chosen. There are two key properties of samples which impinge on a study. First is the size of the sample, which affects the precision of the analyses. We will address this issue elsewhere in this section. Second is the choice of sample, which needs to be representative of the underlying population of interest for the results to be generalizable to that population.

Convenience sample

Many studies use a sample of patients available at a particular time/place, for example, patients who attend an asthma clinic may be recruited into a survey of the use of spirometers. The results of this study will apply to the population from which this sample is drawn and may not apply to other populations because patients’ attendance at a clinic may be due to their response to treatment or their use of spirometers. Hence, they may not be representative of all patients using spirometers.

It is important when using a convenience sample to collect and report information about the baseline characteristics of the sample so that the generalizability of this sample can be deduced.

Quota sample

In choosing a quota sample, the researcher aims to identify a representative sample by choosing subjects in proportion to their numbers in the population of interest. For example, if age, marital status, sex, and employment status were important characteristics, then the researcher would select a number of subjects with each combination of these characteristics so that the overall proportions with the characteristics reflected the proportions in the population. Quota sampling is often used in market research but is less common in medical research. The difficulty with quota sampling is that subjects recruited may differ from those not recruited in subtle ways, for example, if the sample is obtained by knocking on doors or by approaching people in the street or by telephoning, certain sections of the populations will be excluded. Therefore, a quota sample provides no estimate of the true response rate and may not be representative of the desired population.

Random sample (simple random sample)

A random sample is chosen so that each member of the population has an equal chance of being chosen and so the selection is completely independent of patient characteristics. In order to draw a random sample, a list of the population is needed: the sampling frame. A random sample will be representative of the population from which it was chosen because the characteristics of the individuals are not considered when the selection is made. Random sampling can be done using computer programs.

Stratified sample

Stratified samples are used when fixed numbers are needed from particular sections or strata of the population in order to achieve balance across certain important factors. For example, a study designed to estimate the prevalence of diabetes in different ethnic groups may choose a random sample with equal numbers of subjects in each ethnic group to provide a set of estimates with equal precision for each group. If a simple random sample is used rather than a stratified sample, then estimates for minority ethnic groups may be based on small numbers and have poor precision. In terms of efficiency, a stratified sample gives the most precise overall (weighted) estimate, where the overall estimate is weighted according to the fractions sampled in each stratum.

Cluster sample

Cluster samples may be chosen where individuals fall naturally into groups or clusters. For example, patients on a hospital ward or patients in a GP practice. If a sample is needed of these patients, it may be easier to list the clusters and then to choose a random sample of clusters, rather than to choose a random sample of the whole population. (In fact, it may be impossible to list the whole population.) Having chosen the clusters, the researcher can either select all subjects in the cluster or take a random sample within the cluster. Cluster sampling is less efficient statistically than simple random sampling and so needs to be accounted for in the sample size calculations and subsequent analyses (Research design see Cluster samples: units of analysis, p. [link]).

Choosing a sample size

Samples and populations

For pragmatic reasons, research studies nearly always use samples from populations rather than the entire population. Sample estimates will therefore be an imperfect representation of the entire population since they are based on only a subset of the population. As stated previously, when the sample is unbiased and is large enough, then the sample will provide useful information about the population. As well as considering how representative a sample is, it is important also to consider the size of the sample. A sample may be unbiased and therefore representative, but too small to give reliable estimates.

Consequences of too small a sample: studies producing estimates

Prevalence estimates from small samples will be imprecise and therefore may be misleading. For example, suppose we wish to investigate the prevalence of a condition for which studies in other settings have reported a prevalence of 10%. A small sample of, say, 20 people, would be insufficient to produce a reliable estimate since only 2 would be expected to have the condition and a decrease or increase of 1 person would change the estimate considerably (2/20 = 10%, 1/20 = 5%, 3/20 = 15%). Such a study needs a large sample to give a stable estimate.

  • When estimating quantities from a sample such as a proportion or mean, we use the 95% confidence interval to show how precise the estimate is (Research design see 95% confidence interval for a proportion, p. [link])

  • If the confidence interval is narrow, then the estimate is precise and conversely, if the interval is wide, then the estimate is imprecise

  • Sample size calculations determine the number of subjects needed to give a sufficiently narrow confidence interval

Consequences of too small a sample: studies making comparisons

When we compare two groups we use a significance test to calculate the P value and if possible, we calculate the difference and a confidence interval for the difference. For example, when we compare mean blood pressure in patients given two different treatments for hypertension, we can calculate the difference in means between the two groups and a 95% confidence interval for the difference. The result of the significance test may be statistically significant or non-significant, depending on the size of the P value. The P value is affected by the sample size and if the sample is too small, there may not be enough data to draw a firm conclusion about any differences. If the sample is small then, in general, the observed difference needs to be larger to be statistically significant. As a consequence, small but important differences may be statistically non-significant in small samples. Hence, if there is a true difference between groups in the target population, the study must be big enough to give a significant result; otherwise incorrect conclusions may be drawn.

  • Statistical comparisons are made using significance tests which give a P value (Research design see P values, p. [link])

  • If the sample is too small, a true difference may be missed

Calculating sample size

There are formulae for calculating sample size, and the simplest and most commonly used are given in following sections. Computer programs can be used such as the specialist sample size programs nQuery advisor (Statistical Solutions) and PASS (NCSS Statistical Software), which do a wide range of sample size calculations. Some general statistical analysis programs such as Stata (Stata Corporation) also perform sample size calculations for a wide range of situations. G*Power is a free sample size package that covers a good range of situations (University of Düsseldorf; Faul et al. 2009).

The following books also give tables for the calculation of sample size:


Sample size calculations for studies estimating a mean or proportion, and for studies comparing two means or two proportions are shown in the following sections in this chapter. Before the calculations can be done, certain information is needed. This is listed, described, and discussed with examples of sample size calculations using the programs nQuery, PASS, and Stata.


Chow SC, Shao J, Wang H. Sample size calculations in clinical research, 2nd ed. Boca Raton, FL: Chapman & Hall/CRC, 2008.Find this resource:

Faul F,Erdfelder E,Buchner A,Lang AG. Statistical power analyses using G*Power 3.1: tests for correlation and regression analyses. Behav Res Methods 2009; 41:1149–60.Find this resource:

Machin D, Campbell MJ, Tang S-B, Huey S. Sample size tables for clinical studies, 3rd ed. London: BMJ Books, Wiley, 2008.Find this resource:

NCSS Statistical Software. PASS: power analysis and sample size software. Research design

Stata Corporation. Stata: data analysis and statistical software. Research design

Statistical Solutions. nQuery advisor: sample size and power calculations. Research design

University of Düsseldorf. G*Power. Research design

Sample size for estimation studies: means

Estimating a mean with a specified precision

The following information is required:

  • The standard deviation (SD) of the measure being estimated

  • The desired width of the confidence interval (d)

  • The confidence level

The standard deviation is needed because the sample size depends partly on the variability of the measure being estimated. The greater the variability of a measure, the greater the number of subjects needed in the sample to estimate it precisely.

The standard deviation can be estimated from previously published studies on the same topic, from contact with another worker in the field, or from a small pilot study.

The desired width of the confidence interval, d, indicates the precision of the mean and is decided by the researcher.

The confidence level is usually set at 95%, giving a sample confidence interval that contains the true population mean with probability 95%. Other values such as 90% or 99% can be used, but are unusual in practice.

Assuming that the confidence level is 95%, the sample size, n, is then given by:

n = 1.962 × 4  SD2/d2

To change the confidence level, change the multiplier ‘1.962’ as follows.

95%  confidence  level:  n = 1.962 × 4 SD2/d2

90%  confidence  level:  n = 1.642 × 4SD2/d2

99%  confidence  level:  n = 2.582 × 4SD2/d2

where 1.96, 1.64, and 2.58 are the two-sided 5%, 10%, and 1% points, respectively, of the Normal distribution.


Suppose we wish to estimate mean systolic blood pressure in a patient group with a 10 mmHg-wide 95% confidence interval, that is, 5 mmHg either side of the mean. Previous work suggested using a standard deviation of 11.4.

  • The standard deviation (SD) of the measure being estimated = 11.4

  • The desired width of the confidence interval (d) = 10

  • The confidence level = 95%

n = 1.962 × 4 SD2/d2

n = 15.372 × 11.42/102

n = 20

Suppose we reduce the width of the confidence interval to 5 mmHg?

n = 1.962 × 4  × 11.42/52

n = 80

So doubling the precision leads to a quadrupling of the sample size.

Sample size for estimation studies: proportions

Estimating a proportion with a specified precision

The following information is required:

  • The expected population proportion, p

  • The desired width of the confidence interval, d

  • The confidence level

The expected population proportion is the best guess of what the value will be. This need not be accurate but an approximate figure, such as 0.02 (2%), 0.05 (5%), or 0.10 (10%), etc. This guess can be obtained from previously published studies on the same topic, from contact with another worker in the field, or from a small pilot study. The ‘guess’ does not need to be very accurate and in most cases, the researcher will have an idea of what the value will be. If no guess is possible then use 0.50.

It may appear counterintuitive to need to use a ‘guess’ of the value of the proportion in the sample size calculations for a study to produce an estimate. However, it is needed because the variability of a proportion which is needed in the calculation depends on the proportion itself. In the case of estimating a mean, the variability (estimated by the standard deviation) is independent of the mean.

The desired width of the confidence interval,d, indicates the precision of the proportion and is decided by the researcher. The confidence level is usually set at 95%, giving a sample confidence interval that contains the true population proportion with probability 95%.

Assuming that the confidence level is 95%, the sample size, n, is then given by:

n = 1.962 × 4 p(1p)/d2

Note that this formula uses the proportion and not the percentage. Although these are effectively the same, this formula can only be used with p expressed as a proportion.

To change the confidence level, change the multiplier ‘1.962’ as follows:

95% confidence level: n= 1.962×4p(1p) /d2

90% confidence level: n= 1.64×4p(1p) /d2

99%  confidence  level:  n = 2.582 × 4p(1p)/d2


Suppose we wish to estimate the prevalence of asthma in an adult population with the width of the 95% confidence interval being 0.10, an accuracy of ± 0.05. An estimate of the prevalence of asthma is 0.10 (10%).

  • The expected population proportion, p = 0.10

  • The desired width of the confidence interval, d = 0.10

  • The confidence level = 95%

n = 1.962 × 4 p (1p)/d2

n = 15.37 × 0.1 (10.1)/0.102

n = 138

If we choose to double the accuracy to give a 95% confidence interval of 0.05 width:

n = 1.962×4 ×0.1 (10.1)/0.052

n = 553

Again, doubling the precision leads to a quadrupling of the sample size.

Sample size for comparative studies

Significance tests: type 1, type 2 errors

A significance test to compare two groups in a sample may lead us to an incorrect conclusion about the target population in two different ways:

  • Type 1 error: we conclude that there is a difference between the groups in the target populations when in fact there is not. This is actually the significance level of the test and so when we use 0.05 or 5% as the cut-off for statistical significance, then the probability of a type 1 error is 5%. This is often denoted by ‘α‎’

  • Type 2 error: we conclude that there is no difference between the groups in the target population when in fact a real difference of a given size does exist. The type 2 error is often denoted by ‘β‎’ and 1 − β‎ is the power of the study

Note that this means that the power of a study is the ability of the study to detect a difference if one exists.

In calculating the required sample size for a study, we want to minimize type 1 and type 2 errors and therefore avoid spurious statistical significance and avoid missing a real difference. The significance level is usually kept at 5%, by convention, and we set a high value of the power, of at least 80%, and preferably 90% or more.

Minimum clinically important difference

The minimum clinically important difference (MCID) is needed in the sample size calculations. This is the smallest size of difference that the researcher considers to be so important that they would not want their study to miss it. In other words, this size of difference is considered to be clinically meaningful. If the study is too small to detect this size of difference, and it exists, the comparison will be non-significant and the study will therefore be inconclusive.

The choice of a clinically important difference is not a statistical one, but relates to the context of the study. It can be difficult to decide how big a difference would be important in a given context. The literature and/or discussions with colleagues may help decide what size of difference is important.

Pre-determined sample size

In some situations, the sample size is fixed either due to the limited availability of subjects, or due to time or financial constraints. In such cases, sample size calculations should still be done to see how big a difference could be detected with the given sample size. If the available sample size is sufficient to achieve the aims of the study then the study can go ahead but if it cannot then it is questionable whether to proceed. It is better to know in advance if the sample size is too small and choose not to do the study than to conduct a study and then find that it is too small and turns out to be inconclusive.

Some statisticians consider that it is unethical to carry out research which is likely to be inconclusive due to small sample size as it is a waste of resources, and/or a waste of patients’ time and/or can lead to a wrong interpretation that there is no real difference (i.e. a type 2 error). Others argue that small studies are justified if they add to the pool of evidence and can be combined with other small studies in a meta-analysis (Research design see Chapter 13).

Sample size for comparative studies: means

In a comparative study, we choose the sample size to have a high probability of detecting a difference of a given size if it exists but also to have a low probability of finding a significant difference when no real difference exists. In other words, we want to have high power (and hence low type 2 error) and a low significance level (low type 1 error). The formula used for comparing means and comparing proportions balances these probabilities and allows us to calculate sample sizes given certain information. The following information is required:

  • The standard deviation (SD) of the measure being compared

  • The minimum difference (d) that is clinically important (MCID)

  • The significance level (α‎)

  • The power of the test (1 − β‎)

The standard deviation is estimated from previously published studies on the same topic, from contact with another worker in the field or from a small pilot study.

The minimum difference that is clinically important is decided beforehand by the researcher.

The significance level, α‎ is the maximum acceptable type 1 error rate and is usually set at 5%.

The power of the test, 1 − β‎, is the probability of getting a significant result when the true difference between the means is d and is set at 80% or more, preferably 90%.

To compare the two means we need the following number of patients in each group:

n = 2K SD2d2

The total sample size is 2n. K is a multiplier that depends on the significance level and power and comes from the Normal distribution. Details of the formula and the multipliers (Research design see Table 2.7, p. [link]) are given in Bland (2015, chapter 18).

Table 2.7 Multipliers for studies comparing two means or two proportions

Power (1 − β‎)

Significance level (α‎)





















(1) A study of the effects of smoking on birthweight should be able to show a difference between smokers and non-smokers of 200 g with high power. SD for birthweight is 500 g. We will use a significance level of 5% and power of 90%, giving K = 10.5 from Table 2.7.

  • The standard deviation (SD) of the measure being compared = 500

  • The minimum difference (d) that is clinically important = 200

  • The significance level (α‎) = 5%

  • The power of the test (1 − β‎) = 90%

n = 2K SD2d2n = 2×10.5×50022002

n = 131  in each group

(2) Suppose we choose a 5% significance level and 80% power. This gives K = 7.8.

n = 2 × 7.8 × 50022002

n = 98 in each group

(3) Suppose we could only recruit 50 in each group, what size difference could be detected with 80% power and a significance level of 5%? (K = 7.8)?

n = 2K SD2d2

Rearrange to give:

d2= 2K SD2nd2 = 2 × 7.8 × 500250= 78000

d = 280

Under these circumstances, the study will have a high probability to detect differences of 280 g or more. An observed difference of 200 g will not be statistically significant. In this situation, it may be decided that the study is unlikely to be conclusive and is not worthwhile.


Bland M. An introduction to medical statistics, 4th ed. Oxford: Oxford University Press, 2015.Find this resource:

Sample size for comparative studies: proportions

To calculate the sample size for a study comparing two proportions, the following information is required:

  • The expected population proportion in group 1, P1

  • The expected population proportion in group 2, P2

  • The significance level (α‎)

  • The power of the test (1 − β‎)

The expected population proportion in group 1 and the expected population proportion in group 2 are the best estimates of what these values will be. The difference therefore reflects the minimum anticipated change in the proportion which would be regarded as clinically important (MCID).

The significance level, α‎, is the type 1 error and is usually set at 5%.

The power of the test, 1 − β‎, is the probability of getting a significant result when the true difference between the proportions is d and is set at 80% or more, preferably 90%.

n = K[P1 (1P1)+P2(1P2)](P1P2)2

where n is the number in each group as before.


A study is planned to compare patient outcomes following the current form of surgery and a new method. It is expected that the new surgery will have less complications. The proportion of patients who develop complications after undergoing current surgery is 15% and it is expected that the new form of surgery will have a 5% complication rate.

Assuming a significance level of 5% and power of 90%, gives K = 10.5 from Table 2.7:

  • The expected population proportion in group 1, P1 = 0.15

  • The expected population proportion in group 2, P2 = 0.05

  • The significance level (α‎) = 0.05

  • The power of the test (1 − β‎) = 0.90

n = K[P1(1P1)+P2(1P2)](P1P2)2n = 10.5 × [0.15(1-0.15)+0.05(1-0.05)](0.15-0.05)2

n = 183 in each group

Sample size calculations: further issues

Assumptions of sample size formulae for means and proportions

  • There is no attrition, that is, the total number of patients successfully recruited and who complete the study is equal to the number required

  • For comparative studies, assume equal numbers of subjects per group

  • Samples are simple random samples; any randomization is at the individual level. Sample size calculations are different for cluster samples or cluster randomization and the usual calculations will give too few subjects (Research design see Sample size in cluster trials, p. [link])

  • For comparative studies, a simple comparison of two groups only will be made. Multiple regression or logistic regression (Research design see Multiple regression, p. [link] and Research design Logistic regression, p. [link]) is not planned

  • The samples are large enough to use large sample methods for the analysis (Research design see 95% confidence interval for a proportion, p. [link])

Sample size calculation in other situations

  • When attrition is expected: if there are likely to be losses, inflate total to allow for estimated attrition. For example, if the calculated required sample size is 80 in total and it is anticipated that 20% of those recruited will not complete, then 100 patients should be recruited to ensure that 80 will complete

  • Unequal numbers in the groups: unequal numbers in the groups can be dealt with in all good software packages such as nQuery (Statistical Solutions), PASS (NCSS Statistical Software), Stata (Stata Corporation), and G*Power (University of Düsseldorf)

  • Multifactorial analyses are planned: here the sample size calculations are difficult. The statistical power needs to be higher than for a two-group comparison. The calculations can be done if the correlation between the variables is available, but often this is not known. In such circumstances, a rule of thumb that can be used is to increase the sample size by 10% for every extra variable added. (Note that for a categorical variable, the number of variables here is the total number of categories minus 1.) Research design Simulations can be used to estimate required sample sizes for multifactorial analyses—these are beyond the scope of this book

  • Small sample situations: if the calculated sample size is small, say, fewer than 50 per group, then large sample methods may not be possible for the statistical analysis and so the sample size calculations may need adjusting. This may be handled by the sample size program but it is best to check and seek statistical advice if in doubt

  • Survival analysis: if you are comparing the proportion of deaths in two groups at a fixed point and there is no censoring, then the sample size calculations for the comparison of two proportions can be used. If a log rank test is to be used to compare the survival curves, then these calculations are not suitable. nQuery, PASS, or Stata will do the calculations; the formulae are given in Collett (2014, chapter 10)

  • Equivalence trials (Research design see Superiority and equivalence trials, p. [link]): sample size calculations for these need specialized formulae which take into account the limits of equivalence that are acceptable in the trial. These can be done in nQuery and in PASS

When to do replicate measurements

In some situations, measurements are hard to make or are variable and so it is best if several measurements are taken. We give some suggestions:

  • For quantities that are hard to measure accurately, such as skinfold thickness, take three values and use the mean

  • For quantities that depend on patient effort (e.g. peak flow rate), take three values and use the maximum

  • For quantities that vary, such as blood pressure which varies across the day and is subject to ‘white coat syndrome’, it may be necessary to take several measurements over a period of time to get an accurate assessment

  • For quantities that vary due to external factors, such as blood sugar levels which vary with food intake, alternative measures may be needed (e.g. HbA1c level as a surrogate for blood sugar)

Are sample size calculations as described here always needed?

If the study is a descriptive survey, then sample size calculations may be difficult. However, it is important to ensure there are sufficient subjects to achieve the aims of the study. For example, in a survey of satisfaction in two patient groups, there will need to be adequate numbers in the two groups to be able to compare satisfaction. It is useful in such situations to list the main cross tabulations that will be needed and to ensure that total numbers will give adequate numbers in the individual table cells.


Collett D. Modelling survival data in medical research. Boca Raton, FL: Chapman & Hall/CRC, 2014.Find this resource:

NCSS Statistical Software. PASS: power analysis and sample size software. Research design

Stata Corporation. Stata: data analysis and statistical software. Research design

Statistical Solutions. nQuery advisor: sample size and power calculations. Research design

University of Düsseldorf. G*Power. Research design

Research design Sample size in cluster trials

Trials randomized by cluster

When individuals are allocated to treatments in whole groups or clusters rather than as individuals, the sample size calculations are different. This is because individuals within the same cluster are more similar to each other than individuals in different clusters. These differences are quantified by the intraclass correlation coefficient (ICC) which is used in the calculation of sample size.

Intraclass correlation coefficient

The ICC summarizes the correlation between different clusters as a ratio of the total variation between clusters to the total variation between and within clusters, that is:

ICC = total  variation  between  clustersvariation  between  clusters + variation  within  clusters

Hence, the ICC summarizes the extent of the ‘clustering effect’ in the sense that if there was no variability between clusters then the ICC would be zero.

Design effect

The ICC is used to inflate the sample size to allow for the clustering as follows. The calculations are done ignoring the clustering and then increased using the design effect formula:

Design effect = 1+(k1) × ICC

where k is the number of subjects per cluster.

Hence the steps taken to calculate the sample size for a cluster trial are:

  • Estimate the ICC from other studies or a pilot

  • Calculate the sample size ignoring the clustering (Research design see Sample size for comparative studies: means, p. [link])

  • Decide on feasible number of subjects per cluster (k)

  • Calculate total cluster trial sample size: = design effect × simple total


  • Note that there is no unique combination of numbers of clusters and subjects per cluster (Figure 2.5)

  • Calculations assume equal numbers of subjects per cluster

  • Even apparently small ICCs can have a marked effect on the sample size and should not be ignored

  • Seek statistical advice/collaboration for cluster trial designs

Figure 2.5 Design of a cluster trial.

Figure 2.5 Design of a cluster trial.

Example: sample size calculation for cluster trialDesign


Further example

The ‘analysis of a cluster randomized trial’ is outlined in Research design Cluster samples: analysis, p. [link].

Further reading

Donner A, Klar N. Design and analysis of cluster randomization trials in health research. London: Arnold, 2000.Find this resource:

Eldridge S, Kerry SM. A practical guide to cluster randomised trials in health services research. Chichester: Wiley, 2012.Find this resource:

Kerry SM,Bland JM. Sample size in cluster randomisation. BMJ 1998; 316:549.Find this resource:

Using a statistical program to do the calculations

The following examples show the same sample size calculations in nQuery (Statistical Solutions), in PASS (NCSS Statistical Software), and in Stata (Stata Corporation). The same information was input into each program to give the required sample size per group.

The study was to compare lung function in two groups of infants. Power was set at 90% and significance level at 5%. A difference of 0.5 standard deviations was considered to be clinically worthwhile. Equal numbers were to be in each group.



Stata (Stata Corporation) is a command-driven program and so the actual commands need to be typed and then the calculations are done. The command is in bold below this paragraph and the following text is the results that the program gives. The sample size per group is given as 85 per group.

power twomeans 2.6 2.1, power(0.9) sd(1.0) nratio(1.0)

Performing iteration … in

Estimated sample sizes for a two-sample means test

t test assuming sd1 = sd2 = sd

Ho: m2 = m1 versus Ha: m2 != m1

Study parameters:

Estimated sample sizes:


= 0.0500


= 0.9000


= −0.5000


= 2.6000


= 2.1000


= 1.0000


= 172

N per group

= 86


nQuery (Statistical Solutions) is a menu-driven program where the user chooses commands from menus provided. The data are entered into a table on the screen and when all fields are complete, the number per group is automatically calculated. This is shown as follows in bold.

Two group t-test of equal means (unequal n’s)

Test significance level, a


1 or 2 sided test?


   Group 1 mean, m1


   Group 2 mean, m2


   Difference in means, m1 − m2


   Common standard deviation, s


Effect size, d = |m1—m2| / s


Power ( % )






Ratio: n2 / n1


N = n1 + n2



Like nQuery, PASS (NCSS Statistical Software) is a menu-drive program. The data are entered into a table on the screen and when all fields are complete, the number per group is automatically calculated. This is shown in bold.

Two-Sample T-Test Power Analysis

Numeric Results for Two-Sample T-Test

Null Hypothesis: Mean1=Mean2. Alternative Hypothesis: Mean1<>Mean2

The standard deviations were assumed to be known and unequal.






















Other examples of using sample size programmes are given in Presenting medical statistics from proposal to publication (Peacock et al. 2017).


NCSS Statistical Software. PASS: power analysis and sample size software. Research design

Peacock, J, Kerry SM,Balise RR. Presenting medical statistics from proposal to publication, 2nd ed. Oxford: Oxford University Press, 2017.Find this resource:

Stata Corporation. Stata: data analysis and statistical software. Research design

Statistical Solutions. nQuery advisor: sample size and power calculations. Research design

Research study documents


Various documents are required when conducting a research study to ensure that the study is designed and conducted according to best methodological standards. Those that specifically include statistical aspects are the:

  • Research protocol

  • Data monitoring charter

  • Statistical analysis plan

Research protocol

The protocol is a written document that summarizes the proposed study. It is useful because it focuses ideas about the research question and sets the aims in the context of work already done. It documents the design, sample size, and the planned statistical analysis, and provides a timetable for the study. It therefore provides a good working document/template for applications for ethical approval and funding.

The research protocol should include the following items:

  • Title

  • Abstract

  • Aim of study

  • Background

  • Study design

  • Sample size (if relevant)

  • Plan of the statistical analyses

  • Ethical issues (if relevant)

  • Costs

  • Timetable

  • Staffing/resources

Clinical protocol

  • Guidelines to describe good practice in different clinical situations, for example, to describe how patients should be managed

  • May be part of research protocol

Operational protocol

This will be more detailed than the research protocol as it gives full details of how the study will be carried out and the guidelines for specific situations.

Example of a published research protocol

Cools and colleagues published a study protocol for an individual patient data meta-analysis of elective high-frequency oscillatory ventilation in preterm infants with respiratory distress syndrome (Cools et al. 2009). The protocol is too long (13 pages) to reproduce here but key sections are included. The full protocol can be obtained free from the BMC website (Research design

Background—this section described:

  • The clinical problem and the reason why the study was needed

  • The limitations of an aggregate data meta-analysis

  • The benefits given by the proposed individual patient data meta-analysis

Methods and design—this section described:

  • The objectives of the new study

  • How the individual studies for the meta-analysis were identified and the inclusion/exclusion criteria

  • Data management

  • The data items obtained from the individual trialists

  • Planned statistical analyses including primary/secondary outcomes

  • The planned subgroup analyses

  • The planned sensitivity analyses

  • Additional analyses

  • Ethical considerations

  • Project management including the roles of the core group, the trialist group, and the advisory group

  • Funding obtained and competing interests

  • Publication policy

Data Monitoring Committee documents

The role and functioning of the Data Monitoring Committee is described in detail elsewhere (Research design see Formal data monitoring, p. [link]). This committee operates according to documented terms of reference (‘the DMC Charter’).


Cools F, Askie LM, Offringa M. Elective high-frequency oscillatory ventilation in preterm infants with respiratory distress syndrome: an individual patient data meta-analysis. BMC Pediatr 2009; 9:33.Find this resource:

Statistical analysis plan


The statistical analysis plan (SAP) is a formal document for clinical trials. It describes the statistical analyses that are planned and is required in order to avoid post hoc decisions being made about what analyses to do.

Guidelines were published in 2017 that suggest what items of information should be included in a SAP (Gamble et al. 2017). The following items were included:

  • Administrative information including trial registration details and roles and responsibilities

  • Background study details including its objectives

  • Study methods including design, randomization, sample size, any interim analyses, and assessment times

  • Statistical issues including multiplicity, protocol deviations, and analysis populations

  • Trial population details including eligibility and baseline characteristics

  • Analysis including the definition of the outcomes, type of analysis planned, how missing data are handled, how adverse event data are to be reported, and software

The full list of items is given in the publication which can be accessed on the EQUATOR website (Research design

Further details and explanations are given in a longer document (DeMets et al. 2017).

Publishing SAPs

Researchers sometimes publish their SAPs for trials, such as Lo et al. (2016).

Statistical analysis plans for observational studies

Observational studies may be exploratory and so there can be a reluctance to have a plan of analysis. However, it is good practice to have a prior strategy for the analysis to provide transparency and reproducibility and prevent data dredging (Thomas and Peterson 2012).


DeMets DL,Cook TD,Buhr KA. Guidelines for statistical analysis plans. JAMA 2017; 318:2301–3.Find this resource:

Gamble C,Krishan A,Stocken D,Lewis S,Juszczak E,Dore C, et al. Guidelines for the content of statistical analysis plans in clinical trials. JAMA 2017; 318:2337–43.Find this resource:

Lo JW,Bunce C,Charteris D,Banerjee P,Phillips R,Cornelius VR. A phase III, multi-centre, double-masked randomised controlled trial of adjunctive intraocular and peri-ocular steroid (triamcinolone acetonide) versus standard treatment in eyes undergoing vitreoretinal surgery for open globe trauma (ASCOT): statistical analysis plan. Trials 2016; 17:383.Find this resource:

Thomas L,Peterson ED. The value of statistical analysis plans in observational research: defining high-quality research from the start. JAMA 2012; 308:773–4.Find this resource: